perfsonar-user - [perfsonar-user] Problems debugging pscheduler: "Run was preempted."

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Problems debugging pscheduler: "Run was preempted."

From: Brian Candler <>
To: "" <>
Subject: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."
Date: Sun, 1 Sep 2019 18:31:42 +0100
Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=to:from:subject :message-id:date:mime-version:content-type; q=dns; s=sasl; b=WLF ACaKFCW/U8lBPm8Cx6SqglnaiDfyu+di5Wzqo9glblzZyKzueGu8W7Tzs7dPlyLC LelUkgcpwtR20TCk9yFYUaTGM5Q0cBJsX5noaPoIg8gPEXHkt+aIWXOaQhGj1XYG LBYqfTdGUUpTHJmF69QppsllsxX6l5lylAvKAVF8=

This started off as a request for help, but has ended up as a sort of meta bug report: as well as fixing the issue, I would like to highlight how hard it is to get useful debugging information out of pscheduler.

The issue is between two nodes. Node A has perfsonar-toolkit 4.2 under Ubuntu 18.04 lxd; node B is perfsonar-testpoint 4.1 under Debian. Both nodes have NTP-synced clocks.

On node A, I configured a throughput test to B, with v6-only selected (since node A is behind NAT for IPv4). The problem seen in the GUI is that results for the outbound direction are shown, but nothing in the inbound direction (and no red dot for 'failed test' either)

Logs on A and B didn't really seem to show any reason (*).

Firewall rules allow all traffic between A and B on IPv4. I was able to run tests using iperf3 -s/-c, in both directions, without problems.

So I tried using pscheduler to run tests from the command line - and sure enough, I found that tests in the outbound direction worked, but tests in the inbound direction did not.

Here's the outbound test command which works:

# pscheduler task throughput -d ns1.BBBB.com -s perf1.home.AAAA.net --ip-version=6

And here's the inbound test command which fails: just reverse -s and -d, added --debug too.

--------

# pscheduler task --debug throughput -s ns1.BBBB.com -d perf1.home.AAAA.net --ip-version=62019-09-01T16:17:26 Debug started2019-09-01T16:17:26 Assistance is from localhost2019-09-01T16:17:26 Forcing default slip of PT5M2019-09-01T16:17:26 Converting to spec via https://localhost/pscheduler/tests/throughput/specSubmitting task...2019-09-01T16:17:27 Fetching participant list2019-09-01T16:17:27 Spec is: {"dest": "perf1.home.AAAA.net", "source": "ns1.BBBB.com", "ip-version": 6, "schema": 1}2019-09-01T16:17:27 Params are: {'spec': '{"dest": "perf1.home.AAAA.net", "source": "ns1.BBBB.com", "ip-version": 6, "schema": 1}'}2019-09-01T16:17:28 Got participants: {u'participants': [u'ns1.BBBB.com', u'perf1.home.AAAA.net']}2019-09-01T16:17:28 Lead is ns1.BBBB.com2019-09-01T16:17:28 Pinging https://ns1.BBBB.com/pscheduler/2019-09-01T16:17:28 ns1.BBBB.com is up2019-09-01T16:17:28 Posting task to https://ns1.BBBB.com/pscheduler/tasks2019-09-01T16:17:28 Data is {'test': {'type': 'throughput', 'spec': {u'dest': u'perf1.home.AAAA.net', u'source': u'ns1.BBBB.com', u'ip-version': 6, u'schema': 1}}, 'schedule': {'slip': 'PT5M'}}Task URL:https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c72019-09-01T16:17:40 Posted https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c72019-09-01T16:17:40 Submission diagnostics:2019-09-01T16:17:40 Hints:2019-09-01T16:17:40 requester: A:A:A:A::452019-09-01T16:17:40 server: B:B:B:B:feff:ff:fe00:f162019-09-01T16:17:40 Identified as everybody2019-09-01T16:17:40 Classified as default2019-09-01T16:17:40 Application: Defaults applied to non-friendly hosts2019-09-01T16:17:40 Group 1: Limit 'innocuous-tests' failed: Passed but inverted2019-09-01T16:17:40 Group 1: Limit 'throughput-default-time' passed2019-09-01T16:17:40 Group 1: Limit 'idleex-default' failed: Test is not 'idleex'2019-09-01T16:17:40 Group 1: Want any, 1/3 passed, 2/3 failed: PASS2019-09-01T16:17:40 Application PASSES2019-09-01T16:17:40 Proposal meets limitsRunning with tool 'iperf3'Fetching first run...2019-09-01T16:17:40 Fetching https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/first2019-09-01T16:17:42 Handing off: pscheduler watch --first --debug https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c72019-09-01T16:17:42 Debug started2019-09-01T16:17:42 Fetching https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c72019-09-01T16:17:43 Fetching next run from https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/firstNext scheduled run:https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/051b6594-67e1-44eb-891f-a80a3d195550Starts 2019-09-01T17:17:50+01:00 (~6 seconds)Ends 2019-09-01T17:18:09+01:00 (~18 seconds)Waiting for result...2019-09-01T16:18:14 Calling ['pscheduler', 'result', '--quiet', '--format', 'text/plain', '--debug', u'https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/051b6594-67e1-44eb-891f-a80a3d195550']Run did not complete: FailedDiagnostics from ns1.BBBB.com: No diagnostics.Error from ns1.BBBB.com: iperf3 returned an error: error - unable to connect to server: Connection refusedDiagnostics from perf1.home.AAAA.net: Run was preempted.Error from perf1.home.AAAA.net: No error.2019-09-01T16:18:15 Fetching next run from https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/nextNo further runs scheduled.

--------

No useful indication of the problem, other than "Connection refused" and "Run was preempted.". Furthermore, running "ps | grep iperf3" on both nodes during the test window, I didn't see any iperf3 process running at either side.

Running tcpdump on node A, I see an inbound TCP connection attempt which is refused, agreeing with the error above.

--------

root@perf1:~# tcpdump -i eth0 -nn tcp port 5201tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes17:01:40.220897 IP6 B:B:B:B:feff:ff:fe00:f16.52707 > A:A:A:A::45.5201: Flags [S], seq 3700706462, win 28800, options [mss 1440,sackOK,TS val 338337963 ecr 0,nop,wscale 7], length 017:01:40.221105 IP6 A:A:A:A::45.5201 > B:B:B:B:feff:ff:fe00:f16.52707: Flags [R.], seq 0, ack 3700706463, win 0, length 0

--------

More strangely: replacing /usr/bin/iperf3 on node A with a shell script which writes to /tmp, I could see that iperf3 wasn't being run at all for inbound tests. (But it *is* being run for tests in the outbound direction).

Next, I found the pid of the runner process on node A:

pschedu+ 739 0.2 1.1 1168772 89124 ? Ssl Aug31 0:45 /usr/bin/python /usr/lib/pscheduler/daemons/runner --daemon --pid-file /var/run/pscheduler-server/pscheduler-runner.pid --dsn @/etc/pscheduler/database/database-dsn

and attached strace -f -s 128 -p <pid>

For inbound tests, I could see *no exec* taking place! And yet, for outbound tests, I can see it happily clone and exec of various processes, including /usr/bin/iperf3. So for some reason, pscheduler isn't even attempting to run the iperf3 -s process on node A.

/var/log/pscheduler/pscheduler.log just says:

Sep 1 16:29:24 perf1 runner INFO 6945: Running https://ns1.BBBB.com/pscheduler/tasks/f7e0b30d-e247-4507-b2eb-b9c641abf8a5/runs/59bf7b96-94a0-404e-9557-3ff1e62e2a76Sep 1 16:29:24 perf1 runner INFO 6945: With iperf3: throughput --source ns1.BBBB.com --ip-version 6 --dest perf1.home.AAAA.netSep 1 16:29:24 perf1 runner INFO 6945: Run was preempted.

And looking through the python code, all I find for this error is in /usr/lib/pscheduler/daemons/runner:

db = self._get_db_conn() with db.cursor() as cursor: cursor.execute("SELECT run_can_proceed(%s)", [self.id]) assert(cursor.rowcount == 1) if not cursor.fetchone()[0]: if log_details: self.log.info("%d: Run was preempted.", self.id)

But why does run_can_proceed not return true? Could it be a database problem? /var/log/postgresql/postgresql-10-main.log only shows:

--------

2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler ERROR: duplicate key value violates unique constraint "heartbeat_pkey"2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler DETAIL: Key (name)=(scheduler) already exists.2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler CONTEXT: SQL statement "INSERT INTO heartbeat (name) VALUES ('ticker'), ('scheduler'), ('runner'), ('archiver')" PL/pgSQL function heartbeat_cold_boot() line 4 at SQL statement SQL statement "SELECT heartbeat_cold_boot()" PL/pgSQL function cold_boot() line 3 at PERFORM2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler STATEMENT: SELECT cold_boot()2019-08-31 12:39:51.089 UTC [845] pscheduler@pscheduler LOG: could not receive data from client: Connection reset by peer2019-08-31 12:39:51.090 UTC [846] pscheduler@pscheduler LOG: could not receive data from client: Connection reset by peer2019-08-31 21:57:34.485 UTC [30594] esmond@esmond ERROR: null value in column "source" violates not-null constraint2019-08-31 21:57:34.485 UTC [30594] esmond@esmond DETAIL: Failing row contains (8, pscheduler/iperf3, null, A:A:A:A::45, null, ns1.BBBB.com, A:A:A:A::45, 8).2019-08-31 21:57:34.485 UTC [30594] esmond@esmond STATEMENT: INSERT INTO "ps_p2p_subject" ("metadata_id", "tool_name", "source", "destination", "measurement_agent", "input_source", "input_destination") VALUES (8, 'pscheduler/iperf3', NULL, 'A:A:A:A::45'::inet, NULL, 'ns1.BBBB.com', 'A:A:A:A::45') RETURNING "ps_p2p_subject"."id"2019-09-01 02:30:29.021 UTC [9364] esmond@esmond LOG: could not receive data from client: Connection reset by peer

--------

- but these are error messages from yesterday and the day before.

Does the stored procedure exist? Yes.

pscheduler=# select prosrc from pg_proc where proname='run_can_proceed'; prosrc------------------------------------------------------------- + BEGIN + + IF NOT EXISTS (SELECT * FROM run WHERE id = run_id) + THEN + RAISE EXCEPTION 'No such run.'; + END IF; + + RETURN NOT EXISTS ( + SELECT * + + FROM + run run1 + JOIN task task1 ON task1.id = run1.task + JOIN test test1 ON test1.id = task1.test + JOIN scheduling_class scheduling_class1 ON + scheduling_class1.id = test1.scheduling_class+ JOIN run run2 ON + run2.times && run1.times + AND run2.id <> run1.id + AND run2.priority > run1.priority + AND NOT run_state_is_finished(run2.state) + WHERE + run1.id = run_id + AND NOT scheduling_class1.anytime + ); + + END; +(1 row)

Looking at recent test runs, I can see the *scheduled* throughput tests that I made between A and B:

pscheduler=# select run.id,task,times,state,errors,status from run join task on run.task=task.id join test on task.test=test.id where test.name='throughput' order by times desc limit 5; id | task | times | state | errors | status------+------+-----------------------------------------------------+-------+--------+-------- 6675 | 12 | ["2019-09-02 12:54:32+00","2019-09-02 12:55:01+00") | 1 | | 6939 | 16 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") | 1 | | 5427 | 16 | ["2019-09-02 04:30:31+00","2019-09-02 04:31:00+00") | 1 | | 5170 | 12 | ["2019-09-02 04:23:37+00","2019-09-02 04:24:06+00") | 1 | | 5186 | 16 | ["2019-09-02 04:21:11+00","2019-09-02 04:21:40+00") | 1 | |(5 rows)pscheduler=# select run_can_proceed(6675); run_can_proceed----------------- t(1 row)pscheduler=# select run_can_proceed(6939); run_can_proceed----------------- f

(1 row)

"select json from task where id=16;" shows that this is the scheduled inbound test, which isn't taking place, bringing me back to the original problem.

Duplicating the query from run_can_proceed:

SELECT run1.id,run1.times,run1.priority,run1.state,run2.id,run2.times,run2.priority,run2.state FROM run run1 JOIN task task1 ON task1.id = run1.task JOIN test test1 ON test1.id = task1.test JOIN scheduling_class scheduling_class1 ON scheduling_class1.id = test1.scheduling_class JOIN run run2 ON run2.times && run1.times AND run2.id <> run1.id AND run2.priority > run1.priority AND NOT run_state_is_finished(run2.state) WHERE run1.id = 6939 AND NOT scheduling_class1.anytime;

id | times | priority | state | id | times | priority | state ------+-----------------------------------------------------+----------+-------+------+-----------------------------------------------------+----------+------- 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") | 0 | 1 | 4224 | ["2019-09-01 12:48:24+00","2019-09-02 12:48:24+00") | 5 | 3 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") | 0 | 1 | 4221 | ["2019-09-01 12:48:24+00","2019-09-02 12:48:24+00") | 5 | 3 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") | 0 | 1 | 4832 | ["2019-09-01 15:27:17+00","2019-09-02 15:27:17+00") | 5 | 3 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") | 0 | 1 | 4828 | ["2019-09-01 15:27:17+00","2019-09-02 15:27:17+00") | 5 | 3 (4 rows)

OK, at this point I'm stumped. I think I've dug further into the innards of pscheduler than any end-user really is supposed to do. And all I've found is: both ad-hoc and scheduled tests are being dropped on the floor, with message "Run was preempted.", and I believe this is because run_can_proceed() is returning false.

Going back to what I said at the start: whilst I would appreciate hints to help fix the specific problem here, I also think that pscheduler could do a better job of reporting problems. If it decides that it's not a good idea to launch a tool, I think it should say *why* it has decided this. Or at least document this error message: googling "+pscheduler run was preempted" turns up nothing.

Regards,

Brian Candler.

(*) Actually, on node A, I did find a single instance of an error in /var/log/esmond/django.log:

2019-08-31 21:57:34,486 [ERROR] /usr/lib/python2.7/dist-packages/django/core/handlers/exception.py: Internal Server Error: /esmond/perfsonar/archive/Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/django/core/handlers/exception.py", line 41, in inner response = get_response(request) File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 249, in _legacy_get_response response = self._get_response(request) File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 187, in _get_response response = self.process_exception_by_middleware(e, request) File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 185, in _get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File "/usr/lib/python2.7/dist-packages/django/views/decorators/csrf.py", line 58, in wrapped_view return view_func(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/rest_framework/viewsets.py", line 95, in view return self.dispatch(request, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/rest_framework/views.py", line 494, in dispatch response = self.handle_exception(exc) File "/usr/lib/python2.7/dist-packages/rest_framework/views.py", line 454, in handle_exception self.raise_uncaught_exception(exc) File "/usr/lib/python2.7/dist-packages/rest_framework/views.py", line 491, in dispatch response = handler(request, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/esmond/api/perfsonar/api_v2.py", line 876, in create return super(ArchiveViewset, self).create(request) File "/usr/lib/python2.7/dist-packages/rest_framework/mixins.py", line 21, in create self.perform_create(serializer) File "/usr/lib/python2.7/dist-packages/rest_framework/mixins.py", line 26, in perform_create serializer.save() File "/usr/lib/python2.7/dist-packages/rest_framework/serializers.py", line 214, in save self.instance = self.create(validated_data) File "/usr/lib/python2.7/dist-packages/esmond/api/perfsonar/api_v2.py", line 674, in create self.get_fields()[subject_model].Meta.model.objects.create(metadata=metadata, **subject) File "/usr/lib/python2.7/dist-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/django/db/models/query.py", line 394, in create obj.save(force_insert=True, using=self.db) File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 808, in save force_update=force_update, update_fields=update_fields) File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 838, in save_base updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields) File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 924, in _save_table result = self._do_insert(cls._base_manager, using, fields, update_pk, raw) File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 963, in _do_insert using=using, raw=raw) File "/usr/lib/python2.7/dist-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/django/db/models/query.py", line 1079, in _insert return query.get_compiler(using=using).execute_sql(return_id) File "/usr/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 1112, in execute_sql cursor.execute(sql, params) File "/usr/lib/python2.7/dist-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) File "/usr/lib/python2.7/dist-packages/django/db/utils.py", line 94, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/usr/lib/python2.7/dist-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params)IntegrityError: null value in column "source" violates not-null constraintDETAIL: Failing row contains (8, pscheduler/iperf3, null, A:A:A:A::45, null, ns1.BBBB.com, A:A:A:A::45, 8).

And a pscheduler error:

/var/log/perfsonar/psconfig-pscheduler-agent-transactions.log:2019/08/31 15:27:50 ERROR guid=8df7d348-85f9-4bdb-9ce9-62db18bc4f3e action="create" msg=Problem adding test throughput(ns1.BBBB.com->A:A:A:A::45), continuing with rest of config: INTERNAL SERVER ERROR/var/log/perfsonar/psconfig-pscheduler-agent.log:2019/08/31 15:28:24 WARN pid=887 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=227 guid=8df7d348-85f9-4bdb-9ce9-62db18bc4f3e msg=Problem adding test throughput(ns1.BBBB.com->A:A:A:A::45), continuing with rest of config: INTERNAL SERVER ERROR

But I think those are likely red herrings, since quite a few scheduled throughput tests should have run by now, but there was only a single instance of those errors. They were at different timestamps too.

[perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/01/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
  - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Mark Feit, 09/05/2019
  - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Casey Russell, 09/05/2019
    - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
      - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
        
        Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Casey Russell, 09/05/2019

List archive

[perfsonar-user] Problems debugging pscheduler: "Run was preempted."