Skip to Content.
Sympa Menu

perfsonar-user - [perfsonar-user] Problems debugging pscheduler: "Run was preempted."

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Problems debugging pscheduler: "Run was preempted."


Chronological Thread 
  • From: Brian Candler <>
  • To: "" <>
  • Subject: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."
  • Date: Sun, 1 Sep 2019 18:31:42 +0100
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=to:from:subject :message-id:date:mime-version:content-type; q=dns; s=sasl; b=WLF ACaKFCW/U8lBPm8Cx6SqglnaiDfyu+di5Wzqo9glblzZyKzueGu8W7Tzs7dPlyLC LelUkgcpwtR20TCk9yFYUaTGM5Q0cBJsX5noaPoIg8gPEXHkt+aIWXOaQhGj1XYG LBYqfTdGUUpTHJmF69QppsllsxX6l5lylAvKAVF8=

This started off as a request for help, but has ended up as a sort of meta bug report: as well as fixing the issue, I would like to highlight how hard it is to get useful debugging information out of pscheduler.

The issue is between two nodes.  Node A has perfsonar-toolkit 4.2 under Ubuntu 18.04 lxd; node B is perfsonar-testpoint 4.1 under Debian.  Both nodes have NTP-synced clocks.

On node A, I configured a throughput test to B, with v6-only selected (since node A is behind NAT for IPv4).  The problem seen in the GUI is that results for the outbound direction are shown, but nothing in the inbound direction (and no red dot for 'failed test' either)

Logs on A and B didn't really seem to show any reason (*).

Firewall rules allow all traffic between A and B on IPv4.  I was able to run tests using iperf3 -s/-c, in both directions, without problems.

So I tried using pscheduler to run tests from the command line - and sure enough, I found that tests in the outbound direction worked, but tests in the inbound direction did not.

Here's the outbound test command which works:

# pscheduler task throughput -d ns1.BBBB.com -s perf1.home.AAAA.net --ip-version=6

And here's the inbound test command which fails: just reverse -s and -d, added --debug too.

--------

# pscheduler task --debug throughput -s ns1.BBBB.com -d perf1.home.AAAA.net --ip-version=6
2019-09-01T16:17:26 Debug started
2019-09-01T16:17:26 Assistance is from localhost
2019-09-01T16:17:26 Forcing default slip of PT5M
2019-09-01T16:17:26 Converting to spec via https://localhost/pscheduler/tests/throughput/spec
Submitting task...
2019-09-01T16:17:27 Fetching participant list
2019-09-01T16:17:27 Spec is: {"dest": "perf1.home.AAAA.net", "source": "ns1.BBBB.com", "ip-version": 6, "schema": 1}
2019-09-01T16:17:27 Params are: {'spec': '{"dest": "perf1.home.AAAA.net", "source": "ns1.BBBB.com", "ip-version": 6, "schema": 1}'}
2019-09-01T16:17:28 Got participants: {u'participants': [u'ns1.BBBB.com', u'perf1.home.AAAA.net']}
2019-09-01T16:17:28 Lead is ns1.BBBB.com
2019-09-01T16:17:28 Pinging https://ns1.BBBB.com/pscheduler/
2019-09-01T16:17:28 ns1.BBBB.com is up
2019-09-01T16:17:28 Posting task to https://ns1.BBBB.com/pscheduler/tasks
2019-09-01T16:17:28 Data is {'test': {'type': 'throughput', 'spec': {u'dest': u'perf1.home.AAAA.net', u'source': u'ns1.BBBB.com', u'ip-version': 6, u'schema': 1}}, 'schedule': {'slip': 'PT5M'}}
Task URL:
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:40 Posted https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:40 Submission diagnostics:
2019-09-01T16:17:40   Hints:
2019-09-01T16:17:40     requester: A:A:A:A::45
2019-09-01T16:17:40     server: B:B:B:B:feff:ff:fe00:f16
2019-09-01T16:17:40   Identified as everybody
2019-09-01T16:17:40   Classified as default
2019-09-01T16:17:40   Application: Defaults applied to non-friendly hosts
2019-09-01T16:17:40     Group 1: Limit 'innocuous-tests' failed: Passed but inverted
2019-09-01T16:17:40     Group 1: Limit 'throughput-default-time' passed
2019-09-01T16:17:40     Group 1: Limit 'idleex-default' failed: Test is not 'idleex'
2019-09-01T16:17:40     Group 1: Want any, 1/3 passed, 2/3 failed: PASS
2019-09-01T16:17:40     Application PASSES
2019-09-01T16:17:40   Proposal meets limits
Running with tool 'iperf3'
Fetching first run...
2019-09-01T16:17:40 Fetching https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/first
2019-09-01T16:17:42 Handing off: pscheduler watch --first --debug https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:42 Debug started
2019-09-01T16:17:42 Fetching https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:43 Fetching next run from https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/first

Next scheduled run:
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/051b6594-67e1-44eb-891f-a80a3d195550
Starts 2019-09-01T17:17:50+01:00 (~6 seconds)
Ends   2019-09-01T17:18:09+01:00 (~18 seconds)
Waiting for result...

2019-09-01T16:18:14 Calling ['pscheduler', 'result', '--quiet', '--format', 'text/plain', '--debug', u'https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/051b6594-67e1-44eb-891f-a80a3d195550']
Run did not complete: Failed


Diagnostics from ns1.BBBB.com:
  No diagnostics.

Error from ns1.BBBB.com:
  iperf3 returned an error: error - unable to connect to server: Connection refused

Diagnostics from perf1.home.AAAA.net:
  Run was preempted.

Error from perf1.home.AAAA.net:
  No error.
2019-09-01T16:18:15 Fetching next run from https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/next

No further runs scheduled.

--------

No useful indication of the problem, other than "Connection refused" and "Run was preempted.".   Furthermore, running "ps | grep iperf3" on both nodes during the test window, I didn't see any iperf3 process running at either side.

Running tcpdump on node A, I see an inbound TCP connection attempt which is refused, agreeing with the error above.

--------

root@perf1:~# tcpdump -i eth0 -nn tcp port 5201
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:01:40.220897 IP6 B:B:B:B:feff:ff:fe00:f16.52707 > A:A:A:A::45.5201: Flags [S], seq 3700706462, win 28800, options [mss 1440,sackOK,TS val 338337963 ecr 0,nop,wscale 7], length 0
17:01:40.221105 IP6 A:A:A:A::45.5201 > B:B:B:B:feff:ff:fe00:f16.52707: Flags [R.], seq 0, ack 3700706463, win 0, length 0

--------

More strangely: replacing /usr/bin/iperf3 on node A with a shell script which writes to /tmp, I could see that iperf3 wasn't being run at all for inbound tests.  (But it *is* being run for tests in the outbound direction).

Next, I found the pid of the runner process on node A:

pschedu+   739  0.2  1.1 1168772 89124 ?       Ssl  Aug31   0:45 /usr/bin/python /usr/lib/pscheduler/daemons/runner --daemon --pid-file /var/run/pscheduler-server/pscheduler-runner.pid --dsn @/etc/pscheduler/database/database-dsn

and attached strace -f -s 128 -p <pid>

For inbound tests, I could see *no exec* taking place!  And yet, for outbound tests, I can see it happily clone and exec of various processes, including /usr/bin/iperf3.  So for some reason, pscheduler isn't even attempting to run the iperf3 -s process on node A.

/var/log/pscheduler/pscheduler.log just says:

Sep  1 16:29:24 perf1 runner INFO     6945: Running https://ns1.BBBB.com/pscheduler/tasks/f7e0b30d-e247-4507-b2eb-b9c641abf8a5/runs/59bf7b96-94a0-404e-9557-3ff1e62e2a76
Sep  1 16:29:24 perf1 runner INFO     6945: With iperf3: throughput --source ns1.BBBB.com --ip-version 6 --dest perf1.home.AAAA.net
Sep  1 16:29:24 perf1 runner INFO     6945: Run was preempted.

And looking through the python code, all I find for this error is in /usr/lib/pscheduler/daemons/runner:

            db = self._get_db_conn()
            with db.cursor() as cursor:
                cursor.execute("SELECT run_can_proceed(%s)", [self.id])
                assert(cursor.rowcount == 1)
                if not cursor.fetchone()[0]:
                    if log_details:
                        self.log.info("%d: Run was preempted.",
                                      self.id)

But why does run_can_proceed not return true?  Could it be a database problem? /var/log/postgresql/postgresql-10-main.log only shows:

--------

2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler ERROR:  duplicate key value violates unique constraint "heartbeat_pkey"
2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler DETAIL:  Key (name)=(scheduler) already exists.
2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler CONTEXT:  SQL statement "INSERT INTO heartbeat (name)
            VALUES
                ('ticker'),
                ('scheduler'),
                ('runner'),
                ('archiver')"
        PL/pgSQL function heartbeat_cold_boot() line 4 at SQL statement
        SQL statement "SELECT heartbeat_cold_boot()"
        PL/pgSQL function cold_boot() line 3 at PERFORM
2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler STATEMENT:  SELECT cold_boot()
2019-08-31 12:39:51.089 UTC [845] pscheduler@pscheduler LOG:  could not receive data from client: Connection reset by peer
2019-08-31 12:39:51.090 UTC [846] pscheduler@pscheduler LOG:  could not receive data from client: Connection reset by peer
2019-08-31 21:57:34.485 UTC [30594] esmond@esmond ERROR:  null value in column "source" violates not-null constraint
2019-08-31 21:57:34.485 UTC [30594] esmond@esmond DETAIL:  Failing row contains (8, pscheduler/iperf3, null, A:A:A:A::45, null, ns1.BBBB.com, A:A:A:A::45, 8).
2019-08-31 21:57:34.485 UTC [30594] esmond@esmond STATEMENT:  INSERT INTO "ps_p2p_subject" ("metadata_id", "tool_name", "source", "destination", "measurement_agent", "input_source", "input_destination") VALUES (8, 'pscheduler/iperf3', NULL, 'A:A:A:A::45'::inet, NULL, 'ns1.BBBB.com', 'A:A:A:A::45') RETURNING "ps_p2p_subject"."id"
2019-09-01 02:30:29.021 UTC [9364] esmond@esmond LOG:  could not receive data from client: Connection reset by peer

--------

- but these are error messages from yesterday and the day before.

Does the stored procedure exist?  Yes.

pscheduler=# select prosrc from pg_proc where proname='run_can_proceed';
                           prosrc
-------------------------------------------------------------
                                                            +
 BEGIN                                                      +
                                                            +
     IF NOT EXISTS (SELECT * FROM run WHERE id = run_id)    +
     THEN                                                   +
         RAISE EXCEPTION 'No such run.';                    +
     END IF;                                                +
                                                            +
     RETURN NOT EXISTS (                                    +
         SELECT *                                           +
                                                            +
         FROM                                               +
           run run1                                         +
           JOIN task task1 ON task1.id = run1.task          +
           JOIN test test1 ON test1.id = task1.test         +
           JOIN scheduling_class scheduling_class1 ON       +
               scheduling_class1.id = test1.scheduling_class+
           JOIN run run2 ON                                 +
               run2.times && run1.times                     +
               AND run2.id <> run1.id                       +
               AND run2.priority > run1.priority            +
               AND NOT run_state_is_finished(run2.state)    +
         WHERE                                              +
             run1.id = run_id                               +
             AND NOT scheduling_class1.anytime              +
     );                                                     +
                                                            +
 END;                                                       +

(1 row)


Looking at recent test runs, I can see the *scheduled* throughput tests that I made between A and B:

pscheduler=# select run.id,task,times,state,errors,status from run join task on run.task=task.id join test on task.test=test.id where test.name='throughput' order by times desc limit 5;
  id  | task |                        times                        | state | errors | status
------+------+-----------------------------------------------------+-------+--------+--------
 6675 |   12 | ["2019-09-02 12:54:32+00","2019-09-02 12:55:01+00") |     1 |        |
 6939 |   16 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") |     1 |        |
 5427 |   16 | ["2019-09-02 04:30:31+00","2019-09-02 04:31:00+00") |     1 |        |
 5170 |   12 | ["2019-09-02 04:23:37+00","2019-09-02 04:24:06+00") |     1 |        |
 5186 |   16 | ["2019-09-02 04:21:11+00","2019-09-02 04:21:40+00") |     1 |        |
(5 rows)

pscheduler=# select run_can_proceed(6675);
 run_can_proceed
-----------------
 t
(1 row)

pscheduler=# select run_can_proceed(6939);
 run_can_proceed
-----------------
 f

(1 row)

"select json from task where id=16;" shows that this is the scheduled inbound test, which isn't taking place, bringing me back to the original problem.

Duplicating the query from run_can_proceed:

SELECT run1.id,run1.times,run1.priority,run1.state,run2.id,run2.times,run2.priority,run2.state
FROM
  run run1
  JOIN task task1 ON task1.id = run1.task
  JOIN test test1 ON test1.id = task1.test
  JOIN scheduling_class scheduling_class1 ON
      scheduling_class1.id = test1.scheduling_class
  JOIN run run2 ON
      run2.times && run1.times
      AND run2.id <> run1.id
      AND run2.priority > run1.priority
      AND NOT run_state_is_finished(run2.state)
WHERE
    run1.id = 6939
    AND NOT scheduling_class1.anytime;

  id  |                        times                        | priority | state |  id  |                        times                        | priority | state
------+-----------------------------------------------------+----------+-------+------+-----------------------------------------------------+----------+-------
 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") |        0 |     1 | 4224 | ["2019-09-01 12:48:24+00","2019-09-02 12:48:24+00") |        5 |     3
 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") |        0 |     1 | 4221 | ["2019-09-01 12:48:24+00","2019-09-02 12:48:24+00") |        5 |     3
 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") |        0 |     1 | 4832 | ["2019-09-01 15:27:17+00","2019-09-02 15:27:17+00") |        5 |     3
 6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00") |        0 |     1 | 4828 | ["2019-09-01 15:27:17+00","2019-09-02 15:27:17+00") |        5 |     3
(4 rows)

OK, at this point I'm stumped.  I think I've dug further into the innards of pscheduler than any end-user really is supposed to do.  And all I've found is: both ad-hoc and scheduled tests are being dropped on the floor, with message "Run was preempted.", and I believe this is because run_can_proceed() is returning false.

Going back to what I said at the start: whilst I would appreciate hints to help fix the specific problem here, I also think that pscheduler could do a better job of reporting problems.  If it decides that it's not a good idea to launch a tool, I think it should say *why* it has decided this.  Or at least document this error message: googling "+pscheduler run was preempted" turns up nothing.

Regards,

Brian Candler.


(*) Actually, on node A, I did find a single instance of an error in /var/log/esmond/django.log:

2019-08-31 21:57:34,486 [ERROR] /usr/lib/python2.7/dist-packages/django/core/handlers/exception.py: Internal Server Error: /esmond/perfsonar/archive/
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/exception.py", line 41, in inner
    response = get_response(request)
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 249, in _legacy_get_response
    response = self._get_response(request)
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/decorators/csrf.py", line 58, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/rest_framework/viewsets.py", line 95, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/rest_framework/views.py", line 494, in dispatch
    response = self.handle_exception(exc)
  File "/usr/lib/python2.7/dist-packages/rest_framework/views.py", line 454, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/lib/python2.7/dist-packages/rest_framework/views.py", line 491, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/esmond/api/perfsonar/api_v2.py", line 876, in create
    return super(ArchiveViewset, self).create(request)
  File "/usr/lib/python2.7/dist-packages/rest_framework/mixins.py", line 21, in create
    self.perform_create(serializer)
  File "/usr/lib/python2.7/dist-packages/rest_framework/mixins.py", line 26, in perform_create
    serializer.save()
  File "/usr/lib/python2.7/dist-packages/rest_framework/serializers.py", line 214, in save
    self.instance = self.create(validated_data)
  File "/usr/lib/python2.7/dist-packages/esmond/api/perfsonar/api_v2.py", line 674, in create
    self.get_fields()[subject_model].Meta.model.objects.create(metadata=metadata, **subject)
  File "/usr/lib/python2.7/dist-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/db/models/query.py", line 394, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 808, in save
    force_update=force_update, update_fields=update_fields)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 838, in save_base
    updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 924, in _save_table
    result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 963, in _do_insert
    using=using, raw=raw)
  File "/usr/lib/python2.7/dist-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/db/models/query.py", line 1079, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/usr/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 1112, in execute_sql
    cursor.execute(sql, params)
  File "/usr/lib/python2.7/dist-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/usr/lib/python2.7/dist-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/lib/python2.7/dist-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
IntegrityError: null value in column "source" violates not-null constraint
DETAIL:  Failing row contains (8, pscheduler/iperf3, null, A:A:A:A::45, null, ns1.BBBB.com, A:A:A:A::45, 8).

And a pscheduler error:

/var/log/perfsonar/psconfig-pscheduler-agent-transactions.log:2019/08/31 15:27:50 ERROR guid=8df7d348-85f9-4bdb-9ce9-62db18bc4f3e action="create" msg=Problem adding test throughput(ns1.BBBB.com->A:A:A:A::45), continuing with rest of config: INTERNAL SERVER ERROR
/var/log/perfsonar/psconfig-pscheduler-agent.log:2019/08/31 15:28:24 WARN pid=887 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=227 guid=8df7d348-85f9-4bdb-9ce9-62db18bc4f3e msg=Problem adding test throughput(ns1.BBBB.com->A:A:A:A::45), continuing with rest of config: INTERNAL SERVER ERROR

But I think those are likely red herrings, since quite a few scheduled throughput tests should have run by now, but there was only a single instance of those errors. They were at different timestamps too.




Archive powered by MHonArc 2.6.19.

Top of Page