Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."


Chronological Thread 
  • From: Brian Candler <>
  • To: "" <>
  • Subject: Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."
  • Date: Thu, 5 Sep 2019 09:45:32 +0100
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=subject:from:to :references:message-id:date:mime-version:in-reply-to :content-type; q=dns; s=sasl; b=eKgt45OhXBzowEsL/mVm5a/3vIpj3LNv /rKSJxJxbile7OwiNvCva5d51cV2ZlUFDsgkCHd0h70n1RbdzY5INFh4Wd6J5gsT 4kVOsvXp5lc4X+d/0C+YYgRgIyu3KKKxwYP2ePUqfQhtgZCls3ODRWEPe30p2Wdw qex7XRJNFB0=

TL;DR: I have made some progress on this.  It looks like there is a bug in the exclusivity testing between throughput and latency tests, and/or in setting priorities.  There is a patch at the end which fixes the problem for me, and I'd be grateful if a developer could go through this.

-=-=-=-=-

Firstly, I was also able to demonstrate, using "pscheduler schedule --filter-test throughput -PT6H", that my inbound scheduled tests were also definitely failing because they were preempted.

The run state "preempted" is missing from the list at https://docs.perfsonar.net/pscheduler_client_schedule.html#the-basics

But I was able to get a comprehensive list, together with the corresponding database codes, like this:

$ egrep '^CREATE|RETURN' pscheduler-server/pscheduler-server/database/run_state.sql
CREATE OR REPLACE FUNCTION run_state_pending()
RETURNS INTEGER
        RETURN 1;
CREATE OR REPLACE FUNCTION run_state_on_deck()
RETURNS INTEGER
        RETURN 2;
CREATE OR REPLACE FUNCTION run_state_running()
RETURNS INTEGER
        RETURN 3;
CREATE OR REPLACE FUNCTION run_state_cleanup()
RETURNS INTEGER
        RETURN 4;
CREATE OR REPLACE FUNCTION run_state_finished()
RETURNS INTEGER
        RETURN 5;
CREATE OR REPLACE FUNCTION run_state_overdue()
RETURNS INTEGER
        RETURN 6;
CREATE OR REPLACE FUNCTION run_state_missed()
RETURNS INTEGER
        RETURN 7;
CREATE OR REPLACE FUNCTION run_state_failed()
RETURNS INTEGER
        RETURN 8;
CREATE OR REPLACE FUNCTION run_state_preempted()
RETURNS INTEGER
        RETURN 9;
CREATE OR REPLACE FUNCTION run_state_nonstart()
RETURNS INTEGER
        RETURN 10;
CREATE OR REPLACE FUNCTION run_state_canceled()
RETURNS INTEGER
        RETURN 11;

This shows me that state 3 is "running" and state 9 is "preempted".

Now, let me try a test again:

# pscheduler task --debug throughput -s ns1.BBBB.com -d perf1.home.AAAA.net --ip-version=6

=> gives me task URL /pscheduler/tasks/fffc5e1b-426c-4dac-aad4-510d10a73bd2

=> fails (preempted)

Looking in the database:

pscheduler=# select id,test from task where uuid='fffc5e1b-426c-4dac-aad4-510d10a73bd2';
 id | test
----+------
 84 |    8

pscheduler=# select name,scheduling_class from test where id=8;
    name    | scheduling_class
------------+------------------
 throughput |                2
(1 row)

pscheduler=# select id,state from run where task=84;
  id   | state
-------+-------
 28849 |     9
(1 row)

From above, state 9 is "preempted" as expected.  Going back to the logic from run_can_proceed():

SELECT run2.id, run2.state, run1.times, run2.times
FROM
  run run1
  JOIN task task1 ON task1.id = run1.task
  JOIN test test1 ON test1.id = task1.test
  JOIN scheduling_class scheduling_class1 ON
      scheduling_class1.id = test1.scheduling_class
  JOIN run run2 ON
      run2.times && run1.times
      AND run2.id <> run1.id
      AND run2.priority > run1.priority
      AND NOT run_state_is_finished(run2.state)
WHERE
    run1.id = 28849
    AND NOT scheduling_class1.anytime;

  id   | state |                        times                        |                        times
-------+-------+-----------------------------------------------------+-----------------------------------------------------
 22432 |     3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 12:48:24+00","2019-09-05 12:48:24+00")
 22436 |     3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 12:48:24+00","2019-09-05 12:48:24+00")
 23028 |     3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 15:27:17+00","2019-09-05 15:27:17+00")
 23029 |     3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 15:27:17+00","2019-09-05 15:27:17+00")
(4 rows)


So: it looks like there are four overlapping/conflicting tests in state "running".  Those tests have a 24 hour run window!  Surely they are not throughput tests?

pscheduler=# select run.id,task.id,test.id,test.name from run join task on run.task=task.id join test on task.test=test.id where run.id in (22432,22436,23028,23029);
  id   | id | id |   name
-------+----+----+-----------
 22432 | 71 |  3 | latencybg
 22436 | 72 |  3 | latencybg
 23028 | 73 |  3 | latencybg
 23029 | 74 |  3 | latencybg
(4 rows)

OK, I am starting to see the problem.  These four latencybg tests (which are correctly running) are for some reason conflicting with my throughput tests, that is, considered as exclusive.  But I don't see any reason why this should be the case, unless it's something to do with priorities and/or scheduling classes.

I found some basic documentation on scheduling classes here: https://docs.perfsonar.net/pscheduler_ref_tests_tools.html#test-classifications

and the database values:

pscheduler=# select * from scheduling_class;
 id |     display      |       enum       | anytime | exclusive | multi_result
----+------------------+------------------+---------+-----------+--------------
  1 | Background Multi | background-multi | t       | f         | t
  4 | Background       | background       | t       | f         | f
  2 | Exclusive        | exclusive        | f       | t         | f
  3 | Normal           | normal           | f       | f         | f
(4 rows)

My attempted throughput test (run id 28849), with task 84 and test 8, has scheduling class "exclusive", which implies anytime=false.  According to the documentation:

Exclusive - These are tests that cannot run at the same time as any other exclusive or normal test. An example is a throughput test.

That sounds fine.  What about the latencybg test it is clashing with?

pscheduler=# select id,name,scheduling_class from test where id=3;
 id |   name    | scheduling_class
----+-----------+------------------
  3 | latencybg |                1
(1 row)

That's "background-multi" so it should not clash.  Why does the function run_can_proceed() not check for this??  The logic in that function doesn't even join the run2 task and test, so it doesn't take the other test's scheduling class into consideration at all.

That means I must be missing something here, surely it couldn't possibly be that broken.

What about priorities?

SELECT run1.id, run1.priority, run1.state, run2.id, run2.priority, run2.state
FROM
  run run1
  JOIN task task1 ON task1.id = run1.task
  JOIN test test1 ON test1.id = task1.test
  JOIN scheduling_class scheduling_class1 ON
      scheduling_class1.id = test1.scheduling_class
  JOIN run run2 ON
      run2.times && run1.times
      AND run2.id <> run1.id
      AND run2.priority > run1.priority
      AND NOT run_state_is_finished(run2.state)
WHERE
    run1.id = 28849
    AND NOT scheduling_class1.anytime;

  id   | priority | state |  id   | priority | state
-------+----------+-------+-------+----------+-------
 28849 |        0 |     9 | 22432 |        5 |     3
 28849 |        0 |     9 | 22436 |        5 |     3
 28849 |        0 |     9 | 23028 |        5 |     3
 28849 |        0 |     9 | 23029 |        5 |     3
(4 rows)

For some reason, the throughput test I'm trying to run has a lower priority (0) than the background latency test (5).  But AFAICS that shouldn't matter given that they are not exclusive.

OK, let me ask another question.  If inbound throughput tests are blocked by latency tests, why aren't outbound throughput tests similarly blocked?

So I ran an outbound test, which was successful, giving me task uuid c160743a-f89a-4303-ba44-a4d9526ff8bf

pscheduler=# select id,test from task where uuid='c160743a-f89a-4303-ba44-a4d9526ff8bf';
 id | test
----+------
 86 |    8
(1 row)

pscheduler=# select id,state from run where task=86;
  id   | state
-------+-------
 28971 |     5

(1 row)

State 5 = Finished.  And it didn't it clash with background latency tasks.  Why not?

SELECT run1.id, run1.priority, run1.state, run2.id, run2.priority, run2.state
FROM
  run run1
  JOIN task task1 ON task1.id = run1.task
  JOIN test test1 ON test1.id = task1.test
  JOIN scheduling_class scheduling_class1 ON
      scheduling_class1.id = test1.scheduling_class
  JOIN run run2 ON
      run2.times && run1.times
      AND run2.id <> run1.id
WHERE
    run1.id = 28971
    AND NOT scheduling_class1.anytime;

pscheduler-#     AND NOT scheduling_class1.anytime;
  id   | priority | state |  id   | priority | state
-------+----------+-------+-------+----------+-------
 28971 |        5 |     5 | 28972 |        0 |     5
 28971 |        5 |     5 | 28973 |        0 |     5
 28971 |        5 |     5 | 28974 |        0 |     5
 28971 |        5 |     5 | 22432 |        5 |     3
 28971 |        5 |     5 | 22436 |        5 |     3
 28971 |        5 |     5 | 23028 |        5 |     3
 28971 |        5 |     5 | 23029 |        5 |     3
(7 rows)

This time, the throughput test I'm trying to run has a priority of 5.  And since that's greater than or equal to the latency tests, those tests are not pre-empting it.

But the *only* difference I made when submitting the tests was to swap the "-s" and "-d" arguments around.

I found a little documentation on priorities here: https://docs.perfsonar.net/config_pscheduler_limits.html#priorities-which-runs-happen-and-which-do-not

This leaves two questions in my mind.

(1) What sets the "priority" on runs of manually submitted tasks?  Why does my outbound throughput test have priority 5 and my inbound test have priority 0 ?

(2) Is it correct that a lower-priority run always be preempted by a higher-priority run, even if the scheduling classes say that they should not conflict?

Considering point (2), the more I think about it, the more I think that the logic in run_can_proceed is broken.  It checks whether run1 has "anytime"=false (i.e. if it's "normal" or "exclusive"), but surely it should also ignore run2 tests with "anytime"?

More accurately, I think it should test for run1 exclusive and run2 not anytime, and vice versa.  If I'm right, the logic should change like this:

diff --git a/pscheduler-server/pscheduler-server/database/run.sql b/pscheduler-server/pscheduler-server/database/run.sql
index abe5efc8..1df85c67 100644
--- a/pscheduler-server/pscheduler-server/database/run.sql
+++ b/pscheduler-server/pscheduler-server/database/run.sql
@@ -613,9 +613,14 @@ BEGIN
              AND run2.id <> run1.id
              AND run2.priority > run1.priority
              AND NOT run_state_is_finished(run2.state)
+         JOIN task task2 ON task2.id = run2.task
+         JOIN test test2 ON test2.id = task2.test
+         JOIN scheduling_class scheduling_class2 ON
+              scheduling_class2.id = test2.scheduling_class
        WHERE
            run1.id = run_id
-           AND NOT scheduling_class1.anytime
+           AND (  (scheduling_class1.exclusive AND NOT scheduling_class2.anytime)
+               OR (scheduling_class2.exclusive AND NOT scheduling_class1.anytime) )
     );

 END;

And that fixes the problem for me - yay!  If it's right, I'm happy to submit a PR.  But it begs the question: why on earth is nobody else affected by this problem?  Which makes me worry that I've completely misunderstood something.

As for point (1), even if priority is only supposed to be used for conflicting tests, I still don't understand yet why the priority was being set differently for my inbound and outbound tests.  Maybe it's something to do with whether the test originates from the local host or not.

Regards,

Brian Candler.




Archive powered by MHonArc 2.6.19.

Top of Page