perfsonar-user - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."

From: Brian Candler <>
To: "" <>
Subject: Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."
Date: Thu, 5 Sep 2019 09:45:32 +0100
Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=subject:from:to :references:message-id:date:mime-version:in-reply-to :content-type; q=dns; s=sasl; b=eKgt45OhXBzowEsL/mVm5a/3vIpj3LNv /rKSJxJxbile7OwiNvCva5d51cV2ZlUFDsgkCHd0h70n1RbdzY5INFh4Wd6J5gsT 4kVOsvXp5lc4X+d/0C+YYgRgIyu3KKKxwYP2ePUqfQhtgZCls3ODRWEPe30p2Wdw qex7XRJNFB0=

TL;DR: I have made some progress on this. It looks like there is a bug in the exclusivity testing between throughput and latency tests, and/or in setting priorities. There is a patch at the end which fixes the problem for me, and I'd be grateful if a developer could go through this.

-=-=-=-=-

Firstly, I was also able to demonstrate, using "pscheduler schedule --filter-test throughput -PT6H", that my inbound scheduled tests were also definitely failing because they were preempted.

The run state "preempted" is missing from the list at https://docs.perfsonar.net/pscheduler_client_schedule.html#the-basics

But I was able to get a comprehensive list, together with the corresponding database codes, like this:

$ egrep '^CREATE|RETURN' pscheduler-server/pscheduler-server/database/run_state.sqlCREATE OR REPLACE FUNCTION run_state_pending()RETURNS INTEGER RETURN 1;CREATE OR REPLACE FUNCTION run_state_on_deck()RETURNS INTEGER RETURN 2;CREATE OR REPLACE FUNCTION run_state_running()RETURNS INTEGER RETURN 3;CREATE OR REPLACE FUNCTION run_state_cleanup()RETURNS INTEGER RETURN 4;CREATE OR REPLACE FUNCTION run_state_finished()RETURNS INTEGER RETURN 5;CREATE OR REPLACE FUNCTION run_state_overdue()RETURNS INTEGER RETURN 6;CREATE OR REPLACE FUNCTION run_state_missed()RETURNS INTEGER RETURN 7;CREATE OR REPLACE FUNCTION run_state_failed()RETURNS INTEGER RETURN 8;CREATE OR REPLACE FUNCTION run_state_preempted()RETURNS INTEGER RETURN 9;CREATE OR REPLACE FUNCTION run_state_nonstart()RETURNS INTEGER RETURN 10;CREATE OR REPLACE FUNCTION run_state_canceled()RETURNS INTEGER RETURN 11;

This shows me that state 3 is "running" and state 9 is "preempted".

Now, let me try a test again:

# pscheduler task --debug throughput -s ns1.BBBB.com -d perf1.home.AAAA.net --ip-version=6

=> gives me task URL /pscheduler/tasks/fffc5e1b-426c-4dac-aad4-510d10a73bd2

=> fails (preempted)

Looking in the database:

pscheduler=# select id,test from task where uuid='fffc5e1b-426c-4dac-aad4-510d10a73bd2'; id | test----+------ 84 | 8

pscheduler=# select name,scheduling_class from test where id=8; name | scheduling_class------------+------------------ throughput | 2(1 row)

pscheduler=# select id,state from run where task=84; id | state-------+------- 28849 | 9(1 row)

From above, state 9 is "preempted" as expected. Going back to the logic from run_can_proceed():

SELECT run2.id, run2.state, run1.times, run2.timesFROM run run1 JOIN task task1 ON task1.id = run1.task JOIN test test1 ON test1.id = task1.test JOIN scheduling_class scheduling_class1 ON scheduling_class1.id = test1.scheduling_class JOIN run run2 ON run2.times && run1.times AND run2.id <> run1.id AND run2.priority > run1.priority AND NOT run_state_is_finished(run2.state)WHERE run1.id = 28849 AND NOT scheduling_class1.anytime;

id | state | times | times-------+-------+-----------------------------------------------------+----------------------------------------------------- 22432 | 3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 12:48:24+00","2019-09-05 12:48:24+00") 22436 | 3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 12:48:24+00","2019-09-05 12:48:24+00") 23028 | 3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 15:27:17+00","2019-09-05 15:27:17+00") 23029 | 3 | ["2019-09-05 07:32:45+00","2019-09-05 07:32:45+00"] | ["2019-09-04 15:27:17+00","2019-09-05 15:27:17+00")(4 rows)

So: it looks like there are four overlapping/conflicting tests in state "running". Those tests have a 24 hour run window! Surely they are not throughput tests?

pscheduler=# select run.id,task.id,test.id,test.name from run join task on run.task=task.id join test on task.test=test.id where run.id in (22432,22436,23028,23029); id | id | id | name-------+----+----+----------- 22432 | 71 | 3 | latencybg 22436 | 72 | 3 | latencybg 23028 | 73 | 3 | latencybg 23029 | 74 | 3 | latencybg(4 rows)

OK, I am starting to see the problem. These four latencybg tests (which are correctly running) are for some reason conflicting with my throughput tests, that is, considered as exclusive. But I don't see any reason why this should be the case, unless it's something to do with priorities and/or scheduling classes.

I found some basic documentation on scheduling classes here: https://docs.perfsonar.net/pscheduler_ref_tests_tools.html#test-classifications

and the database values:

pscheduler=# select * from scheduling_class; id | display | enum | anytime | exclusive | multi_result----+------------------+------------------+---------+-----------+-------------- 1 | Background Multi | background-multi | t | f | t 4 | Background | background | t | f | f 2 | Exclusive | exclusive | f | t | f 3 | Normal | normal | f | f | f(4 rows)

My attempted throughput test (run id 28849), with task 84 and test 8, has scheduling class "exclusive", which implies anytime=false. According to the documentation:

Exclusive - These are tests that cannot run at the same time as any other exclusive or normal test. An example is a throughput test.

That sounds fine. What about the latencybg test it is clashing with?

pscheduler=# select id,name,scheduling_class from test where id=3; id | name | scheduling_class----+-----------+------------------ 3 | latencybg | 1(1 row)

That's "background-multi" so it should not clash. Why does the function run_can_proceed() not check for this?? The logic in that function doesn't even join the run2 task and test, so it doesn't take the other test's scheduling class into consideration at all.

That means I must be missing something here, surely it couldn't possibly be that broken.

What about priorities?

SELECT run1.id, run1.priority, run1.state, run2.id, run2.priority, run2.stateFROM run run1 JOIN task task1 ON task1.id = run1.task JOIN test test1 ON test1.id = task1.test JOIN scheduling_class scheduling_class1 ON scheduling_class1.id = test1.scheduling_class JOIN run run2 ON run2.times && run1.times AND run2.id <> run1.id AND run2.priority > run1.priority AND NOT run_state_is_finished(run2.state)WHERE run1.id = 28849 AND NOT scheduling_class1.anytime;

id | priority | state | id | priority | state-------+----------+-------+-------+----------+------- 28849 | 0 | 9 | 22432 | 5 | 3 28849 | 0 | 9 | 22436 | 5 | 3 28849 | 0 | 9 | 23028 | 5 | 3 28849 | 0 | 9 | 23029 | 5 | 3(4 rows)

For some reason, the throughput test I'm trying to run has a lower priority (0) than the background latency test (5). But AFAICS that shouldn't matter given that they are not exclusive.

OK, let me ask another question. If inbound throughput tests are blocked by latency tests, why aren't outbound throughput tests similarly blocked?

So I ran an outbound test, which was successful, giving me task uuid c160743a-f89a-4303-ba44-a4d9526ff8bf

pscheduler=# select id,test from task where uuid='c160743a-f89a-4303-ba44-a4d9526ff8bf'; id | test----+------ 86 | 8(1 row)pscheduler=# select id,state from run where task=86; id | state-------+------- 28971 | 5

(1 row)

State 5 = Finished. And it didn't it clash with background latency tasks. Why not?

SELECT run1.id, run1.priority, run1.state, run2.id, run2.priority, run2.stateFROM run run1 JOIN task task1 ON task1.id = run1.task JOIN test test1 ON test1.id = task1.test JOIN scheduling_class scheduling_class1 ON scheduling_class1.id = test1.scheduling_class JOIN run run2 ON run2.times && run1.times AND run2.id <> run1.idWHERE run1.id = 28971 AND NOT scheduling_class1.anytime;

pscheduler-# AND NOT scheduling_class1.anytime; id | priority | state | id | priority | state-------+----------+-------+-------+----------+------- 28971 | 5 | 5 | 28972 | 0 | 5 28971 | 5 | 5 | 28973 | 0 | 5 28971 | 5 | 5 | 28974 | 0 | 5 28971 | 5 | 5 | 22432 | 5 | 3 28971 | 5 | 5 | 22436 | 5 | 3 28971 | 5 | 5 | 23028 | 5 | 3 28971 | 5 | 5 | 23029 | 5 | 3(7 rows)

This time, the throughput test I'm trying to run has a priority of 5. And since that's greater than or equal to the latency tests, those tests are not pre-empting it.

But the *only* difference I made when submitting the tests was to swap the "-s" and "-d" arguments around.

I found a little documentation on priorities here: https://docs.perfsonar.net/config_pscheduler_limits.html#priorities-which-runs-happen-and-which-do-not

This leaves two questions in my mind.

(1) What sets the "priority" on runs of manually submitted tasks? Why does my outbound throughput test have priority 5 and my inbound test have priority 0 ?

(2) Is it correct that a lower-priority run always be preempted by a higher-priority run, even if the scheduling classes say that they should not conflict?

Considering point (2), the more I think about it, the more I think that the logic in run_can_proceed is broken. It checks whether run1 has "anytime"=false (i.e. if it's "normal" or "exclusive"), but surely it should also ignore run2 tests with "anytime"?

More accurately, I think it should test for run1 exclusive and run2 not anytime, and vice versa. If I'm right, the logic should change like this:

diff --git a/pscheduler-server/pscheduler-server/database/run.sql b/pscheduler-server/pscheduler-server/database/run.sqlindex abe5efc8..1df85c67 100644--- a/pscheduler-server/pscheduler-server/database/run.sql+++ b/pscheduler-server/pscheduler-server/database/run.sql@@ -613,9 +613,14 @@ BEGIN AND run2.id <> run1.id AND run2.priority > run1.priority AND NOT run_state_is_finished(run2.state)+ JOIN task task2 ON task2.id = run2.task+ JOIN test test2 ON test2.id = task2.test+ JOIN scheduling_class scheduling_class2 ON+ scheduling_class2.id = test2.scheduling_class WHERE run1.id = run_id- AND NOT scheduling_class1.anytime+ AND ( (scheduling_class1.exclusive AND NOT scheduling_class2.anytime)+ OR (scheduling_class2.exclusive AND NOT scheduling_class1.anytime) ) ); END;

And that fixes the problem for me - yay! If it's right, I'm happy to submit a PR. But it begs the question: why on earth is nobody else affected by this problem? Which makes me worry that I've completely misunderstood something.

As for point (1), even if priority is only supposed to be used for conflicting tests, I still don't understand yet why the priority was being set differently for my inbound and outbound tests. Maybe it's something to do with whether the test originates from the local host or not.

Regards,

Brian Candler.

[perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/01/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
  - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Mark Feit, 09/05/2019
  - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Casey Russell, 09/05/2019
    - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
      - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
        
        Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Casey Russell, 09/05/2019

List archive

Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."