perfsonar-user - Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Brian Candler <>
- To: "" <>
- Subject: Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted."
- Date: Thu, 5 Sep 2019 09:45:32 +0100
- Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=subject:from:to :references:message-id:date:mime-version:in-reply-to :content-type; q=dns; s=sasl; b=eKgt45OhXBzowEsL/mVm5a/3vIpj3LNv /rKSJxJxbile7OwiNvCva5d51cV2ZlUFDsgkCHd0h70n1RbdzY5INFh4Wd6J5gsT 4kVOsvXp5lc4X+d/0C+YYgRgIyu3KKKxwYP2ePUqfQhtgZCls3ODRWEPe30p2Wdw qex7XRJNFB0=
TL;DR: I have made some progress on this. It looks like there is
a bug in the exclusivity testing between throughput and latency
tests, and/or in setting priorities. There is a patch at the end
which fixes the problem for me, and I'd be grateful if a developer
could go through this. -=-=-=-=- Firstly, I was also able to demonstrate, using "pscheduler schedule --filter-test throughput -PT6H", that my inbound scheduled tests were also definitely failing because they were preempted. The run state "preempted" is missing from the list at https://docs.perfsonar.net/pscheduler_client_schedule.html#the-basics But I was able to get a comprehensive list, together with the corresponding database codes, like this: $ egrep '^CREATE|RETURN'
pscheduler-server/pscheduler-server/database/run_state.sql This shows me that state 3 is "running" and state 9 is
"preempted". Now, let me try a test again: # pscheduler task --debug throughput -s ns1.BBBB.com -d
perf1.home.AAAA.net --ip-version=6 => gives me task URL /pscheduler/tasks/fffc5e1b-426c-4dac-aad4-510d10a73bd2 => fails (preempted) Looking in the database: id | test ----+------ 84 | 8 pscheduler=# select name,scheduling_class from test where
id=8; pscheduler=# select id,state from run where task=84; From above, state 9 is "preempted" as expected. Going back to the logic from run_can_proceed(): SELECT run2.id, run2.state, run1.times, run2.times id | state |
times | times
So: it looks like there are four overlapping/conflicting tests in state "running". Those tests have a 24 hour run window! Surely they are not throughput tests? pscheduler=# select run.id,task.id,test.id,test.name from run
join task on run.task=task.id join test on task.test=test.id
where run.id in (22432,22436,23028,23029); OK, I am starting to see the problem. These four latencybg tests (which are correctly running) are for some reason conflicting with my throughput tests, that is, considered as exclusive. But I don't see any reason why this should be the case, unless it's something to do with priorities and/or scheduling classes. I found some basic documentation on scheduling classes here: https://docs.perfsonar.net/pscheduler_ref_tests_tools.html#test-classifications and the database values: pscheduler=# select * from scheduling_class; My attempted throughput test (run id 28849), with task 84 and test 8, has scheduling class "exclusive", which implies anytime=false. According to the documentation: Exclusive - These are tests that cannot run at the same time as any other exclusive or normal test. An example is a throughput test. That sounds fine. What about the latencybg test it is clashing with? pscheduler=# select id,name,scheduling_class from test where
id=3; That's "background-multi" so it should not clash. Why does the function run_can_proceed() not check for this?? The logic in that function doesn't even join the run2 task and test, so it doesn't take the other test's scheduling class into consideration at all. That means I must be missing something here, surely it couldn't possibly be that broken. What about priorities? SELECT run1.id, run1.priority, run1.state, run2.id,
run2.priority, run2.state id | priority | state | id | priority | state For some reason, the throughput test I'm trying to run has a lower priority (0) than the background latency test (5). But AFAICS that shouldn't matter given that they are not exclusive. OK, let me ask another question. If inbound throughput tests are blocked by latency tests, why aren't outbound throughput tests similarly blocked?So I ran an outbound test, which was successful, giving me task uuid c160743a-f89a-4303-ba44-a4d9526ff8bf pscheduler=# select id,test from task where uuid='c160743a-f89a-4303-ba44-a4d9526ff8bf';id | test ----+------ 86 | 8 (1 row) pscheduler=# select id,state from run where task=86; id | state -------+------- 28971 | 5 (1 row) State 5 = Finished. And it didn't it clash with background latency tasks. Why not? SELECT run1.id, run1.priority, run1.state, run2.id, run2.priority, run2.stateFROM run run1 JOIN task task1 ON task1.id = run1.task JOIN test test1 ON test1.id = task1.test JOIN scheduling_class scheduling_class1 ON scheduling_class1.id = test1.scheduling_class JOIN run run2 ON run2.times && run1.times AND run2.id <> run1.id WHERE run1.id = 28971 AND NOT scheduling_class1.anytime; pscheduler-# AND NOT scheduling_class1.anytime; This time, the throughput test I'm trying to run has a priority of 5. And since that's greater than or equal to the latency tests, those tests are not pre-empting it. But the *only* difference I made when submitting the tests was to swap the "-s" and "-d" arguments around. I found a little documentation on priorities here: https://docs.perfsonar.net/config_pscheduler_limits.html#priorities-which-runs-happen-and-which-do-not This leaves two questions in my mind. (1) What sets the "priority" on runs of manually submitted tasks? Why does my outbound throughput test have priority 5 and my inbound test have priority 0 ? (2) Is it correct that a lower-priority run always be preempted by a higher-priority run, even if the scheduling classes say that they should not conflict? Considering point (2), the more I think about it, the more I think that the logic in run_can_proceed is broken. It checks whether run1 has "anytime"=false (i.e. if it's "normal" or "exclusive"), but surely it should also ignore run2 tests with "anytime"? More accurately, I think it should test for run1 exclusive and run2 not anytime, and vice versa. If I'm right, the logic should change like this: diff --git
a/pscheduler-server/pscheduler-server/database/run.sql
b/pscheduler-server/pscheduler-server/database/run.sql And that fixes the problem for me - yay! If it's right, I'm
happy to submit a PR. But it begs the question: why on earth is
nobody else affected by this problem? Which makes me worry that
I've completely misunderstood something. As for point (1), even if priority is only supposed to be used
for conflicting tests, I still don't understand yet why the
priority was being set differently for my inbound and outbound
tests. Maybe it's something to do with whether the test
originates from the local host or not. Regards, Brian Candler. |
- [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/01/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Mark Feit, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Casey Russell, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Casey Russell, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Casey Russell, 09/05/2019
- Re: [perfsonar-user] Problems debugging pscheduler: "Run was preempted.", Brian Candler, 09/05/2019
Archive powered by MHonArc 2.6.19.