perfsonar-user - Re: [perfsonar-user] Not scheduling tests reliably (again)
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Mark Feit <>
- To: Casey Russell <>, "" <>
- Subject: Re: [perfsonar-user] Not scheduling tests reliably (again)
- Date: Wed, 29 Aug 2018 21:09:03 +0000
- Accept-language: en-US
- Authentication-results: kanren.net; dkim=none (message not signed) header.d=none;kanren.net; dmarc=none action=none header.from=internet2.edu;
- Ironport-phdr: 9a23:Jq6+Nh1WVNIlBsCjsmDT+DRfVm0co7zxezQtwd8ZsesWL//xwZ3uMQTl6Ol3ixeRBMOHs60C07KempujcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS47xaFLIv3K98yMZFAnhOgppPOT1HZPZg9iq2+yo9JDffwdFiCChbb9uMR67sRjfus4KjIV4N60/0AHJonxGe+RXwWNnO1eelAvi68mz4ZBu7T1et+ou+MBcX6r6eb84TaFDAzQ9L281/szrugLdQgaJ+3ART38ZkhtMAwjC8RH6QpL8uTb0u+ZhxCWXO9D9QKsqUjq+8ahkVB7oiD8GNzEn9mHXltdwh79frB64uhBz35LYbISTOfFjfK3SYMkaSHJBUMhPSiJBHo2yYYgBD+UDPOZXs4byqkAUoheiAAmhHv/jxiNKi3LwwKY00/4hEQbD3AE4GNwBqm7UrNboP6kST++1zbXIxijEYvNT1zfy9onIcgw6rPGNW7JwbdTeyVMpFwzbklWct5bpMC2I2eQQqmWW6fdrW+yoi24isQ5xoz6vy98jionImoIVyk3E+j5jzIkpIt24TVZ3Yd2+H5tWrSGVKY12TtkkQ252pCY3zKANt52jfCUS1pgo3QLTZ+GCfoSV/x7vSeOcITl3iX55ZL6yghS//lavx+LmU8S51UhGojZKn9XUq3wByx/e5tKIR/Z95Eus2iuD2xrO5uxFIE04jaTbJIAiz7Isk5cetFrPETLrlEj2iaKbckop9+i25+nif7nrqIGQOJFxhw7kKasjlM6yDOIlOQYURWeb4/6z1Lj78E35XrpKivo2n7HBvp3GIsoXuqC0DxZI34kh9RqzFjCm388GknUdK1JFZQ6HgJPuO1HTJvD3EO2zg0y2kDds2/DJIKHuAonMLnjElrftZ7F961NAyAo3ytBf4JFUBqsdL/L0X0/9rN3YDhknPAyo2+vrFs9y2p8DVW+KH6OVLb7evFqG5u8gP+WAeIoYtTTjJPUq/fHjiHo0lUEBcaSmxZcXbWq3HvViI0WXe3rshdIBHH8PvgowUuPqiUGCXCVSZ3a0Q6Iz+Cs7CIS4AoffWIyhmqKO0zqmHpFOfGBJFkiMEWv0d4WDQ/oMcDydItVvkjwfUrihTZUu1Qu3uA/n0LpoMPDU9zYctZLiz9h1+/bTmQ8o+Tx1CcSdz3+CT3tynmwWWz86wrpzrlJgxVeeguBEhKlzHMde9rtzTxwhOJrYh7hxEc3pQQ/Fev+KQVC8T9PgBzwtGJZ5iccDeUhmHNOrlFXexCewK74Ti7GRApEoqOTR02W7b5JlxmzIz64nhkNjX9BCL0WngLJy7Q7eG9SPnkmEwfWEb6MZiQvE7mTL42ePoAkMVQB9ULntXHYDa1HQoMijoE7OUun9WvwcLgJdxJvaeeNxYdrzgAADHa+7Yo6Man+tm2q2GReDz6+Na4yvYWgGwSHBExFey1IQ9HCcOA54ACq98CrSDz1rQFToZU6ksexzs2iyQUJ8yQaWJ1Zg2Ly49l9w57ScRvof06hCtHInrDN5T1W02cjbDZyGqhYyNKlZaMk2tVFA02+RvgdhP5umeqZlgFNWcwl+s070kRttDYAVkM42oWkswRYob6+UzQBM
- Spamdiagnosticoutput: 1:0
Casey Russell writes: The 8 nodes in the mesh will just sporadically refuse to schedule some tests. Right now it appears to be primarily throughput tests. I end up with a bunch of "non-starting" tests in pscheduler, and logs like
the ones below in pscheduler.log Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26599: Posting non-starting run at 2018-08-30T14:28:09Z for task 1a869753-f827-44bc-abb5-d0186075a482:
ps-washburn-bw.perfsonar.kanren.net has no time available for this run As you can see, the misbehaving host is ps-wsu-bw. It just suddenly begins to believe that most of the other hosts in the mesh have "no time available" for a test. If I run a test manually, to one of the
affected hosts, things seem to be fine (maybe it was a short term problem?). When pScheduler gets a task that repeats, it’s going to start scheduling runs out to 24 hours, and as time passes it will schedule more. If you’re seeing no-time errors, it means you have times when there are more tests to run than the
system can find time to schedule without breaking any of the rules. You will see this with throughput tests because they’re the only test that we schedule to have exclusive use of the system while they’re running. Running a dozen traces at the same time
isn’t an issue. The log message says it was trying to schedule something tomorrow at 14:28, which tells me that’s where the congestion is. If you run a test right now and there’s no congestion, it’ll run just fine. There could stand to be more information
in that message, so I’ve opened a ticket to expand on that:
https://github.com/perfsonar/pscheduler/issues/668. In general, the best thing you can do for your tasks to make sure they get scheduled is add as much slip as you can tolerate. I’d actually recommend against using random slip unless you have a measurement-related reason to use it. With
it turned on, runs will be scattered within the slip interval and could result in fragmentation that leaves gaps too small to squeeze in a measurement. With it off, pScheduler will stack up tests one after the other as early as they can be scheduled and any
available time within the slip interval is one big blob at the end. There is no default slip for tasks submitted through the API; the CLI sets it to PT5M if none is explicitly provided. Andy and I just had a short discussion about pSconfig, and he’ll follow up with his thoughts on that. If you could forward
us a copy of your mesh configuration off-list, we’ll have a look at it. pScheduler has a little-known command called plot-schedule that can be used to produce a visualization of what the schedule looks like as a PNG. Having just run it against your system, I suspect it may be buggy. (There’s also a dependency-related
problem on Debian systems where it doesn’t get a recent-enough version of Gnuplot.) I’ll take a quick look at that and see if I can make it work correctly and will send you a plot of that host’s schedule at around the time where the congestion seems to be.
You can also use the schedule command (see “pscheduler schedule --help”) to look at the information textually. HTH. --Mark |
- [perfsonar-user] Not scheduling tests reliably (again), Casey Russell, 08/29/2018
- Re: [perfsonar-user] Not scheduling tests reliably (again), Mark Feit, 08/29/2018
- [perfsonar-user] Re: Not scheduling tests reliably (again), Casey Russell, 08/31/2018
Archive powered by MHonArc 2.6.19.