Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Not scheduling tests reliably (again)

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Not scheduling tests reliably (again)


Chronological Thread 
  • From: Mark Feit <>
  • To: Casey Russell <>, "" <>
  • Subject: Re: [perfsonar-user] Not scheduling tests reliably (again)
  • Date: Wed, 29 Aug 2018 21:09:03 +0000
  • Accept-language: en-US
  • Authentication-results: kanren.net; dkim=none (message not signed) header.d=none;kanren.net; dmarc=none action=none header.from=internet2.edu;
  • Ironport-phdr: 9a23:Jq6+Nh1WVNIlBsCjsmDT+DRfVm0co7zxezQtwd8ZsesWL//xwZ3uMQTl6Ol3ixeRBMOHs60C07KempujcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS47xaFLIv3K98yMZFAnhOgppPOT1HZPZg9iq2+yo9JDffwdFiCChbb9uMR67sRjfus4KjIV4N60/0AHJonxGe+RXwWNnO1eelAvi68mz4ZBu7T1et+ou+MBcX6r6eb84TaFDAzQ9L281/szrugLdQgaJ+3ART38ZkhtMAwjC8RH6QpL8uTb0u+ZhxCWXO9D9QKsqUjq+8ahkVB7oiD8GNzEn9mHXltdwh79frB64uhBz35LYbISTOfFjfK3SYMkaSHJBUMhPSiJBHo2yYYgBD+UDPOZXs4byqkAUoheiAAmhHv/jxiNKi3LwwKY00/4hEQbD3AE4GNwBqm7UrNboP6kST++1zbXIxijEYvNT1zfy9onIcgw6rPGNW7JwbdTeyVMpFwzbklWct5bpMC2I2eQQqmWW6fdrW+yoi24isQ5xoz6vy98jionImoIVyk3E+j5jzIkpIt24TVZ3Yd2+H5tWrSGVKY12TtkkQ252pCY3zKANt52jfCUS1pgo3QLTZ+GCfoSV/x7vSeOcITl3iX55ZL6yghS//lavx+LmU8S51UhGojZKn9XUq3wByx/e5tKIR/Z95Eus2iuD2xrO5uxFIE04jaTbJIAiz7Isk5cetFrPETLrlEj2iaKbckop9+i25+nif7nrqIGQOJFxhw7kKasjlM6yDOIlOQYURWeb4/6z1Lj78E35XrpKivo2n7HBvp3GIsoXuqC0DxZI34kh9RqzFjCm388GknUdK1JFZQ6HgJPuO1HTJvD3EO2zg0y2kDds2/DJIKHuAonMLnjElrftZ7F961NAyAo3ytBf4JFUBqsdL/L0X0/9rN3YDhknPAyo2+vrFs9y2p8DVW+KH6OVLb7evFqG5u8gP+WAeIoYtTTjJPUq/fHjiHo0lUEBcaSmxZcXbWq3HvViI0WXe3rshdIBHH8PvgowUuPqiUGCXCVSZ3a0Q6Iz+Cs7CIS4AoffWIyhmqKO0zqmHpFOfGBJFkiMEWv0d4WDQ/oMcDydItVvkjwfUrihTZUu1Qu3uA/n0LpoMPDU9zYctZLiz9h1+/bTmQ8o+Tx1CcSdz3+CT3tynmwWWz86wrpzrlJgxVeeguBEhKlzHMde9rtzTxwhOJrYh7hxEc3pQQ/Fev+KQVC8T9PgBzwtGJZ5iccDeUhmHNOrlFXexCewK74Ti7GRApEoqOTR02W7b5JlxmzIz64nhkNjX9BCL0WngLJy7Q7eG9SPnkmEwfWEb6MZiQvE7mTL42ePoAkMVQB9ULntXHYDa1HQoMijoE7OUun9WvwcLgJdxJvaeeNxYdrzgAADHa+7Yo6Man+tm2q2GReDz6+Na4yvYWgGwSHBExFey1IQ9HCcOA54ACq98CrSDz1rQFToZU6ksexzs2iyQUJ8yQaWJ1Zg2Ly49l9w57ScRvof06hCtHInrDN5T1W02cjbDZyGqhYyNKlZaMk2tVFA02+RvgdhP5umeqZlgFNWcwl+s070kRttDYAVkM42oWkswRYob6+UzQBM
  • Spamdiagnosticoutput: 1:0

Casey Russell writes:

 

     The 8 nodes in the mesh will just sporadically refuse to schedule some tests.  Right now it appears to be primarily throughput tests.  I end up with a bunch of "non-starting" tests in pscheduler, and logs like the ones below in pscheduler.log

 

Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26599: Posting non-starting run at 2018-08-30T14:28:09Z for task 1a869753-f827-44bc-abb5-d0186075a482: ps-washburn-bw.perfsonar.kanren.net has no time available for this run

 

     As you can see, the misbehaving host is ps-wsu-bw.  It just suddenly begins to believe that most of the other hosts in the mesh have "no time available" for a test.  If I run a test manually, to one of the affected hosts, things seem to be fine (maybe it was a short term problem?).

 

When pScheduler gets a task that repeats, it’s going to start scheduling runs out to 24 hours, and as time passes it will schedule more.  If you’re seeing no-time errors, it means you have times when there are more tests to run than the system can find time to schedule without breaking any of the rules.  You will see this with throughput tests because they’re the only test that we schedule to have exclusive use of the system while they’re running.  Running a dozen traces at the same time isn’t an issue.  The log message says it was trying to schedule something tomorrow at 14:28, which tells me that’s where the congestion is.  If you run a test right now and there’s no congestion, it’ll run just fine.  There could stand to be more information in that message, so I’ve opened a ticket to expand on that:  https://github.com/perfsonar/pscheduler/issues/668.

 

In general, the best thing you can do for your tasks to make sure they get scheduled is add as much slip as you can tolerate.  I’d actually recommend against using random slip unless you have a measurement-related reason to use it.  With it turned on, runs will be scattered within the slip interval and could result in fragmentation that leaves gaps too small to squeeze in a measurement.  With it off, pScheduler will stack up tests one after the other as early as they can be scheduled and any available time within the slip interval is one big blob at the end.

 

There is no default slip for tasks submitted through the API; the CLI sets it to PT5M if none is explicitly provided.  Andy and I just had a short discussion about pSconfig, and he’ll follow up with his thoughts on that.  If you could forward us a copy of your mesh configuration off-list, we’ll have a look at it.

 

pScheduler has a little-known command called plot-schedule that can be used to produce a visualization of what the schedule looks like as a PNG.  Having just run it against your system, I suspect it may be buggy.  (There’s also a dependency-related problem on Debian systems where it doesn’t get a recent-enough version of Gnuplot.)  I’ll take a quick look at that and see if I can make it work correctly and will send you a plot of that host’s schedule at around the time where the congestion seems to be.  You can also use the schedule command (see “pscheduler schedule --help”) to look at the information textually.

 

HTH.

 

--Mark

 




Archive powered by MHonArc 2.6.19.

Top of Page