Skip to Content.
Sympa Menu

perfsonar-user - [perfsonar-user] Re: Not scheduling tests reliably (again)

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Re: Not scheduling tests reliably (again)


Chronological Thread 
  • From: Casey Russell <>
  • To:
  • Subject: [perfsonar-user] Re: Not scheduling tests reliably (again)
  • Date: Fri, 31 Aug 2018 09:26:03 -0500
  • Ironport-phdr: 9a23:y5TwkRV7Km8y3P/jGDwKoND1jTDV8LGtZVwlr6E/grcLSJyIuqrYbBGBt8tkgFKBZ4jH8fUM07OQ7/i/HzRYqb+681k6OKRWUBEEjchE1ycBO+WiTXPBEfjxciYhF95DXlI2t1uyMExSBdqsLwaK+i764jEdAAjwOhRoLerpBIHSk9631+ev8JHPfglEnjWwba9wIRmssQndqtQdjJd/JKo21hbHuGZDdf5MxWNvK1KTnhL86dm18ZV+7SleuO8v+tBZX6nicKs2UbJXDDI9M2Ao/8LrrgXMTRGO5nQHTGoblAdDDhXf4xH7WpfxtTb6tvZ41SKHM8D6Uaw4VDK/5KptVRTmijoINyQh/W7YhMx/jqJVrhyiqRJi3YDbfJqYO+Bicq7HZ94WWXZNU8RXWidcAo28dYwPD+8ZMOtGtYb9vEUBrBujDgewGePv0SRIiWHy3a0+zu8sFh3J3BY9H9IVq3TbstH1NKMJXOC21qbIyy/DYO1Q2Tvn7ojHbAwhrOiKULltf8TRzkwvGBnEjlWWsYHlOzKV1uIOs2eF8uVgVOSvh3Q7pAF2pzij3tssi4fIhoIJ1lDL6z95zJwpKt2/TU53ed2kH4FWtyGAKYR2RNkuQ2d2tyYmzLANpJ21fDASxZg5xhPTd/6Kfo2G4h/gT+mdPTJ1iX15dL6jiRu/9FSvxvH9W8Sx1VtGsDRJncLKu3sQzRLc8NKHReF4/kq52TaAyQTT6uZcLEAxj6XbKpohzqc3lpoSrUjPByD3lFvogKCNbEkk9e+o6+PoYrXiuJCQLZN7igb7Mqg2m8y/B/o3MhQWUmSF+OmwyL/u8Ej3QLhJlfI6jqzUvZ/GKcgHqKO0BhNa3poi5hu6CjqqzsoUkmIfIFJAYh2HjozpO1/UIPD/CPeym0ijkC12x/DdJb3uHJHNLnzYnbfiZ7l97VRcxxQ1zdxF4ZJbFK0BLOrpWkDtrNzYEgM5Mwuszub/Ftp9zI0eWXmIAq+fKqzSq0aE5v80I+aSfo8YozL9K/k+5/7yln81h0URfaiv3ZsLdn+4BPJmLFuFYXbymNsOD3oFvhdtBNDt3UWPSzBIYHC7Rech/TwhIIOgEYrZQI2x2vqM0DroMIdRYzVtA0uBAD/Sap6fVvMIIHaZONJ6iTEAUZCiQok72BfosgLmnek0ZtHI8zEV4MqwnON+4PfewEk/

Group,

     It's generally considered good form to come back to a mailing list (or forum) and post your found solution, to help others who search the forum later on.  I did find my solution, and in case it helps others.  If you are having terribly inconsistent scheduling results in your mesh AND you are seeing a lot of these errors:

Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26599: Posting non-starting run at 2018-08-30T14:28:09Z for task 1a869753-f827-44bc-abb5-d0186075a482: ps-washburn-bw.perfsonar.kanren.net has no time available for this run  

In your pscheduler.log files on the testing host, this may help you as well.  I badly mis-understood two options in the old style mesh config file (apache version) (now for some of you, the psconfig json version) and this misunderstanding led to my problem.  When setting up your tests, particularly bandwidth tests, make sure your schedule allows for "slip".  Slip tells your hosts "if you are trying to schedule the (exclusive) bandwidth test, but that time is taken, allow pscheduler to slip forward a bit and try again."  (that's my own paraphrasing).

Here's what my fixed version of the KanREN.json published mesh config file looks like today.

   "schedules" : {
      "schedule_0" : {
         "repeat" : "PT600S",
         "slip" : "PT30M"
      },
      "schedule_1" : {
         "repeat" : "PT14400S",
         "slip" : "PT30M"
      },
      "schedule_2" : {
         "repeat" : "PT7200S",
         "slip" : "PT30M"
      },
      "schedule_3" : {
         "repeat" : "PT1800S",
         "slip" : "PT30M"
      },
      "schedule_4" : {
         "repeat" : "PT28800S",
         "slip" : "PT30M"
      } 
},

Here's what it looked like when it was broken:

       "schedules" : {
      "schedule_0" : {
         "repeat" : "PT600S",
         "sliprand" : true
      },
      "schedule_1" : {
         "repeat" : "PT14400S",
         "slip" : "PT14400S",
         "sliprand" : true
      },
      "schedule_2" : {
         "repeat" : "PT7200S",
         "sliprand" : true
      },
      "schedule_3" : {
         "repeat" : "PT1800S",
         "sliprand" : true
      },
      "schedule_4" : {
         "repeat" : "PT28800S",
         "slip" : "PT28800S",
         "sliprand" : true
      }
   },

You can see that not all of my schedules had a "slip" time specified.  That's because I interpreted "sliprand" to mean "you can still slip, just randomize how far forward you jump", and I presumed you didn't necessarily need a "slip" boundary (time limit) when you used "sliprand".  At any rate, as I understand it now, the best practice is to give your schedules/tests as much slip as you can tolerate, and leave the "sliprand" option alone unless you really know what you're doing with it.

Thanks to Mark and Andy for the help.

Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter



On Wed, Aug 29, 2018 at 3:11 PM Casey Russell <> wrote:
Group,

     Over the summer, we upgraded hardware on all 8 of our nodes (CPU and memory), installed them fresh with CentOS 7 and PS 4.0 and rebuilt our mesh with the new PSconfig tools a few weeks ago when 4.1 came out.  

     For a few glorious weeks (when all the nodes were upgraded, but before the 4.1 upgrades) I had a green dashboard and thought all was well with the world.  I can't say for sure it was the introduction of 4.1, but something in the last 2 weeks has put me right back where I was before when I thought my primary problem was underpowered nodes.  

     The 8 nodes in the mesh will just sporadically refuse to schedule some tests.  Right now it appears to be primarily throughput tests.  I end up with a bunch of "non-starting" tests in pscheduler, and logs like the ones below in pscheduler.log

Aug 29 09:28:09 ps-wsu-bw journal: runner INFO     10012256: With iperf3: throughput --bandwidth 920000000 --duration PT10S --source ps-wsu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ku-bw.perfsonar.kanren.net --source-node ps-wsu-bw.perfsonar.kanren.net --dest-node ps-ku-bw.perfsonar.kanren.net --udp
Aug 29 09:28:11 ps-wsu-bw journal: runner WARNING  10012256: Starting 0:00:02.632591 later than scheduled
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26599: Posting non-starting run at 2018-08-30T14:28:09Z for task 1a869753-f827-44bc-abb5-d0186075a482: ps-washburn-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26644: Posting non-starting run at 2018-08-30T14:28:09Z for task f6954127-5cab-4279-a61a-269c095e7426: ps-esu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26643: Posting non-starting run at 2018-08-30T14:28:09Z for task 06fb5795-bbe8-4d5e-8c5b-7696e42637db: ps-ku-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26595: Posting non-starting run at 2018-08-30T14:28:09Z for task 31a885b9-54c5-46ca-b1ec-c1935e13058e: ps-ksu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26596: Posting non-starting run at 2018-08-30T14:28:09Z for task 61482a12-3ecc-4a68-a241-49906390f7b7: ps-ku-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26645: Posting non-starting run at 2018-08-30T14:28:09Z for task 047cb255-b64a-4be5-89b6-2b4a1062a924: ps-psu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26598: Posting non-starting run at 2018-08-30T14:28:09Z for task 93f8a914-9b8a-4d93-8088-12b5b0f2b647: ps-psu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO     26642: Posting non-starting run at 2018-08-30T14:28:09Z for task 611ab2f6-6b46-47ce-9e9c-f6c2e00c1387: ps-ksu-bw.perfsonar.kanren.net has no time available for this run

     As you can see, the misbehaving host is ps-wsu-bw.  It just suddenly begins to believe that most of the other hosts in the mesh have "no time available" for a test.  If I run a test manually, to one of the affected hosts, things seem to be fine (maybe it was a short term problem?).


     The web interface no longer tells me what percentage of the time that throughput tests will be running, but my mesh config ( I think) seems sane for these hosts.  Looking at a bandwidth graph (10s resolution) shows lots of dead time for the bandwidth interfaces on these boxes.  

     I suppose it could be that for just a very short duration, there is no time available.  Especially if the hosts get synced up and are all pulling their mesh configs and trying to schedule their tests at roughly the same time.  I just went in today and added a slip (and sliprand) to all of my schedules in the mesh to see if that helps.  Does anyone have any idea what else I should look for?  have you seen this before?

I'm happy to share any other info or logs if you want them.  

Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter




Archive powered by MHonArc 2.6.19.

Top of Page