perfsonar-user - [perfsonar-user] Not scheduling tests reliably (again)
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To:
- Subject: [perfsonar-user] Not scheduling tests reliably (again)
- Date: Wed, 29 Aug 2018 15:11:37 -0500
- Ironport-phdr: 9a23:titNXB1c9LSHirKEsmDT+DRfVm0co7zxezQtwd8ZseIRLPad9pjvdHbS+e9qxAeQG9mDtLQc06L/iOPJYSQ4+5GPsXQPItRndiQuroEopTEmG9OPEkbhLfTnPGQQFcVGU0J5rTngaRAGUMnxaEfPrXKs8DUcBgvwNRZvJuTyB4Xek9m72/q99pHPYQhEniaxba9vJxiqsAvdsdUbj5F/Iagr0BvJpXVIe+VSxWx2IF+Yggjx6MSt8pN96ipco/0u+dJOXqX8ZKQ4UKdXDC86PGAv5c3krgfMQA2S7XYBSGoWkx5IAw/Y7BHmW5r6ryX3uvZh1CScIMb7S60/Vza/4KdxUBLnhycJOTA6/m/KlMJ/kLlWrwi9qxFl2YPYfJ2ZOfh4c6jAfd0aX21BXsNJWiFfGIy8dJUADuocNuhEson9vEAOogW6BQmoGejizSNHhmXr3a0hyOQuDwXG3Ag7EtINqnvUqs/1O7kUUeyvyqnH0ynDYupQ1Dzg5obIdRUhruuNXbJ2acfRzUgvFwXGjlqOtIPlPjWV2v4RvGic6upsTf6vimAmqwFtvjig2N0shpPViYISz1DI7SZ5z5wzJd2iVkF7Z8SoHIFWty6EK4t6WsAiTHtuuCYg1LIGv4S3fC4Ux5Q7wRPUdv+Jc5CQ7x79VeudPTV1iXdreL+8nBm+7U2tx+LgWsWo3ltHqzZKnsXNu30I0hHf9MaKR/R780y8wziAzRrT5ftBIU0slarUNZohwrkom5oWq0vDHyv2lFzxjK+Xakko4+ep5/rpb7jpvJOcOIh0igbxMqQqhMOzG/g3Mg8LX2SD+OS80qPs/VHhTblUj/A6jqvUvZXUJckYvaG1HwpY34k/5xqjATqr1cgXkWUGIV9AfR+LkYbkNl7WLPD9F/i/glCskDlxx/DBO73sGonCLmLekLf6ZrZy9UpcyA4owNBc/Z1UDKsBL+z1WkPrstzXEAM5PxSuw+n7ENV9yp8eWWWXD6+BLqzSq1GI5vkoI+mKfoAVoi/xK+U+5/Hwl380glsdfaiy3ZsLc3C0AO5qI0SfYXrwnNgBC2EKsRQiTOD0klGNTyNcZ2vhF547syk2Eoy9CoHKXMWwm7Gb9Ca9ApBMYG1aUBaBHWq7WZ+DXqInYTmfM4dbjycfWLylA9sqzwy1rwL+z5JkJ+zO9ytes5//gosmr9bPnA0/oGQnR/+W1HuAGjl5
Group,
Over the summer, we upgraded hardware on all 8 of our nodes (CPU and memory), installed them fresh with CentOS 7 and PS 4.0 and rebuilt our mesh with the new PSconfig tools a few weeks ago when 4.1 came out.
For a few glorious weeks (when all the nodes were upgraded, but before the 4.1 upgrades) I had a green dashboard and thought all was well with the world. I can't say for sure it was the introduction of 4.1, but something in the last 2 weeks has put me right back where I was before when I thought my primary problem was underpowered nodes.
The 8 nodes in the mesh will just sporadically refuse to schedule some tests. Right now it appears to be primarily throughput tests. I end up with a bunch of "non-starting" tests in pscheduler, and logs like the ones below in pscheduler.log
Aug 29 09:28:09 ps-wsu-bw journal: runner INFO 10012256: Running https://ps-wsu-bw.perfsonar.kanren.net/pscheduler/tasks/1e3e12f4-2d58-4097-86e2-dc0b014cb964/runs/6425a5f3-df6f-4e5a-870c-5dbdbd835f3d
Aug 29 09:28:09 ps-wsu-bw journal: runner INFO 10012256: With iperf3: throughput --bandwidth 920000000 --duration PT10S --source ps-wsu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ku-bw.perfsonar.kanren.net --source-node ps-wsu-bw.perfsonar.kanren.net --dest-node ps-ku-bw.perfsonar.kanren.net --udp
Aug 29 09:28:11 ps-wsu-bw journal: runner WARNING 10012256: Starting 0:00:02.632591 later than scheduled
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26599: Posting non-starting run at 2018-08-30T14:28:09Z for task 1a869753-f827-44bc-abb5-d0186075a482: ps-washburn-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26644: Posting non-starting run at 2018-08-30T14:28:09Z for task f6954127-5cab-4279-a61a-269c095e7426: ps-esu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26643: Posting non-starting run at 2018-08-30T14:28:09Z for task 06fb5795-bbe8-4d5e-8c5b-7696e42637db: ps-ku-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26595: Posting non-starting run at 2018-08-30T14:28:09Z for task 31a885b9-54c5-46ca-b1ec-c1935e13058e: ps-ksu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26596: Posting non-starting run at 2018-08-30T14:28:09Z for task 61482a12-3ecc-4a68-a241-49906390f7b7: ps-ku-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26645: Posting non-starting run at 2018-08-30T14:28:09Z for task 047cb255-b64a-4be5-89b6-2b4a1062a924: ps-psu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26598: Posting non-starting run at 2018-08-30T14:28:09Z for task 93f8a914-9b8a-4d93-8088-12b5b0f2b647: ps-psu-bw.perfsonar.kanren.net has no time available for this run
Aug 29 09:28:37 ps-wsu-bw journal: scheduler INFO 26642: Posting non-starting run at 2018-08-30T14:28:09Z for task 611ab2f6-6b46-47ce-9e9c-f6c2e00c1387: ps-ksu-bw.perfsonar.kanren.net has no time available for this run
As you can see, the misbehaving host is ps-wsu-bw. It just suddenly begins to believe that most of the other hosts in the mesh have "no time available" for a test. If I run a test manually, to one of the affected hosts, things seem to be fine (maybe it was a short term problem?).
The web interface no longer tells me what percentage of the time that throughput tests will be running, but my mesh config ( I think) seems sane for these hosts. Looking at a bandwidth graph (10s resolution) shows lots of dead time for the bandwidth interfaces on these boxes.
I suppose it could be that for just a very short duration, there is no time available. Especially if the hosts get synced up and are all pulling their mesh configs and trying to schedule their tests at roughly the same time. I just went in today and added a slip (and sliprand) to all of my schedules in the mesh to see if that helps. Does anyone have any idea what else I should look for? have you seen this before?
I'm happy to share any other info or logs if you want them.
- [perfsonar-user] Not scheduling tests reliably (again), Casey Russell, 08/29/2018
- Re: [perfsonar-user] Not scheduling tests reliably (again), Mark Feit, 08/29/2018
- [perfsonar-user] Re: Not scheduling tests reliably (again), Casey Russell, 08/31/2018
Archive powered by MHonArc 2.6.19.