perfsonar-user - Re: [perfsonar-user] Re: Tests not running periodically

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Re: Tests not running periodically

From: Andrew Lake <>
To: "" <>, Casey Russell <>
Subject: Re: [perfsonar-user] Re: Tests not running periodically
Date: Tue, 29 Aug 2017 10:27:12 -0400
Ironport-phdr: 9a23:D/aPUxZbZ7CRRPhtUqEQvaf/LSx+4OfEezUN459isYplN5qZr8uzbnLW6fgltlLVR4KTs6sC0LuG9fi4EUU7or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRpOOv1BpTSj8Oq3Oyu5pHfeQtFiT6+bL9oMBm6sRjau9ULj4dlNqs/0AbCrGFSe+RRy2NoJFaTkAj568yt4pNt8Dletuw4+cJYXqr0Y6o3TbpDDDQ7KG81/9HktQPCTQSU+HQRVHgdnwdSDAjE6BH6WYrxsjf/u+Fg1iSWIdH6QLYpUjm58axlVAHnhzsGNz4h8WHYlMpwjL5AoBm8oxBz2pPYbJ2JOPZ7eK7WYNEUSndbXstJVyJOAI28YYwAAOQPPuhWspfzqEcVoBSkGQWhHvnixyVUinL026AxzuQvERvB3AwlB98DrHLUo8jvNKgMX+G+0a/Gwi/Ab/xIxDzw75LHchY8rvCMRr9/b9HRxVMpFwzbklWdsIroNC6b2OQKtmiU9etgVeS3hm4jqgFxpDuvydkxhYnIgIIZ0EzL9SJ8wIotOd25Rk97YcK4EJROrSGWLZd5QsQnQ21wuyY10LsGuYSlcygM0pgnwQDQa+CBfoOV4RzjTP6cLSpmiH9mYr6yiQy+/Ee9xuHmV8S5005GojRZntTIrHwA1Bze5tKaRvZ54EutwyuD2gTR5+xCPEs6j7DUK4Q7zb41jpcTsVrMHivxmEjugq+ZaEsp9vKs6+v8ZrXqvJCcN4hqig3mM6QunNKwAfggPwQTQWSW+v6w2bP58UD2XblGlPw7n6rBvJDfP8sbp6q5AwFP0oYk7hayFy2p0NIFkngHN19KZgmHg5LvO17QPPD0Fe2/jEi0kDd32/DGOaXsAo3TIXjZnrfhZrF960hGxwop1Nxf+olUBa8bIP/oXk/xtcfYDgMiMwCq2ernCdN91p8AVmKVBK+WLr/SvUGS6u0xPuaMedxdhDGoYeAo/fD1inkwgxoAZqSz9ZoRdH2iGPl6eQOUbWemyoMZHH0EpQ04RfavlUaPSxZSYWq/RaQx+mt9BY67W8OLbYm2ja3J5jqgBZBSYigSAUqRCmzlc4GsWP4Kci+UZMlsjmpXe6KmTtoI0x20uRCy77NkI6KA8ysUpLruz55z6vGFxkJ6ziB9E8nIizLFdGpzhG5dAmZuhK0=

Hi Casey,

Thanks for all the info. It indeed looks like you are hitting some timeout issue. For reference, it looks like the broken run for the test you shared is actually https://ps-fhsu-lt.perfsonar.kanren.net/pscheduler/tasks/f4c1d01f-08f0-4e24-a92b-9298a223516c/runs/58d5fed2-cbb2-403c-95b8-a5d1c5188047. The one you shared looks to have completed successfully but that was the IPv4 task which looked good at the time, and what was broken at that time was the IPv6 task. It does not appear to be an IPv4 vs IPv6 issue as there is currently IPv4 tasks broken in the same way as well (e.g. https://ps-washburn-lt.perfsonar.kanren.net/pscheduler/tasks/0cdcb1a2-5f07-4d1a-b6cc-6a8ea0af317c/runs/44bd3736-8f06-438e-b48f-5ee66eeff910).

As you noted after 24 hours they generally appear to fix themselves. This is because ther perfsonar-meshconfig-agent, the program reponsible for creating the tasks in pscheduler, creates the tasks with an end time 24 hours in the future. It will then recreate the task again in 24 hours. The OWAMP tests run using the tool “powstream" are a bit special in how they get scheduled. Since they run all the time, there is an initial run that is put on the schedule by pscheduler that’s really just a placeholder to indicate there is a task that should be running. Subsequent runs are actually posted by powstream as it gets results. If for whatever reason that initial run fails, then powstream never gets started. Currently, the mesh-config doesn't detect when this happens, so when it breaks like this the task will sit broken for 24 hours until mesh-config creates a new task. It looks like in your case there are occasionally timeouts trying to schedule the task likely due to a busy host. In summary, there are two things going on here:

1. Your host is too busy for some reason, at least at certain times, which is causing timeouts. I am not sure why 4.0.1 would make this worse, we actually decreased the CPU a bit in our testing by adding some bulk requests to the mesh-config, reducing the I/O it was doing. Are these hosts tight on memory? Perhaps this is causing something to go to swap that wasn’t before? That’s just a guess and it’s of course possible something slipped up in other areas that we missed but unfortunately I don’t know anything “new” that would cause this behavior. I do notice that your hosts appear to be writing to a local and central archive. This is a perfectly reasonable thing to do, but archiving is the most CPU intensive operation we do currently but that is not new as of 4.0.1. We actually have a fix for this coming in 4.0.2 that changes how we spawn archiving processes and are showing some pretty significant performance gains, but this is still a month+ away from beta (so not much good to you in the immediate term).

2. The other problem is that the meshconfig-agent is not detecting this problem so it takes 24 hours to fix. It’s not a “bug” in the sense that the behavior is unexpected, but it is something we need to figure out how to do better without hurting performance in other ways. Again not a solution for the immediate term unfortunately.

Sorry this got kinda long and double sorry I don't have an immediate answer. We do appreciate you taking the time to dig through the API and trying to figure this out, but looks like you hit a tricky problem.

Thanks,

Andy

On August 28, 2017 at 3:19:34 PM, Casey Russell () wrote:

Sorry, it might be helpful to see my maddash grid in case you'd like to see other failed tests in the grid.

http://ps-dashboard.kanren.net/maddash-webui

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Mon, Aug 28, 2017 at 2:17 PM, Casey Russell <> wrote:

Group,

I've held off sending this to the group, because I was determined I was going to solve this one myself. However, the beginning of the semester is upon us and I just haven't had the time to devote. So here it goes, I'm asking for help. This seems similar, but perhaps not the same as Mark Maciolek's current thread, but since I'm not certain, I didn't tie it to that thread.

I've got an entire Mesh that has started to randomly start failing tests. What I mean by that is this. Each day about 1/6 to 1/4 of the tests in the mesh will fail to run. When a test between two hosts will fail, it will always fail at the same time of day... stay failed until exactly the same time the next day, then start working again. And at that same time, a random sampling of other tests in the mesh will fail. (because my hosts hate me apparently).

The first instance of failures I can find happened just after the 16th of August and my hosts are running auto updates. Which is why I keyed in on Mark's post. The failures start/swap each day just after noon. When I use the API to look and see what failed with the run, I see either a very generic "participant-data-full" or (paraphrasing here) "participant data unavailable" timeout sort of error.

Here are some reference URLs

First instance I can find (just after the 16th of August)

http://ps-dashboard.kanren.net/maddash-webui/details.cgi?uri=/maddash/grids/KanREN-PS+-+IPv6+OWAMP+Latency/ps-fhsu-lt.perfsonar.kanren.net/ps-ku-lt.perfsonar.kanren.net/Loss

Current example: (failed at the moment 2:13pm CDT Aug 28th)

http://ps-dashboard.kanren.net/maddash-webui/details.cgi?uri=/maddash/grids/KanREN-PS+-+IPv6+OWAMP+Latency/ps-washburn-lt.perfsonar.kanren.net/ps-fhsu-lt.perfsonar.kanren.net/Loss

API "tasks" page:

https://ps-washburn-lt.perfsonar.kanren.net/pscheduler/tasks/35d52743-26e3-4432-b74e-fd899a6b5b45

API "runs" page:

https://ps-washburn-lt.perfsonar.kanren.net/pscheduler/tasks/35d52743-26e3-4432-b74e-fd899a6b5b45/runs

It seems like something is busy, or the API is temporarily unavailable, when the host (or hosts) are pulling the new Mesh and scheduling tests, but I've run dry trying to figure out how to troubleshoot the individual pieces of that.

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

[perfsonar-user] Tests not running periodically, Casey Russell, 08/28/2017
- [perfsonar-user] Re: Tests not running periodically, Casey Russell, 08/28/2017
  - Re: [perfsonar-user] Re: Tests not running periodically, Andrew Lake, 08/29/2017
    - Re: [perfsonar-user] Re: Tests not running periodically, Casey Russell, 08/29/2017

List archive

Re: [perfsonar-user] Re: Tests not running periodically