perfsonar-user - Re: [perfsonar-user] Re: Tests not running periodically
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Andrew Lake <>
- To: "" <>, Casey Russell <>
- Subject: Re: [perfsonar-user] Re: Tests not running periodically
- Date: Tue, 29 Aug 2017 10:27:12 -0400
- Ironport-phdr: 9a23:D/aPUxZbZ7CRRPhtUqEQvaf/LSx+4OfEezUN459isYplN5qZr8uzbnLW6fgltlLVR4KTs6sC0LuG9fi4EUU7or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRpOOv1BpTSj8Oq3Oyu5pHfeQtFiT6+bL9oMBm6sRjau9ULj4dlNqs/0AbCrGFSe+RRy2NoJFaTkAj568yt4pNt8Dletuw4+cJYXqr0Y6o3TbpDDDQ7KG81/9HktQPCTQSU+HQRVHgdnwdSDAjE6BH6WYrxsjf/u+Fg1iSWIdH6QLYpUjm58axlVAHnhzsGNz4h8WHYlMpwjL5AoBm8oxBz2pPYbJ2JOPZ7eK7WYNEUSndbXstJVyJOAI28YYwAAOQPPuhWspfzqEcVoBSkGQWhHvnixyVUinL026AxzuQvERvB3AwlB98DrHLUo8jvNKgMX+G+0a/Gwi/Ab/xIxDzw75LHchY8rvCMRr9/b9HRxVMpFwzbklWdsIroNC6b2OQKtmiU9etgVeS3hm4jqgFxpDuvydkxhYnIgIIZ0EzL9SJ8wIotOd25Rk97YcK4EJROrSGWLZd5QsQnQ21wuyY10LsGuYSlcygM0pgnwQDQa+CBfoOV4RzjTP6cLSpmiH9mYr6yiQy+/Ee9xuHmV8S5005GojRZntTIrHwA1Bze5tKaRvZ54EutwyuD2gTR5+xCPEs6j7DUK4Q7zb41jpcTsVrMHivxmEjugq+ZaEsp9vKs6+v8ZrXqvJCcN4hqig3mM6QunNKwAfggPwQTQWSW+v6w2bP58UD2XblGlPw7n6rBvJDfP8sbp6q5AwFP0oYk7hayFy2p0NIFkngHN19KZgmHg5LvO17QPPD0Fe2/jEi0kDd32/DGOaXsAo3TIXjZnrfhZrF960hGxwop1Nxf+olUBa8bIP/oXk/xtcfYDgMiMwCq2ernCdN91p8AVmKVBK+WLr/SvUGS6u0xPuaMedxdhDGoYeAo/fD1inkwgxoAZqSz9ZoRdH2iGPl6eQOUbWemyoMZHH0EpQ04RfavlUaPSxZSYWq/RaQx+mt9BY67W8OLbYm2ja3J5jqgBZBSYigSAUqRCmzlc4GsWP4Kci+UZMlsjmpXe6KmTtoI0x20uRCy77NkI6KA8ysUpLruz55z6vGFxkJ6ziB9E8nIizLFdGpzhG5dAmZuhK0=
Hi Casey, Thanks for all the info. It indeed looks like you are hitting some timeout issue. For reference, it looks like the broken run for the test you shared is actually https://ps-fhsu-lt.perfsonar.kanren.net/pscheduler/tasks/f4c1d01f-08f0-4e24-a92b-9298a223516c/runs/58d5fed2-cbb2-403c-95b8-a5d1c5188047. The one you shared looks to have completed successfully but that was the IPv4 task which looked good at the time, and what was broken at that time was the IPv6 task. It does not appear to be an IPv4 vs IPv6 issue as there is currently IPv4 tasks broken in the same way as well (e.g. https://ps-washburn-lt.perfsonar.kanren.net/pscheduler/tasks/0cdcb1a2-5f07-4d1a-b6cc-6a8ea0af317c/runs/44bd3736-8f06-438e-b48f-5ee66eeff910). As you noted after 24 hours they generally appear to fix themselves. This is because ther perfsonar-meshconfig-agent, the program reponsible for creating the tasks in pscheduler, creates the tasks with an end time 24 hours in the future. It will then recreate the task again in 24 hours. The OWAMP tests run using the tool “powstream" are a bit special in how they get scheduled. Since they run all the time, there is an initial run that is put on the schedule by pscheduler that’s really just a placeholder to indicate there is a task that should be running. Subsequent runs are actually posted by powstream as it gets results. If for whatever reason that initial run fails, then powstream never gets started. Currently, the mesh-config doesn't detect when this happens, so when it breaks like this the task will sit broken for 24 hours until mesh-config creates a new task. It looks like in your case there are occasionally timeouts trying to schedule the task likely due to a busy host. In summary, there are two things going on here: 1. Your host is too busy for some reason, at least at certain times, which is causing timeouts. I am not sure why 4.0.1 would make this worse, we actually decreased the CPU a bit in our testing by adding some bulk requests to the mesh-config, reducing the I/O it was doing. Are these hosts tight on memory? Perhaps this is causing something to go to swap that wasn’t before? That’s just a guess and it’s of course possible something slipped up in other areas that we missed but unfortunately I don’t know anything “new” that would cause this behavior. I do notice that your hosts appear to be writing to a local and central archive. This is a perfectly reasonable thing to do, but archiving is the most CPU intensive operation we do currently but that is not new as of 4.0.1. We actually have a fix for this coming in 4.0.2 that changes how we spawn archiving processes and are showing some pretty significant performance gains, but this is still a month+ away from beta (so not much good to you in the immediate term). 2. The other problem is that the meshconfig-agent is not detecting this problem so it takes 24 hours to fix. It’s not a “bug” in the sense that the behavior is unexpected, but it is something we need to figure out how to do better without hurting performance in other ways. Again not a solution for the immediate term unfortunately. Sorry this got kinda long and double sorry I don't have an immediate answer. We do appreciate you taking the time to dig through the API and trying to figure this out, but looks like you hit a tricky problem. Thanks, Andy On August 28, 2017 at 3:19:34 PM, Casey Russell () wrote:
|
- [perfsonar-user] Tests not running periodically, Casey Russell, 08/28/2017
- [perfsonar-user] Re: Tests not running periodically, Casey Russell, 08/28/2017
- Re: [perfsonar-user] Re: Tests not running periodically, Andrew Lake, 08/29/2017
- Re: [perfsonar-user] Re: Tests not running periodically, Casey Russell, 08/29/2017
- Re: [perfsonar-user] Re: Tests not running periodically, Andrew Lake, 08/29/2017
- [perfsonar-user] Re: Tests not running periodically, Casey Russell, 08/28/2017
Archive powered by MHonArc 2.6.19.