Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] OWAMP tests not scheduling reliably in mesh

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] OWAMP tests not scheduling reliably in mesh


Chronological Thread 
  • From: Casey Russell <>
  • To: Mark Feit <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] OWAMP tests not scheduling reliably in mesh
  • Date: Fri, 23 Mar 2018 09:38:49 -0500
  • Ironport-phdr: 9a23:iGUZ+BBNUx4sgOAy93sRUyQJP3N1i/DPJgcQr6AfoPdwSPTyr8bcNUDSrc9gkEXOFd2Cra4c0KyO6+jJYi8p2d65qncMcZhBBVcuqP49uEgeOvODElDxN/XwbiY3T4xoXV5h+GynYwAOQJ6tL1LdrWev4jEMBx7xKRR6JvjvGo7Vks+7y/2+94fcbglUijexe69+IAmrpgjNq8cahpdvJLwswRXTuHtIfOpWxWJsJV2Nmhv3+9m98p1+/SlOovwt78FPX7n0cKQ+VrxYES8pM3sp683xtBnMVhWA630BWWgLiBVIAgzF7BbnXpfttybxq+Rw1DWGMcDwULs5Qiqp4bt1RxD0iScHLz85/3/Risxsl6JQvRatqwViz4LIfI2ZMfxzdb7fc9wHX2pMRsReVyJBDI2ybIUBEvQPMvpDoobnu1cDtwGzCRWwCO7tzDJDm3/43bc90+QkCQzLwhYvH8kQv3XUsd77KLoSUfuuzKbWyTXDa+5d1DDh54jSbxAhuuqMUqx0ccrV0kQvFBnKjlOKqYP7OTOZzOINvHaH7+d5U++klmApqwZ0oje1x8csjJHEhoYUylDC9iV23ps6Jdy+SEJhfdGkF55QuzmGN4p4Q8MiX31otzggyrEcpZG7ey0KxIwkxxHFbfyHaZaH4hT5WOaXPzh4mHRoc6+8iRaq6UWs1OzxWtW23VtPoCpIkcLDumwI2hHc9sSLVvVw80K91jqT2QDe7+RJLV46mKbGLZMq36Q+mYAJsUvZGy/7gEX2g7GSdkUj4uWo7v7oYrTippOFMI90lh3yPr0hm8ChD+k0LxICX2ec+eS7273j+VP2TK9Wgf0xl6nVqJHaJcIFqa6lGwJZzIcu5wq9Ajqj3tQVnmIIIE5AdR+Ik4TlJ1/DLfXkAvujjVShlTJmy+7IM7H8GpnNK2LMkLblfbZz8U5czw8zwMhD6JJOF7EBO+nzVVH1tNzcFRI5MBa7w+D9CNpj0IMSQ2SPDbGFMK/Kq1+H+vovI/WQZI8SoDv9M+Yq5+TgjX8inl8de7Om3YEOZHClBfRpPV+ZbGHogtcACmcKohE+QPLwhF2DVz5Te2i9X7g65j4lFIKqE53PSZ6wj7ycj2+HGchzb3pFQn6BEGugI4CKVvYQQCOUPsJ7lDEYD/6sR5J3hj+0swqv4LN8I/ucwDADrp/n0JAh7PfOjgo/8ThcDM2byWyLCWd5gjVbFHcNwKljrBklmR+42q9ijqkdTIQL6g==

Mark,

     It may be worth noting, that all of our hosts are dual-nic/dual-stack hosts where there is a single server, with a latency NIC, and a Bandwidth NIC.  They also run IPv4 and IPv6 on those interfaces.  So it's entirely possible there's something about they way we've built the mesh that causes the problem with the clash of the UUID.

A single host might look like this:
DNS NAME 1 (latency) NIC 1
ps-ku-lt.perfsonar.kanren.net. 85034 IN A       164.113.32.57
ps-ku-lt.perfsonar.kanren.net. 85122 IN AAAA    2001:49d0:23c0:7::57

DNS NAME 2 (bandwidth) NIC 2
ps-ku-bw.perfsonar.kanren.net. 300 IN   A       164.113.32.145
ps-ku-bw.perfsonar.kanren.net. 300 IN   AAAA    2001:49d0:23c0:2::18

And as I sent directly to you, when tests do get scheduled, they appear to run just fine.  The graphs eventually fill in the dead spots after 24 hours or so when tests get re-scheduled and so my longer term graphs (further out that 24 hours) look more or less normal





Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter


On Thu, Mar 22, 2018 at 6:53 PM, Mark Feit <> wrote:

Casey Russell writes:

 

     We've got a large mesh config, and for some time now (months) the owamp tests have not been scheduling reliably.  What I mean by that is tonight when the mesh config agent runs on them, somewhere around 30-40% of the latency tests in the mesh will fail to schedule (one way).  The same test, in the other direction between those hosts will probably schedule fine.  24 hours later, when it runs again, most of those will re-schedule just fine, but a new 30-40% fail to schedule.  

2018/03/21 04:21:34 (22276) WARN> perfsonar_meshconfig_agent:430 main:: - Problem adding test throughput(ps-fhsu-bw.perfsonar.kanren.net->ps-ksu-bw.perfsonar.kanren.net), continuing with rest of config: 500 INTERNAL SERVER ERROR: Error while tasking ps-ksu-bw.perfsonar.kanren.net: Unable to post task to ps-ksu-bw.perfsonar.kanren.net: Task already exists.  All participants must be on separate systems.

 

2018/03/18 23:16:27 (30529) WARN> perfsonar_meshconfig_agent:430 main:: - Problem adding test throughput(ps-ku-bw.perfsonar.kanren.net->ps-bryant-bw.perfsonar.kanren.net), continuing with rest of config: 500 INTERNAL SERVER ERROR: Error while tasking ps-bryant-bw.perfsonar.kanren.net: Unable to post task to ps-bryant-bw.perfsonar.kanren.net: Task already exists.  All participants must be on separate systems.

 

That’s a known error, but not expected under these circumstances.  Let me think aloud for a minute:

 

Meshconfig submits the task to the first participant (“A”), which assigns it an identifier.  A then submits the task to the second participant (“B”) under the same identifier.  The usual case is that B doesn’t have a task with that identifier and everything goes to plan.  If the task already exists on B, it will be rejected by B and, in turn, A with the error you see.  There are two things that can cause this to happen:  One is tasks from different systems having the same identifier.  The identifiers are version 4 (random) UUIDs.  The other is when the task has parameters that put both participants on the same system (e.g., pscheduler task throughput --dest localhost).  That makes A and B the same machine, and when A tries to task itself (as if it were B), it complains that the task is a duplicate.

 

I suspect that the latter is what’s going on here, because happenstance collisions in a 128-bit space should be exceedingly rare.  To the best of my knowledge, we’re not seeing this anywhere else, even at sites running large meshes.  This could be a case of something really weird happening network- or DNS-wise, but it’s also possible that Meshconfig is doing something silly like trying to task the wrong system.  Andy’s our resident guru on that subject and us out until next week, but I may have a peek at the Meshconfig sources to see if I can spot anything obvious.

 

 

2018/03/21 04:21:35 (22276) WARN> perfsonar_meshconfig_agent:430 main:: - Problem adding test latencybg(ps-fhsu-lt.perfsonar.kanren.net->ps-esu-lt.perfsonar.kanren.net), continuing with rest of config: 500 Internal Server Error: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

 

For this one, you’ll probably find an error with the same timestamp (plus or minus a bit) in the Apache logs.

 

Can I take it that you’re not having trouble with the tasks that do get scheduled?

 

--Mark

 





Archive powered by MHonArc 2.6.19.

Top of Page