Skip to Content.
Sympa Menu

perfsonar-user - RE: [perfsonar-user] bwping/owamp tests randomly stop and never restart

Subject: perfSONAR User Q&A and Other Discussion

List archive

RE: [perfsonar-user] bwping/owamp tests randomly stop and never restart


Chronological Thread 
  • From: "Andrew Lake" <>
  • To: "GarnizovIvan (RRZE)" <>
  • Cc: "perfsonar-user" <>, "Ty Bell" <>
  • Subject: RE: [perfsonar-user] bwping/owamp tests randomly stop and never restart
  • Date: Wed, 24 Jun 2015 03:48:38 -0700 (PDT)

Hi Ty,

Are you running the toolkit? or doing nightly restarts? and is this on a 3.4.2 host?

I’ve been debugging an issue with WLCG the past few weeks where sporadically powstream tests will start failing after the remote sides owampd restart. It looks like if owampd is restarted at just the correct moment it kills the parent but leaves around the children to which the powstreams are connected. This causes powstream to sit there and do nothing connected to these orphaned processes on the other side until it’s restarted (or the remote owampd process is forcibly killed). This is similar to what you noted, that if you kill the powstream processes (which also happens to end the orphaned process on the remote end), regular_testing will spawn a new powstream, and powstream will get a new working connection. 

I think the fix/workaround is going to be to send a SIGKILL to anything that looks like an owampd process after giving it a  chance to nicely shutdown during the nightly restart.

Ivan’s issue may or may not be the same, since from what I understand it was isolated to a single host, and this can happen to any host and appears a lot more random.

Thanks,
Andy






On Wed, Jun 24, 2015 at 4:17 AM, Garnizov, Ivan (RRZE) <> wrote:

Hi Ty,

In fact I have reported the same issue about my instances. Issue tracker. https://github.com/perfsonar/regular-testing/issues/5
Suddenly out of no reason, without any notable event in the logs the regular_testing service stops collecting the data. I have also noted that a single service restart does not help. You have to follow a graceful restart....meaning:
sudo service regular_testing stop
sudo service postgresql stop
sudo service cassandra restart
sudo service postgresql start
sudo service regular_testing start

This immediately fixes all measurements. I have tested that on 2 hosts.
We still might be in different scenarios, although my issue is also around the latency tests.

Best regards,
Ivan




-----Original Message-----
From: [mailto:] On Behalf Of Ty Bell
Sent: Dienstag, 23. Juni 2015 16:41
To: perfsonar-user
Subject: Re: [perfsonar-user] bwping/owamp tests randomly stop and never restart

All my hosts are running the same (lastest) versions of the tools and they're all sync'd with the same NTP sources. Instead of restarting the whole regular testing service, I've taken to killing the individual bwping process, regular testing fires up a new process and everything clears up.

--Ty

> On Apr 23, 2015, at 3:29 PM, Amit Khare <> wrote:
>
> Hi Ty,
>
> Are all your hosts running the same version of toolkit. We have had
> similar issues with one of the older toolkit releases.I would also
> check if the hosts are properly synced with NTP server(s). Thanks,
>
> Amit
> ----------------------------------------------------------------------
> -----
> -
> Amit Khare | Network Engineer | CANARIE Inc | 45 O'Connor St., Suite
> 500, Ottawa, ON K1P 1A4 | Office: 613-943-5377│Cell: 613-404-8696│CANARIE NOC:
> 613-944-5612│www.canarie.ca
>
>
>
>
>
>
> On 2015-04-23, 15:19, "Ty Bell" <> wrote:
>
>> Hi All,
>>
>> Wondering if this is something anyone else has observed. I have 10
>> hosts in a mesh all running owamp tests, and randomly (maybe once a
>> week) I’ll check on the mesh and see two hosts have stopped testing
>> in one direction. It’s never the same hosts, and never the same
>> direction, seems totally random. I can execute tests from the command
>> line and they run just fine. I’ve looked around for hung owamp
>> processes or daemon restarts and haven’t found anything.
>>
>> The only resolution I’ve found is to restart regular testing on both
>> hosts.
>>
>> Thanks,
>> --Ty
>>





Archive powered by MHonArc 2.6.16.

Top of Page