Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am


Chronological Thread 
  • From: Casey Russell <>
  • To: Andrew Lake <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am
  • Date: Mon, 13 Jul 2015 10:27:09 -0500

Andy,

Thank you for the quick reply. 

The testing hosts all appear to be running RegularTesting version 3.4.2 release 5 (see output below)

Installed Packages
Name        : perl-perfSONAR_PS-RegularTesting
Arch        : noarch
Version     : 3.4.2
Release     : 5.pSPS
Size        : 285 k
Repo        : installed
From repo   : Internet2


And they do have owamp processes running even though no data is being collected:  (see clipped output below).  Some boxes have dozens, some have only one or two.  But there's always at least one owamp process present that was started with the command line:  /usr/bin/owampd -c /etc/owampd -R /var/run

[crussell@ps-bryant-bw ~]$ sudo ps auxw | grep owampd
owamp     2149  0.0  0.0   7272   688 ?        Ss   09:24   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     3384  0.0  0.0   7484   772 ?        S    09:39   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     3386  0.0  0.0   7484   768 ?        S    09:39   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     4590  0.0  0.0   7484   776 ?        S    09:49   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     4613  0.0  0.0   7484   768 ?        S    09:49   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
****dozens more similar processes clipped for brevity*******


As for the logs, gmail didn't want to let me attach them since they're rather large, so I've temporarily moved copies to the publicly available root area of the web server on one of the affected hosts.  So you can see the relevant logs for one of the affected hosts at:

http://ps-bryant-lt.perfsonar.kanren.net/toolkit/regular_testing.log

http://ps-bryant-lt.perfsonar.kanren.net/toolkit/owamp_bwctl.log

Keep in mind, these logs are from one of the affected testing hosts, if you need to see anything from the central archive host, let me know and I'll get that to you as well.

Thank you again for any help you can lend.

Casey Russell
Network Engineer
Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS  66047

(785)856-9820  ext 9809

On Mon, Jul 13, 2015 at 9:41 AM, Andrew Lake <> wrote:
Hi Casey,

The 1AM time corresponds to the nightly restart time of owampd and regular_testing daemons. Auto-updates happen at random times so i don’t think it’s that. Can you verify the version of regular testing: “yum info perl-perfSONAR_PS-RegularTesting”? It should be at version 3.4.2-5. That particular version contains a fix for similar problems around restart times, so making sure that’s installed is the first step. It’s been out for a few weeks.

If that is latest, when your hosts are in a bad state do they have owampd proceeses ("ps auxw | grep owampd”) and powstream ("ps auxw | grep owampd”) running? Could you also send /var/log/perfsonar/owamp_bwctl.log and /var/log/perfsonar/regular_testing.log?

Thanks,
Andy





On Mon, Jul 13, 2015 at 10:04 AM, Casey Russell <> wrote:

Group,

     I have a mesh of 4 PerfSonar nodes (and 1 collector) in a mesh.  last Thursday morning (9th) at 1:00am, Owamp collection for half of the hosts in the mesh stopped.  I'm still collecting bandwidth data, and traceroute data, just no Owamp.  at around 1:00am on Friday, the other half stopped.

     I can manually run latency testing between hosts using bwctl using either ping or owamp as the tool with no problems.  However, in the maddash interfaces,  if you try to look at the details for the recent tests you get:

"Unable to find any tests in the given time range where....."

I've re-pulled the mesh config on all the testing hosts.  I've restarted the regular testing daemon on all the testing hosts.  I've restarted the local latency services.  I eventually restarted the hosts.  All to no effect.

Since it occurred on two consecutive nights at 1:00am I suspected it was an automatic update that caused the problem.  So I checked the IPtables rules that were borked in the original 3.4x release.  Although they've been all re-written since my original installs, they seem legitimate. 

I did have to make changes to the /etc/httpd/conf.d/apache-toolkit_web_gui.conf to re-enable our Radius authentication for the web interface on the boxes since a recent update had overwritten it.  But besides that, I'm unable to find anything in my own looking that would have caused this problem.

I'm going to need help from someone that knows PS a lot better than I.  I'll be happy to share any log files you need to help things along.  The dashboard is at:  http://ps-dashboard.perfsonar.kanren.net/maddash-webui/ 

Thank you,



Casey Russell
Network Engineer
Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS  66047






Archive powered by MHonArc 2.6.16.

Top of Page