perfsonar-user - Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am

From: "Andrew Lake" <>
To: "Casey Russell" <>
Cc:
Subject: Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am
Date: Tue, 14 Jul 2015 11:01:36 -0700 (PDT)

Hi Casey,

Someone also noted today that those values were actually in the MeshConfig example that comes with the source code. Likely you just copied those over. We updated them in the source tree so next MeshConfig release the initial example should have more sane values and others hopefully won’t encounter the same problem.

Thanks,

Andy

On Tue, Jul 14, 2015 at 1:57 PM, Casey Russell <> wrote:

Andrew,

That appears to have fixed it. Thank you. I just barely remember (perhaps) changing those values during the setup process for my mesh config file. Clearly I didn't understand what I was changing.

Thank you so much.

Casey Russell
Network Engineer

Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS 66047

(785)856-9820 ext 9809

On Mon, Jul 13, 2015 at 12:25 PM, Andrew Lake <> wrote:

Hi,

It looks like you have owamp configured to send 100 packets per second and register results every 300 packets (3 seconds). I believe OWAMP won’t let you actually do such a short reporting interval and will bump it up to something like 15 seconds. Unfortunately the regular_testing doesn’t know it did this, so when it doesn’t get results for 3x the specified reporting interval (9 seconds) it assumes it timed-out and restarts the process.

I would recommend increasing the packet count from 300 to something like 6000 (every 60 seconds). That’s generally the time interval we use for reporting owamp summaries. Let me know if you have any questions.

Thanks,

Andy

On Mon, Jul 13, 2015 at 12:09 PM, Casey Russell <> wrote:

Andy,

     You know what? On further reflection, that was a silly plan. Linking to those enormous log files is likely to detonate most any browser via a memory overload. I've attached some abbreviated versions of the log files you requested. I just cut them down so that they don't go back as far to reduce the size.

     They now fit in Gmail's mouth nicely.

Casey Russell
Network Engineer

Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS 66047

(785)856-9820 ext 9809

On Mon, Jul 13, 2015 at 10:27 AM, Casey Russell <> wrote:

Andy,

Thank you for the quick reply.

The testing hosts all appear to be running RegularTesting version 3.4.2 release 5 (see output below)

Installed Packages
Name        : perl-perfSONAR_PS-RegularTesting
Arch        : noarch
Version     : 3.4.2
Release     : 5.pSPS
Size        : 285 k
Repo        : installed
From repo   : Internet2

And they do have owamp processes running even though no data is being collected: (see clipped output below). Some boxes have dozens, some have only one or two. But there's always at least one owamp process present that was started with the command line: /usr/bin/owampd -c /etc/owampd -R /var/run

[crussell@ps-bryant-bw ~]$ sudo ps auxw | grep owampd
owamp     2149 0.0 0.0   7272   688 ?        Ss   09:24   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     3384 0.0 0.0   7484   772 ?        S    09:39   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     3386 0.0 0.0   7484   768 ?        S    09:39   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     4590 0.0 0.0   7484   776 ?        S    09:49   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
owamp     4613 0.0 0.0   7484   768 ?        S    09:49   0:00 /usr/bin/owampd -c /etc/owampd -R /var/run
****dozens more similar processes clipped for brevity*******

As for the logs, gmail didn't want to let me attach them since they're rather large, so I've temporarily moved copies to the publicly available root area of the web server on one of the affected hosts. So you can see the relevant logs for one of the affected hosts at:

http://ps-bryant-lt.perfsonar.kanren.net/toolkit/regular_testing.log

http://ps-bryant-lt.perfsonar.kanren.net/toolkit/owamp_bwctl.log

Keep in mind, these logs are from one of the affected testing hosts, if you need to see anything from the central archive host, let me know and I'll get that to you as well.

Thank you again for any help you can lend.

Casey Russell
Network Engineer

Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS 66047

(785)856-9820 ext 9809

On Mon, Jul 13, 2015 at 9:41 AM, Andrew Lake <> wrote:

Hi Casey,

The 1AM time corresponds to the nightly restart time of owampd and regular_testing daemons. Auto-updates happen at random times so i don’t think it’s that. Can you verify the version of regular testing: “yum info perl-perfSONAR_PS-RegularTesting”? It should be at version 3.4.2-5. That particular version contains a fix for similar problems around restart times, so making sure that’s installed is the first step. It’s been out for a few weeks.

If that is latest, when your hosts are in a bad state do they have owampd proceeses ("ps auxw | grep owampd”) and powstream ("ps auxw | grep owampd”) running? Could you also send /var/log/perfsonar/owamp_bwctl.log and /var/log/perfsonar/regular_testing.log?

Thanks,

Andy

On Mon, Jul 13, 2015 at 10:04 AM, Casey Russell <> wrote:

Group,

     I have a mesh of 4 PerfSonar nodes (and 1 collector) in a mesh. last Thursday morning (9th) at 1:00am, Owamp collection for half of the hosts in the mesh stopped. I'm still collecting bandwidth data, and traceroute data, just no Owamp. at around 1:00am on Friday, the other half stopped.

   I can manually run latency testing between hosts using bwctl using either ping or owamp as the tool with no problems. However, in the maddash interfaces, if you try to look at the details for the recent tests you get:

"Unable to find any tests in the given time range where....."

I've re-pulled the mesh config on all the testing hosts. I've restarted the regular testing daemon on all the testing hosts. I've restarted the local latency services. I eventually restarted the hosts. All to no effect.

Since it occurred on two consecutive nights at 1:00am I suspected it was an automatic update that caused the problem. So I checked the IPtables rules that were borked in the original 3.4x release. Although they've been all re-written since my original installs, they seem legitimate.

I did have to make changes to the /etc/httpd/conf.d/apache-toolkit_web_gui.conf to re-enable our Radius authentication for the web interface on the boxes since a recent update had overwritten it. But besides that, I'm unable to find anything in my own looking that would have caused this problem.

I'm going to need help from someone that knows PS a lot better than I. I'll be happy to share any log files you need to help things along. The dashboard is at: http://ps-dashboard.perfsonar.kanren.net/maddash-webui/

Thank you,

Casey Russell
Network Engineer

Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS 66047

(785)856-9820 ext 9809

<owamp_bwctl.log><regular_testing.log>

[perfsonar-user] Lost all Owamp testing on Thursday at 1:00am, Casey Russell, 07/13/2015
- Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am, Andrew Lake, 07/13/2015
  - Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am, Casey Russell, 07/13/2015
    - Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am, Casey Russell, 07/13/2015
      - Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am, Andrew Lake, 07/13/2015
        
        Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am, Casey Russell, 07/14/2015
        
        Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am, Andrew Lake, 07/14/2015

List archive

Re: [perfsonar-user] Lost all Owamp testing on Thursday at 1:00am