Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] perfCube (Cubox-i4Pro with 16G microSD) packet loss problem

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] perfCube (Cubox-i4Pro with 16G microSD) packet loss problem


Chronological Thread 
  • From: Brian Tierney <>
  • To: Hyojoon Kim <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] perfCube (Cubox-i4Pro with 16G microSD) packet loss problem
  • Date: Mon, 14 Sep 2015 08:27:03 -0700


Thanks for sending this out. Please keep us posted on your results.

I'm sure several folks will find this useful, and we'll need to be aware of these issues on other low-cost devices as well.



On Fri, Sep 11, 2015 at 2:06 PM, Hyojoon Kim <> wrote:
Hello Brian & all, 

I just to give some update on this matter (and to leave some record). This is still an ongoing investigation, but I hope it gives out some useful information and insight especially for future developments in low-cost perfSONAR nodes.

1. Based on some investigation, it seems there are too much I/O going on: syslogd and systemd-journald are doing a lot of I/O. Big servers can handle this, but probably not a relatively weak Cubox. I assume no one saw this problem on Cubox because no one actually have put this into production, and did not run enough number of tests on it constantly over a period of time. On our setup, we are seeing a lot of syslog messages from "owampd, bwctld, and powstream” in “/var/log/messages”, sometimes up to 30-40 syslog messages per second, constantly.

2. Our next step was to reconfigure the systemd-journald to write to the volatile memory instead of to the microSD storage (I did not want to lose syslogs, so journald was my next target). Then at least the writing by journald will be faster, so will not get stuck on I/O write_wait so much, which often happened. This will make things harder to debug when the Cubox reboots, but we do have syslog, so... It seemed to solve the problem for a while, before...

3. There was a serious bug in NetworkManager in old Redhat 7/Fedora 20+: NetworkManager abnormally kills the DHCP client after systemd-journald process is restarted
(https://bugzilla.gnome.org/show_bug.cgi?id=735962), and there is no official fix for Fedora 20 and below. As DHCP client is killed, the IP address of the Cubox is lost at some point, and then entirely loses network connection. Not even pingable. 

4. Our current workaround is to just do a *reboot* instead of doing “systemctl restart systemd-journald” when we have to restart the systemd-journald process…

5. As perfSONAR processes publish a lot of logs to rsyslogd, I also modified the rsyslogd configuration to discard all logs from perfSONAR (owampd seems to spit to local5, based on "/etc/owampd/owampd.conf”). It would have been better if I can modify the “level” of logs to save (e.g., just log WARN and FATAL logs).

6. I see that perfSONAR 3.4.13-1 had a fix so that it is possible to configure owampd.conf to set the logging level for syslogs (https://github.com/perfsonar/owamp/commit/22c5e47fbf09219ebec3396132d79a767df00612). However, as the perfCubes we have use the old pre-made image that has an older version of perfSONAR, I guess this cannot be applied. 

We still have to see if this solves our problem with perfCubes. I will post later if I have updates. Please correct me if I assessed something wrong. Also let me know if anyone has a better suggestion. 

Hope this helps anyone out there.

Thanks,
Joon

On Sep 4, 2015, at 1:31 PM, Brian Tierney <> wrote:


I only played with the perfCube for a bit, and never put one in production, so I didn't notice this problem.

Is anyone else running a perfCUBE in production?



On Thu, Sep 3, 2015 at 7:43 AM, Hyojoon Kim <> wrote:
Hello,

We have a Cubox-i4Pro with perfSONAR installed (aka, perfCube), using the pre-made image (perfCube-3.4.0.img) with a 16G microSD card. 

We run OWAMP measurements, and we are having some spikes of packet loss time to time (sometimes upto 25% packet loss). Whenever there is a spike, it seems to happen together with a high IO activity and long write wait time. For example, we had a 23% packet loss between 8:10pm-8:20pm, and below is the result of command "iostat -x -t 120” around that time. As you can see, there is a huge spike of w_await, up to 9 seconds. 

This seems to happen with all three perfCube boxes we have. It varies a bit between boxes, but it seems there are 1— 3 spikes like these every 12 hours. 

Before I really track what the problematic process is, etc, I want to just check: Is this a known problem, or was there anything like this before in perfCube boxes? 

Thanks,
Joon 


09/02/2015 08:13:11 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.77    0.00    1.54    0.74    0.00   95.95

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0           0.00     6.49    0.00    7.09     0.00    57.97    16.35     3.26  459.76    0.00  459.76   7.07   5.01

09/02/2015 08:15:11 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.50    0.00    1.57   16.22    0.00   80.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0           0.00     5.53    0.00    6.64     0.00    51.23    15.43    33.57 3268.47    0.00 3268.47  74.21  49.29

09/02/2015 08:17:11 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.24    0.00    1.58   25.48    0.00   70.70

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0           0.00     5.57    0.00    6.48     0.00    52.23    16.11    49.66 9489.77    0.00 9489.77 120.91  78.39

09/02/2015 08:19:11 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.74    0.00    1.43   18.21    0.00   78.62

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0           0.00     5.93    0.00    5.77     0.00    50.30    17.45    50.01 8292.04    0.00 8292.04 124.84  71.99

09/02/2015 08:21:11 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.99    0.00    1.66   11.45    0.00   84.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0           0.00     5.69    0.00    7.40     0.00    57.07    15.42    50.61 7134.64    0.00 7134.64  69.05  51.09

09/02/2015 08:23:11 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.76    0.00    1.49    0.62    0.00   96.13

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0           0.00     5.87    0.00    6.75     0.00    54.43    16.13     1.44  213.38    0.00  213.38   5.67   3.83

-Joon




--
Brian Tierney, http://www.es.net/tierney
Energy Sciences Network (ESnet), Berkeley National Lab
http://fasterdata.es.net






Archive powered by MHonArc 2.6.16.

Top of Page