Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Throughput suddenly unidirectional

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Throughput suddenly unidirectional


Chronological Thread 
  • From: Andrew Lake <>
  • To: Daniel Schmidt <>
  • Cc:
  • Subject: Re: [perfsonar-user] Throughput suddenly unidirectional
  • Date: Wed, 3 Dec 2014 16:22:55 -0500

Hi Dan,

Looking through the latest logs there's definitely tests that try to run but fair to complete (i.e. its not a problem of writing to the database). We actually record failure events in the measurement archive now and I see he failures recorded at times that seem to correspond to the gaps and seem to indicate iperf3 spit out no results or some type of malformed output: http://146.166.250.14/esmond/perfsonar/archive/dc3007648f8249deac86c3e366340290/failures/base?format=json

Unfortunately you host fixed itself, or else running bwctl by hand would be usefull to get a better picture of what was getting reported by bwctl/iperf3 when things are in the "bad" state.The timestamps don't seem to correlate with the "Peer cancelled test before expected" in the BWCTL logs and I am not seeing any other errors. It seems to happen for extended periods of time so maybe the best thing to do is keep an eye on it and see if we can catch it misbehaving and try running bwctl by hand when it happens? Sorry i don;t have a better answer at this point. 

Thanks,
Andy



On Dec 3, 2014, at 1:25 PM, Daniel Schmidt <> wrote:

Thank you very much for your continued assistance.  

On Wed, Dec 3, 2014 at 11:07 AM, Jason Zurawski <> wrote:
Hi Dan;

Thanks for sending the 2nd set of logs, and the other information.  We are still not able to determine what is going on however, could you send us one more log file:

/var/log/perfsonar/regular_testing.log

This will help us see if the issue is related to the testing itself, or the storage/graphing.

Thanks;

-jason

On Dec 2, 2014, at 3:05 PM, Daniel Schmidt <> wrote:

> If I forget to include something, please remind me
>
> * /etc/init.d/iptables stop, however, I would think that would have shown up on my bench test, no?
> * Included logs from other side
> * OWAMP?  Could you give me a bit - I need to look that up, I don't see it
> * Ntp looks fine 2 me, I'll post later in message
> * Yes, I did reverse the c & f.  But.....
>
> I just did it again - look at this done on 2.2.2.2 (remote side)
>
> [root@localhost admin]# bwctl -f m -x -T iperf3 -t 30 -i 1 -c 1.1.1.1 -s 2.2.2.2
> bwctl: Using tool: iperf3
> bwctl: 37 seconds until test results available
>
> RECEIVER START
> -----------------------------------------------------------
> Server listening on 5601
> -----------------------------------------------------------
> Accepted connection from 2.2.2.2, port 45812
> [ 17] local 1.1.1.1 port 5601 connected to 2.2.2.2 port 52941
> [ ID] Interval           Transfer     Bandwidth
> [ 17]   0.00-1.00   sec  45.6 MBytes   382 Mbits/sec
> [ 17]   1.00-2.00   sec  47.3 MBytes   397 Mbits/sec
> [ 17]   2.00-3.00   sec  47.5 MBytes   398 Mbits/sec
> [ 17]   3.00-4.00   sec  36.8 MBytes   309 Mbits/sec
> [ 17]   4.00-5.00   sec  40.0 MBytes   336 Mbits/sec
> [ 17]   5.00-6.00   sec  31.5 MBytes   265 Mbits/sec
> [ 17]   6.00-7.00   sec  21.2 MBytes   178 Mbits/sec
> [ 17]   7.00-8.00   sec  30.2 MBytes   254 Mbits/sec
> [ 17]   8.00-9.00   sec  32.7 MBytes   274 Mbits/sec
> [ 17]   9.00-10.00  sec  36.6 MBytes   307 Mbits/sec
> [ 17]  10.00-11.00  sec  22.5 MBytes   189 Mbits/sec
> [ 17]  11.00-12.00  sec  31.1 MBytes   261 Mbits/sec
> [ 17]  12.00-13.00  sec  3.36 MBytes  28.2 Mbits/sec
> [ 17]  13.00-14.00  sec  28.1 MBytes   236 Mbits/sec
> [ 17]  14.00-15.00  sec  43.1 MBytes   361 Mbits/sec
> [ 17]  15.00-16.00  sec  35.5 MBytes   297 Mbits/sec
> [ 17]  16.00-17.00  sec  42.2 MBytes   354 Mbits/sec
> [ 17]  17.00-18.00  sec  39.9 MBytes   335 Mbits/sec
> [ 17]  18.00-19.00  sec  8.22 MBytes  69.0 Mbits/sec
> [ 17]  19.00-20.00  sec  36.8 MBytes   309 Mbits/sec
> [ 17]  20.00-21.00  sec  39.9 MBytes   335 Mbits/sec
> [ 17]  21.00-22.00  sec  38.7 MBytes   325 Mbits/sec
> [ 17]  22.00-23.00  sec  13.5 MBytes   113 Mbits/sec
> [ 17]  23.00-24.00  sec   539 KBytes  4.41 Mbits/sec
> [ 17]  24.00-25.00  sec   617 KBytes  5.05 Mbits/sec
> [ 17]  25.00-26.00  sec  29.0 MBytes   243 Mbits/sec
> [ 17]  26.00-27.00  sec  16.5 MBytes   138 Mbits/sec
> [ 17]  27.00-28.00  sec   642 KBytes  5.26 Mbits/sec
> [ 17]  28.00-29.00  sec  10.7 MBytes  89.5 Mbits/sec
> [ 17]  29.00-30.00  sec   533 KBytes  4.37 Mbits/sec
> [ 17]  30.00-30.04  sec  21.2 KBytes  4.55 Mbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [ 17]   0.00-30.04  sec   811 MBytes   226 Mbits/sec  220             sender
> [ 17]   0.00-30.04  sec   811 MBytes   226 Mbits/sec                  receiver
>
> RECEIVER END
>
> SENDER START
> Connecting to host 1.1.1.1, port 5601
> [ 16] local 2.2.2.2 port 52941 connected to 1.1.1.1 port 5601
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [ 16]   0.00-1.00   sec  47.8 MBytes   401 Mbits/sec   62   55.1 KBytes
> [ 16]   1.00-2.00   sec  47.4 MBytes   397 Mbits/sec    3   53.7 KBytes
> [ 16]   2.00-3.00   sec  46.9 MBytes   394 Mbits/sec    4   45.2 KBytes
> [ 16]   3.00-4.00   sec  36.5 MBytes   306 Mbits/sec    6   45.2 KBytes
> [ 16]   4.00-5.00   sec  39.8 MBytes   334 Mbits/sec    9   43.8 KBytes
> [ 16]   5.00-6.00   sec  32.3 MBytes   271 Mbits/sec    1   70.7 KBytes
> [ 16]   6.00-7.00   sec  19.7 MBytes   166 Mbits/sec    3   49.5 KBytes
> [ 16]   7.00-8.00   sec  30.1 MBytes   252 Mbits/sec   17   36.8 KBytes
> [ 16]   8.00-9.00   sec  34.5 MBytes   289 Mbits/sec   10   66.5 KBytes
> [ 16]   9.00-10.00  sec  36.1 MBytes   302 Mbits/sec   10   48.1 KBytes
> [ 16]  10.00-11.00  sec  22.7 MBytes   190 Mbits/sec    9   50.9 KBytes
> [ 16]  11.00-12.00  sec  31.0 MBytes   260 Mbits/sec   14   50.9 KBytes
> [ 16]  12.00-13.00  sec  2.72 MBytes  22.8 Mbits/sec    3   29.7 KBytes
> [ 16]  13.00-14.00  sec  28.9 MBytes   243 Mbits/sec    0   74.9 KBytes
> [ 16]  14.00-15.00  sec  42.4 MBytes   356 Mbits/sec    4   36.8 KBytes
> [ 16]  15.00-16.00  sec  35.7 MBytes   300 Mbits/sec    3   43.8 KBytes
> [ 16]  16.00-17.00  sec  42.0 MBytes   352 Mbits/sec    7   35.4 KBytes
> [ 16]  17.00-18.00  sec  40.8 MBytes   342 Mbits/sec    4   59.4 KBytes
> [ 16]  18.00-19.00  sec  7.26 MBytes  60.9 Mbits/sec    8   36.8 KBytes
> [ 16]  19.00-20.00  sec  37.0 MBytes   310 Mbits/sec    5   36.8 KBytes
> [ 16]  20.00-21.00  sec  40.8 MBytes   343 Mbits/sec    2   67.9 KBytes
> [ 16]  21.00-22.00  sec  38.5 MBytes   323 Mbits/sec   17   25.5 KBytes
> [ 16]  22.00-23.00  sec  11.7 MBytes  98.1 Mbits/sec    2   19.8 KBytes
> [ 16]  23.00-24.00  sec   488 KBytes  4.00 Mbits/sec    0   22.6 KBytes
> [ 16]  24.00-25.00  sec   650 KBytes  5.33 Mbits/sec    0   24.0 KBytes
> [ 16]  25.00-26.00  sec  30.3 MBytes   254 Mbits/sec    3   48.1 KBytes
> [ 16]  26.00-27.00  sec  15.2 MBytes   128 Mbits/sec   12   21.2 KBytes
> [ 16]  27.00-28.00  sec   682 KBytes  5.58 Mbits/sec    0   24.0 KBytes
> [ 16]  28.00-29.00  sec  10.7 MBytes  89.6 Mbits/sec    2   18.4 KBytes
> [ 16]  29.00-30.00  sec   488 KBytes  4.00 Mbits/sec    0   21.2 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [ 16]   0.00-30.00  sec   811 MBytes   227 Mbits/sec  220             sender
> [ 16]   0.00-30.00  sec   811 MBytes   227 Mbits/sec                  receiver
>
> This seems to indicate a problem with my circuit as I see this in no other location.  But, my root question remains - why would this cause my graph freak out & stop graphing when it gets this data?
>
> I can't remember the iperf and nuttcp command line options off hand, if you think they would be helpful, I'll go read up & do them.  Sorry - not purposefully being lazy, I just have to put this on the back burner for a few hours.   Where is thrulay?   I can remember how to use thrulay.
>
> Thanks for your time.
>
>
> NTP results:
> ifco
> root@localhost admin]# ntpq -p -c rv
>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> -nms-rlat.chic.n 141.142.143.138  2 u   55 1024  377   27.404   -0.670   0.619
> +eth-1.nms-rlat. .IRIG.           1 u  590 1024  377   53.188   -0.156   0.236
> -nms-rlat.losa.n .CDMA.           1 u  671 1024  377   56.644   14.561   0.207
> +nms-rlat.newy32 .CDMA.           1 u 1016 1024  377   54.850    0.031   0.360
> -chronos.es.net  .CDMA.           1 u  715 1024  377   52.939    0.814   0.220
> *saturn.es.net   .CDMA.           1 u  457 1024  377   30.607   -0.196   5.333
> associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
> version="ntpd Sat Nov 23 18:21:48 UTC 2013 (1)",
> processor="x86_64", system="Linux/2.6.32-504.1.3.el6.aufs.web100.x86_64",
> leap=00, stratum=2, precision=-23, rootdelay=30.607, rootdisp=54.621,
> refid=198.129.252.38,
> reftime=d8288af8.d66b698e  Tue, Dec  2 2014 12:01:12.837,
> clock=d82890dd.8f707797  Tue, Dec  2 2014 12:26:21.560, peer=60532,
> tc=10, mintc=3, offset=-0.109, frequency=1.125, sys_jitter=0.123,
> clk_jitter=0.164, clk_wander=0.007
>
> [root@localhost admin]# ntpq -p -c rv
>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> -nms-rlat.chic.n 141.142.143.138  2 u  928 1024  377   27.017   -0.995   0.932
> +nms-rlat.hous.n .IRIG.           1 u   25 1024  377   52.872   -0.338   0.189
> -nms-rlat.losa.n .CDMA.           1 u  517 1024  377   56.317   14.298   0.320
> -nms-rlat.newy32 .CDMA.           1 u  907 1024  377   54.451   -0.227   0.271
> +chronos.es.net  .CDMA.           1 u  156 1024  377   54.627   -0.438   0.321
> *saturn.es.net   .CDMA.           1 u   19 1024 377 30.346   -0.456   0.304
> associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
> version="ntpd Sat Nov 23 18:21:48 UTC 2013 (1)",
> processor="x86_64", system="Linux/2.6.32-504.1.3.el6.aufs.web100.x86_64",
> leap=00, stratum=2, precision=-23, rootdelay=30.346, rootdisp=36.512,
> refid=198.129.252.38,
> reftime=d8289155.99671fb2  Tue, Dec  2 2014 12:28:21.599,
> clock=d8289168.31de148f  Tue, Dec  2 2014 12:28:40.194, peer=13268,
> tc=10, mintc=3, offset=-0.413, frequency=-0.412, sys_jitter=0.067,
> clk_jitter=0.133, clk_wander=0.020
>
>
>
> On Tue, Dec 2, 2014 at 10:36 AM, Jason Zurawski <> wrote:
> Hey Dan;
>
> Looking through the logs, the only suspect thing I see are lines of this nature:
>
> > Dec  2 10:05:01 localhost bwctld[12565]: FILE=endpoint.c, LINE=1314, PeerAgent: Peer cancelled test before expected
>
> Unfortunately that tells us the ‘what’ but not the ‘how’.  Could you also send the logs from the other host you are using?  That host may have more details about what is going on.  Couple other things that came to mind:
>
> >> * No firewall between A & B
>
> IPTables may be on for both sides, it may be a quick and dirty test to just disable that to see if that helps?
>
> >> * I'm not familiar with "slots."  There are few throughput tests running though.  (Tests running 33% of time)
>
> Ok, this won’t be the issue I was thinking of.
>
> >> * I assumed packet loss was an issue.  So, I setup smokeping on both sides, 5 every 30 seconds, 1472 MTU.  However, I'm not getting loss.
>
> Do you have OWAMP going between the two hosts?  If you don’t, I would suggest setting up that test too.  OWAMP uses UDP packets which may give a different clue than the ICMP that smokeping would use.
>
> >> * PsPerformance comes with ntp on - appears to be running, they have the same time & these machines are not behind any firewalls.
>
> Could you send the output of ‘ntpq -p -c rv’ for both?
>
> >> * I am not seeing the issue on command line bwctl.  Strange.
>
> Could you try the reverse direction as well - e.g. swap the hosts for the -c and -s flags?  Also try using ‘iperf’ and ‘nuttcp’ as the tool instead of ‘iperf3’.
>
> Thanks;
>
> -jason
>
> On Dec 2, 2014, at 12:16 PM, Daniel Schmidt <> wrote:
>
> > Thank you kindly for your reply.  Some short responses:
> >
> > * No firewall between A & B
> > * I'm not familiar with "slots."  There are few throughput tests running though.  (Tests running 33% of time)
> > * I assumed packet loss was an issue.  So, I setup smokeping on both sides, 5 every 30 seconds, 1472 MTU.  However, I'm not getting loss.
> > * PsPerformance comes with ntp on - appears to be running, they have the same time & these machines are not behind any firewalls.
> > * I am not seeing the issue on command line bwctl.  Strange.
> > * Cacti minute graphs don't show any strange usage on the ICX switch.
> >
> > I would suspect hardware, but the boxes ran a solid a line for hours on my bench test.  Please forgive me, but I'm reluctant to give the IP's as I haven't really figured out how I would prevent hackers from using these machines to DOS me.  (Does anybody have to mitigate this issue?  Sorry - off topic question)  However, I'd be happy to privately give you root on the box.
> >
> > I have attached a png of what I see.  You can see the lines greatly vary greatly and around 9:30 the thruput suddenly decided to start working again.  I have also attached the log, replacing 1.1.1.1 for local and 2.2.2.2 for remote.
> >
> > Many thanks,
> > -Dan
> >
> > On Mon, Dec 1, 2014 at 4:24 PM, Jason Zurawski <> wrote:
> > Hey Daniel;
> >
> > Would you be able to provide a link to your node, or send along a screenshot, to give us a better idea of what you are seeing?
> >
> > Off the top of my head, here are a couple of typical reasons that tests could fail:
> >
> >         - Firewalls in the path denying access to ports, or not enough ports available for the number of tests that are running
> >
> >         - Lack of testing ‘slots’ available on one side or the other
> >
> >         - NTP synchronization issues
> >
> >         - Packet loss that prevents the test from starting or finishing.
> >
> > If you send along your /var/log/perfsonar/owamp_bwctl.log file, we can have a look to see what may be menacing your node.  The other thing you can try is some by-hand tests, something like:
> >
> >         bwctl -f m -x -T iperf3 -t 30 -i 1 -c HOST1 -s HOST2
> >
> > Thanks;
> >
> > -jason
> >
> > On Dec 1, 2014, at 5:43 PM, Daniel Schmidt <> wrote:
> >
> > > I've noticed strange behavior on our throughput tests at one site.  Sometimes, the graph turn unidirectional - ie, one way stops working.  Sometimes, both ways will stop working.  The times are random.  Although the site is verified up by ping and passes traffic, however the throughput graphs vary greatly.  (We believe due to issues with this circuit)
> > >
> > > I've only seen it do this in this one case.  It's almost like it gets angry that the speed varies vastly and gives up.
> > >
> > > Has anybody else encountered this?  Any ideas greatly appreciated.

E-Mail to and from me, in connection with the transaction 
of public business, is subject to the Wyoming Public Records 
Act and may be disclosed to third parties.

<regular_testing.tgz>




Archive powered by MHonArc 2.6.16.

Top of Page