Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Throughput suddenly unidirectional

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Throughput suddenly unidirectional


Chronological Thread 
  • From: Jason Zurawski <>
  • To: Daniel Schmidt <>
  • Cc:
  • Subject: Re: [perfsonar-user] Throughput suddenly unidirectional
  • Date: Mon, 15 Dec 2014 08:30:31 -0500

Hey Dan;

I apologize for the belated reply, Andy is on holiday and I spaced out and
didn’t respond to this sooner.

Is the host still in this state, or did it self recover? We have tried to
reproduce the situation in our testing and have noticed it does happen, just
not in a regular or predictable way.

I don’t think we need to see any logs - but I will ask a question: can you
send the output of these two commands for all of the hosts you are testing
with where you see this issue:

> [psadmin@nettest
> ~]$ rpm -qa | grep iperf3
> iperf3-3.0.9-1.el6.x86_64
> iperf3-devel-3.0.9-1.el6.x86_64
> [psadmin@nettest
> ~]$ rpm -qa | grep bwctl
> bwctl-1.5.2-10.el6.x86_64
> bwctl-server-1.5.2-10.el6.x86_64
> bwctl-client-1.5.2-10.el6.x86_64

Thanks;

-jason

On Dec 9, 2014, at 6:48 PM, Daniel Schmidt
<>
wrote:

> --------
> WARNING: At least one of the links in the message below goes to an IP
> address (e.g.10.1.1.1), which could be malicious. To learn how to protect
> yourself, please go here: https://commons.lbl.gov/x/_591B
> --------
>
> Seems to be happening again, should I send logs? Thkx
>
> On Wed, Dec 3, 2014 at 2:22 PM, Andrew Lake
> <>
> wrote:
> Hi Dan,
>
> Looking through the latest logs there's definitely tests that try to run
> but fair to complete (i.e. its not a problem of writing to the database).
> We actually record failure events in the measurement archive now and I see
> he failures recorded at times that seem to correspond to the gaps and seem
> to indicate iperf3 spit out no results or some type of malformed output:
> http://146.166.250.14/esmond/perfsonar/archive/dc3007648f8249deac86c3e366340290/failures/base?format=json
>
> Unfortunately you host fixed itself, or else running bwctl by hand would be
> usefull to get a better picture of what was getting reported by
> bwctl/iperf3 when things are in the "bad" state.The timestamps don't seem
> to correlate with the "Peer cancelled test before expected" in the BWCTL
> logs and I am not seeing any other errors. It seems to happen for extended
> periods of time so maybe the best thing to do is keep an eye on it and see
> if we can catch it misbehaving and try running bwctl by hand when it
> happens? Sorry i don;t have a better answer at this point.
>
> Thanks,
> Andy
>
> On Dec 3, 2014, at 1:25 PM, Daniel Schmidt
> <>
> wrote:
>
>> Thank you very much for your continued assistance.
>>
>> On Wed, Dec 3, 2014 at 11:07 AM, Jason Zurawski
>> <>
>> wrote:
>> Hi Dan;
>>
>> Thanks for sending the 2nd set of logs, and the other information. We are
>> still not able to determine what is going on however, could you send us
>> one more log file:
>>
>> /var/log/perfsonar/regular_testing.log
>>
>> This will help us see if the issue is related to the testing itself, or
>> the storage/graphing.
>>
>> Thanks;
>>
>> -jason
>>
>> On Dec 2, 2014, at 3:05 PM, Daniel Schmidt
>> <>
>> wrote:
>>
>> > If I forget to include something, please remind me
>> >
>> > * /etc/init.d/iptables stop, however, I would think that would have
>> > shown up on my bench test, no?
>> > * Included logs from other side
>> > * OWAMP? Could you give me a bit - I need to look that up, I don't see
>> > it
>> > * Ntp looks fine 2 me, I'll post later in message
>> > * Yes, I did reverse the c & f. But.....
>> >
>> > I just did it again - look at this done on 2.2.2.2 (remote side)
>> >
>> > [root@localhost
>> > admin]# bwctl -f m -x -T iperf3 -t 30 -i 1 -c 1.1.1.1 -s 2.2.2.2
>> > bwctl: Using tool: iperf3
>> > bwctl: 37 seconds until test results available
>> >
>> > RECEIVER START
>> > -----------------------------------------------------------
>> > Server listening on 5601
>> > -----------------------------------------------------------
>> > Accepted connection from 2.2.2.2, port 45812
>> > [ 17] local 1.1.1.1 port 5601 connected to 2.2.2.2 port 52941
>> > [ ID] Interval Transfer Bandwidth
>> > [ 17] 0.00-1.00 sec 45.6 MBytes 382 Mbits/sec
>> > [ 17] 1.00-2.00 sec 47.3 MBytes 397 Mbits/sec
>> > [ 17] 2.00-3.00 sec 47.5 MBytes 398 Mbits/sec
>> > [ 17] 3.00-4.00 sec 36.8 MBytes 309 Mbits/sec
>> > [ 17] 4.00-5.00 sec 40.0 MBytes 336 Mbits/sec
>> > [ 17] 5.00-6.00 sec 31.5 MBytes 265 Mbits/sec
>> > [ 17] 6.00-7.00 sec 21.2 MBytes 178 Mbits/sec
>> > [ 17] 7.00-8.00 sec 30.2 MBytes 254 Mbits/sec
>> > [ 17] 8.00-9.00 sec 32.7 MBytes 274 Mbits/sec
>> > [ 17] 9.00-10.00 sec 36.6 MBytes 307 Mbits/sec
>> > [ 17] 10.00-11.00 sec 22.5 MBytes 189 Mbits/sec
>> > [ 17] 11.00-12.00 sec 31.1 MBytes 261 Mbits/sec
>> > [ 17] 12.00-13.00 sec 3.36 MBytes 28.2 Mbits/sec
>> > [ 17] 13.00-14.00 sec 28.1 MBytes 236 Mbits/sec
>> > [ 17] 14.00-15.00 sec 43.1 MBytes 361 Mbits/sec
>> > [ 17] 15.00-16.00 sec 35.5 MBytes 297 Mbits/sec
>> > [ 17] 16.00-17.00 sec 42.2 MBytes 354 Mbits/sec
>> > [ 17] 17.00-18.00 sec 39.9 MBytes 335 Mbits/sec
>> > [ 17] 18.00-19.00 sec 8.22 MBytes 69.0 Mbits/sec
>> > [ 17] 19.00-20.00 sec 36.8 MBytes 309 Mbits/sec
>> > [ 17] 20.00-21.00 sec 39.9 MBytes 335 Mbits/sec
>> > [ 17] 21.00-22.00 sec 38.7 MBytes 325 Mbits/sec
>> > [ 17] 22.00-23.00 sec 13.5 MBytes 113 Mbits/sec
>> > [ 17] 23.00-24.00 sec 539 KBytes 4.41 Mbits/sec
>> > [ 17] 24.00-25.00 sec 617 KBytes 5.05 Mbits/sec
>> > [ 17] 25.00-26.00 sec 29.0 MBytes 243 Mbits/sec
>> > [ 17] 26.00-27.00 sec 16.5 MBytes 138 Mbits/sec
>> > [ 17] 27.00-28.00 sec 642 KBytes 5.26 Mbits/sec
>> > [ 17] 28.00-29.00 sec 10.7 MBytes 89.5 Mbits/sec
>> > [ 17] 29.00-30.00 sec 533 KBytes 4.37 Mbits/sec
>> > [ 17] 30.00-30.04 sec 21.2 KBytes 4.55 Mbits/sec
>> > - - - - - - - - - - - - - - - - - - - - - - - - -
>> > [ ID] Interval Transfer Bandwidth Retr
>> > [ 17] 0.00-30.04 sec 811 MBytes 226 Mbits/sec 220
>> > sender
>> > [ 17] 0.00-30.04 sec 811 MBytes 226 Mbits/sec
>> > receiver
>> >
>> > RECEIVER END
>> >
>> > SENDER START
>> > Connecting to host 1.1.1.1, port 5601
>> > [ 16] local 2.2.2.2 port 52941 connected to 1.1.1.1 port 5601
>> > [ ID] Interval Transfer Bandwidth Retr Cwnd
>> > [ 16] 0.00-1.00 sec 47.8 MBytes 401 Mbits/sec 62 55.1 KBytes
>> > [ 16] 1.00-2.00 sec 47.4 MBytes 397 Mbits/sec 3 53.7 KBytes
>> > [ 16] 2.00-3.00 sec 46.9 MBytes 394 Mbits/sec 4 45.2 KBytes
>> > [ 16] 3.00-4.00 sec 36.5 MBytes 306 Mbits/sec 6 45.2 KBytes
>> > [ 16] 4.00-5.00 sec 39.8 MBytes 334 Mbits/sec 9 43.8 KBytes
>> > [ 16] 5.00-6.00 sec 32.3 MBytes 271 Mbits/sec 1 70.7 KBytes
>> > [ 16] 6.00-7.00 sec 19.7 MBytes 166 Mbits/sec 3 49.5 KBytes
>> > [ 16] 7.00-8.00 sec 30.1 MBytes 252 Mbits/sec 17 36.8 KBytes
>> > [ 16] 8.00-9.00 sec 34.5 MBytes 289 Mbits/sec 10 66.5 KBytes
>> > [ 16] 9.00-10.00 sec 36.1 MBytes 302 Mbits/sec 10 48.1 KBytes
>> > [ 16] 10.00-11.00 sec 22.7 MBytes 190 Mbits/sec 9 50.9 KBytes
>> > [ 16] 11.00-12.00 sec 31.0 MBytes 260 Mbits/sec 14 50.9 KBytes
>> > [ 16] 12.00-13.00 sec 2.72 MBytes 22.8 Mbits/sec 3 29.7 KBytes
>> > [ 16] 13.00-14.00 sec 28.9 MBytes 243 Mbits/sec 0 74.9 KBytes
>> > [ 16] 14.00-15.00 sec 42.4 MBytes 356 Mbits/sec 4 36.8 KBytes
>> > [ 16] 15.00-16.00 sec 35.7 MBytes 300 Mbits/sec 3 43.8 KBytes
>> > [ 16] 16.00-17.00 sec 42.0 MBytes 352 Mbits/sec 7 35.4 KBytes
>> > [ 16] 17.00-18.00 sec 40.8 MBytes 342 Mbits/sec 4 59.4 KBytes
>> > [ 16] 18.00-19.00 sec 7.26 MBytes 60.9 Mbits/sec 8 36.8 KBytes
>> > [ 16] 19.00-20.00 sec 37.0 MBytes 310 Mbits/sec 5 36.8 KBytes
>> > [ 16] 20.00-21.00 sec 40.8 MBytes 343 Mbits/sec 2 67.9 KBytes
>> > [ 16] 21.00-22.00 sec 38.5 MBytes 323 Mbits/sec 17 25.5 KBytes
>> > [ 16] 22.00-23.00 sec 11.7 MBytes 98.1 Mbits/sec 2 19.8 KBytes
>> > [ 16] 23.00-24.00 sec 488 KBytes 4.00 Mbits/sec 0 22.6 KBytes
>> > [ 16] 24.00-25.00 sec 650 KBytes 5.33 Mbits/sec 0 24.0 KBytes
>> > [ 16] 25.00-26.00 sec 30.3 MBytes 254 Mbits/sec 3 48.1 KBytes
>> > [ 16] 26.00-27.00 sec 15.2 MBytes 128 Mbits/sec 12 21.2 KBytes
>> > [ 16] 27.00-28.00 sec 682 KBytes 5.58 Mbits/sec 0 24.0 KBytes
>> > [ 16] 28.00-29.00 sec 10.7 MBytes 89.6 Mbits/sec 2 18.4 KBytes
>> > [ 16] 29.00-30.00 sec 488 KBytes 4.00 Mbits/sec 0 21.2 KBytes
>> > - - - - - - - - - - - - - - - - - - - - - - - - -
>> > [ ID] Interval Transfer Bandwidth Retr
>> > [ 16] 0.00-30.00 sec 811 MBytes 227 Mbits/sec 220
>> > sender
>> > [ 16] 0.00-30.00 sec 811 MBytes 227 Mbits/sec
>> > receiver
>> >
>> > This seems to indicate a problem with my circuit as I see this in no
>> > other location. But, my root question remains - why would this cause my
>> > graph freak out & stop graphing when it gets this data?
>> >
>> > I can't remember the iperf and nuttcp command line options off hand, if
>> > you think they would be helpful, I'll go read up & do them. Sorry - not
>> > purposefully being lazy, I just have to put this on the back burner for
>> > a few hours. Where is thrulay? I can remember how to use thrulay.
>> >
>> > Thanks for your time.
>> >
>> >
>> > NTP results:
>> > ifco
>> > root@localhost
>> > admin]# ntpq -p -c rv
>> > remote refid st t when poll reach delay offset
>> > jitter
>> > ==============================================================================
>> > -nms-rlat.chic.n 141.142.143.138 2 u 55 1024 377 27.404 -0.670
>> > 0.619
>> > +eth-1.nms-rlat. .IRIG. 1 u 590 1024 377 53.188 -0.156
>> > 0.236
>> > -nms-rlat.losa.n .CDMA. 1 u 671 1024 377 56.644 14.561
>> > 0.207
>> > +nms-rlat.newy32 .CDMA. 1 u 1016 1024 377 54.850 0.031
>> > 0.360
>> > -chronos.es.net .CDMA. 1 u 715 1024 377 52.939 0.814
>> > 0.220
>> > *saturn.es.net .CDMA. 1 u 457 1024 377 30.607 -0.196
>> > 5.333
>> > associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
>> > version="ntpd
>> >
>> > Sat Nov 23 18:21:48 UTC 2013 (1)",
>> > processor="x86_64", system="Linux/2.6.32-504.1.3.el6.aufs.web100.x86_64",
>> > leap=00, stratum=2, precision=-23, rootdelay=30.607, rootdisp=54.621,
>> > refid=198.129.252.38,
>> > reftime=d8288af8.d66b698e Tue, Dec 2 2014 12:01:12.837,
>> > clock=d82890dd.8f707797 Tue, Dec 2 2014 12:26:21.560, peer=60532,
>> > tc=10, mintc=3, offset=-0.109, frequency=1.125, sys_jitter=0.123,
>> > clk_jitter=0.164, clk_wander=0.007
>> >
>> > [root@localhost
>> > admin]# ntpq -p -c rv
>> > remote refid st t when poll reach delay offset
>> > jitter
>> > ==============================================================================
>> > -nms-rlat.chic.n 141.142.143.138 2 u 928 1024 377 27.017 -0.995
>> > 0.932
>> > +nms-rlat.hous.n .IRIG. 1 u 25 1024 377 52.872 -0.338
>> > 0.189
>> > -nms-rlat.losa.n .CDMA. 1 u 517 1024 377 56.317 14.298
>> > 0.320
>> > -nms-rlat.newy32 .CDMA. 1 u 907 1024 377 54.451 -0.227
>> > 0.271
>> > +chronos.es.net .CDMA. 1 u 156 1024 377 54.627 -0.438
>> > 0.321
>> > *saturn.es.net .CDMA. 1 u 19 1024 377 30.346 -0.456
>> > 0.304
>> > associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
>> > version="ntpd
>> >
>> > Sat Nov 23 18:21:48 UTC 2013 (1)",
>> > processor="x86_64", system="Linux/2.6.32-504.1.3.el6.aufs.web100.x86_64",
>> > leap=00, stratum=2, precision=-23, rootdelay=30.346, rootdisp=36.512,
>> > refid=198.129.252.38,
>> > reftime=d8289155.99671fb2 Tue, Dec 2 2014 12:28:21.599,
>> > clock=d8289168.31de148f Tue, Dec 2 2014 12:28:40.194, peer=13268,
>> > tc=10, mintc=3, offset=-0.413, frequency=-0.412, sys_jitter=0.067,
>> > clk_jitter=0.133, clk_wander=0.020
>> >
>> >
>> >
>> > On Tue, Dec 2, 2014 at 10:36 AM, Jason Zurawski
>> > <>
>> > wrote:
>> > Hey Dan;
>> >
>> > Looking through the logs, the only suspect thing I see are lines of this
>> > nature:
>> >
>> > > Dec 2 10:05:01 localhost bwctld[12565]: FILE=endpoint.c, LINE=1314,
>> > > PeerAgent: Peer cancelled test before expected
>> >
>> > Unfortunately that tells us the ‘what’ but not the ‘how’. Could you
>> > also send the logs from the other host you are using? That host may
>> > have more details about what is going on. Couple other things that came
>> > to mind:
>> >
>> > >> * No firewall between A & B
>> >
>> > IPTables may be on for both sides, it may be a quick and dirty test to
>> > just disable that to see if that helps?
>> >
>> > >> * I'm not familiar with "slots." There are few throughput tests
>> > >> running though. (Tests running 33% of time)
>> >
>> > Ok, this won’t be the issue I was thinking of.
>> >
>> > >> * I assumed packet loss was an issue. So, I setup smokeping on both
>> > >> sides, 5 every 30 seconds, 1472 MTU. However, I'm not getting loss.
>> >
>> > Do you have OWAMP going between the two hosts? If you don’t, I would
>> > suggest setting up that test too. OWAMP uses UDP packets which may give
>> > a different clue than the ICMP that smokeping would use.
>> >
>> > >> * PsPerformance comes with ntp on - appears to be running, they have
>> > >> the same time & these machines are not behind any firewalls.
>> >
>> > Could you send the output of ‘ntpq -p -c rv’ for both?
>> >
>> > >> * I am not seeing the issue on command line bwctl. Strange.
>> >
>> > Could you try the reverse direction as well - e.g. swap the hosts for
>> > the -c and -s flags? Also try using ‘iperf’ and ‘nuttcp’ as the tool
>> > instead of ‘iperf3’.
>> >
>> > Thanks;
>> >
>> > -jason
>> >
>> > On Dec 2, 2014, at 12:16 PM, Daniel Schmidt
>> > <>
>> > wrote:
>> >
>> > > Thank you kindly for your reply. Some short responses:
>> > >
>> > > * No firewall between A & B
>> > > * I'm not familiar with "slots." There are few throughput tests
>> > > running though. (Tests running 33% of time)
>> > > * I assumed packet loss was an issue. So, I setup smokeping on both
>> > > sides, 5 every 30 seconds, 1472 MTU. However, I'm not getting loss.
>> > > * PsPerformance comes with ntp on - appears to be running, they have
>> > > the same time & these machines are not behind any firewalls.
>> > > * I am not seeing the issue on command line bwctl. Strange.
>> > > * Cacti minute graphs don't show any strange usage on the ICX switch.
>> > >
>> > > I would suspect hardware, but the boxes ran a solid a line for hours
>> > > on my bench test. Please forgive me, but I'm reluctant to give the
>> > > IP's as I haven't really figured out how I would prevent hackers from
>> > > using these machines to DOS me. (Does anybody have to mitigate this
>> > > issue? Sorry - off topic question) However, I'd be happy to
>> > > privately give you root on the box.
>> > >
>> > > I have attached a png of what I see. You can see the lines greatly
>> > > vary greatly and around 9:30 the thruput suddenly decided to start
>> > > working again. I have also attached the log, replacing 1.1.1.1 for
>> > > local and 2.2.2.2 for remote.
>> > >
>> > > Many thanks,
>> > > -Dan
>> > >
>> > > On Mon, Dec 1, 2014 at 4:24 PM, Jason Zurawski
>> > > <>
>> > > wrote:
>> > > Hey Daniel;
>> > >
>> > > Would you be able to provide a link to your node, or send along a
>> > > screenshot, to give us a better idea of what you are seeing?
>> > >
>> > > Off the top of my head, here are a couple of typical reasons that
>> > > tests could fail:
>> > >
>> > > - Firewalls in the path denying access to ports, or not enough
>> > > ports available for the number of tests that are running
>> > >
>> > > - Lack of testing ‘slots’ available on one side or the other
>> > >
>> > > - NTP synchronization issues
>> > >
>> > > - Packet loss that prevents the test from starting or
>> > > finishing.
>> > >
>> > > If you send along your /var/log/perfsonar/owamp_bwctl.log file, we can
>> > > have a look to see what may be menacing your node. The other thing
>> > > you can try is some by-hand tests, something like:
>> > >
>> > > bwctl -f m -x -T iperf3 -t 30 -i 1 -c HOST1 -s HOST2
>> > >
>> > > Thanks;
>> > >
>> > > -jason
>> > >
>> > > On Dec 1, 2014, at 5:43 PM, Daniel Schmidt
>> > > <>
>> > > wrote:
>> > >
>> > > > I've noticed strange behavior on our throughput tests at one site.
>> > > > Sometimes, the graph turn unidirectional - ie, one way stops
>> > > > working. Sometimes, both ways will stop working. The times are
>> > > > random. Although the site is verified up by ping and passes
>> > > > traffic, however the throughput graphs vary greatly. (We believe
>> > > > due to issues with this circuit)
>> > > >
>> > > > I've only seen it do this in this one case. It's almost like it
>> > > > gets angry that the speed varies vastly and gives up.
>> > > >
>> > > > Has anybody else encountered this? Any ideas greatly appreciated.
>>
>> E-Mail to and from me, in connection with the transaction
>> of public business, is subject to the Wyoming Public Records
>> Act and may be disclosed to third parties.
>>
>>
>> <regular_testing.tgz>
>
>
> E-Mail to and from me, in connection with the transaction
> of public business, is subject to the Wyoming Public Records
> Act and may be disclosed to third parties.



Archive powered by MHonArc 2.6.16.

Top of Page