perfsonar-user - Re: [perf-node-users] Re: [perfsonar-user] Help with inconsistent bwctl measurements
Subject: perfSONAR User Q&A and Other Discussion
List archive
Re: [perf-node-users] Re: [perfsonar-user] Help with inconsistent bwctl measurements
Chronological Thread
- From: "Roderick Mooi" <>
- To: "Jason Zurawski" <>, "Shawn McKee" <>
- Cc: "perf-node-users" <>, <>
- Subject: Re: [perf-node-users] Re: [perfsonar-user] Help with inconsistent bwctl measurements
- Date: Thu, 17 Oct 2013 15:30:51 +0200
Hey Jason, Shawn
Just FYI the network between the 2 nodes looks like this:
jnb2pe1 jnb2(p1) pta1p1 192.96.2.241
(gateway)
switch ----- router ===== router ----- router === DWDM -- router
| | |
perfsonara 155.232.40.58 192.96.2.247
196.21.48.249
So 192.21.48.249 (perfsonara) and 155.232.40.58 are only 1 hop apart
(different machines though)...
I'm checking out the ECMP with one of our network engineers. In the meantime,
I looked at the traceroute results and made similar observations:
https://192.96.2.247/toolkit/gui/psTracerouteViewer/index.cgi?mahost=http%3A%2F%2Flocalhost%3A8086%2FperfSONAR_PS%2Fservices%2FtracerouteMA&stime=yesterday&etime=now&tzselect=Africa%2FJohannesburg&epselect=c87e55d65b2f8349319ea8e75518982c
I see some "error:requestTimedOut" entries and also some "(ECMP)"s. Could
this be pointing us to the source of the problem? (I didn't learn much from
Googling this error previously)
Something else which is concerning is that tracepath reports 60 hops back
(although testing from the other direction also takes 5 hops). In the reverse
direction this is reported correctly (5 hops to and from). I've seen this
before when MTU was incorrect somewhere on the path but that's not applicable
here.
$ tracepath perfsonara.sanren.ac.za
1: 192.96.2.247 (192.96.2.247) 0.052ms pmtu 1500
1: 192.96.2.241 (192.96.2.241) 0.776ms
1: 192.96.2.241 (192.96.2.241) 0.743ms
2: pta1p1-t0100-pta1pe1-t84.net.tenet.ac.za (155.232.6.129) 27.925ms asymm
6
3: jnb2-t75-pta1-t42.net.tenet.ac.za (155.232.6.25) 4.984ms asymm 5
4: jnb2pe1-te82-jnb2p1-t01200.net.tenet.ac.za (155.232.7.158) 2.202ms
5: 196.21.48.249 (196.21.48.249) 2.640ms reached
Resume: pmtu 1500 hops 5 back 60
$ tracepath 155.232.40.58
1: 192.96.2.247 (192.96.2.247) 0.059ms pmtu 1500
1: 192.96.2.241 (192.96.2.241) 3.563ms
1: 192.96.2.241 (192.96.2.241) 0.749ms
2: pta1p1-t01200-pta1pe1-t94.net.tenet.ac.za (155.232.6.137) 3.842ms
asymm 6
3: jnb2-t75-pta1-t42.net.tenet.ac.za (155.232.6.25) 3.423ms asymm 5
4: jnb2pe1-t101-jnb2p1-t0100.net.tenet.ac.za (155.232.7.154) 2.187ms
5: 155.232.40.58 (155.232.40.58) 2.057ms reached
Resume: pmtu 1500 hops 5 back 60
I don't see any dropped packets on the router interfaces 155.232.6.129 and
155.232.7.158. Power levels etc also check out ok...
How do I check if the machines are happy with their drivers?
Thanks!
--
Roderick Mooi | SANREN Engineer
--
| +27 12 841 4111 | www.sanren.ac.za
>>> On 2013-10-17 at 13:29, Jason Zurawski
>>> <>
>>> wrote:
> Hey Roderick;
>
> Besides Shawn's suggestion (which is good, and something to look into), I
> would add the classic suggestions of being sure the cables/fibers are clean
> and un-crimped, and that the local machines are happy with their drivers.
>
> Digging a little more, I was comparing these two graphs (for the second be
> sure to check 'show reverse direction data', and maybe slide zoom in on a
> 1-2
> hr chunk):
>
> http://192.96.2.247/serviceTest/bandwidthGraph.cgi?url=http://localhost:8085
>
> /perfSONAR_PS/services/pSB&key=d9013ce7df20b8bbe45defeaeae785d6&keyR=0a0ed6c928
> edf28976414a2cc7e87d6f&dstIP=192.96.2.247&srcIP=196.21.48.249&dst=192.96.2.247&sr
> c=perfsonara.sanren.ac.za&type=TCP&length=7776000
>
> https://192.96.2.247/serviceTest/delayGraph.cgi?url=http://localhost:8085/pe
>
> rfSONAR_PS/services/pSB&key=de3625b0c481ef8338aab14be049313d&keyR=2d31fa42a62b9
> 188f88046b2f24b7510&dstIP=192.96.2.247&srcIP=196.21.48.249&dst=192.96.2.247&src=1
> 96.21.48.249&type=TCP&length=604800&bucket_width=0.0001
>
> Zooming in on the OWAMP graph shows the near constant loss in the
> 192.96.2.247 -> 196.21.48.249 direction, which matches what BWCTL notes.
> The
> only appreciable difference I can see is when doing traceroutes originating
> from 192.96.2.247 and going to 196.21.48.249 and 155.232.40.58:
>
> http://192.96.2.247/toolkit/gui/reverse_traceroute.cgi?target=196.21.48.249&f
>
> unction=traceroute
> http://192.96.2.247/toolkit/gui/reverse_traceroute.cgi?target=155.232.40.58&f
>
> unction=traceroute
>
> While basically the same, hops 2 and 4 report a slight different answer
> (which lends credence to the ECMP idea - or a bad interface).
>
> Thanks;
>
> -jason
>
> On Oct 17, 2013, at 5:37 AM, Shawn McKee
> <>
> wrote:
>
>> Could there be some kind of ECMP (Equal Cost Multi-Pathing) between this
> source and destination and one of the alternate links is not good?
>>
>> Shawn
>>
>>
>> On Thu, Oct 17, 2013 at 5:22 AM, Roderick Mooi
>> <>
>> wrote:
>> Hi Alan, Eli
>>
>> I'm not seeing fluctuations in "good" or "bad" measurements.
>>
>> Good:
>>
>> RECEIVER START
>> bwctl: exec_line: iperf -B 196.21.48.249 -s -f m -m -p 5152 -t 20 -i 1
>> bwctl: start_tool: 3590989992.044877
>> ------------------------------------------------------------
>> Server listening on TCP port 5152
>> Binding to local address 196.21.48.249
>> TCP window size: 0.08 MByte (default)
>> ------------------------------------------------------------
>> [ 14] local 196.21.48.249 port 5152 connected with 192.96.2.247 port 5152
>> [ ID] Interval Transfer Bandwidth
>> [ 14] 0.0- 1.0 sec 111 MBytes 929 Mbits/sec
>> [ 14] 1.0- 2.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 2.0- 3.0 sec 112 MBytes 942 Mbits/sec
>> [ 14] 3.0- 4.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 4.0- 5.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 5.0- 6.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 6.0- 7.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 7.0- 8.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 8.0- 9.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 9.0-10.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 10.0-11.0 sec 112 MBytes 942 Mbits/sec
>> [ 14] 11.0-12.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 12.0-13.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 13.0-14.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 14.0-15.0 sec 112 MBytes 942 Mbits/sec
>> [ 14] 15.0-16.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 16.0-17.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 17.0-18.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 18.0-19.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 19.0-20.0 sec 112 MBytes 941 Mbits/sec
>> [ 14] 0.0-20.5 sec 2298 MBytes 941 Mbits/sec
>> [ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
>> bwctl: stop_exec: 3590990016.831918
>>
>> RECEIVER END
>>
>>
>> Bad:
>>
>> RECEIVER START
>> bwctl: exec_line: iperf -B 196.21.48.249 -s -f m -m -p 5149 -t 20 -i 1
>> bwctl: start_tool: 3590989696.322229
>> ------------------------------------------------------------
>> Server listening on TCP port 5149
>> Binding to local address 196.21.48.249
>> TCP window size: 0.08 MByte (default)
>> ------------------------------------------------------------
>> [ 14] local 196.21.48.249 port 5149 connected with 192.96.2.247 port 5149
>> [ ID] Interval Transfer Bandwidth
>> [ 14] 0.0- 1.0 sec 12.8 MBytes 107 Mbits/sec
>> [ 14] 1.0- 2.0 sec 11.1 MBytes 93.3 Mbits/sec
>> [ 14] 2.0- 3.0 sec 13.9 MBytes 116 Mbits/sec
>> [ 14] 3.0- 4.0 sec 18.1 MBytes 152 Mbits/sec
>> [ 14] 4.0- 5.0 sec 14.7 MBytes 124 Mbits/sec
>> [ 14] 5.0- 6.0 sec 16.1 MBytes 135 Mbits/sec
>> [ 14] 6.0- 7.0 sec 14.9 MBytes 125 Mbits/sec
>> [ 14] 7.0- 8.0 sec 10.3 MBytes 86.3 Mbits/sec
>> [ 14] 8.0- 9.0 sec 16.6 MBytes 139 Mbits/sec
>> [ 14] 9.0-10.0 sec 19.7 MBytes 165 Mbits/sec
>> [ 14] 10.0-11.0 sec 15.0 MBytes 126 Mbits/sec
>> [ 14] 11.0-12.0 sec 21.2 MBytes 178 Mbits/sec
>> [ 14] 12.0-13.0 sec 13.3 MBytes 112 Mbits/sec
>> [ 14] 13.0-14.0 sec 12.2 MBytes 102 Mbits/sec
>> [ 14] 14.0-15.0 sec 12.7 MBytes 107 Mbits/sec
>> [ 14] 15.0-16.0 sec 10.9 MBytes 91.2 Mbits/sec
>> [ 14] 16.0-17.0 sec 10.9 MBytes 91.6 Mbits/sec
>> [ 14] 17.0-18.0 sec 13.5 MBytes 114 Mbits/sec
>> [ 14] 18.0-19.0 sec 11.7 MBytes 97.8 Mbits/sec
>> [ 14] 19.0-20.0 sec 12.0 MBytes 100 Mbits/sec
>> [ 14] 0.0-20.1 sec 282 MBytes 118 Mbits/sec
>> [ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
>> bwctl: stop_exec: 3590989721.229266
>>
>> RECEIVER END
>>
>> Referring to my complementary email to Brian and Ivan, do you have further
> suggestions?
>>
>> " I'm still puzzled by the fact that all tests between these 2 nodes and
> others nodes on the network path are fine (i.e. I don't see this up-down
> behaviour).
>> [see:
>> http://perfsonara.sanren.ac.za/serviceTest/index.cgi?eventType=bwctl
>> and
>> https://192.96.2.247/serviceTest/index.cgi?eventType=bwctl
>> ]
>> "
>>
>> Thanks!
>>
>> Roderick
>>
>> >>> On 2013-10-16 at 19:07, Eli Dart
>> >>> <>
>> >>> wrote:
>>
>> >
>> > On 10/16/13 9:46 AM, Alan Whinery wrote:
>> >> You might also reveal something useful by using periodic reports in your
>> >> bwctl invocations (like " -i 1 ") you may find that the per second
>> >> reports show burstiness, or the lack of it.
>> >
>> > I find this to be very very helpful.
>> >
>> > There is a big difference between a clean ramp to a stable speed, and
>> > wild fluctuation that is averaged.
>> >
>> > A clean ramp to a stable speed argues against the presence of packet
>> > loss. If performance is poor but stable, I would check the hosts and
>> > the application, and then check for a clean bottleneck link. Wild
>> > fluctuation points toward loss - check your router and switch buffers.
>> > (And if "poor but stable" means fluctuating between 10Kbps and 30Kbps,
>> > there is probably loss too :)
>> >
>> > None of this is set in stone of course. However, I find that telling
>> > bwctl to give periodic reports is very helpful indeed.
>> >
>> > --eli
>> >
>> >
>> >>
>> >> On 10/16/2013 6:26 AM, Wefel, Paul wrote:
>> >>> Couple ideas
>> >>>
>> >>> Run owamp between these two hosts looking for packet loss in only one
>> >>> direction.
>> >>> Check the switch interface that Dst is connected to looking for queue
>> >>> drops and pause frames being sent.
>> >>>
>> >>> I have also seen strange issues with some NICS when offloading is
>> >>> enabled
>> >>> on them.
>> >>>
>> >>> good luck, let us know what you find.
>> >>>
>> >>> -paul
>> >>> NCSA @ UIUC
>> >>>
>> >>> -----Original Message-----
>> >>> From: Roderick Mooi
>> >>> <>
>> >>> Date: Wednesday, October 16, 2013 5:07 AM
>> >>> To:
>> >>> ""
>> >>> <>,
>> >>>
>> >>> ""
>> >>> <>
>> >>> Subject: [perfsonar-user] Help with inconsistent bwctl measurements
>> >>>
>> >>>> Hi
>> >>>>
>> >>>> I have been trying to locate the cause of inconsistent measurements
>> >>>> between two nodes for a few weeks now without success. The pattern I'm
>> >>>> seeing is available at:
>> >>>>
>> >>>> https://192.96.2.247/serviceTest/bandwidthGraph.cgi?url=http://localhost:8
>> >>>>
>> >>>> 085/perfSONAR_PS/services/pSB&key=d9013ce7df20b8bbe45defeaeae785d6&keyR=0a
>> >>>> 0ed6c928edf28976414a2cc7e87d6f&dstIP=192.96.2.247&srcIP=196.21.48.249&dst=
>> >>>> 192.96.2.247&src=perfsonara.sanren.ac.za&type=TCP&length=2592000
>> >>>>
>> >>>> Src-Dst is consistent but Dst-Src is not.
>> >>>>
>> >>>> Manual tests (attached) show the same behaviour without any
>> >>>> indication of
>> >>>> cause - measures 941 Mbps then drops to 189 Mbps (end) and back to 941
>> >>>> (nothing different in the logs between "good" measurements and "bad"
>> >>>> ones). The only time I've seen something similar is when I was testing
>> >>> >from a 10 G interface to a 1 G interface which was subsequently being
>> >>>> flooded. In this case both interfaces are 1 G. I'm also not seeing any
>> >>>> problems with measurements along the path or between these nodes and
>> >>>> any
>> >>>> other nodes. Additionally, there is very little (< 50 Mbps) real
>> >>>> traffic
>> >>>> between these 2 nodes.
>> >>>>
>> >>>> Any ideas?
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>> Roderick
>> >>>>
>> >>>> --
>> >>>> Roderick Mooi | SANREN Engineer
>> >>>> --
>> >>>>
>> >>>> | +27 12 841 4111 | www.sanren.ac.za
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> This message is subject to the CSIR's copyright terms and conditions,
>> >>>> e-mail legal notice, and implemented Open Document Format (ODF)
>> >>>> standard.
>> >>>> The full disclaimer details can be found at
>> >>>> http://www.csir.co.za/disclaimer.html.
>> >>>>
>> >>>> This message has been scanned for viruses and dangerous content by
>> >>>> MailScanner,
>> >>>> and is believed to be clean.
>> >>>>
>> >>>> Please consider the environment before printing this email.
>> >>>>
>> >>>
>> >>
>> >
>> > --
>> > Eli Dart, Network Engineer NOC: (510) 486-7600
>> > ESnet Office of the CTO (AS293) (800) 333-7638
>> > Lawrence Berkeley National Laboratory
>> > PGP Key fingerprint = C970 F8D3 CFDD 8FFF 5486 343A 2D31 4478 5F82 B2B3
>> >
>> > --
>> > This message is subject to the CSIR's copyright terms and conditions,
>> > legal notice, and implemented Open Document Format (ODF) standard.
>> > The full disclaimer details can be found at
>> > http://www.csir.co.za/disclaimer.html.
>> >
>> > This message has been scanned for viruses and dangerous content by
>> > MailScanner,
>> > and is believed to be clean.
>> >
>> > Please consider the environment before printing this email.
>>
>> --
>> This message is subject to the CSIR's copyright terms and conditions,
> legal notice, and implemented Open Document Format (ODF) standard.
>> The full disclaimer details can be found at
> http://www.csir.co.za/disclaimer.html.
>>
>> This message has been scanned for viruses and dangerous content by
> MailScanner,
>> and is believed to be clean.
>>
>> Please consider the environment before printing this email.
>
> --
> This message is subject to the CSIR's copyright terms and conditions,
> legal notice, and implemented Open Document Format (ODF) standard.
> The full disclaimer details can be found at
> http://www.csir.co.za/disclaimer.html.
>
> This message has been scanned for viruses and dangerous content by
> MailScanner,
> and is believed to be clean.
>
> Please consider the environment before printing this email.
--
This message is subject to the CSIR's copyright terms and conditions, e-mail
legal notice, and implemented Open Document Format (ODF) standard.
The full disclaimer details can be found at
http://www.csir.co.za/disclaimer.html.
This message has been scanned for viruses and dangerous content by
MailScanner,
and is believed to be clean.
Please consider the environment before printing this email.
- [perfsonar-user] Re: [perf-node-users] Perfsonar Server got hacked (non root user), (continued)
- Message not available
- Message not available
- [perfsonar-user] Re: [perf-node-users] Perfsonar Server got hacked (non root user), Jason Zurawski, 10/11/2013
- [perfsonar-user] Re: [perf-node-users] Perfsonar Server got hacked (non root user), Soichi Hayashi, 10/11/2013
- [perfsonar-user] Help with inconsistent bwctl measurements, Roderick Mooi, 10/16/2013
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Brian Tierney, 10/16/2013
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Jason Zurawski, 10/16/2013
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Wefel, Paul, 10/16/2013
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Alan Whinery, 10/16/2013
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Eli Dart, 10/16/2013
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Roderick Mooi, 10/17/2013
- Message not available
- Re: [perf-node-users] Re: [perfsonar-user] Help with inconsistent bwctl measurements, Jason Zurawski, 10/17/2013
- Re: [perf-node-users] Re: [perfsonar-user] Help with inconsistent bwctl measurements, Roderick Mooi, 10/17/2013
- Re: [perf-node-users] Re: [perfsonar-user] Help with inconsistent bwctl measurements, Jason Zurawski, 10/17/2013
- Message not available
- Message not available
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Eli Dart, 10/17/2013
- Re: [perfsonar-user] Help with inconsistent bwctl measurements, Roderick Mooi, 10/17/2013
- [perfsonar-user] Re: [perf-node-users] Perfsonar Server got hacked (non root user), Roderick Mooi, 10/16/2013
- RE: [perfsonar-user] Re: [perf-node-users] Perfsonar Server got hacked (non root user), Garnizov, Ivan, 10/16/2013
Archive powered by MHonArc 2.6.16.