ndt-users - Re: Using NDT with 10 gigabit interfaces

Subject: ndt-users list created

List archive

Re: Using NDT with 10 gigabit interfaces

From: Matt Mathis <>
To: Brian Tierney <>
Cc: Rich Carlson <>, NDT users <>
Subject: Re: Using NDT with 10 gigabit interfaces
Date: Wed, 1 Jun 2011 15:05:54 -0400
Domainkey-signature: a=rsa-sha1; c=nofws; d=google.com; s=beta; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=pRIr5z87bP5DqLTWwp2XHSRNwziXZ8HKCxhiYazV7lT6WxxPxlMRCykDbGDYx96Tlb XPTdJPItrG6EuTg15mfQ==

Be aware that there is a 8:1 conversion bug in some versions of the
java client. It only effects the displayed data rate, but not the
rate computed and saved at the server.

Thanks,
--MM--
The best way to predict the future is to create it. - Alan Kay

On Wed, Jun 1, 2011 at 9:56 AM, Brian Tierney
<>
wrote:
>
> Ah, right, I forgot that interrupt coalescing was off for a reason.
>
> It's still surprising to me that throughput is 18x faster in 1 direction
> than that other with coalescing off.
>
>
> On Jun 1, 2011, at 6:10 AM, Rich Carlson wrote:
>
>> Brian;
>>
>> The NDT server tries to determine the bottleneck link capacity by timing
>> every packet. Either the NIC needs to add timestamps, or the BPF needs
>> to. In order to get the BPF timestamps, the NIC needs to forward every
>> packet as it arrives (no coalescing).
>>
>> I would not be opposed to a better bottleneck detection algorithm that
>> reduces the need for per/packet forwarding by the NIC. However, until
>> that exists, turning off coalescing will disable the bottleneck link
>> detection function.
>>
>> I will again note, that NDT was not meant to be the ultimate bandwidth
>> tester. It was designed to give you a quick look at the e2e path so you
>> can determine if further investigation is required. A 10% hit over a 10 G
>> path with the details to show that TCP ran slowstart up over the link
>> capacity and then went into Cong Avoid mode should be enough to show the
>> link isn't the problem. If you want the max throughput number, but no
>> data to back them up, then run iperf, nuttcp, ...
>>
>> Rich
>>
>> On 5/31/2011 11:11 PM, Brian Tierney wrote:
>>>
>>>
>>> On May 31, 2011, at 8:10 AM, Matt Mathis wrote:
>>>
>>>> I am just guessing here but NDT is actually quite busy: it reads
>>>> Web100 vars every millisecond, and runs 2 different packet capture
>>>> tools. Although one would hope that all of these activities run in
>>>> different cores it would not surprise me to discover that the maximum
>>>> data rate is somewhat depressed.
>>>>
>>>> Web100 and related tools can't do any meaningful performance debugging
>>>> when the bottleneck is very fine grained resource contention within
>>>> the sender itself, especially CPU, bus bandwidth and lock contention.
>>>
>>>
>>> This seems plausible to me, and I think explains the asymmetry (which I
>>> was not clear about
>>> in my last email):
>>>
>>> using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)
>>>
>>> I consistently see results similar to this:
>>>
>>> running 10s outbound test (client to server) . . . . . 7748.44 Mb/s
>>> running 10s inbound test (server to client) . . . . . . 425.89 Mb/s
>>>
>>> while iperf is consistently around 8.3 Gbps both directions
>>>
>>> (results are the same if I swap client and server hosts, btw)
>>>
>>>
>>> vmstat output from server during 'client to server' testing:
>>>
>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>> -----cpu------
>>> r b swpd free buff cache si so bi bo in cs us sy id wa st
>>> 1 0 0 3000756 153796 284476 0 0 0 0 275956 106682 2 27 71 0 0
>>> 3 0 0 3000012 153796 284476 0 0 0 184 278421 125647 3 29 69 0 0
>>> 2 0 0 3000016 153796 284492 0 0 0 0 281350 102942 2 27 71 0 0
>>> 2 0 0 2999024 153796 284492 0 0 0 0 281674 103412 2 28 70 0 0
>>> 2 0 0 2999768 153796 284492 0 0 0 0 281432 103257 2 27 71 0 0
>>> 2 0 0 2999148 153796 284492 0 0 0 0 281082 102463 2 28 70 0 0
>>> 2 0 0 2999148 153796 284492 0 0 0 56 281413 102872 2 27 71 0 0
>>> 1 0 0 3001616 153796 284492 0 0 0 64 218677 114352 2 20 78 0 0
>>>
>>> vmstat output on server during 'server to client' testing:
>>>
>>>
>>> 1 0 0 3002236 153796 284492 0 0 0 0 193199 142030 2 16 83 0 0
>>> 0 0 0 3002484 153796 284492 0 0 0 0 193191 142068 2 15 83 0 0
>>> 1 0 0 2999880 153796 284492 0 0 0 240 193065 142319 2 16 82 0 0
>>> 1 0 0 2994672 153796 284492 0 0 0 0 193231 142132 2 16 83 0 0
>>> 1 0 0 2993316 153796 284492 0 0 0 64 193451 142211 1 16 82 0 0
>>> 1 0 0 2996664 153796 284492 0 0 0 0 191818 145425 2 15 83 0 0
>>> 0 0 0 2996420 153796 284496 0 0 0 0 189887 143033 2 15 83 0 0
>>>
>>> Note the very high context switches per second values (cs), particularly
>>> while sending
>>>
>>> and compare with iperf:
>>>
>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>> -----cpu------
>>> r b swpd free buff cache si so bi bo in cs us sy id wa st
>>> 1 0 0 3024664 153856 286472 0 0 0 0 213989 5348 0 11 89 0 0
>>> 0 0 0 3024416 153856 286472 0 0 0 0 213440 4019 0 11 89 0 0
>>> 0 0 0 3024168 153856 286472 0 0 0 0 213908 3239 0 11 89 0 0
>>> 1 0 0 3023796 153856 286472 0 0 0 0 213721 2613 0 11 89 0 0
>>> 2 0 0 3023548 153856 286472 0 0 0 48 213933 2113 0 11 89 0 0
>>> 0 0 0 3022804 153856 286472 0 0 0 0 213921 1758 0 11 89 0 0
>>> 0 0 0 3022432 153856 286472 0 0 0 0 213864 1531 0 12 88 0 0
>>> 0 0 0 3021936 153856 286472 0 0 0 240 213558 1331 0 11 89 0 0
>>> 2 0 0 3021564 153856 286472 0 0 0 0 213885 1202 0 11 89 0 0
>>>
>>>
>>> Which is a dramatic differences in context switches (as expected due to
>>> the web100 calls).
>>> These hosts have 6 "Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz" CPUs.
>>>
>>> Using mpstat we see CPU on 2 processors, and some additional interrupts
>>> on a 3rd
>>>
>>> 06:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>>> 06:43:40 PM 0 8.00 0.00 19.50 0.00 4.50 40.00 0.00 28.00 191739.50
>>> 06:43:40 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>> 06:43:40 PM 2 2.01 0.00 18.59 0.00 0.00 7.54 0.00 71.86 0.00
>>> 06:43:40 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1001.00
>>> 06:43:40 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>> 06:43:40 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>>
>>>
>>> So then I tried increasing the interrupt coalescing to 100ms (it was set
>>> to 0), and this made a big difference:
>>>
>>> running 10s outbound test (client to server) . . . . . 9394.79 Mb/s
>>> running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s
>>>
>>> and brought the number of intr/sec down by around 20x
>>>
>>> 08:06:53 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>>> 08:06:55 PM 0 5.47 0.00 14.43 0.00 1.49 22.39 0.00 56.22 9907.96
>>> 08:06:55 PM 1 3.00 0.00 31.00 0.00 0.00 6.50 0.00 59.50 0.00
>>> 08:06:55 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>> 08:06:55 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 996.02
>>> 08:06:55 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>> 08:06:55 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>>
>>>
>>>
>>> But inbound is still 4x slower than outbound. (iperf is now 9.6 Gbps
>>> both directions).
>>>
>>> Anyone know any other myricom tuning knobs to try?
>>>
>>> Is the conclusion to all this: "to do NDT/web100 at 10G requires a
>>> web10G kernel" ?
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>> On Sat, May 28, 2011 at 7:02 PM, Brian Tierney
>>>> <
>>>> <mailto:>>
>>>> wrote:
>>>>>
>>>>>
>>>>> I'm seeing the same thing (much higher results reported by iperf
>>>>> compared to NDT)
>>>>>
>>>>> Is this expected?
>>>>>
>>>>>
>>>>> On May 2, 2011, at 8:26 AM,
>>>>> <
>>>>> <mailto:>>
>>>>>
>>>>> <
>>>>> <mailto:>>
>>>>> wrote:
>>>>>
>>>>>> Dear members:
>>>>>> I have tried several approaches to use NDT on a server with a 10
>>>>>> gigabit
>>>>>> interface. I wonder if there are any limitations on the server to
>>>>>> client
>>>>>> tests. I have not been able to get more than around 2.6 gigs
>>>>>> server-to-client. The client-to-server test can go over 9 gigs even
>>>>>> without
>>>>>> extensive tuning. On the same server, I can get over 9 gigs in each
>>>>>> direction
>>>>>> to a neighbor server using iperf tests.
>>>>>>
>>>>>> Are there any tips on running NDT on a 10gig capable server?
>>>>>>
>>>>>> Thanks,
>>>>>> Nat Stoddard
>>>>>
>>>>>
>>>>>
>>>
>
>

Re: Using NDT with 10 gigabit interfaces, Aaron Brown, 06/01/2011
- Re: Using NDT with 10 gigabit interfaces, John Heffner, 06/03/2011
- <Possible follow-up(s)>
- Re: Using NDT with 10 gigabit interfaces, Rich Carlson, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Brian Tierney, 06/01/2011
    - Re: Using NDT with 10 gigabit interfaces, Matt Mathis, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Alan Whinery, 06/01/2011

List archive

Re: Using NDT with 10 gigabit interfaces