ndt-users - Re: Using NDT with 10 gigabit interfaces

Subject: ndt-users list created

List archive

Re: Using NDT with 10 gigabit interfaces

From: Brian Tierney <>
To: Rich Carlson <>
Cc:
Subject: Re: Using NDT with 10 gigabit interfaces
Date: Wed, 1 Jun 2011 06:56:44 -0700

Ah, right, I forgot that interrupt coalescing was off for a reason.

It's still surprising to me that throughput is 18x faster in 1 direction than
that other with coalescing off.

On Jun 1, 2011, at 6:10 AM, Rich Carlson wrote:

> Brian;
>
> The NDT server tries to determine the bottleneck link capacity by timing
> every packet. Either the NIC needs to add timestamps, or the BPF needs to.
> In order to get the BPF timestamps, the NIC needs to forward every packet
> as it arrives (no coalescing).
>
> I would not be opposed to a better bottleneck detection algorithm that
> reduces the need for per/packet forwarding by the NIC. However, until that
> exists, turning off coalescing will disable the bottleneck link detection
> function.
>
> I will again note, that NDT was not meant to be the ultimate bandwidth
> tester. It was designed to give you a quick look at the e2e path so you
> can determine if further investigation is required. A 10% hit over a 10 G
> path with the details to show that TCP ran slowstart up over the link
> capacity and then went into Cong Avoid mode should be enough to show the
> link isn't the problem. If you want the max throughput number, but no data
> to back them up, then run iperf, nuttcp, ...
>
> Rich
>
> On 5/31/2011 11:11 PM, Brian Tierney wrote:
>>
>>
>> On May 31, 2011, at 8:10 AM, Matt Mathis wrote:
>>
>>> I am just guessing here but NDT is actually quite busy: it reads
>>> Web100 vars every millisecond, and runs 2 different packet capture
>>> tools. Although one would hope that all of these activities run in
>>> different cores it would not surprise me to discover that the maximum
>>> data rate is somewhat depressed.
>>>
>>> Web100 and related tools can't do any meaningful performance debugging
>>> when the bottleneck is very fine grained resource contention within
>>> the sender itself, especially CPU, bus bandwidth and lock contention.
>>
>>
>> This seems plausible to me, and I think explains the asymmetry (which I
>> was not clear about
>> in my last email):
>>
>> using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)
>>
>> I consistently see results similar to this:
>>
>> running 10s outbound test (client to server) . . . . . 7748.44 Mb/s
>> running 10s inbound test (server to client) . . . . . . 425.89 Mb/s
>>
>> while iperf is consistently around 8.3 Gbps both directions
>>
>> (results are the same if I swap client and server hosts, btw)
>>
>>
>> vmstat output from server during 'client to server' testing:
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------
>> r b swpd free buff cache si so bi bo in cs us sy id wa st
>> 1 0 0 3000756 153796 284476 0 0 0 0 275956 106682 2 27 71 0 0
>> 3 0 0 3000012 153796 284476 0 0 0 184 278421 125647 3 29 69 0 0
>> 2 0 0 3000016 153796 284492 0 0 0 0 281350 102942 2 27 71 0 0
>> 2 0 0 2999024 153796 284492 0 0 0 0 281674 103412 2 28 70 0 0
>> 2 0 0 2999768 153796 284492 0 0 0 0 281432 103257 2 27 71 0 0
>> 2 0 0 2999148 153796 284492 0 0 0 0 281082 102463 2 28 70 0 0
>> 2 0 0 2999148 153796 284492 0 0 0 56 281413 102872 2 27 71 0 0
>> 1 0 0 3001616 153796 284492 0 0 0 64 218677 114352 2 20 78 0 0
>>
>> vmstat output on server during 'server to client' testing:
>>
>>
>> 1 0 0 3002236 153796 284492 0 0 0 0 193199 142030 2 16 83 0 0
>> 0 0 0 3002484 153796 284492 0 0 0 0 193191 142068 2 15 83 0 0
>> 1 0 0 2999880 153796 284492 0 0 0 240 193065 142319 2 16 82 0 0
>> 1 0 0 2994672 153796 284492 0 0 0 0 193231 142132 2 16 83 0 0
>> 1 0 0 2993316 153796 284492 0 0 0 64 193451 142211 1 16 82 0 0
>> 1 0 0 2996664 153796 284492 0 0 0 0 191818 145425 2 15 83 0 0
>> 0 0 0 2996420 153796 284496 0 0 0 0 189887 143033 2 15 83 0 0
>>
>> Note the very high context switches per second values (cs), particularly
>> while sending
>>
>> and compare with iperf:
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------
>> r b swpd free buff cache si so bi bo in cs us sy id wa st
>> 1 0 0 3024664 153856 286472 0 0 0 0 213989 5348 0 11 89 0 0
>> 0 0 0 3024416 153856 286472 0 0 0 0 213440 4019 0 11 89 0 0
>> 0 0 0 3024168 153856 286472 0 0 0 0 213908 3239 0 11 89 0 0
>> 1 0 0 3023796 153856 286472 0 0 0 0 213721 2613 0 11 89 0 0
>> 2 0 0 3023548 153856 286472 0 0 0 48 213933 2113 0 11 89 0 0
>> 0 0 0 3022804 153856 286472 0 0 0 0 213921 1758 0 11 89 0 0
>> 0 0 0 3022432 153856 286472 0 0 0 0 213864 1531 0 12 88 0 0
>> 0 0 0 3021936 153856 286472 0 0 0 240 213558 1331 0 11 89 0 0
>> 2 0 0 3021564 153856 286472 0 0 0 0 213885 1202 0 11 89 0 0
>>
>>
>> Which is a dramatic differences in context switches (as expected due to
>> the web100 calls).
>> These hosts have 6 "Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz" CPUs.
>>
>> Using mpstat we see CPU on 2 processors, and some additional interrupts
>> on a 3rd
>>
>> 06:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>> 06:43:40 PM 0 8.00 0.00 19.50 0.00 4.50 40.00 0.00 28.00 191739.50
>> 06:43:40 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>> 06:43:40 PM 2 2.01 0.00 18.59 0.00 0.00 7.54 0.00 71.86 0.00
>> 06:43:40 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1001.00
>> 06:43:40 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>> 06:43:40 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>
>>
>> So then I tried increasing the interrupt coalescing to 100ms (it was set
>> to 0), and this made a big difference:
>>
>> running 10s outbound test (client to server) . . . . . 9394.79 Mb/s
>> running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s
>>
>> and brought the number of intr/sec down by around 20x
>>
>> 08:06:53 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>> 08:06:55 PM 0 5.47 0.00 14.43 0.00 1.49 22.39 0.00 56.22 9907.96
>> 08:06:55 PM 1 3.00 0.00 31.00 0.00 0.00 6.50 0.00 59.50 0.00
>> 08:06:55 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>> 08:06:55 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 996.02
>> 08:06:55 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>> 08:06:55 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
>>
>>
>>
>> But inbound is still 4x slower than outbound. (iperf is now 9.6 Gbps
>> both directions).
>>
>> Anyone know any other myricom tuning knobs to try?
>>
>> Is the conclusion to all this: "to do NDT/web100 at 10G requires a
>> web10G kernel" ?
>>
>>
>>
>>
>>>
>>>
>>>
>>> On Sat, May 28, 2011 at 7:02 PM, Brian Tierney
>>> <
>>> <mailto:>>
>>> wrote:
>>>>
>>>>
>>>> I'm seeing the same thing (much higher results reported by iperf
>>>> compared to NDT)
>>>>
>>>> Is this expected?
>>>>
>>>>
>>>> On May 2, 2011, at 8:26 AM,
>>>> <
>>>> <mailto:>>
>>>>
>>>> <
>>>> <mailto:>>
>>>> wrote:
>>>>
>>>>> Dear members:
>>>>> I have tried several approaches to use NDT on a server with a 10 gigabit
>>>>> interface. I wonder if there are any limitations on the server to client
>>>>> tests. I have not been able to get more than around 2.6 gigs
>>>>> server-to-client. The client-to-server test can go over 9 gigs even
>>>>> without
>>>>> extensive tuning. On the same server, I can get over 9 gigs in each
>>>>> direction
>>>>> to a neighbor server using iperf tests.
>>>>>
>>>>> Are there any tips on running NDT on a 10gig capable server?
>>>>>
>>>>> Thanks,
>>>>> Nat Stoddard
>>>>
>>>>
>>>>
>>

Re: Using NDT with 10 gigabit interfaces, Aaron Brown, 06/01/2011
- <Possible follow-up(s)>
- Re: Using NDT with 10 gigabit interfaces, Rich Carlson, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Brian Tierney, 06/01/2011
    - Re: Using NDT with 10 gigabit interfaces, Matt Mathis, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Alan Whinery, 06/01/2011

List archive

Re: Using NDT with 10 gigabit interfaces