ndt-users - Re: Using NDT with 10 gigabit interfaces

Subject: ndt-users list created

List archive

Re: Using NDT with 10 gigabit interfaces

From: Rich Carlson <>
To:
Subject: Re: Using NDT with 10 gigabit interfaces
Date: Wed, 01 Jun 2011 09:10:04 -0400

Brian;

The NDT server tries to determine the bottleneck link capacity by timing every packet. Either the NIC needs to add timestamps, or the BPF needs to. In order to get the BPF timestamps, the NIC needs to forward every packet as it arrives (no coalescing).

I would not be opposed to a better bottleneck detection algorithm that reduces the need for per/packet forwarding by the NIC. However, until that exists, turning off coalescing will disable the bottleneck link detection function.

I will again note, that NDT was not meant to be the ultimate bandwidth tester. It was designed to give you a quick look at the e2e path so you can determine if further investigation is required. A 10% hit over a 10 G path with the details to show that TCP ran slowstart up over the link capacity and then went into Cong Avoid mode should be enough to show the link isn't the problem. If you want the max throughput number, but no data to back them up, then run iperf, nuttcp, ...

Rich

On 5/31/2011 11:11 PM, Brian Tierney wrote:

On May 31, 2011, at 8:10 AM, Matt Mathis wrote:

I am just guessing here but NDT is actually quite busy: it reads
Web100 vars every millisecond, and runs 2 different packet capture
tools. Although one would hope that all of these activities run in
different cores it would not surprise me to discover that the maximum
data rate is somewhat depressed.

Web100 and related tools can't do any meaningful performance debugging
when the bottleneck is very fine grained resource contention within
the sender itself, especially CPU, bus bandwidth and lock contention.

This seems plausible to me, and I think explains the asymmetry (which I
was not clear about
in my last email):

using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)

I consistently see results similar to this:

running 10s outbound test (client to server) . . . . . 7748.44 Mb/s
running 10s inbound test (server to client) . . . . . . 425.89 Mb/s

while iperf is consistently around 8.3 Gbps both directions

(results are the same if I swap client and server hosts, btw)

vmstat output from server during 'client to server' testing:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3000756 153796 284476 0 0 0 0 275956 106682 2 27 71 0 0
3 0 0 3000012 153796 284476 0 0 0 184 278421 125647 3 29 69 0 0
2 0 0 3000016 153796 284492 0 0 0 0 281350 102942 2 27 71 0 0
2 0 0 2999024 153796 284492 0 0 0 0 281674 103412 2 28 70 0 0
2 0 0 2999768 153796 284492 0 0 0 0 281432 103257 2 27 71 0 0
2 0 0 2999148 153796 284492 0 0 0 0 281082 102463 2 28 70 0 0
2 0 0 2999148 153796 284492 0 0 0 56 281413 102872 2 27 71 0 0
1 0 0 3001616 153796 284492 0 0 0 64 218677 114352 2 20 78 0 0

vmstat output on server during 'server to client' testing:

1 0 0 3002236 153796 284492 0 0 0 0 193199 142030 2 16 83 0 0
0 0 0 3002484 153796 284492 0 0 0 0 193191 142068 2 15 83 0 0
1 0 0 2999880 153796 284492 0 0 0 240 193065 142319 2 16 82 0 0
1 0 0 2994672 153796 284492 0 0 0 0 193231 142132 2 16 83 0 0
1 0 0 2993316 153796 284492 0 0 0 64 193451 142211 1 16 82 0 0
1 0 0 2996664 153796 284492 0 0 0 0 191818 145425 2 15 83 0 0
0 0 0 2996420 153796 284496 0 0 0 0 189887 143033 2 15 83 0 0

Note the very high context switches per second values (cs), particularly
while sending

and compare with iperf:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3024664 153856 286472 0 0 0 0 213989 5348 0 11 89 0 0
0 0 0 3024416 153856 286472 0 0 0 0 213440 4019 0 11 89 0 0
0 0 0 3024168 153856 286472 0 0 0 0 213908 3239 0 11 89 0 0
1 0 0 3023796 153856 286472 0 0 0 0 213721 2613 0 11 89 0 0
2 0 0 3023548 153856 286472 0 0 0 48 213933 2113 0 11 89 0 0
0 0 0 3022804 153856 286472 0 0 0 0 213921 1758 0 11 89 0 0
0 0 0 3022432 153856 286472 0 0 0 0 213864 1531 0 12 88 0 0
0 0 0 3021936 153856 286472 0 0 0 240 213558 1331 0 11 89 0 0
2 0 0 3021564 153856 286472 0 0 0 0 213885 1202 0 11 89 0 0

Which is a dramatic differences in context switches (as expected due to
the web100 calls).
These hosts have 6 "Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz" CPUs.

Using mpstat we see CPU on 2 processors, and some additional interrupts
on a 3rd

06:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
06:43:40 PM 0 8.00 0.00 19.50 0.00 4.50 40.00 0.00 28.00 191739.50
06:43:40 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
06:43:40 PM 2 2.01 0.00 18.59 0.00 0.00 7.54 0.00 71.86 0.00
06:43:40 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1001.00
06:43:40 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
06:43:40 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00

So then I tried increasing the interrupt coalescing to 100ms (it was set
to 0), and this made a big difference:

running 10s outbound test (client to server) . . . . . 9394.79 Mb/s
running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s

and brought the number of intr/sec down by around 20x

08:06:53 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
08:06:55 PM 0 5.47 0.00 14.43 0.00 1.49 22.39 0.00 56.22 9907.96
08:06:55 PM 1 3.00 0.00 31.00 0.00 0.00 6.50 0.00 59.50 0.00
08:06:55 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
08:06:55 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 996.02
08:06:55 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
08:06:55 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00

But inbound is still 4x slower than outbound. (iperf is now 9.6 Gbps
both directions).

Anyone know any other myricom tuning knobs to try?

Is the conclusion to all this: "to do NDT/web100 at 10G requires a
web10G kernel" ?

On Sat, May 28, 2011 at 7:02 PM, Brian Tierney
<
<mailto:>>
wrote:

I'm seeing the same thing (much higher results reported by iperf
compared to NDT)

Is this expected?

On May 2, 2011, at 8:26 AM,
<
<mailto:>>

<
<mailto:>>
wrote:

Dear members:
I have tried several approaches to use NDT on a server with a 10 gigabit
interface. I wonder if there are any limitations on the server to client
tests. I have not been able to get more than around 2.6 gigs
server-to-client. The client-to-server test can go over 9 gigs even
without
extensive tuning. On the same server, I can get over 9 gigs in each
direction
to a neighbor server using iperf tests.

Are there any tips on running NDT on a 10gig capable server?

Thanks,
Nat Stoddard

Re: Using NDT with 10 gigabit interfaces, Aaron Brown, 06/01/2011
- Re: Using NDT with 10 gigabit interfaces, John Heffner, 06/03/2011
  - Re: Using NDT with 10 gigabit interfaces, Brian Tierney, 06/03/2011
  - REMOVE, Daniel Romero P., 06/06/2011
- <Possible follow-up(s)>
- Re: Using NDT with 10 gigabit interfaces, Rich Carlson, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Brian Tierney, 06/01/2011
    - Re: Using NDT with 10 gigabit interfaces, Matt Mathis, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Alan Whinery, 06/01/2011

List archive

Re: Using NDT with 10 gigabit interfaces