ndt-users - Re: Using NDT with 10 gigabit interfaces

Subject: ndt-users list created

List archive

Re: Using NDT with 10 gigabit interfaces

From: John Heffner <>
To: Aaron Brown <>
Cc: Brian Tierney <>, Matt Mathis <>, NDT users <>
Subject: Re: Using NDT with 10 gigabit interfaces
Date: Fri, 3 Jun 2011 10:25:04 -0400
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=qMR/p1KDHdOh3hbky4OlyTL3Hm7GynakBFTOXYRvwVENVEKMhBL0nQpRMrTkWXgvQ6 XmmVVXe9vGKNL5eRHrnVQwS5YKJqwrALznZpGN2yg8Pd3r7DwuLc2fR2Tq6AeTXRitGh lPIWIYOMvl7jWaB/pb7I2eUR25FyZ6xl02JOE=

Brian, did you try out Aaron's suggestion?

Another thing to try would be comment out the lock_sock and
unlock_sock lines in fs/proc/web100.c:connection_file_rw() in the
kernel. This will get rid of the web100/tcp lock contention at the
expense of no longer providing correct atomic snapshots. It might be
worth a try to see what the performance impact is.

-John

On Wed, Jun 1, 2011 at 8:11 AM, Aaron Brown
<>
wrote:
>
> On May 31, 2011, at 11:11 PM, Brian Tierney wrote:
>
>
> On May 31, 2011, at 8:10 AM, Matt Mathis wrote:
>
> I am just guessing here but NDT is actually quite busy: it reads
> Web100 vars every millisecond, and runs 2 different packet capture
> tools. Although one would hope that all of these activities run in
> different cores it would not surprise me to discover that the maximum
> data rate is somewhat depressed.
>
> Web100 and related tools can't do any meaningful performance debugging
> when the bottleneck is very fine grained resource contention within
> the sender itself, especially CPU, bus bandwidth and lock contention.
>
>
> This seems plausible to me, and I think explains the asymmetry (which I was
> not clear about
> in my last email):
> using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)
> I consistently see results similar to this:
> running 10s outbound test (client to server) . . . . . 7748.44 Mb/s
> running 10s inbound test (server to client) . . . . . . 425.89 Mb/s
>
> while iperf is consistently around 8.3 Gbps both directions
> (results are the same if I swap client and server hosts, btw)
>
> vmstat output from server during 'client to server' testing:
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy
> id wa st
> 1 0 0 3000756 153796 284476 0 0 0 0 275956 106682 2 27
> 71 0 0
> 3 0 0 3000012 153796 284476 0 0 0 184 278421 125647 3 29
> 69 0 0
> 2 0 0 3000016 153796 284492 0 0 0 0 281350 102942 2 27
> 71 0 0
> 2 0 0 2999024 153796 284492 0 0 0 0 281674 103412 2 28
> 70 0 0
> 2 0 0 2999768 153796 284492 0 0 0 0 281432 103257 2 27
> 71 0 0
> 2 0 0 2999148 153796 284492 0 0 0 0 281082 102463 2 28
> 70 0 0
> 2 0 0 2999148 153796 284492 0 0 0 56 281413 102872 2 27
> 71 0 0
> 1 0 0 3001616 153796 284492 0 0 0 64 218677 114352 2 20
> 78 0 0
> vmstat output on server during 'server to client' testing:
>
>
> 1 0 0 3002236 153796 284492 0 0 0 0 193199 142030 2 16
> 83 0 0
> 0 0 0 3002484 153796 284492 0 0 0 0 193191 142068 2 15
> 83 0 0
> 1 0 0 2999880 153796 284492 0 0 0 240 193065 142319 2 16
> 82 0 0
> 1 0 0 2994672 153796 284492 0 0 0 0 193231 142132 2 16
> 83 0 0
> 1 0 0 2993316 153796 284492 0 0 0 64 193451 142211 1 16
> 82 0 0
> 1 0 0 2996664 153796 284492 0 0 0 0 191818 145425 2 15
> 83 0 0
> 0 0 0 2996420 153796 284496 0 0 0 0 189887 143033 2 15
> 83 0 0
>
> Note the very high context switches per second values (cs), particularly
> while sending
> and compare with iperf:
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy
> id wa st
> 1 0 0 3024664 153856 286472 0 0 0 0 213989 5348 0 11
> 89 0 0
> 0 0 0 3024416 153856 286472 0 0 0 0 213440 4019 0 11
> 89 0 0
> 0 0 0 3024168 153856 286472 0 0 0 0 213908 3239 0 11
> 89 0 0
> 1 0 0 3023796 153856 286472 0 0 0 0 213721 2613 0 11
> 89 0 0
> 2 0 0 3023548 153856 286472 0 0 0 48 213933 2113 0 11
> 89 0 0
> 0 0 0 3022804 153856 286472 0 0 0 0 213921 1758 0 11
> 89 0 0
> 0 0 0 3022432 153856 286472 0 0 0 0 213864 1531 0 12
> 88 0 0
> 0 0 0 3021936 153856 286472 0 0 0 240 213558 1331 0 11
> 89 0 0
> 2 0 0 3021564 153856 286472 0 0 0 0 213885 1202 0 11
> 89 0 0
>
> Which is a dramatic differences in context switches (as expected due to the
> web100 calls).
> These hosts have 6 "Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz" CPUs.
> Using mpstat we see CPU on 2 processors, and some additional interrupts on a
> 3rd
> 06:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal
> %idle intr/s
> 06:43:40 PM 0 8.00 0.00 19.50 0.00 4.50 40.00 0.00
> 28.00 191739.50
> 06:43:40 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 0.00
> 06:43:40 PM 2 2.01 0.00 18.59 0.00 0.00 7.54 0.00
> 71.86 0.00
> 06:43:40 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 1001.00
> 06:43:40 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 0.00
> 06:43:40 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 0.00
>
> So then I tried increasing the interrupt coalescing to 100ms (it was set to
> 0), and this made a big difference:
> running 10s outbound test (client to server) . . . . . 9394.79 Mb/s
> running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s
> and brought the number of intr/sec down by around 20x
> 08:06:53 PM CPU %user %nice %sys %iowait %irq %soft %steal
> %idle intr/s
> 08:06:55 PM 0 5.47 0.00 14.43 0.00 1.49 22.39 0.00
> 56.22 9907.96
> 08:06:55 PM 1 3.00 0.00 31.00 0.00 0.00 6.50 0.00
> 59.50 0.00
> 08:06:55 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 0.00
> 08:06:55 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 996.02
> 08:06:55 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 0.00
> 08:06:55 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 100.00 0.00
>
>
> But inbound is still 4x slower than outbound. (iperf is now 9.6 Gbps both
> directions).
> Anyone know any other myricom tuning knobs to try?
> Is the conclusion to all this: "to do NDT/web100 at 10G requires a web10G
> kernel" ?
>
> What happens if you set interrupt coalescing to zero again, but change the
> "snap delay" from 5 to 20? On a toolkit host, you should be able to edit
> "/etc/sysconfig/ndt" and add "--snapdelay 20" to the WEB100SRV_OPTIONS line.
> That should decrease how regularly NDT collects web100 data from once every
> 5ms to once every 20ms (note: i've no clue how this will affect the quality
> of the data it collects).
> Cheers,
> Aaron
>
>
>
>
>
>
>
>
>
> On Sat, May 28, 2011 at 7:02 PM, Brian Tierney
> <>
> wrote:
>
>
> I'm seeing the same thing (much higher results reported by iperf compared
> to NDT)
>
> Is this expected?
>
>
> On May 2, 2011, at 8:26 AM,
> <>
>
> <>
> wrote:
>
> Dear members:
>
> I have tried several approaches to use NDT on a server with a 10 gigabit
>
> interface. I wonder if there are any limitations on the server to client
>
> tests. I have not been able to get more than around 2.6 gigs
>
> server-to-client. The client-to-server test can go over 9 gigs even without
>
> extensive tuning. On the same server, I can get over 9 gigs in each
> direction
>
> to a neighbor server using iperf tests.
>
> Are there any tips on running NDT on a 10gig capable server?
>
> Thanks,
>
> Nat Stoddard
>
>
>
>
>
> Summer 2011 ESCC/Internet2 Joint Techs
> Hosted by the University of Alaska-Fairbanks
> http://events.internet2.edu/2011/jt-uaf
>

Re: Using NDT with 10 gigabit interfaces, Aaron Brown, 06/01/2011
- Re: Using NDT with 10 gigabit interfaces, John Heffner, 06/03/2011
  - Re: Using NDT with 10 gigabit interfaces, Brian Tierney, 06/03/2011
  - REMOVE, Daniel Romero P., 06/06/2011
- <Possible follow-up(s)>
- Re: Using NDT with 10 gigabit interfaces, Rich Carlson, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Brian Tierney, 06/01/2011
    - Re: Using NDT with 10 gigabit interfaces, Matt Mathis, 06/01/2011
  - Re: Using NDT with 10 gigabit interfaces, Alan Whinery, 06/01/2011

List archive

Re: Using NDT with 10 gigabit interfaces