Skip to Content.
Sympa Menu

ndt-users - Re: Using NDT with 10 gigabit interfaces

Subject: ndt-users list created

List archive

Re: Using NDT with 10 gigabit interfaces


Chronological Thread 
  • From: John Heffner <>
  • To: Aaron Brown <>
  • Cc: Brian Tierney <>, Matt Mathis <>, NDT users <>
  • Subject: Re: Using NDT with 10 gigabit interfaces
  • Date: Fri, 3 Jun 2011 10:25:04 -0400
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=qMR/p1KDHdOh3hbky4OlyTL3Hm7GynakBFTOXYRvwVENVEKMhBL0nQpRMrTkWXgvQ6 XmmVVXe9vGKNL5eRHrnVQwS5YKJqwrALznZpGN2yg8Pd3r7DwuLc2fR2Tq6AeTXRitGh lPIWIYOMvl7jWaB/pb7I2eUR25FyZ6xl02JOE=

Brian, did you try out Aaron's suggestion?

Another thing to try would be comment out the lock_sock and
unlock_sock lines in fs/proc/web100.c:connection_file_rw() in the
kernel. This will get rid of the web100/tcp lock contention at the
expense of no longer providing correct atomic snapshots. It might be
worth a try to see what the performance impact is.

-John


On Wed, Jun 1, 2011 at 8:11 AM, Aaron Brown
<>
wrote:
>
> On May 31, 2011, at 11:11 PM, Brian Tierney wrote:
>
>
> On May 31, 2011, at 8:10 AM, Matt Mathis wrote:
>
> I am just guessing here but NDT is actually quite busy: it reads
> Web100 vars every millisecond, and runs 2 different packet capture
> tools.  Although one would hope that all of these activities run in
> different cores it would not surprise me to discover that the maximum
> data rate is somewhat depressed.
>
> Web100 and related tools can't do any meaningful performance debugging
> when the bottleneck is very fine grained resource contention within
> the sender itself, especially CPU, bus bandwidth and lock contention.
>
>
> This seems plausible to me, and I think explains the asymmetry (which I was
> not clear about
> in my last email):
> using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)
> I consistently see results similar to this:
> running 10s outbound test (client to server) . . . . .  7748.44 Mb/s
> running 10s inbound test (server to client) . . . . . . 425.89 Mb/s
>
> while iperf is consistently around 8.3 Gbps both directions
> (results are the same if I swap client and server hosts, btw)
>
> vmstat output from server during 'client to server' testing:
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so     bi    bo   in     cs   us sy
> id wa st
>  1  0      0 3000756 153796 284476    0    0     0     0 275956 106682  2 27
> 71  0  0
>  3  0      0 3000012 153796 284476    0    0     0   184 278421 125647  3 29
> 69  0  0
>  2  0      0 3000016 153796 284492    0    0     0     0 281350 102942  2 27
> 71  0  0
>  2  0      0 2999024 153796 284492    0    0     0     0 281674 103412  2 28
> 70  0  0
>  2  0      0 2999768 153796 284492    0    0     0     0 281432 103257  2 27
> 71  0  0
>  2  0      0 2999148 153796 284492    0    0     0     0 281082 102463  2 28
> 70  0  0
>  2  0      0 2999148 153796 284492    0    0     0    56 281413 102872  2 27
> 71  0  0
>  1  0      0 3001616 153796 284492    0    0     0    64 218677 114352  2 20
> 78  0  0
> vmstat output on server during 'server to client' testing:
>
>
>  1  0      0 3002236 153796 284492    0    0     0     0 193199 142030  2 16
> 83  0  0
>  0  0      0 3002484 153796 284492    0    0     0     0 193191 142068  2 15
> 83  0  0
>  1  0      0 2999880 153796 284492    0    0     0   240 193065 142319  2 16
> 82  0  0
>  1  0      0 2994672 153796 284492    0    0     0     0 193231 142132  2 16
> 83  0  0
>  1  0      0 2993316 153796 284492    0    0     0    64 193451 142211  1 16
> 82  0  0
>  1  0      0 2996664 153796 284492    0    0     0     0 191818 145425  2 15
> 83  0  0
>  0  0      0 2996420 153796 284496    0    0     0     0 189887 143033  2 15
> 83  0  0
>
> Note the very high context switches per second values (cs), particularly
> while sending
> and compare with iperf:
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in     cs  us sy
> id wa st
>  1  0      0 3024664 153856 286472    0    0     0     0 213989 5348  0 11
> 89  0  0
>  0  0      0 3024416 153856 286472    0    0     0     0 213440 4019  0 11
> 89  0  0
>  0  0      0 3024168 153856 286472    0    0     0     0 213908 3239  0 11
> 89  0  0
>  1  0      0 3023796 153856 286472    0    0     0     0 213721 2613  0 11
> 89  0  0
>  2  0      0 3023548 153856 286472    0    0     0    48 213933 2113  0 11
> 89  0  0
>  0  0      0 3022804 153856 286472    0    0     0     0 213921 1758  0 11
> 89  0  0
>  0  0      0 3022432 153856 286472    0    0     0     0 213864 1531  0 12
> 88  0  0
>  0  0      0 3021936 153856 286472    0    0     0   240 213558 1331  0 11
> 89  0  0
>  2  0      0 3021564 153856 286472    0    0     0     0 213885 1202  0 11
> 89  0  0
>
> Which is a dramatic differences in context switches (as expected due to the
> web100 calls).
> These hosts have 6 "Intel(R) Core(TM) i7 CPU   X 980  @ 3.33GHz" CPUs.
> Using mpstat we see CPU on 2 processors, and some additional interrupts on a
> 3rd
> 06:43:38 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
> %idle    intr/s
> 06:43:40 PM    0    8.00    0.00   19.50    0.00    4.50   40.00    0.00
> 28.00 191739.50
> 06:43:40 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00      0.00
> 06:43:40 PM    2    2.01    0.00   18.59    0.00    0.00    7.54    0.00
> 71.86      0.00
> 06:43:40 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00   1001.00
> 06:43:40 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00      0.00
> 06:43:40 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00      0.00
>
> So then I tried increasing the interrupt coalescing to 100ms (it was set to
> 0), and this made a big difference:
> running 10s outbound test (client to server) . . . . .  9394.79 Mb/s
> running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s
> and brought the number of intr/sec down by around 20x
> 08:06:53 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
> %idle    intr/s
> 08:06:55 PM    0    5.47    0.00   14.43    0.00    1.49   22.39    0.00
> 56.22   9907.96
> 08:06:55 PM    1    3.00    0.00   31.00    0.00    0.00    6.50    0.00
> 59.50      0.00
> 08:06:55 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00      0.00
> 08:06:55 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00    996.02
> 08:06:55 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00      0.00
> 08:06:55 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00
>  100.00      0.00
>
>
> But inbound is still 4x slower than outbound.  (iperf is now 9.6 Gbps both
> directions).
> Anyone know any other myricom tuning knobs to try?
> Is the conclusion to all this: "to do NDT/web100 at 10G requires a web10G
> kernel" ?
>
> What happens if you set interrupt coalescing to zero again, but change the
> "snap delay" from 5 to 20? On a toolkit host, you should be able to edit
> "/etc/sysconfig/ndt" and add "--snapdelay 20" to the WEB100SRV_OPTIONS line.
> That should decrease how regularly NDT collects web100 data from once every
> 5ms to once every 20ms (note: i've no clue how this will affect the quality
> of the data it collects).
> Cheers,
> Aaron
>
>
>
>
>
>
>
>
>
> On Sat, May 28, 2011 at 7:02 PM, Brian Tierney
> <>
> wrote:
>
>
> I'm seeing the same thing  (much higher results reported by iperf compared
> to NDT)
>
> Is this expected?
>
>
> On May 2, 2011, at 8:26 AM,
> <>
>
> <>
> wrote:
>
> Dear members:
>
> I have tried several approaches to use NDT on a server with a 10 gigabit
>
> interface.  I wonder if there are any limitations on the server to client
>
> tests.  I have not been able to get more than around 2.6 gigs
>
> server-to-client.  The client-to-server test can go over 9 gigs even without
>
> extensive tuning.  On the same server, I can get over 9 gigs in each
> direction
>
> to a neighbor server using iperf tests.
>
> Are there any tips on running NDT on a 10gig capable server?
>
> Thanks,
>
> Nat Stoddard
>
>
>
>
>
> Summer 2011 ESCC/Internet2 Joint Techs
> Hosted by the University of Alaska-Fairbanks
> http://events.internet2.edu/2011/jt-uaf
>



Archive powered by MHonArc 2.6.16.

Top of Page