Skip to Content.
Sympa Menu

ndt-users - Re: Using NDT with 10 gigabit interfaces

Subject: ndt-users list created

List archive

Re: Using NDT with 10 gigabit interfaces


Chronological Thread 
  • From: Aaron Brown <>
  • To: Brian Tierney <>
  • Cc: Matt Mathis <>, NDT users <>
  • Subject: Re: Using NDT with 10 gigabit interfaces
  • Date: Wed, 1 Jun 2011 08:11:08 -0400


On May 31, 2011, at 11:11 PM, Brian Tierney wrote:



On May 31, 2011, at 8:10 AM, Matt Mathis wrote:

I am just guessing here but NDT is actually quite busy: it reads
Web100 vars every millisecond, and runs 2 different packet capture
tools.  Although one would hope that all of these activities run in
different cores it would not surprise me to discover that the maximum
data rate is somewhat depressed.

Web100 and related tools can't do any meaningful performance debugging
when the bottleneck is very fine grained resource contention within
the sender itself, especially CPU, bus bandwidth and lock contention.


This seems plausible to me, and I think explains the asymmetry (which I was not clear about 
in my last email):

using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)

I consistently see results similar to this:

running 10s outbound test (client to server) . . . . .  7748.44 Mb/s
running 10s inbound test (server to client) . . . . . . 425.89 Mb/s

while iperf is consistently around 8.3 Gbps both directions

(results are the same if I swap client and server hosts, btw)


vmstat output from server during 'client to server' testing:

procs -----------memory---------- ---swap-- -----io---- --system--   -----cpu------
 r  b   swpd   free   buff  cache   si   so     bi    bo   in     cs   us sy id wa st
 1  0      0 3000756 153796 284476    0    0     0     0 275956 106682  2 27 71  0  0
 3  0      0 3000012 153796 284476    0    0     0   184 278421 125647  3 29 69  0  0
 2  0      0 3000016 153796 284492    0    0     0     0 281350 102942  2 27 71  0  0
 2  0      0 2999024 153796 284492    0    0     0     0 281674 103412  2 28 70  0  0
 2  0      0 2999768 153796 284492    0    0     0     0 281432 103257  2 27 71  0  0
 2  0      0 2999148 153796 284492    0    0     0     0 281082 102463  2 28 70  0  0
 2  0      0 2999148 153796 284492    0    0     0    56 281413 102872  2 27 71  0  0
 1  0      0 3001616 153796 284492    0    0     0    64 218677 114352  2 20 78  0  0

vmstat output on server during 'server to client' testing:


 1  0      0 3002236 153796 284492    0    0     0     0 193199 142030  2 16 83  0  0
 0  0      0 3002484 153796 284492    0    0     0     0 193191 142068  2 15 83  0  0
 1  0      0 2999880 153796 284492    0    0     0   240 193065 142319  2 16 82  0  0
 1  0      0 2994672 153796 284492    0    0     0     0 193231 142132  2 16 83  0  0
 1  0      0 2993316 153796 284492    0    0     0    64 193451 142211  1 16 82  0  0
 1  0      0 2996664 153796 284492    0    0     0     0 191818 145425  2 15 83  0  0
 0  0      0 2996420 153796 284496    0    0     0     0 189887 143033  2 15 83  0  0


Note the very high context switches per second values (cs), particularly while sending

and compare with iperf:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in     cs  us sy id wa st
 1  0      0 3024664 153856 286472    0    0     0     0 213989 5348  0 11 89  0  0
 0  0      0 3024416 153856 286472    0    0     0     0 213440 4019  0 11 89  0  0
 0  0      0 3024168 153856 286472    0    0     0     0 213908 3239  0 11 89  0  0
 1  0      0 3023796 153856 286472    0    0     0     0 213721 2613  0 11 89  0  0
 2  0      0 3023548 153856 286472    0    0     0    48 213933 2113  0 11 89  0  0
 0  0      0 3022804 153856 286472    0    0     0     0 213921 1758  0 11 89  0  0
 0  0      0 3022432 153856 286472    0    0     0     0 213864 1531  0 12 88  0  0
 0  0      0 3021936 153856 286472    0    0     0   240 213558 1331  0 11 89  0  0
 2  0      0 3021564 153856 286472    0    0     0     0 213885 1202  0 11 89  0  0


Which is a dramatic differences in context switches (as expected due to the web100 calls). 
These hosts have 6 "Intel(R) Core(TM) i7 CPU   X 980  @ 3.33GHz" CPUs.

Using mpstat we see CPU on 2 processors, and some additional interrupts on a 3rd

06:43:38 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
06:43:40 PM    0    8.00    0.00   19.50    0.00    4.50   40.00    0.00   28.00 191739.50
06:43:40 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
06:43:40 PM    2    2.01    0.00   18.59    0.00    0.00    7.54    0.00   71.86      0.00
06:43:40 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1001.00
06:43:40 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
06:43:40 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00


So then I tried increasing the interrupt coalescing to 100ms (it was set to 0), and this made a big difference:

running 10s outbound test (client to server) . . . . .  9394.79 Mb/s
running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s

and brought the number of intr/sec down by around 20x

08:06:53 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
08:06:55 PM    0    5.47    0.00   14.43    0.00    1.49   22.39    0.00   56.22   9907.96
08:06:55 PM    1    3.00    0.00   31.00    0.00    0.00    6.50    0.00   59.50      0.00
08:06:55 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
08:06:55 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    996.02
08:06:55 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
08:06:55 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00



But inbound is still 4x slower than outbound.  (iperf is now 9.6 Gbps both directions).

Anyone know any other myricom tuning knobs to try?

Is the conclusion to all this: "to do NDT/web100 at 10G requires a web10G kernel" ?

What happens if you set interrupt coalescing to zero again, but change the "snap delay" from 5 to 20? On a toolkit host, you should be able to edit "/etc/sysconfig/ndt" and add "--snapdelay 20" to the WEB100SRV_OPTIONS line. That should decrease how regularly NDT collects web100 data from once every 5ms to once every 20ms (note: i've no clue how this will affect the quality of the data it collects).

Cheers,
Aaron





 
 


On Sat, May 28, 2011 at 7:02 PM, Brian Tierney <> wrote:


I'm seeing the same thing  (much higher results reported by iperf compared to NDT)

Is this expected?


On May 2, 2011, at 8:26 AM, <> <> wrote:

Dear members:
I have tried several approaches to use NDT on a server with a 10 gigabit
interface.  I wonder if there are any limitations on the server to client
tests.  I have not been able to get more than around 2.6 gigs
server-to-client.  The client-to-server test can go over 9 gigs even without
extensive tuning.  On the same server, I can get over 9 gigs in each direction
to a neighbor server using iperf tests.

Are there any tips on running NDT on a 10gig capable server?

Thanks,
Nat Stoddard





Summer 2011 ESCC/Internet2 Joint Techs
Hosted by the University of Alaska-Fairbanks
http://events.internet2.edu/2011/jt-uaf




Archive powered by MHonArc 2.6.16.

Top of Page