Skip to Content.
Sympa Menu

ndt-users - Re: Using NDT with 10 gigabit interfaces

Subject: ndt-users list created

List archive

Re: Using NDT with 10 gigabit interfaces


Chronological Thread 
  • From: Brian Tierney <>
  • To: John Heffner <>
  • Cc: Aaron Brown <>, Matt Mathis <>, NDT users <>
  • Subject: Re: Using NDT with 10 gigabit interfaces
  • Date: Fri, 3 Jun 2011 16:52:39 -0700


On Jun 3, 2011, at 7:25 AM, John Heffner wrote:

> Brian, did you try out Aaron's suggestion?

I finally tested this.

With rx-usecs=0, performance goes from 425 Mbps to 2.3 Gbps
With rx-usecs=100, increasing snapdelay has no effect: both are 2.5 Gbps


>
> Another thing to try would be comment out the lock_sock and
> unlock_sock lines in fs/proc/web100.c:connection_file_rw() in the
> kernel. This will get rid of the web100/tcp lock contention at the
> expense of no longer providing correct atomic snapshots. It might be
> worth a try to see what the performance impact is.

These are not my systems, so I dont want to muck with the kernel....

>
> -John
>
>
> On Wed, Jun 1, 2011 at 8:11 AM, Aaron Brown
> <>
> wrote:
>>
>> On May 31, 2011, at 11:11 PM, Brian Tierney wrote:
>>
>>
>> On May 31, 2011, at 8:10 AM, Matt Mathis wrote:
>>
>> I am just guessing here but NDT is actually quite busy: it reads
>> Web100 vars every millisecond, and runs 2 different packet capture
>> tools. Although one would hope that all of these activities run in
>> different cores it would not surprise me to discover that the maximum
>> data rate is somewhat depressed.
>>
>> Web100 and related tools can't do any meaningful performance debugging
>> when the bottleneck is very fine grained resource contention within
>> the sender itself, especially CPU, bus bandwidth and lock contention.
>>
>>
>> This seems plausible to me, and I think explains the asymmetry (which I was
>> not clear about
>> in my last email):
>> using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)
>> I consistently see results similar to this:
>> running 10s outbound test (client to server) . . . . . 7748.44 Mb/s
>> running 10s inbound test (server to client) . . . . . . 425.89 Mb/s
>>
>> while iperf is consistently around 8.3 Gbps both directions
>> (results are the same if I swap client and server hosts, btw)
>>
>> vmstat output from server during 'client to server' testing:
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------
>> r b swpd free buff cache si so bi bo in cs us
>> sy
>> id wa st
>> 1 0 0 3000756 153796 284476 0 0 0 0 275956 106682 2
>> 27
>> 71 0 0
>> 3 0 0 3000012 153796 284476 0 0 0 184 278421 125647 3
>> 29
>> 69 0 0
>> 2 0 0 3000016 153796 284492 0 0 0 0 281350 102942 2
>> 27
>> 71 0 0
>> 2 0 0 2999024 153796 284492 0 0 0 0 281674 103412 2
>> 28
>> 70 0 0
>> 2 0 0 2999768 153796 284492 0 0 0 0 281432 103257 2
>> 27
>> 71 0 0
>> 2 0 0 2999148 153796 284492 0 0 0 0 281082 102463 2
>> 28
>> 70 0 0
>> 2 0 0 2999148 153796 284492 0 0 0 56 281413 102872 2
>> 27
>> 71 0 0
>> 1 0 0 3001616 153796 284492 0 0 0 64 218677 114352 2
>> 20
>> 78 0 0
>> vmstat output on server during 'server to client' testing:
>>
>>
>> 1 0 0 3002236 153796 284492 0 0 0 0 193199 142030 2
>> 16
>> 83 0 0
>> 0 0 0 3002484 153796 284492 0 0 0 0 193191 142068 2
>> 15
>> 83 0 0
>> 1 0 0 2999880 153796 284492 0 0 0 240 193065 142319 2
>> 16
>> 82 0 0
>> 1 0 0 2994672 153796 284492 0 0 0 0 193231 142132 2
>> 16
>> 83 0 0
>> 1 0 0 2993316 153796 284492 0 0 0 64 193451 142211 1
>> 16
>> 82 0 0
>> 1 0 0 2996664 153796 284492 0 0 0 0 191818 145425 2
>> 15
>> 83 0 0
>> 0 0 0 2996420 153796 284496 0 0 0 0 189887 143033 2
>> 15
>> 83 0 0
>>
>> Note the very high context switches per second values (cs), particularly
>> while sending
>> and compare with iperf:
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------
>> r b swpd free buff cache si so bi bo in cs us sy
>> id wa st
>> 1 0 0 3024664 153856 286472 0 0 0 0 213989 5348 0 11
>> 89 0 0
>> 0 0 0 3024416 153856 286472 0 0 0 0 213440 4019 0 11
>> 89 0 0
>> 0 0 0 3024168 153856 286472 0 0 0 0 213908 3239 0 11
>> 89 0 0
>> 1 0 0 3023796 153856 286472 0 0 0 0 213721 2613 0 11
>> 89 0 0
>> 2 0 0 3023548 153856 286472 0 0 0 48 213933 2113 0 11
>> 89 0 0
>> 0 0 0 3022804 153856 286472 0 0 0 0 213921 1758 0 11
>> 89 0 0
>> 0 0 0 3022432 153856 286472 0 0 0 0 213864 1531 0 12
>> 88 0 0
>> 0 0 0 3021936 153856 286472 0 0 0 240 213558 1331 0 11
>> 89 0 0
>> 2 0 0 3021564 153856 286472 0 0 0 0 213885 1202 0 11
>> 89 0 0
>>
>> Which is a dramatic differences in context switches (as expected due to the
>> web100 calls).
>> These hosts have 6 "Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz" CPUs.
>> Using mpstat we see CPU on 2 processors, and some additional interrupts on
>> a
>> 3rd
>> 06:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal
>> %idle intr/s
>> 06:43:40 PM 0 8.00 0.00 19.50 0.00 4.50 40.00 0.00
>> 28.00 191739.50
>> 06:43:40 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 0.00
>> 06:43:40 PM 2 2.01 0.00 18.59 0.00 0.00 7.54 0.00
>> 71.86 0.00
>> 06:43:40 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 1001.00
>> 06:43:40 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 0.00
>> 06:43:40 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 0.00
>>
>> So then I tried increasing the interrupt coalescing to 100ms (it was set to
>> 0), and this made a big difference:
>> running 10s outbound test (client to server) . . . . . 9394.79 Mb/s
>> running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s
>> and brought the number of intr/sec down by around 20x
>> 08:06:53 PM CPU %user %nice %sys %iowait %irq %soft %steal
>> %idle intr/s
>> 08:06:55 PM 0 5.47 0.00 14.43 0.00 1.49 22.39 0.00
>> 56.22 9907.96
>> 08:06:55 PM 1 3.00 0.00 31.00 0.00 0.00 6.50 0.00
>> 59.50 0.00
>> 08:06:55 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 0.00
>> 08:06:55 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 996.02
>> 08:06:55 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 0.00
>> 08:06:55 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 100.00 0.00
>>
>>
>> But inbound is still 4x slower than outbound. (iperf is now 9.6 Gbps both
>> directions).
>> Anyone know any other myricom tuning knobs to try?
>> Is the conclusion to all this: "to do NDT/web100 at 10G requires a web10G
>> kernel" ?
>>
>> What happens if you set interrupt coalescing to zero again, but change the
>> "snap delay" from 5 to 20? On a toolkit host, you should be able to edit
>> "/etc/sysconfig/ndt" and add "--snapdelay 20" to the WEB100SRV_OPTIONS
>> line.
>> That should decrease how regularly NDT collects web100 data from once every
>> 5ms to once every 20ms (note: i've no clue how this will affect the quality
>> of the data it collects).
>> Cheers,
>> Aaron
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, May 28, 2011 at 7:02 PM, Brian Tierney
>> <>
>> wrote:
>>
>>
>> I'm seeing the same thing (much higher results reported by iperf compared
>> to NDT)
>>
>> Is this expected?
>>
>>
>> On May 2, 2011, at 8:26 AM,
>> <>
>>
>> <>
>> wrote:
>>
>> Dear members:
>>
>> I have tried several approaches to use NDT on a server with a 10 gigabit
>>
>> interface. I wonder if there are any limitations on the server to client
>>
>> tests. I have not been able to get more than around 2.6 gigs
>>
>> server-to-client. The client-to-server test can go over 9 gigs even
>> without
>>
>> extensive tuning. On the same server, I can get over 9 gigs in each
>> direction
>>
>> to a neighbor server using iperf tests.
>>
>> Are there any tips on running NDT on a 10gig capable server?
>>
>> Thanks,
>>
>> Nat Stoddard
>>
>>
>>
>>
>>
>> Summer 2011 ESCC/Internet2 Joint Techs
>> Hosted by the University of Alaska-Fairbanks
>> http://events.internet2.edu/2011/jt-uaf
>>




Archive powered by MHonArc 2.6.16.

Top of Page