On May 31, 2011, at 8:10 AM, Matt Mathis wrote:
I am just guessing here but NDT is actually quite busy: it reads
Web100 vars every millisecond, and runs 2 different packet capture
tools. Although one would hope that all of these activities run in
different cores it would not surprise me to discover that the maximum
data rate is somewhat depressed.
Web100 and related tools can't do any meaningful performance debugging
when the bottleneck is very fine grained resource contention within
the sender itself, especially CPU, bus bandwidth and lock contention.
This seems plausible to me, and I think explains the asymmetry (which I was not clear about
in my last email):
using the web100clt tool between 2 nearby 10G NDT hosts (RTT = 0.02 ms)
I consistently see results similar to this:
running 10s outbound test (client to server) . . . . . 7748.44 Mb/s
running 10s inbound test (server to client) . . . . . . 425.89 Mb/s
while iperf is consistently around 8.3 Gbps both directions
(results are the same if I swap client and server hosts, btw)
vmstat output from server during 'client to server' testing:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3000756 153796 284476 0 0 0 0 275956 106682 2 27 71 0 0
3 0 0 3000012 153796 284476 0 0 0 184 278421 125647 3 29 69 0 0
2 0 0 3000016 153796 284492 0 0 0 0 281350 102942 2 27 71 0 0
2 0 0 2999024 153796 284492 0 0 0 0 281674 103412 2 28 70 0 0
2 0 0 2999768 153796 284492 0 0 0 0 281432 103257 2 27 71 0 0
2 0 0 2999148 153796 284492 0 0 0 0 281082 102463 2 28 70 0 0
2 0 0 2999148 153796 284492 0 0 0 56 281413 102872 2 27 71 0 0
1 0 0 3001616 153796 284492 0 0 0 64 218677 114352 2 20 78 0 0
vmstat output on server during 'server to client' testing:
1 0 0 3002236 153796 284492 0 0 0 0 193199 142030 2 16 83 0 0
0 0 0 3002484 153796 284492 0 0 0 0 193191 142068 2 15 83 0 0
1 0 0 2999880 153796 284492 0 0 0 240 193065 142319 2 16 82 0 0
1 0 0 2994672 153796 284492 0 0 0 0 193231 142132 2 16 83 0 0
1 0 0 2993316 153796 284492 0 0 0 64 193451 142211 1 16 82 0 0
1 0 0 2996664 153796 284492 0 0 0 0 191818 145425 2 15 83 0 0
0 0 0 2996420 153796 284496 0 0 0 0 189887 143033 2 15 83 0 0
Note the very high context switches per second values (cs), particularly while sending
and compare with iperf:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3024664 153856 286472 0 0 0 0 213989 5348 0 11 89 0 0
0 0 0 3024416 153856 286472 0 0 0 0 213440 4019 0 11 89 0 0
0 0 0 3024168 153856 286472 0 0 0 0 213908 3239 0 11 89 0 0
1 0 0 3023796 153856 286472 0 0 0 0 213721 2613 0 11 89 0 0
2 0 0 3023548 153856 286472 0 0 0 48 213933 2113 0 11 89 0 0
0 0 0 3022804 153856 286472 0 0 0 0 213921 1758 0 11 89 0 0
0 0 0 3022432 153856 286472 0 0 0 0 213864 1531 0 12 88 0 0
0 0 0 3021936 153856 286472 0 0 0 240 213558 1331 0 11 89 0 0
2 0 0 3021564 153856 286472 0 0 0 0 213885 1202 0 11 89 0 0
Which is a dramatic differences in context switches (as expected due to the web100 calls).
These hosts have 6 "Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz" CPUs.
Using mpstat we see CPU on 2 processors, and some additional interrupts on a 3rd
06:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
06:43:40 PM 0 8.00 0.00 19.50 0.00 4.50 40.00 0.00 28.00 191739.50
06:43:40 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
06:43:40 PM 2 2.01 0.00 18.59 0.00 0.00 7.54 0.00 71.86 0.00
06:43:40 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1001.00
06:43:40 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
06:43:40 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
So then I tried increasing the interrupt coalescing to 100ms (it was set to 0), and this made a big difference:
running 10s outbound test (client to server) . . . . . 9394.79 Mb/s
running 10s inbound test (server to client) . . . . . . 2523.48 Mb/s
and brought the number of intr/sec down by around 20x
08:06:53 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
08:06:55 PM 0 5.47 0.00 14.43 0.00 1.49 22.39 0.00 56.22 9907.96
08:06:55 PM 1 3.00 0.00 31.00 0.00 0.00 6.50 0.00 59.50 0.00
08:06:55 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
08:06:55 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 996.02
08:06:55 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
08:06:55 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
But inbound is still 4x slower than outbound. (iperf is now 9.6 Gbps both directions).
Anyone know any other myricom tuning knobs to try?
Is the conclusion to all this: "to do NDT/web100 at 10G requires a web10G kernel" ?