Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] iperf3 performance tuning for small packets

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] iperf3 performance tuning for small packets


Chronological Thread 
  • From: Eli Dart <>
  • To: Casey Russell <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] iperf3 performance tuning for small packets
  • Date: Tue, 14 Feb 2017 09:24:01 -0800
  • Ironport-phdr: 9a23:CwveWxODyDtUC5+WpZQl6mtUPXoX/o7sNwtQ0KIMzox0Iv7/rarrMEGX3/hxlliBBdydsKMZzbKO+P+wEUU7or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRpOOv1BpTSj8Oq3Oyu5pHfeQtFiT69bL9oMBm6sQrdu8sVjIB/Nqs/1xzFr2dSde9L321oP1WTnxj95se04pFu9jlbtuwi+cBdT6j0Zrw0QrNEAjsoNWA1/9DrugLYTQST/HscU34ZnQRODgPY8Rz1RJbxsi/9tupgxCmXOND9QL4oVTi+6apgVQTlgzkbOTEn7G7Xi9RwjKNFrxKnuxx/2JPfbIWMOPZjYq/RYdYWSGxcVchTSiNBGJuxYIkBD+QBM+hWrJTzqUUSohalHwagGPnixyVUinPq36A31fkqHwHc3AwnGtIDqHvarNH0NKwPX+661rPIzSneZP5RxDjy8pLIcgw6rP6SRrJ8a8zRxlczFw7ciFibtILrPzSQ1usXsmib6fJtWvy0i2I9rQF+vCSvyt8oionIgIIVyU7L+jh4wIYzP9G3VEl7Ydu8HJtMuSCaNpd2Qt88TGFyoio11roGuZujcCgJ0psnwQTfZOKBc4SS5BLsSvqRLDFlj3xmYLKynwu+/Vajx+HmWMS4yllHojdfntTOq3wBzwLf5tSDR/dn/Uqs2SyD2x7O5uxLO0w5l7fXJpg8ybAqjJUTq17MHirulUX2kqCWckIk9/Cm6+v5bbjqvJucOJRwig3kPaQundK/Dfw5MggIQWeb5fyx2KD98UD6WrlHgOc6n6bEvJzAJ8kXu7a1AwpP3YYi7xa/AS2m0NMdnXQfMV1KYgiHj5TyNl7QO/D0F/G/jEqwkDtz3fDJIqXhAonRLnjEiLrhZahy61RSyAooytBf4YhbCqsYLPLuQU/+qsbYAwQ9Mwy12ObnFM592p0EVWKOBK+ZLL3dsUWO5u0xP+mAepUZtyjgJPg4tLbSiioSkEQQbOGTwIAMZXS8VqBtOViCenfohv8CGGEQswx4SuH23g6sSzlWMlu2XuoQ7ysnQNaqBJ3fbo22xrqMwHHoTdVtemlaBwXUQj/TfIKeVqJJMXrKLw==

Hi Casey,

I know you said you've been through the Linux tuning page on fasterdata.....have you check the send and receive rings on the Intel NIC?  http://fasterdata.es.net/host-tuning/nic-tuning/

I've seen those settings make a significant difference, especially in UDP applications (e.g. netflow collectors, large syslog hosts).

Eli



On Mon, Feb 13, 2017 at 4:50 PM, Casey Russell <> wrote:
Group,  

     I have recently needed to stress test some new firewalls and intended to use iperf3 to do it.  I'm pushing iperf3 harder than I've personally tried to push it before and running into some challenges.  I know this isn't an iperf3 list, specifically, but many of you will have used the tool pretty extensively, so if you'll indulge me...

     Interestingly (at least to me) I'm having trouble generating a full 1G of traffic in a "worst case" scenario through the firewalls, or even directly box to box.  By worst case, I mean at 64byte (or near 64byte) udp packets.  Fortunately (or unfortunately, depending on your love for IPsec) the circuit between the firewalls is going to be an IPSEC tunnel, so there is a massive amount of protocol overhead at the small packet sizes.  I only need to generate something less than 500Mbits/s of iperf3 traffic to fill that pipe.  

     The servers I'm using are a few years old, but I didn't expect to have as much trouble as I am getting near gig speeds.  When I push the machines at all, one of two things happens.

     If I keep the setup simple, iperf3 simply refuses to push the requested bandwidth.  For instance, what should have been an 800Mb/s test runs at 240Mb/s with no, or very little loss.

     Because I can see, in that scenario, that a single CPU core is getting hammered, I create multiple send and receiver processes on both ends and use the -A flag to assign each a different CPU affinity.  This results in achieving a bit more bandwidth (500-600Mb/s) but massive loss (20-40% or more).  In particular, I notice that the loss begins in spades anytime I'm running multiple sender (or receiver) processes using the same physical socket (even if it's different logical processor cores).   

      As an example, I've experimented with running as many as 6 servers and 6 clients on each physical host.  I used the -A flag to set CPU affinity since each host has 16+ cores.  I'm using the -w flag to increase the receive buffer to 1M or larger.  I'm using Zerocopy to reduce CPU load as much as I can 

     I also experimented with fewer send/receive processes with the -P flag set to send more parallel streams per process.  However, I'm finding it near impossible to get more than about 600Mb/s between the two hosts even with them connected back to back via a Cat6 cable. 

     My question for the group is:  Does this sound like a "meh, that sounds about right" scenario, or should I definitely be able to squeeze more performance out of these boxes and I'm just missing a tuning option somewhere?  I've followed the ES.net tuning guides here the Centos 6 host:  https://fasterdata.es.net/host-tuning/linux/


     I realize that's a brief overview, but I don't want to drown you with an even bigger wall of text.  I'll be happy to provide more specifics if requested either on or offline   You'll find an example run below as well as my system specs.


For those of you who made it to the bottom, thank you
Sincerely,
Casey Russell
Network Engineer
KanREN
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter


My system specs are as follows:

Host1 (sender) (pci express NIC)
CentOS release 6.8
Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
2 physical CPUs (sockets), 4 cores/socket, 2 threads/core (16 CPUs)
126G RAM  (DDR3 1600)
Intel Corporation 82576 Gigabit Network Connection (rev 01)
igb 0000:05:00.0: Intel(R) Gigabit Ethernet Network Connection
igb 0000:05:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 00:1b:21:8e:63:08

Host2 (receiver) (onboard NIC)
CentOS Linux release 7.3.1611
Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz 
2 physical CPUs (sockets), 6 cores/socket, 2 threads/core (24 CPUs)
12G RAM  (DDR3 1333)
Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20) 
bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found ....


TEST RUN
You'll notice I requested 90Mb/s per stream, ran 4 streams (360Mb/s total), but only got a total of 224Mb/s.  I'm certain that's because CPU 7 is railed on both since I only have a single process on each box for this test.

[sender]
Cpu7  : 11.0%us, 89.0%sy, 
[receiver]
Cpu7  : 11.4%us, 88.2%sy, 


[crussell@localhost ~]$ iperf3 -i 10 -u -b 90M -l 64 -t 30 -Z -P 4 -w 1M -A 7,7 -p 5195 -c 10.18.49.10
Connecting to host 10.18.49.10, port 5195
[  4] local 10.18.48.10 port 41757 connected to 10.18.49.10 port 5195
[  6] local 10.18.48.10 port 56923 connected to 10.18.49.10 port 5195
[  8] local 10.18.48.10 port 38524 connected to 10.18.49.10 port 5195
[ 10] local 10.18.48.10 port 58086 connected to 10.18.49.10 port 5195
[ ID] Interval           Transfer     Bandwidth       Total Datagrams

(Middle redacted for Brevity)

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-30.00  sec   200 MBytes  55.9 Mbits/sec  0.004 ms  0/3275660 (0%)  
[  4] Sent 3275660 datagrams
[  6]   0.00-30.00  sec   200 MBytes  55.9 Mbits/sec  0.003 ms  0/3275660 (0%)  
[  6] Sent 3275660 datagrams
[  8]   0.00-30.00  sec   200 MBytes  55.9 Mbits/sec  0.003 ms  0/3275660 (0%)  
[  8] Sent 3275660 datagrams
[ 10]   0.00-30.00  sec   200 MBytes  55.9 Mbits/sec  0.004 ms  0/3275660 (0%)  
[ 10] Sent 3275660 datagrams
[SUM]   0.00-30.00  sec   800 MBytes   224 Mbits/sec  0.003 ms  0/13102640 (0%)  




--
Eli Dart, Network Engineer                          NOC: (510) 486-7600
ESnet Science Engagement Group                           (800) 333-7638
Lawrence Berkeley National Laboratory 



Archive powered by MHonArc 2.6.19.

Top of Page