ndt-users - RE: bandwidth asymmetry on a 10G link
Subject: ndt-users list created
List archive
- From: "Gholmieh, Nathalie" <>
- To: 'Rich Carlson' <>
- Cc: "''" <>
- Subject: RE: bandwidth asymmetry on a 10G link
- Date: Wed, 21 Apr 2010 08:56:41 -0700
- Accept-language: en-US
- Acceptlanguage: en-US
Hi Rich-
thank you for your reply.
> 1) what are the flow control settings for the various interfaces (sudo
> /sbin/ethtool -a ethx for the Myricom cards).
[root@M]#
ethtool -a eth5
Pause parameters for eth5:
Autonegotiate: off
RX: on
TX: on
[root@T]#
ethtool -a eth4
Pause parameters for eth4:
Autonegotiate: off
RX: on
TX: on
> 2) what drivers are you using on the NICs (ethtool -i ethx)
[root@M]#
ethtool -i eth5
driver: myri10ge
version: 1.5.0
firmware-version: 1.4.43 -- 2009/05/26 23:56:28 m
bus-info: 0000:07:00.0
[root@T]#
ethtool -i eth4
driver: myri10ge
version: 1.4.4-1.401
firmware-version: 1.4.43 -- 2009/05/26 23:56:28 m
bus-info: 0000:07:00.0
> 3) have you tried making the txqueuelen bigger (say 10000)?
I have increased txqueuelen on the two machines to 10000, and the results are
still the same.
I have tried installing NPAD on these two boxes, but the python web100 test
is failing on both and NPAd is not working. I'm looking more into that.
Nathalie~
----Original Message-----
From: Rich Carlson
[mailto:]
Sent: Monday, April 19, 2010 10:42 AM
To: Gholmieh, Nathalie
Cc:
''
Subject: Re: bandwidth asymmetry on a 10G link
Nathalie;
Thanks for the correction. A few quick comments/questions
1) what are the flow control settings for the various interfaces (sudo
/sbin/ethtool -a ethx for the Myricom cards).
2) what drivers are you using on the NICs (ethtool -i ethx)
3) have you tried making the txqueuelen bigger (say 10000)?
4) the 1st output shows
> minCWNDpeak: 8960
> maxCWNDpeak: 12615680
> CWNDpeaks: 26
Which says, the CWND dropped down to 1 pkt and grew up to 1408 packets.
Also the flow went through 26 growth/loss cycles. (Note: the NDT
anaylsis is based on the use of RENO.)
5) the 2nd output shows
> minCWNDpeak: -1
> maxCWNDpeak: -1
> CWNDpeaks: -1
Which means the connection never went into CA mode. A host buffer limit
was reached before an packets were lost.
6) both outputs also show that the NDT server is resource constrained
and that is why the speed is limited.
7) have you tried running an NPAD test to see if there are switch/router
buffers impacting the path?
Rich
On 4/19/2010 12:59 PM, Gholmieh, Nathalie wrote:
> Hi Rich-
>
> I am sorry my email was not clear.
>
> The two servers are located in different parts of campus and they
> communicate on a 10G fiber link through 2 cisco layer3 switches.
> The two servers are running
> - linux kernel 2.6.30 with web100 patch
> - Myricom 10G-PCIE2-8B2-2S+E
>
> NDT tests run on the two linux machines show the same asymmetry in the
> bandwidth. In summary:
> Web100clt run on machine 1 with machine 2 as server returns:
> C2S bandwidth ~ 10Gbps
> S2C bandwidth< 3Gbps
> Web100clt run on machine 2 with machine 1 as server returns:
> C2S bandwidth ~ 10Gbps
> S2C bandwidth< 3Gbps
>
> I have set the Myricom NICs on both machines to use jumbo frames.
> Write combining is enabled on the NICs
> I have modified the network buffer sizes as sent in my first email to the
> values recommended by Myricom for better performance
> I also have played with the congestion control protocol: I have read that
> cubic might enhance performance, so I set net.ipv4.tcp_congestion_control =
> cubic. I still get the same results.
>
> txqueuelen:1000 on both servers
>
> here are the results of NDT tests run on the two servers:
>
> --------
> M is the client and T is the server:
> --------
>
> [root@M
> ~]# web100clt -n T.ucsd.edu -lll
> Testing network path for configuration and performance problems -- Using
> IPv4 address
> Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
> checking for firewalls . . . . . . . . . . . . . . . . . . . Done
> running 10s outbound test (client to server) . . . . . 9564.84 Mb/s
> running 10s inbound test (server to client) . . . . . . 2901.84 Mb/s
> The slowest link in the end-to-end path is a 10 Gbps 10 Gigabit
> Ethernet/OC-192 subnet
> Information [S2C]: Packet queuing detected: 70.32% (local buffers)
> Server 'T.ucsd.edu' is not behind a firewall. [Connection to the ephemeral
> port was successful]
> Client is not behind a firewall. [Connection to the ephemeral port was
> successful]
>
> ------ Web100 Detailed Analysis ------
>
> Web100 reports the Round trip time = 6.85 msec;the Packet size = 8960
> Bytes; and
> There were 96 packets retransmitted, 6893 duplicate acks received, and 7002
> SACK blocks received
> Packets arrived out-of-order 3.00% of the time.
> This connection is sender limited 81.76% of the time.
> This connection is network limited 17.63% of the time.
>
> Web100 reports TCP negotiated the optional Performance Settings to:
> RFC 2018 Selective Acknowledgment: ON
> RFC 896 Nagle Algorithm: ON
> RFC 3168 Explicit Congestion Notification: OFF
> RFC 1323 Time Stamping: OFF
> RFC 1323 Window Scaling: ON; Scaling Factors - Server=9, Client=9
> The theoretical network limit is 1970.18 Mbps
> The NDT server has a 16384 KByte buffer which limits the throughput to
> 18688.86 Mbps
> Your PC/Workstation has a 12282 KByte buffer which limits the throughput to
> 14009.23 Mbps
> The network based flow control limits the throughput to 14053.15 Mbps
>
> Client Data reports link is ' 9', Client Acks report link is ' 9'
> Server Data reports link is ' 9', Server Acks report link is ' 9'
> Packet size is preserved End-to-End
> Information: Network Address Translation (NAT) box is modifying the
> Server's IP address
> Server says [<IP1>] but Client says [ T.ucsd.edu]
> Information: Network Address Translation (NAT) box is modifying the
> Client's IP address,
> Server says [<IP2>] but Client says [ M.ucsd.edu]
> CurMSS: 8960
> X_Rcvbuf: 87380
> X_Sndbuf: 16777216
> AckPktsIn: 230148
> AckPktsOut: 0
> BytesRetrans: 843008
> CongAvoid: 0
> CongestionOverCount: 0
> CongestionSignals: 35
> CountRTT: 223133
> CurCwnd: 9309440
> CurRTO: 208
> CurRwinRcvd: 12481024
> CurRwinSent: 17920
> CurSsthresh: 8296960
> DSACKDups: 0
> DataBytesIn: 0
> DataBytesOut: -660222984
> DataPktsIn: 0
> DataPktsOut: 1363901
> DupAcksIn: 6893
> ECNEnabled: 0
> FastRetran: 35
> MaxCwnd: 12615680
> MaxMSS: 8960
> MaxRTO: 211
> MaxRTT: 11
> MaxRwinRcvd: 12576256
> MaxRwinSent: 17920
> MaxSsthresh: 9551360
> MinMSS: 8960
> MinRTO: 201
> MinRTT: 0
> MinRwinRcvd: 17920
> MinRwinSent: 17920
> NagleEnabled: 1
> OtherReductions: 144
> PktsIn: 230148
> PktsOut: 1363901
> PktsRetrans: 96
> RcvWinScale: 9
> SACKEnabled: 3
> SACKsRcvd: 7002
> SendStall: 0
> SlowStart: 0
> SampleRTT: 8
> SmoothedRTT: 8
> SndWinScale: 9
> SndLimTimeRwin: 61464
> SndLimTimeCwnd: 1773632
> SndLimTimeSender: 8227091
> SndLimTransRwin: 544
> SndLimTransCwnd: 28627
> SndLimTransSender: 29117
> SndLimBytesRwin: 41668440
> SndLimBytesCwnd: -1770059740
> SndLimBytesSender: 1068168316
> SubsequentTimeouts: 0
> SumRTT: 1528314
> Timeouts: 0
> TimestampsEnabled: 0
> WinScaleRcvd: 9
> WinScaleSent: 9
> DupAcksOut: 0
> StartTimeUsec: 903832
> Duration: 10062211
> c2sData: 9
> c2sAck: 9
> s2cData: 9
> s2cAck: 9
> half_duplex: 0
> link: 100
> congestion: 0
> bad_cable: 0
> mismatch: 0
> spd: -524.91
> bw: 1970.18
> loss: 0.000025662
> avgrtt: 6.85
> waitsec: 0.00
> timesec: 10.00
> order: 0.0300
> rwintime: 0.0061
> sendtime: 0.8176
> cwndtime: 0.1763
> rwin: 95.9492
> swin: 128.0000
> cwin: 96.2500
> rttsec: 0.006849
> Sndbuf: 16777216
> aspd: 0.00000
> CWND-Limited: 115198.00
> minCWNDpeak: 8960
> maxCWNDpeak: 12615680
> CWNDpeaks: 26
> [root@M
> ~]#
>
> -----------------------
> T is the client and M is the server (I had restarted M so some of the
> values were reset to the default):
> -----------------------
>
> [root@T
> ~]# web100clt -n M.ucsd.edu -lll
> Testing network path for configuration and performance problems -- Using
> IPv4 address
> Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
> checking for firewalls . . . . . . . . . . . . . . . . . . . Done
> running 10s outbound test (client to server) . . . . . 9687.49 Mb/s
> running 10s inbound test (server to client) . . . . . . 1919.56 Mb/s
> The slowest link in the end-to-end path is a 10 Gbps 10 Gigabit
> Ethernet/OC-192 subnet
> Information [S2C]: Packet queuing detected: 78.17% (local buffers)
> Server 'M.ucsd.edu' is not behind a firewall. [Connection to the ephemeral
> port was successful]
> Client is not behind a firewall. [Connection to the ephemeral port was
> successful]
>
> ------ Web100 Detailed Analysis ------
>
> Web100 reports the Round trip time = 0.79 msec;the Packet size = 8960
> Bytes; and
> No packet loss was observed.
> This connection is sender limited 99.62% of the time.
>
> Web100 reports TCP negotiated the optional Performance Settings to:
> RFC 2018 Selective Acknowledgment: ON
> RFC 896 Nagle Algorithm: ON
> RFC 3168 Explicit Congestion Notification: OFF
> RFC 1323 Time Stamping: OFF
> RFC 1323 Window Scaling: ON; Scaling Factors - Server=9, Client=9
> The theoretical network limit is 8658517.00 Mbps
> The NDT server has a 16384 KByte buffer which limits the throughput to
> 162025.31 Mbps
> Your PC/Workstation has a 12288 KByte buffer which limits the throughput to
> 121518.98 Mbps
> The network based flow control limits the throughput to 121835.44 Mbps
>
> Client Data reports link is ' 9', Client Acks report link is ' 9'
> Server Data reports link is ' 9', Server Acks report link is ' 9'
> Packet size is preserved End-to-End
> Information: Network Address Translation (NAT) box is modifying the
> Server's IP address
> Server says [<IP2>] but Client says [ M.ucsd.edu]
> Information: Network Address Translation (NAT) box is modifying the
> Client's IP address
> Server says [<IP1>] but Client says [ T]
> CurMSS: 8960
> X_Rcvbuf: 87380
> X_Sndbuf: 16777216
> AckPktsIn: 224114
> AckPktsOut: 0
> BytesRetrans: 0
> CongAvoid: 0
> CongestionOverCount: 0
> CongestionSignals: 0
> CountRTT: 224114
> CurCwnd: 12615680
> CurRTO: 201
> CurRwinRcvd: 12558848
> CurRwinSent: 17920
> CurSsthresh: -256
> DSACKDups: 0
> DataBytesIn: 0
> DataBytesOut: -1871234080
> DataPktsIn: 0
> DataPktsOut: 1237850
> DupAcksIn: 0
> ECNEnabled: 0
> FastRetran: 0
> MaxCwnd: 12615680
> MaxMSS: 8960
> MaxRTO: 211
> MaxRTT: 11
> MaxRwinRcvd: 12582912
> MaxRwinSent: 17920
> MaxSsthresh: 0
> MinMSS: 8960
> MinRTO: 201
> MinRTT: 0
> MinRwinRcvd: 17920
> MinRwinSent: 17920
> NagleEnabled: 1
> OtherReductions: 0
> PktsIn: 224114
> PktsOut: 1237850
> PktsRetrans: 0
> RcvWinScale: 9
> SACKEnabled: 3
> SACKsRcvd: 0
> SendStall: 0
> SlowStart: 0
> SampleRTT: 0
> SmoothedRTT: 1
> SndWinScale: 9
> SndLimTimeRwin: 29301
> SndLimTimeCwnd: 8996
> SndLimTimeSender: 10054035
> SndLimTransRwin: 602
> SndLimTransCwnd: 99
> SndLimTransSender: 701
> SndLimBytesRwin: 49860420
> SndLimBytesCwnd: 14403440
> SndLimBytesSender: -1935497940
> SubsequentTimeouts: 0
> SumRTT: 176939
> Timeouts: 0
> TimestampsEnabled: 0
> WinScaleRcvd: 9
> WinScaleSent: 9
> DupAcksOut: 0
> StartTimeUsec: 614339
> Duration: 10094355
> c2sData: 9
> c2sAck: 9
> s2cData: 9
> s2cAck: 9
> half_duplex: 0
> link: 100
> congestion: 0
> bad_cable: 0
> mismatch: 0
> spd: -1483.29
> bw: 8658516.76
> loss: 0.000000000
> avgrtt: 0.79
> waitsec: 0.00
> timesec: 10.00
> order: 0.0000
> rwintime: 0.0029
> sendtime: 0.9962
> cwndtime: 0.0009
> rwin: 96.0000
> swin: 128.0000
> cwin: 96.2500
> rttsec: 0.000790
> Sndbuf: 16777216
> aspd: 0.00000
> CWND-Limited: 121894.00
> minCWNDpeak: -1
> maxCWNDpeak: -1
> CWNDpeaks: -1
> [root@T
> ~]#
>
> -------------------------------
>
> thanks!
>
>
> Nathalie~
>
>
>
> -----Original Message-----
> From: Rich Carlson
> [mailto:]
> Sent: Thursday, April 15, 2010 11:17 AM
> To: Gholmieh, Nathalie
> Cc:
> ''
> Subject: Re: bandwidth asymmetry on a 10G link
>
> Hi Nathalie;
>
> I'm not sure I understand the configuration and the problem so let me
> ask for clarification.
>
> You have 2 hosts connected back-to-back with a cross-over cable (fiber
> or copper?) You have installed an NDT server on both nodes and from
> either node you get asymmetric results as shown below. If this is not
> correct, then please clarify.
>
> A couple of questions.
> 1) What Linux kernel version are you using?
> 2) what Myircom driver version are you using?
> 3) have you tuned any of the Myircom parameters?
>
> more comments in-line
>
> On 4/15/2010 1:27 PM, Gholmieh, Nathalie wrote:
>> Hi-
>>
>> I have setup 2 NDT servers interconnected with a 10G link, both using
>> Myricom 10G NICs, on our local network. the two servers have the same
>> versions of NDT 3.6.1.
>>
>> when running NDT tests between the 2 servers, I get a C2S bandwidth of
>> approximately 10Gbps, but the S2C bandwidth is not exceeding 3 Gbps, and
>> that is on BOTH machines:
>>
>> [root@M
>> ~]# web100clt -n T -4 -l
>>
>> Testing network path for configuration and performance problems -- Using
>> IPv4 address
>>
>> Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
>>
>> checking for firewalls . . . . . . . . . . . . . . . . . . . Done
>>
>> *running 10s outbound test (client to server) . . . . . 9351.25 Mb/s*
>>
>> *running 10s inbound test (server to client) . . . . . . 2605.19 Mb/s*
>
> If the results in 1 direction showed these rates and a test in the
> opposite direction showed an inverted state (c2s lower than s2c) then I
> would suspect a problem with pacing or flow control in 1 direction or a
> configuration problem on 1 node. However, if both nodes report the same
> results (c2s is always greater than s2c) then (1) it is a problem with
> my code in the xmit loop; or (2) an unknown problem.
>
>> The slowest link in the end-to-end path is a 10 Gbps 10 Gigabit
>> Ethernet/OC-192 subnet
>>
>> *Information [S2C]: Packet queuing detected: 72.52% (local buffers)*
>>
>> Server 'T' is not behind a firewall. [Connection to the ephemeral port
>> was successful]
>>
>> Client is not behind a firewall. [Connection to the ephemeral port was
>> successful]
>>
>> ------ Web100 Detailed Analysis ------
>>
>> Web100 reports the Round trip time = 4.09 msec;the Packet size = 8960
>> Bytes; and
>
> The RTT includes host queuing time (~5.6 MB in queue) using jumbo
> frames. What is the txqueuelen value for this interface (ifconfig command)?
>
>> There were 337 packets retransmitted, 9749 duplicate acks received, and
>> 10089 SACK blocks received
>>
>> Packets arrived out-of-order 4.32% of the time.
>
> Packets are being reordered. This is probably due to the pkt processing
> by multiple cores.
>
>> The connection stalled 1 times due to packet loss.
>>
>> The connection was idle 0.20 seconds (2.00%) of the time.
>
> The sending node went through at least 1 timeout. Add a 2nd -l to the
> command line and look at the last 3 variables that get reported (*CWND*)
> this will tell you the number of times TCP invoked the CA algorithm and
> what the high and low watermarks were.
>
>> This connection is receiver limited 2.41% of the time.
>>
>> This connection is sender limited 76.06% of the time.
>
> This is saying that the sender has limited resources, probably
> txqueuelen limits that prevent it from sending more data. Note with
> jumbo frames 4 msec is about 5.6 MB and 624 packets.
>
>> This connection is network limited 21.54% of the time.
>>
>> Web100 reports TCP negotiated the optional Performance Settings to:
>>
>> RFC 2018 Selective Acknowledgment: ON
>>
>> RFC 896 Nagle Algorithm: ON
>>
>> RFC 3168 Explicit Congestion Notification: OFF
>>
>> RFC 1323 Time Stamping: OFF
>>
>> RFC 1323 Window Scaling: ON; Scaling Factors - Server=9, Client=9
>>
>> The theoretical network limit is 2148.73 Mbps
>
> This is from the Mathis equation ((pkt-size)/(rtt*sqrt(loss))). This is
> about the same as the measured rate so this is the limiting factor.
>
>> The NDT server has a 16384 KByte buffer which limits the throughput to
>> 31295.84 Mbps
>>
>> Your PC/Workstation has a 12282 KByte buffer which limits the throughput
>> to 23459.46 Mbps
>>
>> The network based flow control limits the throughput to 23533.01 Mbps
>
> Buffer space from, your tuning parms below, is adequate.
>
>> Client Data reports link is ' 9', Client Acks report link is ' 9'
>>
>> Server Data reports link is ' 9', Server Acks report link is ' 9'
>>
>> Packet size is preserved End-to-End
>>
>> Information: Network Address Translation (NAT) box is modifying the
>> Server's IP address
>>
>> Server says [<IP>] but Client says [ T]
>>
>> Information: Network Address Translation (NAT) box is modifying the
>> Client's IP address
>>
>> Server says [<IP2>] but Client says [M]
>>
>> [root@M
>> ~]#
>>
>> I have these sysctl values set on both servers:
>>
>> net.core.rmem_max = 16777216
>>
>> net.core.wmem_max = 16777216
>>
>> net.ipv4.tcp_wmem = 4096 65536 16777216
>>
>> net.ipv4.tcp_rmem = 4096 87380 16777216
>>
>> net.core.netdev_max_backlog = 250000
>>
>> net.ipv4.tcp_no_metrics_save = 1
>
> Run the ifconfig command and report the txqueuelen value.
>
>> I have also noticed that same asymmetry in the bandwidth while
>> transferring an FTP file back and forth on the same link between the two
>> servers.
>>
>> Note that the traffic both ways is using the same path.
>>
>> I am wondering why there is a difference between the sent and received
>> bandwidth, and what parameters I should tune to use the full 10G both ways.
>>
>> Any ideas are very appreciated.
>>
>> Thanks!
>
> I don't have a good clue right now. Check the things listed
> (txqueuelen, version info, NIC tuning) and also run with more logging
> (-ll instead of -l). Turning on flow control may help. Also consider
> running an NPAD test. The NPAD system probes for pkt queues and other
> system configuration settings and it may point out more details that can
> help you understand what is going on here.
>
> Rich
>
>> Nathalie~//
>>
>
- bandwidth asymmetry on a 10G link, Gholmieh, Nathalie, 04/15/2010
- Re: bandwidth asymmetry on a 10G link, Rich Carlson, 04/15/2010
- RE: bandwidth asymmetry on a 10G link, Gholmieh, Nathalie, 04/19/2010
- Re: bandwidth asymmetry on a 10G link, Scott Bertilson, 04/19/2010
- Message not available
- RE: bandwidth asymmetry on a 10G link, Gholmieh, Nathalie, 04/21/2010
- Re: bandwidth asymmetry on a 10G link, Rich Carlson, 04/21/2010
- RE: bandwidth asymmetry on a 10G link, Gholmieh, Nathalie, 04/21/2010
- RE: bandwidth asymmetry on a 10G link, Gholmieh, Nathalie, 04/19/2010
- Re: bandwidth asymmetry on a 10G link, Rich Carlson, 04/15/2010
Archive powered by MHonArc 2.6.16.