Skip to Content.
Sympa Menu

ndt-users - Re: bandwidth asymmetry on a 10G link

Subject: ndt-users list created

List archive

Re: bandwidth asymmetry on a 10G link


Chronological Thread 
  • From: Rich Carlson <>
  • To: "Gholmieh, Nathalie" <>
  • Cc: "''" <>
  • Subject: Re: bandwidth asymmetry on a 10G link
  • Date: Wed, 21 Apr 2010 13:06:27 -0400

Hi Nathalie;

Have you tried updating the driver on T?

What is the CPU utilization on M? When M is the server, the connection is limited, but there is no packet loss and there is no tx queue building up.

Finally, the flow control settings for the NIC's are on, what about the various switches/routers in the path? You will need to look at the vendor docs to find out how to check/set this parameter.

Rich

On 4/21/2010 11:56 AM, Gholmieh, Nathalie wrote:
Hi Rich-

thank you for your reply.

1) what are the flow control settings for the various interfaces (sudo
/sbin/ethtool -a ethx for the Myricom cards).

[root@M]#
ethtool -a eth5
Pause parameters for eth5:
Autonegotiate: off
RX: on
TX: on

[root@T]#
ethtool -a eth4
Pause parameters for eth4:
Autonegotiate: off
RX: on
TX: on

2) what drivers are you using on the NICs (ethtool -i ethx)

[root@M]#
ethtool -i eth5
driver: myri10ge
version: 1.5.0
firmware-version: 1.4.43 -- 2009/05/26 23:56:28 m
bus-info: 0000:07:00.0

[root@T]#
ethtool -i eth4
driver: myri10ge
version: 1.4.4-1.401
firmware-version: 1.4.43 -- 2009/05/26 23:56:28 m
bus-info: 0000:07:00.0

3) have you tried making the txqueuelen bigger (say 10000)?

I have increased txqueuelen on the two machines to 10000, and the results are
still the same.

I have tried installing NPAD on these two boxes, but the python web100 test
is failing on both and NPAd is not working. I'm looking more into that.


Nathalie~

----Original Message-----
From: Rich Carlson
[mailto:]
Sent: Monday, April 19, 2010 10:42 AM
To: Gholmieh, Nathalie
Cc:
''
Subject: Re: bandwidth asymmetry on a 10G link

Nathalie;

Thanks for the correction. A few quick comments/questions

1) what are the flow control settings for the various interfaces (sudo
/sbin/ethtool -a ethx for the Myricom cards).

2) what drivers are you using on the NICs (ethtool -i ethx)

3) have you tried making the txqueuelen bigger (say 10000)?

4) the 1st output shows
> minCWNDpeak: 8960
> maxCWNDpeak: 12615680
> CWNDpeaks: 26

Which says, the CWND dropped down to 1 pkt and grew up to 1408 packets.
Also the flow went through 26 growth/loss cycles. (Note: the NDT
anaylsis is based on the use of RENO.)

5) the 2nd output shows
> minCWNDpeak: -1
> maxCWNDpeak: -1
> CWNDpeaks: -1

Which means the connection never went into CA mode. A host buffer limit
was reached before an packets were lost.

6) both outputs also show that the NDT server is resource constrained
and that is why the speed is limited.

7) have you tried running an NPAD test to see if there are switch/router
buffers impacting the path?

Rich

On 4/19/2010 12:59 PM, Gholmieh, Nathalie wrote:
Hi Rich-

I am sorry my email was not clear.

The two servers are located in different parts of campus and they communicate
on a 10G fiber link through 2 cisco layer3 switches.
The two servers are running
- linux kernel 2.6.30 with web100 patch
- Myricom 10G-PCIE2-8B2-2S+E

NDT tests run on the two linux machines show the same asymmetry in the
bandwidth. In summary:
Web100clt run on machine 1 with machine 2 as server returns:
C2S bandwidth ~ 10Gbps
S2C bandwidth< 3Gbps
Web100clt run on machine 2 with machine 1 as server returns:
C2S bandwidth ~ 10Gbps
S2C bandwidth< 3Gbps

I have set the Myricom NICs on both machines to use jumbo frames.
Write combining is enabled on the NICs
I have modified the network buffer sizes as sent in my first email to the
values recommended by Myricom for better performance
I also have played with the congestion control protocol: I have read that
cubic might enhance performance, so I set net.ipv4.tcp_congestion_control =
cubic. I still get the same results.

txqueuelen:1000 on both servers

here are the results of NDT tests run on the two servers:

--------
M is the client and T is the server:
--------

[root@M
~]# web100clt -n T.ucsd.edu -lll
Testing network path for configuration and performance problems -- Using
IPv4 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
checking for firewalls . . . . . . . . . . . . . . . . . . . Done
running 10s outbound test (client to server) . . . . . 9564.84 Mb/s
running 10s inbound test (server to client) . . . . . . 2901.84 Mb/s
The slowest link in the end-to-end path is a 10 Gbps 10 Gigabit
Ethernet/OC-192 subnet
Information [S2C]: Packet queuing detected: 70.32% (local buffers)
Server 'T.ucsd.edu' is not behind a firewall. [Connection to the ephemeral
port was successful]
Client is not behind a firewall. [Connection to the ephemeral port was
successful]

------ Web100 Detailed Analysis ------

Web100 reports the Round trip time = 6.85 msec;the Packet size = 8960 Bytes;
and
There were 96 packets retransmitted, 6893 duplicate acks received, and 7002
SACK blocks received
Packets arrived out-of-order 3.00% of the time.
This connection is sender limited 81.76% of the time.
This connection is network limited 17.63% of the time.

Web100 reports TCP negotiated the optional Performance Settings to:
RFC 2018 Selective Acknowledgment: ON
RFC 896 Nagle Algorithm: ON
RFC 3168 Explicit Congestion Notification: OFF
RFC 1323 Time Stamping: OFF
RFC 1323 Window Scaling: ON; Scaling Factors - Server=9, Client=9
The theoretical network limit is 1970.18 Mbps
The NDT server has a 16384 KByte buffer which limits the throughput to
18688.86 Mbps
Your PC/Workstation has a 12282 KByte buffer which limits the throughput to
14009.23 Mbps
The network based flow control limits the throughput to 14053.15 Mbps

Client Data reports link is ' 9', Client Acks report link is ' 9'
Server Data reports link is ' 9', Server Acks report link is ' 9'
Packet size is preserved End-to-End
Information: Network Address Translation (NAT) box is modifying the Server's
IP address
Server says [<IP1>] but Client says [ T.ucsd.edu]
Information: Network Address Translation (NAT) box is modifying the Client's
IP address,
Server says [<IP2>] but Client says [ M.ucsd.edu]
CurMSS: 8960
X_Rcvbuf: 87380
X_Sndbuf: 16777216
AckPktsIn: 230148
AckPktsOut: 0
BytesRetrans: 843008
CongAvoid: 0
CongestionOverCount: 0
CongestionSignals: 35
CountRTT: 223133
CurCwnd: 9309440
CurRTO: 208
CurRwinRcvd: 12481024
CurRwinSent: 17920
CurSsthresh: 8296960
DSACKDups: 0
DataBytesIn: 0
DataBytesOut: -660222984
DataPktsIn: 0
DataPktsOut: 1363901
DupAcksIn: 6893
ECNEnabled: 0
FastRetran: 35
MaxCwnd: 12615680
MaxMSS: 8960
MaxRTO: 211
MaxRTT: 11
MaxRwinRcvd: 12576256
MaxRwinSent: 17920
MaxSsthresh: 9551360
MinMSS: 8960
MinRTO: 201
MinRTT: 0
MinRwinRcvd: 17920
MinRwinSent: 17920
NagleEnabled: 1
OtherReductions: 144
PktsIn: 230148
PktsOut: 1363901
PktsRetrans: 96
RcvWinScale: 9
SACKEnabled: 3
SACKsRcvd: 7002
SendStall: 0
SlowStart: 0
SampleRTT: 8
SmoothedRTT: 8
SndWinScale: 9
SndLimTimeRwin: 61464
SndLimTimeCwnd: 1773632
SndLimTimeSender: 8227091
SndLimTransRwin: 544
SndLimTransCwnd: 28627
SndLimTransSender: 29117
SndLimBytesRwin: 41668440
SndLimBytesCwnd: -1770059740
SndLimBytesSender: 1068168316
SubsequentTimeouts: 0
SumRTT: 1528314
Timeouts: 0
TimestampsEnabled: 0
WinScaleRcvd: 9
WinScaleSent: 9
DupAcksOut: 0
StartTimeUsec: 903832
Duration: 10062211
c2sData: 9
c2sAck: 9
s2cData: 9
s2cAck: 9
half_duplex: 0
link: 100
congestion: 0
bad_cable: 0
mismatch: 0
spd: -524.91
bw: 1970.18
loss: 0.000025662
avgrtt: 6.85
waitsec: 0.00
timesec: 10.00
order: 0.0300
rwintime: 0.0061
sendtime: 0.8176
cwndtime: 0.1763
rwin: 95.9492
swin: 128.0000
cwin: 96.2500
rttsec: 0.006849
Sndbuf: 16777216
aspd: 0.00000
CWND-Limited: 115198.00
minCWNDpeak: 8960
maxCWNDpeak: 12615680
CWNDpeaks: 26
[root@M
~]#

-----------------------
T is the client and M is the server (I had restarted M so some of the values
were reset to the default):
-----------------------

[root@T
~]# web100clt -n M.ucsd.edu -lll
Testing network path for configuration and performance problems -- Using
IPv4 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
checking for firewalls . . . . . . . . . . . . . . . . . . . Done
running 10s outbound test (client to server) . . . . . 9687.49 Mb/s
running 10s inbound test (server to client) . . . . . . 1919.56 Mb/s
The slowest link in the end-to-end path is a 10 Gbps 10 Gigabit
Ethernet/OC-192 subnet
Information [S2C]: Packet queuing detected: 78.17% (local buffers)
Server 'M.ucsd.edu' is not behind a firewall. [Connection to the ephemeral
port was successful]
Client is not behind a firewall. [Connection to the ephemeral port was
successful]

------ Web100 Detailed Analysis ------

Web100 reports the Round trip time = 0.79 msec;the Packet size = 8960 Bytes;
and
No packet loss was observed.
This connection is sender limited 99.62% of the time.

Web100 reports TCP negotiated the optional Performance Settings to:
RFC 2018 Selective Acknowledgment: ON
RFC 896 Nagle Algorithm: ON
RFC 3168 Explicit Congestion Notification: OFF
RFC 1323 Time Stamping: OFF
RFC 1323 Window Scaling: ON; Scaling Factors - Server=9, Client=9
The theoretical network limit is 8658517.00 Mbps
The NDT server has a 16384 KByte buffer which limits the throughput to
162025.31 Mbps
Your PC/Workstation has a 12288 KByte buffer which limits the throughput to
121518.98 Mbps
The network based flow control limits the throughput to 121835.44 Mbps

Client Data reports link is ' 9', Client Acks report link is ' 9'
Server Data reports link is ' 9', Server Acks report link is ' 9'
Packet size is preserved End-to-End
Information: Network Address Translation (NAT) box is modifying the Server's
IP address
Server says [<IP2>] but Client says [ M.ucsd.edu]
Information: Network Address Translation (NAT) box is modifying the Client's
IP address
Server says [<IP1>] but Client says [ T]
CurMSS: 8960
X_Rcvbuf: 87380
X_Sndbuf: 16777216
AckPktsIn: 224114
AckPktsOut: 0
BytesRetrans: 0
CongAvoid: 0
CongestionOverCount: 0
CongestionSignals: 0
CountRTT: 224114
CurCwnd: 12615680
CurRTO: 201
CurRwinRcvd: 12558848
CurRwinSent: 17920
CurSsthresh: -256
DSACKDups: 0
DataBytesIn: 0
DataBytesOut: -1871234080
DataPktsIn: 0
DataPktsOut: 1237850
DupAcksIn: 0
ECNEnabled: 0
FastRetran: 0
MaxCwnd: 12615680
MaxMSS: 8960
MaxRTO: 211
MaxRTT: 11
MaxRwinRcvd: 12582912
MaxRwinSent: 17920
MaxSsthresh: 0
MinMSS: 8960
MinRTO: 201
MinRTT: 0
MinRwinRcvd: 17920
MinRwinSent: 17920
NagleEnabled: 1
OtherReductions: 0
PktsIn: 224114
PktsOut: 1237850
PktsRetrans: 0
RcvWinScale: 9
SACKEnabled: 3
SACKsRcvd: 0
SendStall: 0
SlowStart: 0
SampleRTT: 0
SmoothedRTT: 1
SndWinScale: 9
SndLimTimeRwin: 29301
SndLimTimeCwnd: 8996
SndLimTimeSender: 10054035
SndLimTransRwin: 602
SndLimTransCwnd: 99
SndLimTransSender: 701
SndLimBytesRwin: 49860420
SndLimBytesCwnd: 14403440
SndLimBytesSender: -1935497940
SubsequentTimeouts: 0
SumRTT: 176939
Timeouts: 0
TimestampsEnabled: 0
WinScaleRcvd: 9
WinScaleSent: 9
DupAcksOut: 0
StartTimeUsec: 614339
Duration: 10094355
c2sData: 9
c2sAck: 9
s2cData: 9
s2cAck: 9
half_duplex: 0
link: 100
congestion: 0
bad_cable: 0
mismatch: 0
spd: -1483.29
bw: 8658516.76
loss: 0.000000000
avgrtt: 0.79
waitsec: 0.00
timesec: 10.00
order: 0.0000
rwintime: 0.0029
sendtime: 0.9962
cwndtime: 0.0009
rwin: 96.0000
swin: 128.0000
cwin: 96.2500
rttsec: 0.000790
Sndbuf: 16777216
aspd: 0.00000
CWND-Limited: 121894.00
minCWNDpeak: -1
maxCWNDpeak: -1
CWNDpeaks: -1
[root@T
~]#

-------------------------------

thanks!


Nathalie~



-----Original Message-----
From: Rich Carlson
[mailto:]
Sent: Thursday, April 15, 2010 11:17 AM
To: Gholmieh, Nathalie
Cc:
''
Subject: Re: bandwidth asymmetry on a 10G link

Hi Nathalie;

I'm not sure I understand the configuration and the problem so let me
ask for clarification.

You have 2 hosts connected back-to-back with a cross-over cable (fiber
or copper?) You have installed an NDT server on both nodes and from
either node you get asymmetric results as shown below. If this is not
correct, then please clarify.

A couple of questions.
1) What Linux kernel version are you using?
2) what Myircom driver version are you using?
3) have you tuned any of the Myircom parameters?

more comments in-line

On 4/15/2010 1:27 PM, Gholmieh, Nathalie wrote:
Hi-

I have setup 2 NDT servers interconnected with a 10G link, both using
Myricom 10G NICs, on our local network. the two servers have the same
versions of NDT 3.6.1.

when running NDT tests between the 2 servers, I get a C2S bandwidth of
approximately 10Gbps, but the S2C bandwidth is not exceeding 3 Gbps, and
that is on BOTH machines:

[root@M
~]# web100clt -n T -4 -l

Testing network path for configuration and performance problems -- Using
IPv4 address

Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done

checking for firewalls . . . . . . . . . . . . . . . . . . . Done

*running 10s outbound test (client to server) . . . . . 9351.25 Mb/s*

*running 10s inbound test (server to client) . . . . . . 2605.19 Mb/s*

If the results in 1 direction showed these rates and a test in the
opposite direction showed an inverted state (c2s lower than s2c) then I
would suspect a problem with pacing or flow control in 1 direction or a
configuration problem on 1 node. However, if both nodes report the same
results (c2s is always greater than s2c) then (1) it is a problem with
my code in the xmit loop; or (2) an unknown problem.

The slowest link in the end-to-end path is a 10 Gbps 10 Gigabit
Ethernet/OC-192 subnet

*Information [S2C]: Packet queuing detected: 72.52% (local buffers)*

Server 'T' is not behind a firewall. [Connection to the ephemeral port
was successful]

Client is not behind a firewall. [Connection to the ephemeral port was
successful]

------ Web100 Detailed Analysis ------

Web100 reports the Round trip time = 4.09 msec;the Packet size = 8960
Bytes; and

The RTT includes host queuing time (~5.6 MB in queue) using jumbo
frames. What is the txqueuelen value for this interface (ifconfig command)?

There were 337 packets retransmitted, 9749 duplicate acks received, and
10089 SACK blocks received

Packets arrived out-of-order 4.32% of the time.

Packets are being reordered. This is probably due to the pkt processing
by multiple cores.

The connection stalled 1 times due to packet loss.

The connection was idle 0.20 seconds (2.00%) of the time.

The sending node went through at least 1 timeout. Add a 2nd -l to the
command line and look at the last 3 variables that get reported (*CWND*)
this will tell you the number of times TCP invoked the CA algorithm and
what the high and low watermarks were.

This connection is receiver limited 2.41% of the time.

This connection is sender limited 76.06% of the time.

This is saying that the sender has limited resources, probably
txqueuelen limits that prevent it from sending more data. Note with
jumbo frames 4 msec is about 5.6 MB and 624 packets.

This connection is network limited 21.54% of the time.

Web100 reports TCP negotiated the optional Performance Settings to:

RFC 2018 Selective Acknowledgment: ON

RFC 896 Nagle Algorithm: ON

RFC 3168 Explicit Congestion Notification: OFF

RFC 1323 Time Stamping: OFF

RFC 1323 Window Scaling: ON; Scaling Factors - Server=9, Client=9

The theoretical network limit is 2148.73 Mbps

This is from the Mathis equation ((pkt-size)/(rtt*sqrt(loss))). This is
about the same as the measured rate so this is the limiting factor.

The NDT server has a 16384 KByte buffer which limits the throughput to
31295.84 Mbps

Your PC/Workstation has a 12282 KByte buffer which limits the throughput
to 23459.46 Mbps

The network based flow control limits the throughput to 23533.01 Mbps

Buffer space from, your tuning parms below, is adequate.

Client Data reports link is ' 9', Client Acks report link is ' 9'

Server Data reports link is ' 9', Server Acks report link is ' 9'

Packet size is preserved End-to-End

Information: Network Address Translation (NAT) box is modifying the
Server's IP address

Server says [<IP>] but Client says [ T]

Information: Network Address Translation (NAT) box is modifying the
Client's IP address

Server says [<IP2>] but Client says [M]

[root@M
~]#

I have these sysctl values set on both servers:

net.core.rmem_max = 16777216

net.core.wmem_max = 16777216

net.ipv4.tcp_wmem = 4096 65536 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216

net.core.netdev_max_backlog = 250000

net.ipv4.tcp_no_metrics_save = 1

Run the ifconfig command and report the txqueuelen value.

I have also noticed that same asymmetry in the bandwidth while
transferring an FTP file back and forth on the same link between the two
servers.

Note that the traffic both ways is using the same path.

I am wondering why there is a difference between the sent and received
bandwidth, and what parameters I should tune to use the full 10G both ways.

Any ideas are very appreciated.

Thanks!

I don't have a good clue right now. Check the things listed
(txqueuelen, version info, NIC tuning) and also run with more logging
(-ll instead of -l). Turning on flow control may help. Also consider
running an NPAD test. The NPAD system probes for pkt queues and other
system configuration settings and it may point out more details that can
help you understand what is going on here.

Rich

Nathalie~//






Archive powered by MHonArc 2.6.16.

Top of Page