ndt-users - Re: Slow Inbound Tests
Subject: ndt-users list created
List archive
- From: Clayton Keller <>
- To:
- Cc:
- Subject: Re: Slow Inbound Tests
- Date: Wed, 19 Oct 2005 12:01:19 -0500
Just a report back:
Here is what I changed my values in sysctl.conf to:
net.core.wmem_max = 4194304
net.core.rmem_max = 4194304
net.ipv4.tcp_wmem = 4096 87380 4194304
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_default_win_scale = 7
net.ipv4.tcp_moderate_rcvbuf = 1
When looking at the Java Console it has cut the time in half:
Old sysctl.conf settings:
wait flag received = 0
server ports 32847 32848
calling in2Socket.getLocalAddress()
339 Kbs outbound got 19
12640256 bytes 1506.2268827454718 Kb/s 67.136 secs
Calling InetAddress.getLocalHost() twice
New sysctl.conf settings:
wait flag received = 0
server ports 32849 32850
calling in2Socket.getLocalAddress()
382 Kbs outbound got 19
6324224 bytes 1349.7436773023157 Kb/s 37.484 secs
Calling InetAddress.getLocalHost() twice
I did run the test without the "-m" and I returned similar results. So
this did help out some. We cut the buffer in half and the time to return
results has also cut in half.
Clay
Clayton Keller wrote:
> Rich,
>
> The server currently is not doing much else. Load averages on the server
> sit pretty much at 0.00. It is a Pentium-4 3.40GHz with 2GB of RAM.
> There is not anything else running on it that is causing any heavy loads
> or additional traffic at this time.
>
> Currently, I have the following lines added to the /etc/sysctl.conf
> file, which I acquired from the README:
>
> # Recommended sysctl settings from web100 README
> net.core.wmem_max = 8388608
> net.core.rmem_max = 8388608
> net.ipv4.tcp_wmem = 4096 65536 8388608
> net.ipv4.tcp_rmem = 4096 87380 8388608
> net.ipv4.tcp_default_win_scale = 7
> net.ipv4.tcp_moderate_rcvbuf = 1
>
> I can go ahead and make the adjustments that you recommended, but didn't
> know if I should be making any further changes as well.
>
> I will run some further tests with the new settings and also with the
> "-m" flag removed. However, I wanted to run the sysctl.conf settings
> that we currently have by you first, and see if I should look at further
> changes there.
>
> Clay
>
> Richard Carlson wrote:
>
>>Hi Clay
>>
>>OK, I looked at the traces and the web100 stats and there are a couple
>>of things that stand out.
>>
>>1) your server is set to use 16 MB buffers.
>>2) this inbound test ran for 18 seconds (Duration and SndLimTimeCwnd)
>>3) the trace (.2790) shows that data stops flowing, but the connection
>>isn't closing gracefully (no TCP FIN packets being exchanged). [This
>>might be another bug in my server code]
>>
>>It's not clear to me why the test is running so long. What else is
>>running on this server? Is it very busy? What does "/usr/bin/top"
>>report? Finally, what messages appear in the clients Java console
>>window? The client will report how long it spent reading data from the
>>network
>>
>>Things to try:
>>* One thing would be to reduce the maximum sender buffer size. Try
>>making the max 4 MB instead of 16. Edit the /etc/sysctl.conf file and
>>change the following lines.
>># increase Linux autotuning TCP buffer limits
>>net.ipv4.tcp_rmem = 4096 87380 16777216
>>net.ipv4.tcp_wmem = 4096 87380 16777216
>>to # increase Linux autotuning TCP buffer limits
>>net.ipv4.tcp_rmem = 4096 87380 4194304
>>net.ipv4.tcp_wmem = 4096 87380 4194304
>>
>>and then run the "/sbin/sysctl -p" command.
>>
>>One possible problem is that the server is faster than the network so
>>data is being placed in the send queue. The connection wouldn't
>>shut-down until the queue is empty. So even if the NDT process stops
>>sending after 10 seconds, it could take some time to drain the queue.
>>With a 4 MB queue it would take less time to drain.
>>
>>That said, it isn't clear why the client is hanging for so long. I
>>guess it's also possible that my shutdown patch isn't working properly
>>in the multi-client mode. Can you try running the web100srv process
>>without the -m flag. This will case the server to handle clients in a
>>FIFO manner. If the server is busy the incoming clients will receive a
>>message saying the server is busy and a test will begin in xx seconds.
>>The client is updated every time another client's test finishes. I know
>>the shutdown() patch fixed a hang there, if possible give it a try and
>>let me know what happens.
>>
>>That's all I can think of right now, I'll think about it some more
>>tonight and run some tests tomorrow.
>>
>>Rich
>>
>>At 09:08 AM 10/18/2005, Clayton Keller wrote:
>>
>>
>>>Rich,
>>>
>>>We are still seeing issues with the Inbound tests even after reverting
>>>to the 2.6.12.5 kernel. This is not the Fedora Source kernel that
>>>Martin is using, but the stock kernel.org download.
>>>
>>>I would like to go ahead and submit another trace for you. Is there a
>>>possibility that the issues we are seeing are network/bandwidth issues
>>>on our part?
>>>
>>>From my connection which is on a different network, the Outbound test
>>>took aprox. 10 seconds while the Inbound test took well over one
>>>minute. The info you are receiving is from a connection on that same
>>>network. The Inbound test took about one minute before it reported its
>>>results back to the user.
>>>
>>>I apologize, but I am not quite sure what all info is found in the
>>>trace so I guess that is why I am asking you if there are external
>>>issues on our end that maybe part of the cause.
>>>
>>>Also, I could look at using one of the Fedora kernels and patch it as
>>>like Martyn did.
>>>
>>>Clay
>>>
>>>
>>>
>>>Richard Carlson wrote:
>>>
>>>
>>>>Hi Clay;
>>>>The trace you sent does show a problem. At this point I don't see a
>>>>need for more, but it would be useful to see what the 2.6.12 kernel
>>>>does. So I'd suggest you revert back to the 2.6.12 kernel and I'll
>>>>try and figure out how to get the kernel problem resolved.
>>>>Rich
>>>>At 09:21 AM 10/17/2005, Clayton Keller wrote:
>>>>
>>>>
>>>>>Richard Carlson wrote:
>>>>>
>>>>>
>>>>>>Hi Craig;
>>>>>>No, this NDT bug effects all servers. I ran into it while testing
>>>>>>from multiple clients. Clients 2, 3, & 4 would get the "Other
>>>>>>client testing please wait..." type message. Client 2 would not
>>>>>>get the final results until client 4 finished. I'll add this patch
>>>>>>to my next distribution, or you can apply it now if you are
>>>>>>experiencing some problems.
>>>>>>Since this didn't fix Clay's problem, I may need to rethink how the
>>>>>>tests are done. Right now the server simply streams data out for
>>>>>>10 seconds, sending as much as it can. Given the way TCP works,
>>>>>>there is a probability that the server will build up a queue in the
>>>>>>Send buffer (the bus is faster than the wire). This buffer will
>>>>>>need to drain before the test is complete. Packet loss, or other
>>>>>>factors could mean that this draining takes a long time so the
>>>>>>client simply sits there waiting. If it takes too long, the server
>>>>>>process will time-out and terminate so the client will never get
>>>>>>the final results.
>>>>>>More later.
>>>>>>Rich
>>>>>>At 08:26 AM 10/14/2005, Pepmiller, Craig E. wrote:
>>>>>>
>>>>>>
>>>>>>>Ok, so this is only seen when the NDT machine is configured for
>>>>>>>multiple
>>>>>>>simultaneous clients?
>>>>>>>
>>>>>>>Thanks-
>>>>>>>-Craig
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Richard Carlson
>>>>>>>[mailto:]
>>>>>>>Sent: Wednesday, October 12, 2005 2:56 PM
>>>>>>>To: Clayton Keller;
>>>>>>>
>>>>>>>Subject: Re: Slow Inbound Tests
>>>>>>>
>>>>>>>Hi Clayton;
>>>>>>>
>>>>>>>This is a bug in the web100srv code. I forgot to shutdown the
>>>>>>>control
>>>>>>>socket at the end of the test. If there are multiple clients then
>>>>>>>the
>>>>>>>final results are sent in a LIFO manner, so the first client needs to
>>>>>>>wait
>>>>>>>until all subsequent clients are done before the results are
>>>>>>>returned.
>>>>>>>
>>>>>>>I'll issue a patched version soon. In the mean time you can patch
>>>>>>>your
>>>>>>>version by hand by adding the line "shutdown(ctlsockfd,
>>>>>>>SHUT_RDWR);" to
>>>>>>>the
>>>>>>>web100srv.c file (on line 1126).
>>>>>>>
>>>>>>>Let me know if that fixes things.
>>>>>>>
>>>>>>>Rich
>>>>>>>
>>>>>>>
>>>>>>>---------------------------------------------------------------
>>>>>>>Original code:
>>>>>>> if (admin_view == 1) {
>>>>>>> totalcnt = calculate(SumRTT, CountRTT,
>>>>>>>CongestionSignals,
>>>>>>>PktsOut, DupAcksIn, AckPktsIn,
>>>>>>> CurrentMSS, SndLimTimeRwin, SndLimTimeCwnd,
>>>>>>>SndLimTimeSender,
>>>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>>>>>>>DataBytesOut,
>>>>>>>
>>>>>>>mismatch, bad_cable,
>>>>>>> (int)bwout, (int)bwin, c2sdata, s2cack, 1,
>>>>>>>debug);
>>>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>>>>>>>Timeouts,
>>>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd, mismatch,
>>>>>>>bad_cable, totalcnt,
>>>>>>> debug);
>>>>>>> }
>>>>>>>
>>>>>>> /* printf("Saved data to log file\n"); */
>>>>>>>
>>>>>>> /* exit(0); */
>>>>>>>}
>>>>>>>
>>>>>>>main(argc, argv)
>>>>>>>
>>>>>>>----------------------------------------------------------
>>>>>>>Modified code
>>>>>>> if (admin_view == 1) {
>>>>>>> totalcnt = calculate(SumRTT, CountRTT,
>>>>>>>CongestionSignals,
>>>>>>>PktsOut, DupAcksIn, AckPktsIn,
>>>>>>> CurrentMSS, SndLimTimeRwin, SndLimTimeCwnd,
>>>>>>>SndLimTimeSender,
>>>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>>>>>>>DataBytesOut,
>>>>>>>
>>>>>>>mismatch, bad_cable,
>>>>>>> (int)bwout, (int)bwin, c2sdata, s2cack, 1,
>>>>>>>debug);
>>>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>>>>>>>Timeouts,
>>>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd, mismatch,
>>>>>>>bad_cable, totalcnt,
>>>>>>> debug);
>>>>>>> }
>>>>>>> shutdown(ctlsockfd, SHUT_RDWR);
>>>>>>> /* printf("Saved data to log file\n"); */
>>>>>>>
>>>>>>> /* exit(0); */
>>>>>>>}
>>>>>>>
>>>>>>>main(argc, argv)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>At 01:54 PM 10/12/2005, Clayton Keller wrote:
>>>>>>>
>>>>>>>>I wanted to address this to the list. I believe there was a similar
>>>>>>>
>>>>>>>post a
>>>>>>>
>>>>>>>>week or so back but I wanted to address this clean.
>>>>>>>>
>>>>>>>>I currently have web100srv running from /etc/init.d/ndt with the
>>>>>>>
>>>>>>>following:
>>>>>>>
>>>>>>>>/usr/local/sbin/web100srv -a -m -l /var/log/web100/web100srv.log
>>>>>>>>
>>>>>>>>The system is running on Fedora Core 4 using a patached 2.6.13
>>>>>>>
>>>>>>>kernel
>>>>>>>from
>>>>>>>
>>>>>>>>kernel.org.
>>>>>>>>
>>>>>>>>The server itself is also sitting behind a PIX firewall.
>>>>>>>>
>>>>>>>>We have noticed that the Outbound Test will run rather quickly, but
>>>>>>>
>>>>>>>when
>>>>>>>
>>>>>>>>the Inbound, server to client, test is ran it can take upwards of
>>>>>>>
>>>>>>>several
>>>>>>>
>>>>>>>>minutes to complete, many times as much as 4 minutes. There are
>>>>>>>
>>>>>>>other
>>>>>>>
>>>>>>>>times where from the end user's point-of-view it appears the test
>>>>>>>
>>>>>>>never
>>>>>>>
>>>>>>>
>>>>>>>>completes although you can see results for the test appear in the
>>>>>>>>web100.log file. The test though will continue to sit on the
>>>>>>>
>>>>>>>unning 10s
>>>>>>>
>>>>>>>
>>>>>>>>inbound test (server to client) . . . . . . portion of the test,
>>>>>>>
>>>>>>>and
>>>>>>>many
>>>>>>>
>>>>>>>>users are beginning to just close out the window.
>>>>>>>>
>>>>>>>>At this point I am looking for general issues that I can look
>>>>>>>
>>>>>>>into and
>>>>>>>
>>>>>>>>possibly run debug against as far as these tests are concerned.
>>>>>>>>
>>>>>>>>Clayton Keller
>>>>>>>
>>>>>>>------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>Richard
>>>>>
>>>>>Did you want me to grab any more traces on newer versions of the
>>>>>2.6.13.x kernel or more on the current kernel it is running? Or
>>>>>should I revert back to my 2.6.12.5 kernel and see how performance
>>>>>improves?
>>>>>
>>>>>I saw from an earlier post to a differnent thread that it appears
>>>>>you are seeing some items in the traces that are eluding to issues
>>>>>pertaining to the 2.6.13.x kernel.
>>>>>
>>>>>Clay
>>>>
>>>>
>>>>------------------------------------
>>>>
>>>>Richard A. Carlson e-mail:
>>>>
>>>>Network Engineer phone: (734) 352-7043
>>>>Internet2 fax: (734) 913-4255
>>>>1000 Oakbrook Dr; Suite 300
>>>>Ann Arbor, MI 48104
>>>
>>>
>>>
>>>
>>>TCP/Web100 Network Diagnostic Tool v5.3.3e
>>>click START to begin
>>>Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
>>>running 10s outbound test (client to server) . . . . . 894.71Kb/s
>>>running 10s inbound test (server to client) . . . . . . 3.86Mb/s
>>>Your PC is connected to a Cable/DSL modem
>>>Information: Other network traffic is congesting the link
>>>
>>>
>>>WEB100 Kernel Variables:
>>>Client: localhost/127.0.0.1
>>>AckPktsIn: 3330
>>>AckPktsOut: 0
>>>BytesRetrans: 81420
>>>CongAvoid: 2639
>>>CongestionOverCount: 0
>>>CongestionSignals: 27
>>>CountRTT: 2802
>>>CurCwnd: 22080
>>>CurMSS: 1380
>>>CurRTO: 248
>>>CurRwinRcvd: 258060
>>>CurRwinSent: 5888
>>>CurSsthresh: 16560
>>>DSACKDups: 0
>>>DataBytesIn: 0
>>>DataBytesOut: 8879328
>>>DataPktsIn: 0
>>>DataPktsOut: 6192
>>>DupAcksIn: 481
>>>ECNEnabled: 0
>>>FastRetran: 27
>>>MaxCwnd: 63480
>>>MaxMSS: 1380
>>>MaxRTO: 295
>>>MaxRTT: 111
>>>MaxRwinRcvd: 258060
>>>MaxRwinSent: 5888
>>>MaxSsthresh: 41400
>>>MinMSS: 1380
>>>MinRTO: 229
>>>MinRTT: 20
>>>MinRwinRcvd: 238740
>>>MinRwinSent: 5888
>>>NagleEnabled: 1
>>>OtherReductions: 0
>>>PktsIn: 3330
>>>PktsOut: 6192
>>>PktsRetrans: 59
>>>X_Rcvbuf: 16777216
>>>RcvWinScale: 8
>>>SACKEnabled: 3
>>>SACKsRcvd: 510
>>>SendStall: 0
>>>SlowStart: 152
>>>SampleRTT: 42
>>>SmoothedRTT: 48
>>>X_Sndbuf: 16777216
>>>SndWinScale: 2
>>>SndLimTimeRwin: 0
>>>SndLimTimeCwnd: 18404625
>>>SndLimTimeSender: 8258
>>>SndLimTransRwin: 0
>>>SndLimTransCwnd: 1
>>>SndLimTransSender: 1
>>>SndLimBytesRwin: 0
>>>SndLimBytesCwnd: 8879328
>>>SndLimBytesSender: 0
>>>SubsequentTimeouts: 0
>>>SumRTT: 127937
>>>Timeouts: 0
>>>TimestampsEnabled: 0
>>>WinScaleRcvd: 2
>>>WinScaleSent: 8
>>>DupAcksOut: 0
>>>StartTimeUsec: 118172
>>>Duration: 18416093
>>>c2sData: 2
>>>c2sAck: 2
>>>s2cData: 9
>>>s2cAck: 3
>>>half_duplex: 0
>>>link: 100
>>>congestion: 1
>>>bad_cable: 0
>>>mismatch: 0
>>>spd: 0.00
>>>bw: 3.49
>>>loss: 0.004360465
>>>avgrtt: 45.66
>>>waitsec: 0.00
>>>timesec: 18.00
>>>order: 0.1444
>>>rwintime: 0.0000
>>>sendtime: 0.0004
>>>cwndtime: 0.9996
>>>rwin: 1.9688
>>>swin: 128.0000
>>>cwin: 0.4843
>>>rttsec: 0.045659
>>>Sndbuf: 16777216
>>>aspd: 8.63416
>>
>>
>>------------------------------------
>>
>>
>>
>>Richard A. Carlson e-mail:
>>
>>Network Engineer phone: (734) 352-7043
>>Internet2 fax: (734) 913-4255
>>1000 Oakbrook Dr; Suite 300
>>Ann Arbor, MI 48104
>>
>
>
- Re: Slow Inbound Tests, (continued)
- Re: Slow Inbound Tests, Mike Iglesias, 10/13/2005
- Re: Slow Inbound Tests, Martyn, 10/13/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/13/2005
- RE: Slow Inbound Tests, Pepmiller, Craig E., 10/14/2005
- Message not available
- RE: Slow Inbound Tests, Richard Carlson, 10/14/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/17/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/17/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/17/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/17/2005
- Message not available
- Message not available
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/19/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Dale Blount, 10/20/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/24/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/24/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/24/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/17/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/17/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/17/2005
- RE: Slow Inbound Tests, Richard Carlson, 10/14/2005
- Message not available
- Re: Slow Inbound Tests, Mike Iglesias, 10/13/2005
Archive powered by MHonArc 2.6.16.