Skip to Content.
Sympa Menu

ndt-users - Re: Slow Inbound Tests

Subject: ndt-users list created

List archive

Re: Slow Inbound Tests


Chronological Thread 
  • From: Clayton Keller <>
  • To: Richard Carlson <>
  • Cc:
  • Subject: Re: Slow Inbound Tests
  • Date: Wed, 19 Oct 2005 08:43:50 -0500

Rich,

The server currently is not doing much else. Load averages on the server
sit pretty much at 0.00. It is a Pentium-4 3.40GHz with 2GB of RAM.
There is not anything else running on it that is causing any heavy loads
or additional traffic at this time.

Currently, I have the following lines added to the /etc/sysctl.conf
file, which I acquired from the README:

# Recommended sysctl settings from web100 README
net.core.wmem_max = 8388608
net.core.rmem_max = 8388608
net.ipv4.tcp_wmem = 4096 65536 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_default_win_scale = 7
net.ipv4.tcp_moderate_rcvbuf = 1

I can go ahead and make the adjustments that you recommended, but didn't
know if I should be making any further changes as well.

I will run some further tests with the new settings and also with the
"-m" flag removed. However, I wanted to run the sysctl.conf settings
that we currently have by you first, and see if I should look at further
changes there.

Clay

Richard Carlson wrote:
> Hi Clay
>
> OK, I looked at the traces and the web100 stats and there are a couple
> of things that stand out.
>
> 1) your server is set to use 16 MB buffers.
> 2) this inbound test ran for 18 seconds (Duration and SndLimTimeCwnd)
> 3) the trace (.2790) shows that data stops flowing, but the connection
> isn't closing gracefully (no TCP FIN packets being exchanged). [This
> might be another bug in my server code]
>
> It's not clear to me why the test is running so long. What else is
> running on this server? Is it very busy? What does "/usr/bin/top"
> report? Finally, what messages appear in the clients Java console
> window? The client will report how long it spent reading data from the
> network
>
> Things to try:
> * One thing would be to reduce the maximum sender buffer size. Try
> making the max 4 MB instead of 16. Edit the /etc/sysctl.conf file and
> change the following lines.
> # increase Linux autotuning TCP buffer limits
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 87380 16777216
> to # increase Linux autotuning TCP buffer limits
> net.ipv4.tcp_rmem = 4096 87380 4194304
> net.ipv4.tcp_wmem = 4096 87380 4194304
>
> and then run the "/sbin/sysctl -p" command.
>
> One possible problem is that the server is faster than the network so
> data is being placed in the send queue. The connection wouldn't
> shut-down until the queue is empty. So even if the NDT process stops
> sending after 10 seconds, it could take some time to drain the queue.
> With a 4 MB queue it would take less time to drain.
>
> That said, it isn't clear why the client is hanging for so long. I
> guess it's also possible that my shutdown patch isn't working properly
> in the multi-client mode. Can you try running the web100srv process
> without the -m flag. This will case the server to handle clients in a
> FIFO manner. If the server is busy the incoming clients will receive a
> message saying the server is busy and a test will begin in xx seconds.
> The client is updated every time another client's test finishes. I know
> the shutdown() patch fixed a hang there, if possible give it a try and
> let me know what happens.
>
> That's all I can think of right now, I'll think about it some more
> tonight and run some tests tomorrow.
>
> Rich
>
> At 09:08 AM 10/18/2005, Clayton Keller wrote:
>
>> Rich,
>>
>> We are still seeing issues with the Inbound tests even after reverting
>> to the 2.6.12.5 kernel. This is not the Fedora Source kernel that
>> Martin is using, but the stock kernel.org download.
>>
>> I would like to go ahead and submit another trace for you. Is there a
>> possibility that the issues we are seeing are network/bandwidth issues
>> on our part?
>>
>> From my connection which is on a different network, the Outbound test
>> took aprox. 10 seconds while the Inbound test took well over one
>> minute. The info you are receiving is from a connection on that same
>> network. The Inbound test took about one minute before it reported its
>> results back to the user.
>>
>> I apologize, but I am not quite sure what all info is found in the
>> trace so I guess that is why I am asking you if there are external
>> issues on our end that maybe part of the cause.
>>
>> Also, I could look at using one of the Fedora kernels and patch it as
>> like Martyn did.
>>
>> Clay
>>
>>
>>
>> Richard Carlson wrote:
>>
>>> Hi Clay;
>>> The trace you sent does show a problem. At this point I don't see a
>>> need for more, but it would be useful to see what the 2.6.12 kernel
>>> does. So I'd suggest you revert back to the 2.6.12 kernel and I'll
>>> try and figure out how to get the kernel problem resolved.
>>> Rich
>>> At 09:21 AM 10/17/2005, Clayton Keller wrote:
>>>
>>>> Richard Carlson wrote:
>>>>
>>>>> Hi Craig;
>>>>> No, this NDT bug effects all servers. I ran into it while testing
>>>>> from multiple clients. Clients 2, 3, & 4 would get the "Other
>>>>> client testing please wait..." type message. Client 2 would not
>>>>> get the final results until client 4 finished. I'll add this patch
>>>>> to my next distribution, or you can apply it now if you are
>>>>> experiencing some problems.
>>>>> Since this didn't fix Clay's problem, I may need to rethink how the
>>>>> tests are done. Right now the server simply streams data out for
>>>>> 10 seconds, sending as much as it can. Given the way TCP works,
>>>>> there is a probability that the server will build up a queue in the
>>>>> Send buffer (the bus is faster than the wire). This buffer will
>>>>> need to drain before the test is complete. Packet loss, or other
>>>>> factors could mean that this draining takes a long time so the
>>>>> client simply sits there waiting. If it takes too long, the server
>>>>> process will time-out and terminate so the client will never get
>>>>> the final results.
>>>>> More later.
>>>>> Rich
>>>>> At 08:26 AM 10/14/2005, Pepmiller, Craig E. wrote:
>>>>>
>>>>>> Ok, so this is only seen when the NDT machine is configured for
>>>>>> multiple
>>>>>> simultaneous clients?
>>>>>>
>>>>>> Thanks-
>>>>>> -Craig
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Richard Carlson
>>>>>> [mailto:]
>>>>>> Sent: Wednesday, October 12, 2005 2:56 PM
>>>>>> To: Clayton Keller;
>>>>>>
>>>>>> Subject: Re: Slow Inbound Tests
>>>>>>
>>>>>> Hi Clayton;
>>>>>>
>>>>>> This is a bug in the web100srv code. I forgot to shutdown the
>>>>>> control
>>>>>> socket at the end of the test. If there are multiple clients then
>>>>>> the
>>>>>> final results are sent in a LIFO manner, so the first client needs to
>>>>>> wait
>>>>>> until all subsequent clients are done before the results are
>>>>>> returned.
>>>>>>
>>>>>> I'll issue a patched version soon. In the mean time you can patch
>>>>>> your
>>>>>> version by hand by adding the line "shutdown(ctlsockfd,
>>>>>> SHUT_RDWR);" to
>>>>>> the
>>>>>> web100srv.c file (on line 1126).
>>>>>>
>>>>>> Let me know if that fixes things.
>>>>>>
>>>>>> Rich
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------
>>>>>> Original code:
>>>>>> if (admin_view == 1) {
>>>>>> totalcnt = calculate(SumRTT, CountRTT,
>>>>>> CongestionSignals,
>>>>>> PktsOut, DupAcksIn, AckPktsIn,
>>>>>> CurrentMSS, SndLimTimeRwin, SndLimTimeCwnd,
>>>>>> SndLimTimeSender,
>>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>>>>>> DataBytesOut,
>>>>>>
>>>>>> mismatch, bad_cable,
>>>>>> (int)bwout, (int)bwin, c2sdata, s2cack, 1,
>>>>>> debug);
>>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>>>>>> Timeouts,
>>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd, mismatch,
>>>>>> bad_cable, totalcnt,
>>>>>> debug);
>>>>>> }
>>>>>>
>>>>>> /* printf("Saved data to log file\n"); */
>>>>>>
>>>>>> /* exit(0); */
>>>>>> }
>>>>>>
>>>>>> main(argc, argv)
>>>>>>
>>>>>> ----------------------------------------------------------
>>>>>> Modified code
>>>>>> if (admin_view == 1) {
>>>>>> totalcnt = calculate(SumRTT, CountRTT,
>>>>>> CongestionSignals,
>>>>>> PktsOut, DupAcksIn, AckPktsIn,
>>>>>> CurrentMSS, SndLimTimeRwin, SndLimTimeCwnd,
>>>>>> SndLimTimeSender,
>>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>>>>>> DataBytesOut,
>>>>>>
>>>>>> mismatch, bad_cable,
>>>>>> (int)bwout, (int)bwin, c2sdata, s2cack, 1,
>>>>>> debug);
>>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>>>>>> Timeouts,
>>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd, mismatch,
>>>>>> bad_cable, totalcnt,
>>>>>> debug);
>>>>>> }
>>>>>> shutdown(ctlsockfd, SHUT_RDWR);
>>>>>> /* printf("Saved data to log file\n"); */
>>>>>>
>>>>>> /* exit(0); */
>>>>>> }
>>>>>>
>>>>>> main(argc, argv)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> At 01:54 PM 10/12/2005, Clayton Keller wrote:
>>>>>> >I wanted to address this to the list. I believe there was a similar
>>>>>> post a
>>>>>> >week or so back but I wanted to address this clean.
>>>>>> >
>>>>>> >I currently have web100srv running from /etc/init.d/ndt with the
>>>>>> following:
>>>>>> >
>>>>>> >/usr/local/sbin/web100srv -a -m -l /var/log/web100/web100srv.log
>>>>>> >
>>>>>> >The system is running on Fedora Core 4 using a patached 2.6.13
>>>>>> kernel
>>>>>> from
>>>>>> >kernel.org.
>>>>>> >
>>>>>> >The server itself is also sitting behind a PIX firewall.
>>>>>> >
>>>>>> >We have noticed that the Outbound Test will run rather quickly, but
>>>>>> when
>>>>>> >the Inbound, server to client, test is ran it can take upwards of
>>>>>> several
>>>>>> >minutes to complete, many times as much as 4 minutes. There are
>>>>>> other
>>>>>> >times where from the end user's point-of-view it appears the test
>>>>>> never
>>>>>>
>>>>>> >completes although you can see results for the test appear in the
>>>>>> >web100.log file. The test though will continue to sit on the
>>>>>> unning 10s
>>>>>>
>>>>>> >inbound test (server to client) . . . . . . portion of the test,
>>>>>> and
>>>>>> many
>>>>>> >users are beginning to just close out the window.
>>>>>> >
>>>>>> >At this point I am looking for general issues that I can look
>>>>>> into and
>>>>>> >possibly run debug against as far as these tests are concerned.
>>>>>> >
>>>>>> >Clayton Keller
>>>>>>
>>>>>> ------------------------------------
>>>>
>>>>
>>>>
>>>> Richard
>>>>
>>>> Did you want me to grab any more traces on newer versions of the
>>>> 2.6.13.x kernel or more on the current kernel it is running? Or
>>>> should I revert back to my 2.6.12.5 kernel and see how performance
>>>> improves?
>>>>
>>>> I saw from an earlier post to a differnent thread that it appears
>>>> you are seeing some items in the traces that are eluding to issues
>>>> pertaining to the 2.6.13.x kernel.
>>>>
>>>> Clay
>>>
>>>
>>> ------------------------------------
>>>
>>> Richard A. Carlson e-mail:
>>>
>>> Network Engineer phone: (734) 352-7043
>>> Internet2 fax: (734) 913-4255
>>> 1000 Oakbrook Dr; Suite 300
>>> Ann Arbor, MI 48104
>>
>>
>>
>>
>> TCP/Web100 Network Diagnostic Tool v5.3.3e
>> click START to begin
>> Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
>> running 10s outbound test (client to server) . . . . . 894.71Kb/s
>> running 10s inbound test (server to client) . . . . . . 3.86Mb/s
>> Your PC is connected to a Cable/DSL modem
>> Information: Other network traffic is congesting the link
>>
>>
>> WEB100 Kernel Variables:
>> Client: localhost/127.0.0.1
>> AckPktsIn: 3330
>> AckPktsOut: 0
>> BytesRetrans: 81420
>> CongAvoid: 2639
>> CongestionOverCount: 0
>> CongestionSignals: 27
>> CountRTT: 2802
>> CurCwnd: 22080
>> CurMSS: 1380
>> CurRTO: 248
>> CurRwinRcvd: 258060
>> CurRwinSent: 5888
>> CurSsthresh: 16560
>> DSACKDups: 0
>> DataBytesIn: 0
>> DataBytesOut: 8879328
>> DataPktsIn: 0
>> DataPktsOut: 6192
>> DupAcksIn: 481
>> ECNEnabled: 0
>> FastRetran: 27
>> MaxCwnd: 63480
>> MaxMSS: 1380
>> MaxRTO: 295
>> MaxRTT: 111
>> MaxRwinRcvd: 258060
>> MaxRwinSent: 5888
>> MaxSsthresh: 41400
>> MinMSS: 1380
>> MinRTO: 229
>> MinRTT: 20
>> MinRwinRcvd: 238740
>> MinRwinSent: 5888
>> NagleEnabled: 1
>> OtherReductions: 0
>> PktsIn: 3330
>> PktsOut: 6192
>> PktsRetrans: 59
>> X_Rcvbuf: 16777216
>> RcvWinScale: 8
>> SACKEnabled: 3
>> SACKsRcvd: 510
>> SendStall: 0
>> SlowStart: 152
>> SampleRTT: 42
>> SmoothedRTT: 48
>> X_Sndbuf: 16777216
>> SndWinScale: 2
>> SndLimTimeRwin: 0
>> SndLimTimeCwnd: 18404625
>> SndLimTimeSender: 8258
>> SndLimTransRwin: 0
>> SndLimTransCwnd: 1
>> SndLimTransSender: 1
>> SndLimBytesRwin: 0
>> SndLimBytesCwnd: 8879328
>> SndLimBytesSender: 0
>> SubsequentTimeouts: 0
>> SumRTT: 127937
>> Timeouts: 0
>> TimestampsEnabled: 0
>> WinScaleRcvd: 2
>> WinScaleSent: 8
>> DupAcksOut: 0
>> StartTimeUsec: 118172
>> Duration: 18416093
>> c2sData: 2
>> c2sAck: 2
>> s2cData: 9
>> s2cAck: 3
>> half_duplex: 0
>> link: 100
>> congestion: 1
>> bad_cable: 0
>> mismatch: 0
>> spd: 0.00
>> bw: 3.49
>> loss: 0.004360465
>> avgrtt: 45.66
>> waitsec: 0.00
>> timesec: 18.00
>> order: 0.1444
>> rwintime: 0.0000
>> sendtime: 0.0004
>> cwndtime: 0.9996
>> rwin: 1.9688
>> swin: 128.0000
>> cwin: 0.4843
>> rttsec: 0.045659
>> Sndbuf: 16777216
>> aspd: 8.63416
>
>
> ------------------------------------
>
>
>
> Richard A. Carlson e-mail:
>
> Network Engineer phone: (734) 352-7043
> Internet2 fax: (734) 913-4255
> 1000 Oakbrook Dr; Suite 300
> Ann Arbor, MI 48104
>



Archive powered by MHonArc 2.6.16.

Top of Page