ndt-users - Re: Slow Inbound Tests
Subject: ndt-users list created
List archive
- From: Richard Carlson <>
- To: Clayton Keller <>
- Cc:
- Subject: Re: Slow Inbound Tests
- Date: Wed, 19 Oct 2005 15:21:14 -0400
Hi Clay;
At 02:33 PM 10/19/2005, Clayton Keller wrote:
Rich,
I've confused myself a little between the two threads maybe. The
additons in the INSTALL file, should I apply all of these tunings, or
leave what I had in my sysctl.conf file but with the 4M change, and then
also include the other recommendations form #9?
Sorry for the confusion. No, don't apply these settings now. Use the 4M buffer size, or as I said in my last email reduce it more. The issue is that your CPU can write data to the network faster than the network can deliver it to the remote client. This leads to a situation where the server streams data for 10 seconds, but it takes 60 seconds to drain the queue. Setting the buffer to a smaller value reduces the queue size and therefor improves the response time. This is certainly a stop-gap measure until I can figure out a better sending strategy.
Also 3.1.4b, is including the patch you set to me that I added in myself
for the shutdown issues you saw, correct?
Yes, 3.1.4b includes the shutdown() patch I sent to the list.
Rich
Clay
Richard Carlson wrote:
> Hi Clay;
>
> At 09:43 AM 10/19/2005, Clayton Keller wrote:
>
>> Rich,
>>
>> The server currently is not doing much else. Load averages on the server
>> sit pretty much at 0.00. It is a Pentium-4 3.40GHz with 2GB of RAM.
>> There is not anything else running on it that is causing any heavy loads
>> or additional traffic at this time.
>>
>> Currently, I have the following lines added to the /etc/sysctl.conf
>> file, which I acquired from the README:
>>
>> # Recommended sysctl settings from web100 README
>> net.core.wmem_max = 8388608
>> net.core.rmem_max = 8388608
>> net.ipv4.tcp_wmem = 4096 65536 8388608
>> net.ipv4.tcp_rmem = 4096 87380 8388608
>> net.ipv4.tcp_default_win_scale = 7
>> net.ipv4.tcp_moderate_rcvbuf = 1
>
>
> OK, the changes I suggest are minor. Just change the tcp_wmem and
> tcp_rmem max value to 4M (4194304) from the current 8M value. You can
> also change the tcp_default_win_scale value to 6.
>
> Let me know what happens.
>
> Rich
>
>> I can go ahead and make the adjustments that you recommended, but didn't
>> know if I should be making any further changes as well.
>>
>> I will run some further tests with the new settings and also with the
>> "-m" flag removed. However, I wanted to run the sysctl.conf settings
>> that we currently have by you first, and see if I should look at further
>> changes there.
>>
>> Clay
>>
>> Richard Carlson wrote:
>> > Hi Clay
>> >
>> > OK, I looked at the traces and the web100 stats and there are a couple
>> > of things that stand out.
>> >
>> > 1) your server is set to use 16 MB buffers.
>> > 2) this inbound test ran for 18 seconds (Duration and SndLimTimeCwnd)
>> > 3) the trace (.2790) shows that data stops flowing, but the connection
>> > isn't closing gracefully (no TCP FIN packets being exchanged). [This
>> > might be another bug in my server code]
>> >
>> > It's not clear to me why the test is running so long. What else is
>> > running on this server? Is it very busy? What does "/usr/bin/top"
>> > report? Finally, what messages appear in the clients Java console
>> > window? The client will report how long it spent reading data from the
>> > network
>> >
>> > Things to try:
>> > * One thing would be to reduce the maximum sender buffer size. Try
>> > making the max 4 MB instead of 16. Edit the /etc/sysctl.conf file and
>> > change the following lines.
>> > # increase Linux autotuning TCP buffer limits
>> > net.ipv4.tcp_rmem = 4096 87380 16777216
>> > net.ipv4.tcp_wmem = 4096 87380 16777216
>> > to # increase Linux autotuning TCP buffer limits
>> > net.ipv4.tcp_rmem = 4096 87380 4194304
>> > net.ipv4.tcp_wmem = 4096 87380 4194304
>> >
>> > and then run the "/sbin/sysctl -p" command.
>> >
>> > One possible problem is that the server is faster than the network so
>> > data is being placed in the send queue. The connection wouldn't
>> > shut-down until the queue is empty. So even if the NDT process stops
>> > sending after 10 seconds, it could take some time to drain the queue.
>> > With a 4 MB queue it would take less time to drain.
>> >
>> > That said, it isn't clear why the client is hanging for so long. I
>> > guess it's also possible that my shutdown patch isn't working properly
>> > in the multi-client mode. Can you try running the web100srv process
>> > without the -m flag. This will case the server to handle clients in a
>> > FIFO manner. If the server is busy the incoming clients will receive a
>> > message saying the server is busy and a test will begin in xx seconds.
>> > The client is updated every time another client's test finishes. I
>> know
>> > the shutdown() patch fixed a hang there, if possible give it a try and
>> > let me know what happens.
>> >
>> > That's all I can think of right now, I'll think about it some more
>> > tonight and run some tests tomorrow.
>> >
>> > Rich
>> >
>> > At 09:08 AM 10/18/2005, Clayton Keller wrote:
>> >
>> >> Rich,
>> >>
>> >> We are still seeing issues with the Inbound tests even after reverting
>> >> to the 2.6.12.5 kernel. This is not the Fedora Source kernel that
>> >> Martin is using, but the stock kernel.org download.
>> >>
>> >> I would like to go ahead and submit another trace for you. Is there a
>> >> possibility that the issues we are seeing are network/bandwidth issues
>> >> on our part?
>> >>
>> >> From my connection which is on a different network, the Outbound test
>> >> took aprox. 10 seconds while the Inbound test took well over one
>> >> minute. The info you are receiving is from a connection on that same
>> >> network. The Inbound test took about one minute before it reported its
>> >> results back to the user.
>> >>
>> >> I apologize, but I am not quite sure what all info is found in the
>> >> trace so I guess that is why I am asking you if there are external
>> >> issues on our end that maybe part of the cause.
>> >>
>> >> Also, I could look at using one of the Fedora kernels and patch it as
>> >> like Martyn did.
>> >>
>> >> Clay
>> >>
>> >>
>> >>
>> >> Richard Carlson wrote:
>> >>
>> >>> Hi Clay;
>> >>> The trace you sent does show a problem. At this point I don't see a
>> >>> need for more, but it would be useful to see what the 2.6.12 kernel
>> >>> does. So I'd suggest you revert back to the 2.6.12 kernel and I'll
>> >>> try and figure out how to get the kernel problem resolved.
>> >>> Rich
>> >>> At 09:21 AM 10/17/2005, Clayton Keller wrote:
>> >>>
>> >>>> Richard Carlson wrote:
>> >>>>
>> >>>>> Hi Craig;
>> >>>>> No, this NDT bug effects all servers. I ran into it while testing
>> >>>>> from multiple clients. Clients 2, 3, & 4 would get the "Other
>> >>>>> client testing please wait..." type message. Client 2 would not
>> >>>>> get the final results until client 4 finished. I'll add this patch
>> >>>>> to my next distribution, or you can apply it now if you are
>> >>>>> experiencing some problems.
>> >>>>> Since this didn't fix Clay's problem, I may need to rethink how the
>> >>>>> tests are done. Right now the server simply streams data out for
>> >>>>> 10 seconds, sending as much as it can. Given the way TCP works,
>> >>>>> there is a probability that the server will build up a queue in the
>> >>>>> Send buffer (the bus is faster than the wire). This buffer will
>> >>>>> need to drain before the test is complete. Packet loss, or other
>> >>>>> factors could mean that this draining takes a long time so the
>> >>>>> client simply sits there waiting. If it takes too long, the server
>> >>>>> process will time-out and terminate so the client will never get
>> >>>>> the final results.
>> >>>>> More later.
>> >>>>> Rich
>> >>>>> At 08:26 AM 10/14/2005, Pepmiller, Craig E. wrote:
>> >>>>>
>> >>>>>> Ok, so this is only seen when the NDT machine is configured for
>> >>>>>> multiple
>> >>>>>> simultaneous clients?
>> >>>>>>
>> >>>>>> Thanks-
>> >>>>>> -Craig
>> >>>>>>
>> >>>>>> -----Original Message-----
>> >>>>>> From: Richard Carlson
[mailto:]
>> >>>>>> Sent: Wednesday, October 12, 2005 2:56 PM
>> >>>>>> To: Clayton Keller;
>> >>>>>> Subject: Re: Slow Inbound Tests
>> >>>>>>
>> >>>>>> Hi Clayton;
>> >>>>>>
>> >>>>>> This is a bug in the web100srv code. I forgot to shutdown the
>> >>>>>> control
>> >>>>>> socket at the end of the test. If there are multiple clients then
>> >>>>>> the
>> >>>>>> final results are sent in a LIFO manner, so the first client
>> needs to
>> >>>>>> wait
>> >>>>>> until all subsequent clients are done before the results are
>> >>>>>> returned.
>> >>>>>>
>> >>>>>> I'll issue a patched version soon. In the mean time you can patch
>> >>>>>> your
>> >>>>>> version by hand by adding the line "shutdown(ctlsockfd,
>> >>>>>> SHUT_RDWR);" to
>> >>>>>> the
>> >>>>>> web100srv.c file (on line 1126).
>> >>>>>>
>> >>>>>> Let me know if that fixes things.
>> >>>>>>
>> >>>>>> Rich
>> >>>>>>
>> >>>>>>
>> >>>>>> ---------------------------------------------------------------
>> >>>>>> Original code:
>> >>>>>> if (admin_view == 1) {
>> >>>>>> totalcnt = calculate(SumRTT, CountRTT,
>> >>>>>> CongestionSignals,
>> >>>>>> PktsOut, DupAcksIn, AckPktsIn,
>> >>>>>> CurrentMSS, SndLimTimeRwin,
>> SndLimTimeCwnd,
>> >>>>>> SndLimTimeSender,
>> >>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>> >>>>>> DataBytesOut,
>> >>>>>>
>> >>>>>> mismatch, bad_cable,
>> >>>>>> (int)bwout, (int)bwin, c2sdata,
>> s2cack, 1,
>> >>>>>> debug);
>> >>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>> >>>>>> Timeouts,
>> >>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd,
>> mismatch,
>> >>>>>> bad_cable, totalcnt,
>> >>>>>> debug);
>> >>>>>> }
>> >>>>>>
>> >>>>>> /* printf("Saved data to log file\n"); */
>> >>>>>>
>> >>>>>> /* exit(0); */
>> >>>>>> }
>> >>>>>>
>> >>>>>> main(argc, argv)
>> >>>>>>
>> >>>>>> ----------------------------------------------------------
>> >>>>>> Modified code
>> >>>>>> if (admin_view == 1) {
>> >>>>>> totalcnt = calculate(SumRTT, CountRTT,
>> >>>>>> CongestionSignals,
>> >>>>>> PktsOut, DupAcksIn, AckPktsIn,
>> >>>>>> CurrentMSS, SndLimTimeRwin,
>> SndLimTimeCwnd,
>> >>>>>> SndLimTimeSender,
>> >>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>> >>>>>> DataBytesOut,
>> >>>>>>
>> >>>>>> mismatch, bad_cable,
>> >>>>>> (int)bwout, (int)bwin, c2sdata,
>> s2cack, 1,
>> >>>>>> debug);
>> >>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>> >>>>>> Timeouts,
>> >>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd,
>> mismatch,
>> >>>>>> bad_cable, totalcnt,
>> >>>>>> debug);
>> >>>>>> }
>> >>>>>> shutdown(ctlsockfd, SHUT_RDWR);
>> >>>>>> /* printf("Saved data to log file\n"); */
>> >>>>>>
>> >>>>>> /* exit(0); */
>> >>>>>> }
>> >>>>>>
>> >>>>>> main(argc, argv)
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> At 01:54 PM 10/12/2005, Clayton Keller wrote:
>> >>>>>> >I wanted to address this to the list. I believe there was a
>> similar
>> >>>>>> post a
>> >>>>>> >week or so back but I wanted to address this clean.
>> >>>>>> >
>> >>>>>> >I currently have web100srv running from /etc/init.d/ndt with the
>> >>>>>> following:
>> >>>>>> >
>> >>>>>> >/usr/local/sbin/web100srv -a -m -l /var/log/web100/web100srv.log
>> >>>>>> >
>> >>>>>> >The system is running on Fedora Core 4 using a patached 2.6.13
>> >>>>>> kernel
>> >>>>>> from
>> >>>>>> >kernel.org.
>> >>>>>> >
>> >>>>>> >The server itself is also sitting behind a PIX firewall.
>> >>>>>> >
>> >>>>>> >We have noticed that the Outbound Test will run rather
>> quickly, but
>> >>>>>> when
>> >>>>>> >the Inbound, server to client, test is ran it can take upwards of
>> >>>>>> several
>> >>>>>> >minutes to complete, many times as much as 4 minutes. There are
>> >>>>>> other
>> >>>>>> >times where from the end user's point-of-view it appears the test
>> >>>>>> never
>> >>>>>>
>> >>>>>> >completes although you can see results for the test appear in the
>> >>>>>> >web100.log file. The test though will continue to sit on the
>> >>>>>> unning 10s
>> >>>>>>
>> >>>>>> >inbound test (server to client) . . . . . . portion of the test,
>> >>>>>> and
>> >>>>>> many
>> >>>>>> >users are beginning to just close out the window.
>> >>>>>> >
>> >>>>>> >At this point I am looking for general issues that I can look
>> >>>>>> into and
>> >>>>>> >possibly run debug against as far as these tests are concerned.
>> >>>>>> >
>> >>>>>> >Clayton Keller
>> >>>>>>
>> >>>>>> ------------------------------------
>> >>>>
>> >>>>
>> >>>>
>> >>>> Richard
>> >>>>
>> >>>> Did you want me to grab any more traces on newer versions of the
>> >>>> 2.6.13.x kernel or more on the current kernel it is running? Or
>> >>>> should I revert back to my 2.6.12.5 kernel and see how performance
>> >>>> improves?
>> >>>>
>> >>>> I saw from an earlier post to a differnent thread that it appears
>> >>>> you are seeing some items in the traces that are eluding to issues
>> >>>> pertaining to the 2.6.13.x kernel.
>> >>>>
>> >>>> Clay
>> >>>
>> >>>
>> >>> ------------------------------------
>> >>>
>> >>> Richard A. Carlson e-mail:
>> >>>
>> >>> Network Engineer phone: (734)
>> 352-7043
>> >>> Internet2 fax: (734)
>> 913-4255
>> >>> 1000 Oakbrook Dr; Suite 300
>> >>> Ann Arbor, MI 48104
>> >>
>> >>
>> >>
>> >>
>> >> TCP/Web100 Network Diagnostic Tool v5.3.3e
>> >> click START to begin
>> >> Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
>> >> running 10s outbound test (client to server) . . . . . 894.71Kb/s
>> >> running 10s inbound test (server to client) . . . . . . 3.86Mb/s
>> >> Your PC is connected to a Cable/DSL modem
>> >> Information: Other network traffic is congesting the link
>> >>
>> >>
>> >> WEB100 Kernel Variables:
>> >> Client: localhost/127.0.0.1
>> >> AckPktsIn: 3330
>> >> AckPktsOut: 0
>> >> BytesRetrans: 81420
>> >> CongAvoid: 2639
>> >> CongestionOverCount: 0
>> >> CongestionSignals: 27
>> >> CountRTT: 2802
>> >> CurCwnd: 22080
>> >> CurMSS: 1380
>> >> CurRTO: 248
>> >> CurRwinRcvd: 258060
>> >> CurRwinSent: 5888
>> >> CurSsthresh: 16560
>> >> DSACKDups: 0
>> >> DataBytesIn: 0
>> >> DataBytesOut: 8879328
>> >> DataPktsIn: 0
>> >> DataPktsOut: 6192
>> >> DupAcksIn: 481
>> >> ECNEnabled: 0
>> >> FastRetran: 27
>> >> MaxCwnd: 63480
>> >> MaxMSS: 1380
>> >> MaxRTO: 295
>> >> MaxRTT: 111
>> >> MaxRwinRcvd: 258060
>> >> MaxRwinSent: 5888
>> >> MaxSsthresh: 41400
>> >> MinMSS: 1380
>> >> MinRTO: 229
>> >> MinRTT: 20
>> >> MinRwinRcvd: 238740
>> >> MinRwinSent: 5888
>> >> NagleEnabled: 1
>> >> OtherReductions: 0
>> >> PktsIn: 3330
>> >> PktsOut: 6192
>> >> PktsRetrans: 59
>> >> X_Rcvbuf: 16777216
>> >> RcvWinScale: 8
>> >> SACKEnabled: 3
>> >> SACKsRcvd: 510
>> >> SendStall: 0
>> >> SlowStart: 152
>> >> SampleRTT: 42
>> >> SmoothedRTT: 48
>> >> X_Sndbuf: 16777216
>> >> SndWinScale: 2
>> >> SndLimTimeRwin: 0
>> >> SndLimTimeCwnd: 18404625
>> >> SndLimTimeSender: 8258
>> >> SndLimTransRwin: 0
>> >> SndLimTransCwnd: 1
>> >> SndLimTransSender: 1
>> >> SndLimBytesRwin: 0
>> >> SndLimBytesCwnd: 8879328
>> >> SndLimBytesSender: 0
>> >> SubsequentTimeouts: 0
>> >> SumRTT: 127937
>> >> Timeouts: 0
>> >> TimestampsEnabled: 0
>> >> WinScaleRcvd: 2
>> >> WinScaleSent: 8
>> >> DupAcksOut: 0
>> >> StartTimeUsec: 118172
>> >> Duration: 18416093
>> >> c2sData: 2
>> >> c2sAck: 2
>> >> s2cData: 9
>> >> s2cAck: 3
>> >> half_duplex: 0
>> >> link: 100
>> >> congestion: 1
>> >> bad_cable: 0
>> >> mismatch: 0
>> >> spd: 0.00
>> >> bw: 3.49
>> >> loss: 0.004360465
>> >> avgrtt: 45.66
>> >> waitsec: 0.00
>> >> timesec: 18.00
>> >> order: 0.1444
>> >> rwintime: 0.0000
>> >> sendtime: 0.0004
>> >> cwndtime: 0.9996
>> >> rwin: 1.9688
>> >> swin: 128.0000
>> >> cwin: 0.4843
>> >> rttsec: 0.045659
>> >> Sndbuf: 16777216
>> >> aspd: 8.63416
>> >
>> >
>> > ------------------------------------
>> >
>> >
>> >
>> > Richard A. Carlson e-mail:
>> >
>> > Network Engineer phone: (734) 352-7043
>> > Internet2 fax: (734) 913-4255
>> > 1000 Oakbrook Dr; Suite 300
>> > Ann Arbor, MI 48104
>> >
>
>
> ------------------------------------
>
>
>
> Richard A. Carlson e-mail:
>
> Network Engineer phone: (734) 352-7043
> Internet2 fax: (734) 913-4255
> 1000 Oakbrook Dr; Suite 300
> Ann Arbor, MI 48104
>
------------------------------------
Richard A. Carlson e-mail:
Network Engineer phone: (734) 352-7043
Internet2 fax: (734) 913-4255
1000 Oakbrook Dr; Suite 300
Ann Arbor, MI 48104
- RE: Slow Inbound Tests, (continued)
- Message not available
- RE: Slow Inbound Tests, Richard Carlson, 10/14/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/17/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/17/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/17/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/17/2005
- Message not available
- Message not available
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/19/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/17/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/19/2005
- Re: Slow Inbound Tests, Dale Blount, 10/20/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/24/2005
- Re: Slow Inbound Tests, Richard Carlson, 10/24/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/24/2005
- Re: Slow Inbound Tests, Clayton Keller, 10/17/2005
- RE: Slow Inbound Tests, Richard Carlson, 10/14/2005
- Message not available
- Message not available
- RE: Slow Inbound Tests, Richard Carlson, 10/18/2005
- Re: Slow Inbound Tests, Martyn, 10/18/2005
- RE: Slow Inbound Tests, Richard Carlson, 10/18/2005
Archive powered by MHonArc 2.6.16.