Skip to Content.
Sympa Menu

ndt-users - Re: Slow Inbound Tests

Subject: ndt-users list created

List archive

Re: Slow Inbound Tests


Chronological Thread 
  • From: Clayton Keller <>
  • To:
  • Subject: Re: Slow Inbound Tests
  • Date: Wed, 19 Oct 2005 14:31:06 -0500

With the following settings, this are finishing up pretty close to the
estimated 20secs:

net.core.wmem_max = 2097152
net.core.rmem_max = 2097152
net.ipv4.tcp_wmem = 4096 87380 2097152
net.ipv4.tcp_rmem = 4096 87380 2097152
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_ecn = 1

I'll be doing some testing tonight from other connections, as well as
have some others do follow up tests for me again.

Things are looking pretty promising though at this point.

Clay

Richard Carlson wrote:
> Hi Clay;
>
> At 02:33 PM 10/19/2005, Clayton Keller wrote:
>
>> Rich,
>>
>> I've confused myself a little between the two threads maybe. The
>> additons in the INSTALL file, should I apply all of these tunings, or
>> leave what I had in my sysctl.conf file but with the 4M change, and then
>> also include the other recommendations form #9?
>
>
> Sorry for the confusion. No, don't apply these settings now. Use the
> 4M buffer size, or as I said in my last email reduce it more. The issue
> is that your CPU can write data to the network faster than the network
> can deliver it to the remote client. This leads to a situation where
> the server streams data for 10 seconds, but it takes 60 seconds to drain
> the queue. Setting the buffer to a smaller value reduces the queue size
> and therefor improves the response time. This is certainly a stop-gap
> measure until I can figure out a better sending strategy.
>
>> Also 3.1.4b, is including the patch you set to me that I added in myself
>> for the shutdown issues you saw, correct?
>
>
> Yes, 3.1.4b includes the shutdown() patch I sent to the list.
>
> Rich
>
>> Clay
>>
>> Richard Carlson wrote:
>> > Hi Clay;
>> >
>> > At 09:43 AM 10/19/2005, Clayton Keller wrote:
>> >
>> >> Rich,
>> >>
>> >> The server currently is not doing much else. Load averages on the
>> server
>> >> sit pretty much at 0.00. It is a Pentium-4 3.40GHz with 2GB of RAM.
>> >> There is not anything else running on it that is causing any heavy
>> loads
>> >> or additional traffic at this time.
>> >>
>> >> Currently, I have the following lines added to the /etc/sysctl.conf
>> >> file, which I acquired from the README:
>> >>
>> >> # Recommended sysctl settings from web100 README
>> >> net.core.wmem_max = 8388608
>> >> net.core.rmem_max = 8388608
>> >> net.ipv4.tcp_wmem = 4096 65536 8388608
>> >> net.ipv4.tcp_rmem = 4096 87380 8388608
>> >> net.ipv4.tcp_default_win_scale = 7
>> >> net.ipv4.tcp_moderate_rcvbuf = 1
>> >
>> >
>> > OK, the changes I suggest are minor. Just change the tcp_wmem and
>> > tcp_rmem max value to 4M (4194304) from the current 8M value. You can
>> > also change the tcp_default_win_scale value to 6.
>> >
>> > Let me know what happens.
>> >
>> > Rich
>> >
>> >> I can go ahead and make the adjustments that you recommended, but
>> didn't
>> >> know if I should be making any further changes as well.
>> >>
>> >> I will run some further tests with the new settings and also with the
>> >> "-m" flag removed. However, I wanted to run the sysctl.conf settings
>> >> that we currently have by you first, and see if I should look at
>> further
>> >> changes there.
>> >>
>> >> Clay
>> >>
>> >> Richard Carlson wrote:
>> >> > Hi Clay
>> >> >
>> >> > OK, I looked at the traces and the web100 stats and there are a
>> couple
>> >> > of things that stand out.
>> >> >
>> >> > 1) your server is set to use 16 MB buffers.
>> >> > 2) this inbound test ran for 18 seconds (Duration and
>> SndLimTimeCwnd)
>> >> > 3) the trace (.2790) shows that data stops flowing, but the
>> connection
>> >> > isn't closing gracefully (no TCP FIN packets being exchanged).
>> [This
>> >> > might be another bug in my server code]
>> >> >
>> >> > It's not clear to me why the test is running so long. What else is
>> >> > running on this server? Is it very busy? What does "/usr/bin/top"
>> >> > report? Finally, what messages appear in the clients Java console
>> >> > window? The client will report how long it spent reading data
>> from the
>> >> > network
>> >> >
>> >> > Things to try:
>> >> > * One thing would be to reduce the maximum sender buffer size. Try
>> >> > making the max 4 MB instead of 16. Edit the /etc/sysctl.conf file
>> and
>> >> > change the following lines.
>> >> > # increase Linux autotuning TCP buffer limits
>> >> > net.ipv4.tcp_rmem = 4096 87380 16777216
>> >> > net.ipv4.tcp_wmem = 4096 87380 16777216
>> >> > to # increase Linux autotuning TCP buffer limits
>> >> > net.ipv4.tcp_rmem = 4096 87380 4194304
>> >> > net.ipv4.tcp_wmem = 4096 87380 4194304
>> >> >
>> >> > and then run the "/sbin/sysctl -p" command.
>> >> >
>> >> > One possible problem is that the server is faster than the
>> network so
>> >> > data is being placed in the send queue. The connection wouldn't
>> >> > shut-down until the queue is empty. So even if the NDT process
>> stops
>> >> > sending after 10 seconds, it could take some time to drain the
>> queue.
>> >> > With a 4 MB queue it would take less time to drain.
>> >> >
>> >> > That said, it isn't clear why the client is hanging for so long. I
>> >> > guess it's also possible that my shutdown patch isn't working
>> properly
>> >> > in the multi-client mode. Can you try running the web100srv process
>> >> > without the -m flag. This will case the server to handle clients
>> in a
>> >> > FIFO manner. If the server is busy the incoming clients will
>> receive a
>> >> > message saying the server is busy and a test will begin in xx
>> seconds.
>> >> > The client is updated every time another client's test finishes. I
>> >> know
>> >> > the shutdown() patch fixed a hang there, if possible give it a
>> try and
>> >> > let me know what happens.
>> >> >
>> >> > That's all I can think of right now, I'll think about it some more
>> >> > tonight and run some tests tomorrow.
>> >> >
>> >> > Rich
>> >> >
>> >> > At 09:08 AM 10/18/2005, Clayton Keller wrote:
>> >> >
>> >> >> Rich,
>> >> >>
>> >> >> We are still seeing issues with the Inbound tests even after
>> reverting
>> >> >> to the 2.6.12.5 kernel. This is not the Fedora Source kernel that
>> >> >> Martin is using, but the stock kernel.org download.
>> >> >>
>> >> >> I would like to go ahead and submit another trace for you. Is
>> there a
>> >> >> possibility that the issues we are seeing are network/bandwidth
>> issues
>> >> >> on our part?
>> >> >>
>> >> >> From my connection which is on a different network, the Outbound
>> test
>> >> >> took aprox. 10 seconds while the Inbound test took well over one
>> >> >> minute. The info you are receiving is from a connection on that
>> same
>> >> >> network. The Inbound test took about one minute before it
>> reported its
>> >> >> results back to the user.
>> >> >>
>> >> >> I apologize, but I am not quite sure what all info is found in the
>> >> >> trace so I guess that is why I am asking you if there are external
>> >> >> issues on our end that maybe part of the cause.
>> >> >>
>> >> >> Also, I could look at using one of the Fedora kernels and patch
>> it as
>> >> >> like Martyn did.
>> >> >>
>> >> >> Clay
>> >> >>
>> >> >>
>> >> >>
>> >> >> Richard Carlson wrote:
>> >> >>
>> >> >>> Hi Clay;
>> >> >>> The trace you sent does show a problem. At this point I don't
>> see a
>> >> >>> need for more, but it would be useful to see what the 2.6.12
>> kernel
>> >> >>> does. So I'd suggest you revert back to the 2.6.12 kernel and
>> I'll
>> >> >>> try and figure out how to get the kernel problem resolved.
>> >> >>> Rich
>> >> >>> At 09:21 AM 10/17/2005, Clayton Keller wrote:
>> >> >>>
>> >> >>>> Richard Carlson wrote:
>> >> >>>>
>> >> >>>>> Hi Craig;
>> >> >>>>> No, this NDT bug effects all servers. I ran into it while
>> testing
>> >> >>>>> from multiple clients. Clients 2, 3, & 4 would get the "Other
>> >> >>>>> client testing please wait..." type message. Client 2 would not
>> >> >>>>> get the final results until client 4 finished. I'll add this
>> patch
>> >> >>>>> to my next distribution, or you can apply it now if you are
>> >> >>>>> experiencing some problems.
>> >> >>>>> Since this didn't fix Clay's problem, I may need to rethink
>> how the
>> >> >>>>> tests are done. Right now the server simply streams data out
>> for
>> >> >>>>> 10 seconds, sending as much as it can. Given the way TCP works,
>> >> >>>>> there is a probability that the server will build up a queue
>> in the
>> >> >>>>> Send buffer (the bus is faster than the wire). This buffer will
>> >> >>>>> need to drain before the test is complete. Packet loss, or
>> other
>> >> >>>>> factors could mean that this draining takes a long time so the
>> >> >>>>> client simply sits there waiting. If it takes too long, the
>> server
>> >> >>>>> process will time-out and terminate so the client will never get
>> >> >>>>> the final results.
>> >> >>>>> More later.
>> >> >>>>> Rich
>> >> >>>>> At 08:26 AM 10/14/2005, Pepmiller, Craig E. wrote:
>> >> >>>>>
>> >> >>>>>> Ok, so this is only seen when the NDT machine is configured for
>> >> >>>>>> multiple
>> >> >>>>>> simultaneous clients?
>> >> >>>>>>
>> >> >>>>>> Thanks-
>> >> >>>>>> -Craig
>> >> >>>>>>
>> >> >>>>>> -----Original Message-----
>> >> >>>>>> From: Richard Carlson
>> >> >>>>>> [mailto:]
>> >> >>>>>> Sent: Wednesday, October 12, 2005 2:56 PM
>> >> >>>>>> To: Clayton Keller;
>> >> >>>>>>
>> >> >>>>>> Subject: Re: Slow Inbound Tests
>> >> >>>>>>
>> >> >>>>>> Hi Clayton;
>> >> >>>>>>
>> >> >>>>>> This is a bug in the web100srv code. I forgot to shutdown the
>> >> >>>>>> control
>> >> >>>>>> socket at the end of the test. If there are multiple
>> clients then
>> >> >>>>>> the
>> >> >>>>>> final results are sent in a LIFO manner, so the first client
>> >> needs to
>> >> >>>>>> wait
>> >> >>>>>> until all subsequent clients are done before the results are
>> >> >>>>>> returned.
>> >> >>>>>>
>> >> >>>>>> I'll issue a patched version soon. In the mean time you can
>> patch
>> >> >>>>>> your
>> >> >>>>>> version by hand by adding the line "shutdown(ctlsockfd,
>> >> >>>>>> SHUT_RDWR);" to
>> >> >>>>>> the
>> >> >>>>>> web100srv.c file (on line 1126).
>> >> >>>>>>
>> >> >>>>>> Let me know if that fixes things.
>> >> >>>>>>
>> >> >>>>>> Rich
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> ---------------------------------------------------------------
>> >> >>>>>> Original code:
>> >> >>>>>> if (admin_view == 1) {
>> >> >>>>>> totalcnt = calculate(SumRTT, CountRTT,
>> >> >>>>>> CongestionSignals,
>> >> >>>>>> PktsOut, DupAcksIn, AckPktsIn,
>> >> >>>>>> CurrentMSS, SndLimTimeRwin,
>> >> SndLimTimeCwnd,
>> >> >>>>>> SndLimTimeSender,
>> >> >>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>> >> >>>>>> DataBytesOut,
>> >> >>>>>>
>> >> >>>>>> mismatch, bad_cable,
>> >> >>>>>> (int)bwout, (int)bwin, c2sdata,
>> >> s2cack, 1,
>> >> >>>>>> debug);
>> >> >>>>>> gen_html((int)bwout, (int)bwin, MinRTT,
>> PktsRetrans,
>> >> >>>>>> Timeouts,
>> >> >>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd,
>> >> mismatch,
>> >> >>>>>> bad_cable, totalcnt,
>> >> >>>>>> debug);
>> >> >>>>>> }
>> >> >>>>>>
>> >> >>>>>> /* printf("Saved data to log file\n"); */
>> >> >>>>>>
>> >> >>>>>> /* exit(0); */
>> >> >>>>>> }
>> >> >>>>>>
>> >> >>>>>> main(argc, argv)
>> >> >>>>>>
>> >> >>>>>> ----------------------------------------------------------
>> >> >>>>>> Modified code
>> >> >>>>>> if (admin_view == 1) {
>> >> >>>>>> totalcnt = calculate(SumRTT, CountRTT,
>> >> >>>>>> CongestionSignals,
>> >> >>>>>> PktsOut, DupAcksIn, AckPktsIn,
>> >> >>>>>> CurrentMSS, SndLimTimeRwin,
>> >> SndLimTimeCwnd,
>> >> >>>>>> SndLimTimeSender,
>> >> >>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>> >> >>>>>> DataBytesOut,
>> >> >>>>>>
>> >> >>>>>> mismatch, bad_cable,
>> >> >>>>>> (int)bwout, (int)bwin, c2sdata,
>> >> s2cack, 1,
>> >> >>>>>> debug);
>> >> >>>>>> gen_html((int)bwout, (int)bwin, MinRTT,
>> PktsRetrans,
>> >> >>>>>> Timeouts,
>> >> >>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd,
>> >> mismatch,
>> >> >>>>>> bad_cable, totalcnt,
>> >> >>>>>> debug);
>> >> >>>>>> }
>> >> >>>>>> shutdown(ctlsockfd, SHUT_RDWR);
>> >> >>>>>> /* printf("Saved data to log file\n"); */
>> >> >>>>>>
>> >> >>>>>> /* exit(0); */
>> >> >>>>>> }
>> >> >>>>>>
>> >> >>>>>> main(argc, argv)
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> At 01:54 PM 10/12/2005, Clayton Keller wrote:
>> >> >>>>>> >I wanted to address this to the list. I believe there was a
>> >> similar
>> >> >>>>>> post a
>> >> >>>>>> >week or so back but I wanted to address this clean.
>> >> >>>>>> >
>> >> >>>>>> >I currently have web100srv running from /etc/init.d/ndt
>> with the
>> >> >>>>>> following:
>> >> >>>>>> >
>> >> >>>>>> >/usr/local/sbin/web100srv -a -m -l
>> /var/log/web100/web100srv.log
>> >> >>>>>> >
>> >> >>>>>> >The system is running on Fedora Core 4 using a patached 2.6.13
>> >> >>>>>> kernel
>> >> >>>>>> from
>> >> >>>>>> >kernel.org.
>> >> >>>>>> >
>> >> >>>>>> >The server itself is also sitting behind a PIX firewall.
>> >> >>>>>> >
>> >> >>>>>> >We have noticed that the Outbound Test will run rather
>> >> quickly, but
>> >> >>>>>> when
>> >> >>>>>> >the Inbound, server to client, test is ran it can take
>> upwards of
>> >> >>>>>> several
>> >> >>>>>> >minutes to complete, many times as much as 4 minutes. There
>> are
>> >> >>>>>> other
>> >> >>>>>> >times where from the end user's point-of-view it appears
>> the test
>> >> >>>>>> never
>> >> >>>>>>
>> >> >>>>>> >completes although you can see results for the test appear
>> in the
>> >> >>>>>> >web100.log file. The test though will continue to sit on the
>> >> >>>>>> unning 10s
>> >> >>>>>>
>> >> >>>>>> >inbound test (server to client) . . . . . . portion of the
>> test,
>> >> >>>>>> and
>> >> >>>>>> many
>> >> >>>>>> >users are beginning to just close out the window.
>> >> >>>>>> >
>> >> >>>>>> >At this point I am looking for general issues that I can look
>> >> >>>>>> into and
>> >> >>>>>> >possibly run debug against as far as these tests are
>> concerned.
>> >> >>>>>> >
>> >> >>>>>> >Clayton Keller
>> >> >>>>>>
>> >> >>>>>> ------------------------------------
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> Richard
>> >> >>>>
>> >> >>>> Did you want me to grab any more traces on newer versions of the
>> >> >>>> 2.6.13.x kernel or more on the current kernel it is running? Or
>> >> >>>> should I revert back to my 2.6.12.5 kernel and see how
>> performance
>> >> >>>> improves?
>> >> >>>>
>> >> >>>> I saw from an earlier post to a differnent thread that it appears
>> >> >>>> you are seeing some items in the traces that are eluding to
>> issues
>> >> >>>> pertaining to the 2.6.13.x kernel.
>> >> >>>>
>> >> >>>> Clay
>> >> >>>
>> >> >>>
>> >> >>> ------------------------------------
>> >> >>>
>> >> >>> Richard A. Carlson e-mail:
>> >> >>>
>> >> >>> Network Engineer phone: (734)
>> >> 352-7043
>> >> >>> Internet2 fax: (734)
>> >> 913-4255
>> >> >>> 1000 Oakbrook Dr; Suite 300
>> >> >>> Ann Arbor, MI 48104
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> TCP/Web100 Network Diagnostic Tool v5.3.3e
>> >> >> click START to begin
>> >> >> Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
>> >> >> running 10s outbound test (client to server) . . . . . 894.71Kb/s
>> >> >> running 10s inbound test (server to client) . . . . . . 3.86Mb/s
>> >> >> Your PC is connected to a Cable/DSL modem
>> >> >> Information: Other network traffic is congesting the link
>> >> >>
>> >> >>
>> >> >> WEB100 Kernel Variables:
>> >> >> Client: localhost/127.0.0.1
>> >> >> AckPktsIn: 3330
>> >> >> AckPktsOut: 0
>> >> >> BytesRetrans: 81420
>> >> >> CongAvoid: 2639
>> >> >> CongestionOverCount: 0
>> >> >> CongestionSignals: 27
>> >> >> CountRTT: 2802
>> >> >> CurCwnd: 22080
>> >> >> CurMSS: 1380
>> >> >> CurRTO: 248
>> >> >> CurRwinRcvd: 258060
>> >> >> CurRwinSent: 5888
>> >> >> CurSsthresh: 16560
>> >> >> DSACKDups: 0
>> >> >> DataBytesIn: 0
>> >> >> DataBytesOut: 8879328
>> >> >> DataPktsIn: 0
>> >> >> DataPktsOut: 6192
>> >> >> DupAcksIn: 481
>> >> >> ECNEnabled: 0
>> >> >> FastRetran: 27
>> >> >> MaxCwnd: 63480
>> >> >> MaxMSS: 1380
>> >> >> MaxRTO: 295
>> >> >> MaxRTT: 111
>> >> >> MaxRwinRcvd: 258060
>> >> >> MaxRwinSent: 5888
>> >> >> MaxSsthresh: 41400
>> >> >> MinMSS: 1380
>> >> >> MinRTO: 229
>> >> >> MinRTT: 20
>> >> >> MinRwinRcvd: 238740
>> >> >> MinRwinSent: 5888
>> >> >> NagleEnabled: 1
>> >> >> OtherReductions: 0
>> >> >> PktsIn: 3330
>> >> >> PktsOut: 6192
>> >> >> PktsRetrans: 59
>> >> >> X_Rcvbuf: 16777216
>> >> >> RcvWinScale: 8
>> >> >> SACKEnabled: 3
>> >> >> SACKsRcvd: 510
>> >> >> SendStall: 0
>> >> >> SlowStart: 152
>> >> >> SampleRTT: 42
>> >> >> SmoothedRTT: 48
>> >> >> X_Sndbuf: 16777216
>> >> >> SndWinScale: 2
>> >> >> SndLimTimeRwin: 0
>> >> >> SndLimTimeCwnd: 18404625
>> >> >> SndLimTimeSender: 8258
>> >> >> SndLimTransRwin: 0
>> >> >> SndLimTransCwnd: 1
>> >> >> SndLimTransSender: 1
>> >> >> SndLimBytesRwin: 0
>> >> >> SndLimBytesCwnd: 8879328
>> >> >> SndLimBytesSender: 0
>> >> >> SubsequentTimeouts: 0
>> >> >> SumRTT: 127937
>> >> >> Timeouts: 0
>> >> >> TimestampsEnabled: 0
>> >> >> WinScaleRcvd: 2
>> >> >> WinScaleSent: 8
>> >> >> DupAcksOut: 0
>> >> >> StartTimeUsec: 118172
>> >> >> Duration: 18416093
>> >> >> c2sData: 2
>> >> >> c2sAck: 2
>> >> >> s2cData: 9
>> >> >> s2cAck: 3
>> >> >> half_duplex: 0
>> >> >> link: 100
>> >> >> congestion: 1
>> >> >> bad_cable: 0
>> >> >> mismatch: 0
>> >> >> spd: 0.00
>> >> >> bw: 3.49
>> >> >> loss: 0.004360465
>> >> >> avgrtt: 45.66
>> >> >> waitsec: 0.00
>> >> >> timesec: 18.00
>> >> >> order: 0.1444
>> >> >> rwintime: 0.0000
>> >> >> sendtime: 0.0004
>> >> >> cwndtime: 0.9996
>> >> >> rwin: 1.9688
>> >> >> swin: 128.0000
>> >> >> cwin: 0.4843
>> >> >> rttsec: 0.045659
>> >> >> Sndbuf: 16777216
>> >> >> aspd: 8.63416
>> >> >
>> >> >
>> >> > ------------------------------------
>> >> >
>> >> >
>> >> >
>> >> > Richard A. Carlson e-mail:
>> >> >
>> >> > Network Engineer phone: (734)
>> 352-7043
>> >> > Internet2 fax: (734)
>> 913-4255
>> >> > 1000 Oakbrook Dr; Suite 300
>> >> > Ann Arbor, MI 48104
>> >> >
>> >
>> >
>> > ------------------------------------
>> >
>> >
>> >
>> > Richard A. Carlson e-mail:
>> >
>> > Network Engineer phone: (734) 352-7043
>> > Internet2 fax: (734) 913-4255
>> > 1000 Oakbrook Dr; Suite 300
>> > Ann Arbor, MI 48104
>> >
>
>
> ------------------------------------
>
>
>
> Richard A. Carlson e-mail:
>
> Network Engineer phone: (734) 352-7043
> Internet2 fax: (734) 913-4255
> 1000 Oakbrook Dr; Suite 300
> Ann Arbor, MI 48104
>



Archive powered by MHonArc 2.6.16.

Top of Page