Skip to Content.
Sympa Menu

ndt-users - Re: Slow Inbound Tests

Subject: ndt-users list created

List archive

Re: Slow Inbound Tests


Chronological Thread 
  • From: Richard Carlson <>
  • To: Clayton Keller <>,
  • Subject: Re: Slow Inbound Tests
  • Date: Wed, 19 Oct 2005 15:15:53 -0400

Hi Clay;

Great! This is saying that there really is a bunch of data getting stuck in the servers transmit queue. On my older (1.2 GHZ server) I can't get as far ahead of the network as your 3 GHZ CPU can so I'm not seeing as big an impact.

I'll look at my sending strategy to see if I can't come up with some way to reduce the amount of unsent data that can exist when the test ends. In the mean time you can try reducing the TCP buffer size again. Even reducing it to the 1-2 MB range would give you good performance without having a large queue built up.

Rich
At 01:01 PM 10/19/2005, Clayton Keller wrote:
Just a report back:

Here is what I changed my values in sysctl.conf to:

net.core.wmem_max = 4194304
net.core.rmem_max = 4194304
net.ipv4.tcp_wmem = 4096 87380 4194304
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_default_win_scale = 7
net.ipv4.tcp_moderate_rcvbuf = 1

When looking at the Java Console it has cut the time in half:

Old sysctl.conf settings:
wait flag received = 0
server ports 32847 32848
calling in2Socket.getLocalAddress()
339 Kbs outbound got 19
12640256 bytes 1506.2268827454718 Kb/s 67.136 secs
Calling InetAddress.getLocalHost() twice

New sysctl.conf settings:
wait flag received = 0
server ports 32849 32850
calling in2Socket.getLocalAddress()
382 Kbs outbound got 19
6324224 bytes 1349.7436773023157 Kb/s 37.484 secs
Calling InetAddress.getLocalHost() twice

I did run the test without the "-m" and I returned similar results. So
this did help out some. We cut the buffer in half and the time to return
results has also cut in half.

Clay

Clayton Keller wrote:
> Rich,
>
> The server currently is not doing much else. Load averages on the server
> sit pretty much at 0.00. It is a Pentium-4 3.40GHz with 2GB of RAM.
> There is not anything else running on it that is causing any heavy loads
> or additional traffic at this time.
>
> Currently, I have the following lines added to the /etc/sysctl.conf
> file, which I acquired from the README:
>
> # Recommended sysctl settings from web100 README
> net.core.wmem_max = 8388608
> net.core.rmem_max = 8388608
> net.ipv4.tcp_wmem = 4096 65536 8388608
> net.ipv4.tcp_rmem = 4096 87380 8388608
> net.ipv4.tcp_default_win_scale = 7
> net.ipv4.tcp_moderate_rcvbuf = 1
>
> I can go ahead and make the adjustments that you recommended, but didn't
> know if I should be making any further changes as well.
>
> I will run some further tests with the new settings and also with the
> "-m" flag removed. However, I wanted to run the sysctl.conf settings
> that we currently have by you first, and see if I should look at further
> changes there.
>
> Clay
>
> Richard Carlson wrote:
>
>>Hi Clay
>>
>>OK, I looked at the traces and the web100 stats and there are a couple
>>of things that stand out.
>>
>>1) your server is set to use 16 MB buffers.
>>2) this inbound test ran for 18 seconds (Duration and SndLimTimeCwnd)
>>3) the trace (.2790) shows that data stops flowing, but the connection
>>isn't closing gracefully (no TCP FIN packets being exchanged). [This
>>might be another bug in my server code]
>>
>>It's not clear to me why the test is running so long. What else is
>>running on this server? Is it very busy? What does "/usr/bin/top"
>>report? Finally, what messages appear in the clients Java console
>>window? The client will report how long it spent reading data from the
>>network
>>
>>Things to try:
>>* One thing would be to reduce the maximum sender buffer size. Try
>>making the max 4 MB instead of 16. Edit the /etc/sysctl.conf file and
>>change the following lines.
>># increase Linux autotuning TCP buffer limits
>>net.ipv4.tcp_rmem = 4096 87380 16777216
>>net.ipv4.tcp_wmem = 4096 87380 16777216
>>to # increase Linux autotuning TCP buffer limits
>>net.ipv4.tcp_rmem = 4096 87380 4194304
>>net.ipv4.tcp_wmem = 4096 87380 4194304
>>
>>and then run the "/sbin/sysctl -p" command.
>>
>>One possible problem is that the server is faster than the network so
>>data is being placed in the send queue. The connection wouldn't
>>shut-down until the queue is empty. So even if the NDT process stops
>>sending after 10 seconds, it could take some time to drain the queue.
>>With a 4 MB queue it would take less time to drain.
>>
>>That said, it isn't clear why the client is hanging for so long. I
>>guess it's also possible that my shutdown patch isn't working properly
>>in the multi-client mode. Can you try running the web100srv process
>>without the -m flag. This will case the server to handle clients in a
>>FIFO manner. If the server is busy the incoming clients will receive a
>>message saying the server is busy and a test will begin in xx seconds.
>>The client is updated every time another client's test finishes. I know
>>the shutdown() patch fixed a hang there, if possible give it a try and
>>let me know what happens.
>>
>>That's all I can think of right now, I'll think about it some more
>>tonight and run some tests tomorrow.
>>
>>Rich
>>
>>At 09:08 AM 10/18/2005, Clayton Keller wrote:
>>
>>
>>>Rich,
>>>
>>>We are still seeing issues with the Inbound tests even after reverting
>>>to the 2.6.12.5 kernel. This is not the Fedora Source kernel that
>>>Martin is using, but the stock kernel.org download.
>>>
>>>I would like to go ahead and submit another trace for you. Is there a
>>>possibility that the issues we are seeing are network/bandwidth issues
>>>on our part?
>>>
>>>From my connection which is on a different network, the Outbound test
>>>took aprox. 10 seconds while the Inbound test took well over one
>>>minute. The info you are receiving is from a connection on that same
>>>network. The Inbound test took about one minute before it reported its
>>>results back to the user.
>>>
>>>I apologize, but I am not quite sure what all info is found in the
>>>trace so I guess that is why I am asking you if there are external
>>>issues on our end that maybe part of the cause.
>>>
>>>Also, I could look at using one of the Fedora kernels and patch it as
>>>like Martyn did.
>>>
>>>Clay
>>>
>>>
>>>
>>>Richard Carlson wrote:
>>>
>>>
>>>>Hi Clay;
>>>>The trace you sent does show a problem. At this point I don't see a
>>>>need for more, but it would be useful to see what the 2.6.12 kernel
>>>>does. So I'd suggest you revert back to the 2.6.12 kernel and I'll
>>>>try and figure out how to get the kernel problem resolved.
>>>>Rich
>>>>At 09:21 AM 10/17/2005, Clayton Keller wrote:
>>>>
>>>>
>>>>>Richard Carlson wrote:
>>>>>
>>>>>
>>>>>>Hi Craig;
>>>>>>No, this NDT bug effects all servers. I ran into it while testing
>>>>>>from multiple clients. Clients 2, 3, & 4 would get the "Other
>>>>>>client testing please wait..." type message. Client 2 would not
>>>>>>get the final results until client 4 finished. I'll add this patch
>>>>>>to my next distribution, or you can apply it now if you are
>>>>>>experiencing some problems.
>>>>>>Since this didn't fix Clay's problem, I may need to rethink how the
>>>>>>tests are done. Right now the server simply streams data out for
>>>>>>10 seconds, sending as much as it can. Given the way TCP works,
>>>>>>there is a probability that the server will build up a queue in the
>>>>>>Send buffer (the bus is faster than the wire). This buffer will
>>>>>>need to drain before the test is complete. Packet loss, or other
>>>>>>factors could mean that this draining takes a long time so the
>>>>>>client simply sits there waiting. If it takes too long, the server
>>>>>>process will time-out and terminate so the client will never get
>>>>>>the final results.
>>>>>>More later.
>>>>>>Rich
>>>>>>At 08:26 AM 10/14/2005, Pepmiller, Craig E. wrote:
>>>>>>
>>>>>>
>>>>>>>Ok, so this is only seen when the NDT machine is configured for
>>>>>>>multiple
>>>>>>>simultaneous clients?
>>>>>>>
>>>>>>>Thanks-
>>>>>>>-Craig
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Richard Carlson
[mailto:]
>>>>>>>Sent: Wednesday, October 12, 2005 2:56 PM
>>>>>>>To: Clayton Keller;

>>>>>>>Subject: Re: Slow Inbound Tests
>>>>>>>
>>>>>>>Hi Clayton;
>>>>>>>
>>>>>>>This is a bug in the web100srv code. I forgot to shutdown the
>>>>>>>control
>>>>>>>socket at the end of the test. If there are multiple clients then
>>>>>>>the
>>>>>>>final results are sent in a LIFO manner, so the first client needs to
>>>>>>>wait
>>>>>>>until all subsequent clients are done before the results are
>>>>>>>returned.
>>>>>>>
>>>>>>>I'll issue a patched version soon. In the mean time you can patch
>>>>>>>your
>>>>>>>version by hand by adding the line "shutdown(ctlsockfd,
>>>>>>>SHUT_RDWR);" to
>>>>>>>the
>>>>>>>web100srv.c file (on line 1126).
>>>>>>>
>>>>>>>Let me know if that fixes things.
>>>>>>>
>>>>>>>Rich
>>>>>>>
>>>>>>>
>>>>>>>---------------------------------------------------------------
>>>>>>>Original code:
>>>>>>> if (admin_view == 1) {
>>>>>>> totalcnt = calculate(SumRTT, CountRTT,
>>>>>>>CongestionSignals,
>>>>>>>PktsOut, DupAcksIn, AckPktsIn,
>>>>>>> CurrentMSS, SndLimTimeRwin, SndLimTimeCwnd,
>>>>>>>SndLimTimeSender,
>>>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>>>>>>>DataBytesOut,
>>>>>>>
>>>>>>>mismatch, bad_cable,
>>>>>>> (int)bwout, (int)bwin, c2sdata, s2cack, 1,
>>>>>>>debug);
>>>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>>>>>>>Timeouts,
>>>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd, mismatch,
>>>>>>>bad_cable, totalcnt,
>>>>>>> debug);
>>>>>>> }
>>>>>>>
>>>>>>> /* printf("Saved data to log file\n"); */
>>>>>>>
>>>>>>> /* exit(0); */
>>>>>>>}
>>>>>>>
>>>>>>>main(argc, argv)
>>>>>>>
>>>>>>>----------------------------------------------------------
>>>>>>>Modified code
>>>>>>> if (admin_view == 1) {
>>>>>>> totalcnt = calculate(SumRTT, CountRTT,
>>>>>>>CongestionSignals,
>>>>>>>PktsOut, DupAcksIn, AckPktsIn,
>>>>>>> CurrentMSS, SndLimTimeRwin, SndLimTimeCwnd,
>>>>>>>SndLimTimeSender,
>>>>>>> MaxRwinRcvd, CurrentCwnd, Sndbuf,
>>>>>>>DataBytesOut,
>>>>>>>
>>>>>>>mismatch, bad_cable,
>>>>>>> (int)bwout, (int)bwin, c2sdata, s2cack, 1,
>>>>>>>debug);
>>>>>>> gen_html((int)bwout, (int)bwin, MinRTT, PktsRetrans,
>>>>>>>Timeouts,
>>>>>>> Sndbuf, MaxRwinRcvd, CurrentCwnd, mismatch,
>>>>>>>bad_cable, totalcnt,
>>>>>>> debug);
>>>>>>> }
>>>>>>> shutdown(ctlsockfd, SHUT_RDWR);
>>>>>>> /* printf("Saved data to log file\n"); */
>>>>>>>
>>>>>>> /* exit(0); */
>>>>>>>}
>>>>>>>
>>>>>>>main(argc, argv)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>At 01:54 PM 10/12/2005, Clayton Keller wrote:
>>>>>>>
>>>>>>>>I wanted to address this to the list. I believe there was a similar
>>>>>>>
>>>>>>>post a
>>>>>>>
>>>>>>>>week or so back but I wanted to address this clean.
>>>>>>>>
>>>>>>>>I currently have web100srv running from /etc/init.d/ndt with the
>>>>>>>
>>>>>>>following:
>>>>>>>
>>>>>>>>/usr/local/sbin/web100srv -a -m -l /var/log/web100/web100srv.log
>>>>>>>>
>>>>>>>>The system is running on Fedora Core 4 using a patached 2.6.13
>>>>>>>
>>>>>>>kernel
>>>>>>>from
>>>>>>>
>>>>>>>>kernel.org.
>>>>>>>>
>>>>>>>>The server itself is also sitting behind a PIX firewall.
>>>>>>>>
>>>>>>>>We have noticed that the Outbound Test will run rather quickly, but
>>>>>>>
>>>>>>>when
>>>>>>>
>>>>>>>>the Inbound, server to client, test is ran it can take upwards of
>>>>>>>
>>>>>>>several
>>>>>>>
>>>>>>>>minutes to complete, many times as much as 4 minutes. There are
>>>>>>>
>>>>>>>other
>>>>>>>
>>>>>>>>times where from the end user's point-of-view it appears the test
>>>>>>>
>>>>>>>never
>>>>>>>
>>>>>>>
>>>>>>>>completes although you can see results for the test appear in the
>>>>>>>>web100.log file. The test though will continue to sit on the
>>>>>>>
>>>>>>>unning 10s
>>>>>>>
>>>>>>>
>>>>>>>>inbound test (server to client) . . . . . . portion of the test,
>>>>>>>
>>>>>>>and
>>>>>>>many
>>>>>>>
>>>>>>>>users are beginning to just close out the window.
>>>>>>>>
>>>>>>>>At this point I am looking for general issues that I can look
>>>>>>>
>>>>>>>into and
>>>>>>>
>>>>>>>>possibly run debug against as far as these tests are concerned.
>>>>>>>>
>>>>>>>>Clayton Keller
>>>>>>>
>>>>>>>------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>Richard
>>>>>
>>>>>Did you want me to grab any more traces on newer versions of the
>>>>>2.6.13.x kernel or more on the current kernel it is running? Or
>>>>>should I revert back to my 2.6.12.5 kernel and see how performance
>>>>>improves?
>>>>>
>>>>>I saw from an earlier post to a differnent thread that it appears
>>>>>you are seeing some items in the traces that are eluding to issues
>>>>>pertaining to the 2.6.13.x kernel.
>>>>>
>>>>>Clay
>>>>
>>>>
>>>>------------------------------------
>>>>
>>>>Richard A. Carlson e-mail:
>>>>
>>>>Network Engineer phone: (734) 352-7043
>>>>Internet2 fax: (734) 913-4255
>>>>1000 Oakbrook Dr; Suite 300
>>>>Ann Arbor, MI 48104
>>>
>>>
>>>
>>>
>>>TCP/Web100 Network Diagnostic Tool v5.3.3e
>>>click START to begin
>>>Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
>>>running 10s outbound test (client to server) . . . . . 894.71Kb/s
>>>running 10s inbound test (server to client) . . . . . . 3.86Mb/s
>>>Your PC is connected to a Cable/DSL modem
>>>Information: Other network traffic is congesting the link
>>>
>>>
>>>WEB100 Kernel Variables:
>>>Client: localhost/127.0.0.1
>>>AckPktsIn: 3330
>>>AckPktsOut: 0
>>>BytesRetrans: 81420
>>>CongAvoid: 2639
>>>CongestionOverCount: 0
>>>CongestionSignals: 27
>>>CountRTT: 2802
>>>CurCwnd: 22080
>>>CurMSS: 1380
>>>CurRTO: 248
>>>CurRwinRcvd: 258060
>>>CurRwinSent: 5888
>>>CurSsthresh: 16560
>>>DSACKDups: 0
>>>DataBytesIn: 0
>>>DataBytesOut: 8879328
>>>DataPktsIn: 0
>>>DataPktsOut: 6192
>>>DupAcksIn: 481
>>>ECNEnabled: 0
>>>FastRetran: 27
>>>MaxCwnd: 63480
>>>MaxMSS: 1380
>>>MaxRTO: 295
>>>MaxRTT: 111
>>>MaxRwinRcvd: 258060
>>>MaxRwinSent: 5888
>>>MaxSsthresh: 41400
>>>MinMSS: 1380
>>>MinRTO: 229
>>>MinRTT: 20
>>>MinRwinRcvd: 238740
>>>MinRwinSent: 5888
>>>NagleEnabled: 1
>>>OtherReductions: 0
>>>PktsIn: 3330
>>>PktsOut: 6192
>>>PktsRetrans: 59
>>>X_Rcvbuf: 16777216
>>>RcvWinScale: 8
>>>SACKEnabled: 3
>>>SACKsRcvd: 510
>>>SendStall: 0
>>>SlowStart: 152
>>>SampleRTT: 42
>>>SmoothedRTT: 48
>>>X_Sndbuf: 16777216
>>>SndWinScale: 2
>>>SndLimTimeRwin: 0
>>>SndLimTimeCwnd: 18404625
>>>SndLimTimeSender: 8258
>>>SndLimTransRwin: 0
>>>SndLimTransCwnd: 1
>>>SndLimTransSender: 1
>>>SndLimBytesRwin: 0
>>>SndLimBytesCwnd: 8879328
>>>SndLimBytesSender: 0
>>>SubsequentTimeouts: 0
>>>SumRTT: 127937
>>>Timeouts: 0
>>>TimestampsEnabled: 0
>>>WinScaleRcvd: 2
>>>WinScaleSent: 8
>>>DupAcksOut: 0
>>>StartTimeUsec: 118172
>>>Duration: 18416093
>>>c2sData: 2
>>>c2sAck: 2
>>>s2cData: 9
>>>s2cAck: 3
>>>half_duplex: 0
>>>link: 100
>>>congestion: 1
>>>bad_cable: 0
>>>mismatch: 0
>>>spd: 0.00
>>>bw: 3.49
>>>loss: 0.004360465
>>>avgrtt: 45.66
>>>waitsec: 0.00
>>>timesec: 18.00
>>>order: 0.1444
>>>rwintime: 0.0000
>>>sendtime: 0.0004
>>>cwndtime: 0.9996
>>>rwin: 1.9688
>>>swin: 128.0000
>>>cwin: 0.4843
>>>rttsec: 0.045659
>>>Sndbuf: 16777216
>>>aspd: 8.63416
>>
>>
>>------------------------------------
>>
>>
>>
>>Richard A. Carlson e-mail:
>>
>>Network Engineer phone: (734) 352-7043
>>Internet2 fax: (734) 913-4255
>>1000 Oakbrook Dr; Suite 300
>>Ann Arbor, MI 48104
>>
>
>

------------------------------------



Richard A. Carlson e-mail:

Network Engineer phone: (734) 352-7043
Internet2 fax: (734) 913-4255
1000 Oakbrook Dr; Suite 300
Ann Arbor, MI 48104



Archive powered by MHonArc 2.6.16.

Top of Page