perfsonar-user - RE: [perfsonar-user] Bandwidth system failing to run some tests
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: "Garnizov, Ivan (RRZE)" <>
- To: Trey Dockendorf <>
- Cc: Sowmya Balasubramanian <>, perfsonar-user <>
- Subject: RE: [perfsonar-user] Bandwidth system failing to run some tests
- Date: Wed, 5 Aug 2015 16:16:04 +0000
- Accept-language: en-GB, de-DE, en-US
Hi Trey,
Yes please check with campus' network engineer. “I guess from the point of view of detecting a failure or issue on the network, why would no data exist in the measurement archive when command line bwctl appears to work but produce 0Mbps? “ Please note that if you see such discrepancy of having successful tests and but there is no data in the MA, this is a faulty state. Please check your regular_testing.log
for failures. It might be the case that this is a bug in the system, but it is unlikely. Is the check_rec_count.pl not distributed with the perl-perfSONAR_PS-Nagios RPM Check_rec_count.pl is a very recent addition to the Nagios module, so has not made it yet as an rpm. Best regards, Ivan From: Trey Dockendorf [mailto:]
Ivan, Thanks, I had seen mention of
check_throughput.pl when experimenting with MaDDash but very useful to see working examples. So the issue of why performance is so bad and/or tests are missing is something my campus' network engineers may be able to solve. The issue is reminds me of
one I've seen in the past when I misconfigured my local switch's MTU and was causing odd problems. So far most of the "check_throughput.pl" commands I've issued produce "Unable to find any tests with data in the given time range" which is what I see in the measurement archive graphs. The only
time they differ is when I query a host with time range of one month. The
check_throughput.pl shows data [1] while the web interface shows nothing. Actually the web interface does show the data but only once I click the "Link to this chart" which opens the graph in a new tab. I guess from the point of view of detecting a failure or issue on the network, why would no data exist in the measurement archive when command line bwctl appears to work but produce 0Mbps? I can see where
check_rec_count.pl becomes useful if I can state a failure being either low throughput or low record count. Is the check_rec_count.pl not distributed with the perl-perfSONAR_PS-Nagios RPM? I'm guessing by the commit history on the github repo that it's just not been put into repos yet. Thanks, - Trey [1]: # /opt/perfsonar_ps/nagios/bin/check_throughput.pl -u
http://psonar-bwctl.brazos.tamu.edu/esmond/perfsonar/archive -d 74.200.187.98 -s 165.91.55.6 -a 165.91.55.6 -r 2592000 -c 0.8: -w 1: PS_CHECK_THROUGHPUT OK - Average throughput is 6.997Gbps | Count=53;; Min=4.50272;; Max=8.9316;; Average=6.99655094339623;; Standard_Deviation=0.877849216582703;;
============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: Jabber: On Wed, Aug 5, 2015 at 1:50 AM, Garnizov, Ivan (RRZE) <> wrote: Hi Trey, [1] Yes. TCP self-adjusts this CWND in order to keep down the failed deliveries and retransmits.
I see your host allows for higher CWND values, obviously. You need to make sure your router also allows for this throughput. With tests towards only one destination you are unable to localize the problem. Please check with other toolkits in US for higher bandwidth
results to at least eliminate the possibility the problem is in your garden. Also have you checked with FNAL guys, if they are happy having 10G tests with you? (From what I understand
you are having regular tests with them) What are the results in the opposite direction (FNAL to
psonar-owamp.brazos.tamu.edu )? [2] Sorry for missing this yesterday, but there are 2 tools that you can use to check for data in
the MA. nagios/bin/check_throughput.pl -u
http://geant.org/esmond/perfsonar/archive -d 62.40.106.147 -s 62.40.106.131
-a 62.40.106.147
-r 18000 -c 0.8: -w 1: PS_CHECK_THROUGHPUT WARNING - Average throughput is 0.993Gbps |
Count=2;; Min=0.991869;; Max=0.993715;; Average=0.992792;; Standard_Deviation=0.00130531911807037;; The test counts the number of the throughput tests found in a specified period -r <period> It allows to specify the initiator of the test. Thus we would know which instance exactly requested the test -a
<requestor> Usage:
check_rec_count.pl -u|--url <service-url> -s|--source <source-addr> -d|--destination <dest-addr> -r <number-seconds-in-past>
--type (bw|owd|rttd|loss|trcrt) -w|--warning <threshold> -c|--critical <threshold> <options> ~/nagios/bin/check_rec_count.pl -u
http://geant.net/esmond/perfsonar/archive -d 62.40.106.147 -s 62.40.106.131 --type bw -r 9000 -w 3: -c 2: PS_CHECK_ESMOND_REC_COUNT
CRITICAL - Total number of records is 1.000 record(s) | Count=1;; Min=1;; Max=1;; Average=1;; Standard_Deviation=0;; Both of them basically are doing the same thing, but the latter one is customized version of the
1st one for exactly this task of checking in the DB. The ridiculous thing is that the latter is written by me and still I did not remember about it yesterday. Best regards, Ivan From: Trey Dockendorf [mailto:]
So I guess what I'm not sure of is if something on the network, external to my host, could influence CWND. I start iperf3 server on my latency host via 'iperf3 -p 5001 -s' and
then run iperf3 on my bandwidth system [1]. So if this is a configuration issue on psonar-bwctl then it doesn't seem to be impacted when I run basic tests to a system on the same local 10Gbps network. What I've noticed is that many of my tests have stopped working while others work just fine. I seem to have very few hosts with traceroute data but basic command line traceroute
to hosts working and hosts not working shows the same Texas LEARN POP is used. Thanks, - Trey [1]: # iperf3 -c
psonar-owamp.brazos.tamu.edu -p 5001 -i 1 Connecting to host
psonar-owamp.brazos.tamu.edu, port 5001 [ 4] local 165.91.55.6 port 55768 connected to 165.91.55.4 port 5001 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 0 4.04 MBytes [ 4] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 0 4.07 MBytes [ 4] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 0 4.16 MBytes [ 4] 3.00-4.00 sec 1.15 GBytes 9.83 Gbits/sec 0 4.74 MBytes [ 4] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 4.74 MBytes [ 4] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 4.74 MBytes [ 4] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 4.74 MBytes [ 4] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 4.74 MBytes [ 4] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 4.74 MBytes [ 4] 9.00-10.00 sec 1.13 GBytes 9.74 Gbits/sec 0 4.74 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.5 GBytes 9.88 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 11.5 GBytes 9.87 Gbits/sec receiver
============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email:
Jabber:
On Tue, Aug 4, 2015 at 11:52 AM, Garnizov, Ivan (RRZE) <> wrote: Hi Trey, Yes the congestion window has a huge impact on the TCP communications. I am not sure if you have
other toolkits to verify with and I am too lazy to calculate the maximum traffic your config allows and which will also require the RTT time with FNAL, but here is an example from a test between 2x 1G hosts (I believe yours were 10G) bwctl: Using tool: iperf3 bwctl: 17 seconds until test results available SENDER START Connecting to host 2001:798:fc00:2c::6, port 5372 [ 15] local 2001:798:fc00:23::6 port 46317 connected to 2001:798:fc00:2c::6 port 5372 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 21.5 MBytes 181 Mbits/sec 0 4.32 MBytes [ 15] 1.00-2.00 sec 116 MBytes 975 Mbits/sec 0 6.84 MBytes [ 15] 2.00-3.00 sec 118 MBytes 986 Mbits/sec 0 6.84 MBytes [ 15] 3.00-4.00 sec 119 MBytes 996 Mbits/sec 0 6.84 MBytes [ 15] 4.00-5.00 sec 118 MBytes 986 Mbits/sec 0 6.84 MBytes [ 15] 5.00-6.00 sec 118 MBytes 986 Mbits/sec 0 6.84 MBytes [ 15] 6.00-7.00 sec 118 MBytes 986 Mbits/sec 0 6.84 MBytes [ 15] 7.00-8.00 sec 118 MBytes 986 Mbits/sec 0 6.84 MBytes [ 15] 8.00-9.00 sec 119 MBytes 996 Mbits/sec 0 6.84 MBytes [ 15] 9.00-10.00 sec 118 MBytes 986 Mbits/sec 0 6.84 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-10.00 sec 1.05 GBytes 906 Mbits/sec 0 sender [ 15] 0.00-10.00 sec 1.05 GBytes 903 Mbits/sec receiver iperf Done. SENDER END The Cwnd is a configuration on your system, which is then adjusted by the TCP protocol within certain
limits. BUT in fact even with your CWND size it seems that you should at least be getting 0.79 Mbits/sec,
which you are not. “[ 14] 0.00-1.00 sec 96.1 KBytes 0.79 Mbits/sec 2 26.2 Kbytes” [2] about missing data….well I hope Andy will be able to guide you here. Best regards, Ivan From: Trey Dockendorf [mailto:]
Ivan, Thanks for the response. Does the low congestion window size indicate anything? I'd like to try and rule out the issue being specific to my PerfSONAR host before going to my campus'
networking group to begin seeing if there's networking issues causing problems. We are seeing data transfer issues between my site and FNAL and I usually reference the PerfSONAR data to rule out networking issues. The graphs I'm viewing are the toolkit measurements archive [1] on
psonar-bwctl.brazos.tamu.edu. I do not yet have MADDASH setup for my systems. The graph for tests with FNAL [2] show things break around July 18th. The graphs for Houston LEARN host show the
"No data to plot" messages [3]. A couple weeks ago I used those graphs to identify a network issue occurring in early July. So I know there was data at one point. The IPs of my PerfSONAR hosts have not changed. The only changes I've applied to these systems was on July 23rd when I updated to latest web100 kernel and applied other pending
updates (including esmond). Thanks, - Trey
============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email:
Jabber:
On Tue, Aug 4, 2015 at 8:44 AM, Garnizov, Ivan (RRZE) <> wrote: Hi Trey, [1] It is obvious here that the communication between the 2 endpoints was successful. In your results
the disturbing value in fact is the congestion window size, which is extremely low. [2] When you are reviewing the graphs, please select 1 month view and then use the link: “Previous
1m”. If there is no plot, then there is no data to plot. In fact you are not telling us, which interface you are using – the one of MADDASH or the toolkit
Measurements archive. Have you changed IPs? Please also check for the most ridiculous case, where the bandwidth line is hidden behind the loss.
Meaning there is 0 loss and almost 0 traffic. (MADDASH) Best regards, Ivan From:
[mailto:]
On Behalf Of Trey Dockendorf All tests were working at one point. What's odd is an endpoint like
ps1-hardy-hstn.tx-learn.net shows "ERROR: No data to plot for the hosts and time range selected." when I try to view the graph and clicking "1m" for a month of data shows nothing. Command line tests [1] show what appears to be no bandwidth. This is for a system with seemingly no data on graphs. May be the graphs are showing the correct data which is 0Mbps.
Another host with the "No data to plot" on graphs also runs from command line [2] but with same 0Mbps. If I start iperf3 on my latency host and run tests from my bandwidth system I get back nearly 10Gbps as expected. Thanks, - Trey [1]: # bwctl -T iperf3 -f m -t 10 -i 1 -c
psonar3.fnal.gov bwctl: Using tool: iperf3 bwctl: 16 seconds until test results available SENDER START Connecting to host 131.225.205.23, port 5726 [ 14] local 165.91.55.6 port 58249 connected to 131.225.205.23 port 5726 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 14] 0.00-1.00 sec 96.1 KBytes 0.79 Mbits/sec 2 26.2 KBytes [ 14] 1.00-2.00 sec 0.00 Bytes 0.00 Mbits/sec 1 26.2 KBytes [ 14] 2.00-3.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 14] 3.00-4.00 sec 0.00 Bytes 0.00 Mbits/sec 1 26.2 KBytes [ 14] 4.00-5.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 14] 5.00-6.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 14] 6.00-7.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 14] 7.00-8.00 sec 0.00 Bytes 0.00 Mbits/sec 1 26.2 KBytes [ 14] 8.00-9.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 14] 9.00-10.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 14] 0.00-10.00 sec 96.1 KBytes 0.08 Mbits/sec 5 sender [ 14] 0.00-10.00 sec 0.00 Bytes 0.00 Mbits/sec receiver iperf Done. SENDER END [2]: # bwctl -T iperf3 -f m -t 10 -i 1 -c
ps1-hardy-hstn.tx-learn.net bwctl: Using tool: iperf3 bwctl: 37 seconds until test results available SENDER START Connecting to host 74.200.187.98, port 5579 [ 15] local 165.91.55.6 port 50987 connected to 74.200.187.98 port 5579 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 87.4 KBytes 0.72 Mbits/sec 2 26.2 KBytes [ 15] 1.00-2.00 sec 0.00 Bytes 0.00 Mbits/sec 1 26.2 KBytes [ 15] 2.00-3.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 15] 3.00-4.00 sec 0.00 Bytes 0.00 Mbits/sec 1 26.2 KBytes [ 15] 4.00-5.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 15] 5.00-6.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 15] 6.00-7.00 sec 0.00 Bytes 0.00 Mbits/sec 1 26.2 KBytes [ 15] 7.00-8.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 15] 8.00-9.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes [ 15] 9.00-10.00 sec 0.00 Bytes 0.00 Mbits/sec 0 26.2 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-10.00 sec 87.4 KBytes 0.07 Mbits/sec 5 sender [ 15] 0.00-10.00 sec 0.00 Bytes 0.00 Mbits/sec receiver iperf Done. SENDER END
============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email:
Jabber:
On Mon, Aug 3, 2015 at 3:32 PM, Sowmya Balasubramanian <> wrote: Hi Trey, Was the test working at some point? Can you try running a test from the command line from your host to the other host and send the results? There is a possibility that the firewall rules or BWCTL limits on the other side is preventing your host from running the test. Thanks, Sowmya On Mon, Aug 3, 2015 at 10:17 AM, Trey Dockendorf <> wrote: I just discovered my bandwidth testing host is failing to run some of the configured tests. I'm seeing errors like this in /var/log/perfsonar/regular_testing.log 2015/08/03 11:57:35 (21746) ERROR> MeasurementArchiveChild.pm:125 perfSONAR_PS::RegularTesting::Master::MeasurementArchiveChild::__ANON__ - Problem handling test results: Problem
storing results: Error writing metadata: Error running test to
psonar3.fnal.gov with output bwctl: star t_endpoint: 3634164453.980392 Attached is my regular_testing.log The tests I was hoping to look at were against
psonar3.fnal.gov. I've noticed other tests are missing data too like tests to
tx-learn.net hosts. Tests to
psnr-bw01.slac.stanford.edu is one that shows data but with lots of "red dots" at the top of the graphs. I saw some mention of NTP problems in the logs so I forced NTP server updates via interface hoping to get closer NTP servers used. # ntpq -p -c rv remote refid st t when poll reach delay offset jitter ============================================================================== -nms-rlat.chic.n 141.142.143.138 2 u 11 64 377 24.055 -3.209 0.407 *nms-rlat.hous.n .IRIG. 1 u 6 64 377 18.186 1.227 0.230 -nms-rlat.salt.n 128.138.140.44 2 u 65 64 377 33.024 -4.304 0.435 +time2.chpc.utah 198.60.22.240 2 u 8 64 377 34.591 -2.486 0.335 +time3.chpc.utah 198.60.22.240 2 u 2 64 377 34.534 -2.771 0.219 associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync, version="ntpd
Mon Mar 16 14:53:03 UTC 2015 (1)", processor="x86_64", system="Linux/2.6.32-504.30.3.el6.web100.x86_64", leap=00, stratum=2, precision=-24, rootdelay=18.186, rootdisp=18.007, refid=64.57.16.162, reftime=d96a2001.60695f02 Mon, Aug 3 2015 12:14:41.376, clock=d96a2090.760490a5 Mon, Aug 3 2015 12:17:04.461, peer=21380, tc=6, mintc=3, offset=-0.105, frequency=41.314, sys_jitter=2.330, clk_jitter=1.141, clk_wander=0.151 # ntpstat synchronised to NTP server (64.57.16.162) at stratum 2 time correct to within 27 ms polling server every 64 s Let me know what other information would be useful to debug this. Thanks, - Trey
============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone:
(979)458-2396 Email:
Jabber:
|
- [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/03/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Sowmya Balasubramanian, 08/03/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/03/2015
- RE: [perfsonar-user] Bandwidth system failing to run some tests, Garnizov, Ivan (RRZE), 08/04/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/04/2015
- RE: [perfsonar-user] Bandwidth system failing to run some tests, Garnizov, Ivan (RRZE), 08/04/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/04/2015
- RE: [perfsonar-user] Bandwidth system failing to run some tests, Garnizov, Ivan (RRZE), 08/05/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/05/2015
- RE: [perfsonar-user] Bandwidth system failing to run some tests, Garnizov, Ivan (RRZE), 08/05/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/04/2015
- RE: [perfsonar-user] Bandwidth system failing to run some tests, Garnizov, Ivan (RRZE), 08/04/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/04/2015
- RE: [perfsonar-user] Bandwidth system failing to run some tests, Garnizov, Ivan (RRZE), 08/04/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Trey Dockendorf, 08/03/2015
- Re: [perfsonar-user] Bandwidth system failing to run some tests, Sowmya Balasubramanian, 08/03/2015
Archive powered by MHonArc 2.6.16.