Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Still looking at pS-SB scheduled BWCTL tests

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Still looking at pS-SB scheduled BWCTL tests


Chronological Thread 
  • From: Richard Carlson <>
  • To: "Jeff W. Boote" <>
  • Cc: Aaron Brown <>, John F Bigrow <>, Shawn McKee <>,
  • Subject: Re: [perfsonar-user] Still looking at pS-SB scheduled BWCTL tests
  • Date: Fri, 27 Feb 2009 14:02:24 -0600

Hi Jeff;

I don't recall the 3rd party results, but nothing jumped out at me either so ... I can say that my tests to AA were showing the expected results (100 mbps limited by the AA node).

I'll keep on it. One question, is there likely to be info in the UC logs that would help isolate the problem? The UC folks just completed the rebuild process and want to know if they need to send in any logs.

Rich

On Feb 27, 2009, at 12:17 PM, Jeff W. Boote wrote:


On Feb 27, 2009, at 10:58 AM, Richard Carlson wrote:

Hi Jeff;

I have run bwctl tests on the command line using both direct connects to each site and 3rd party tests between BNL & UC. The direct tests to UC seem to work most of the time, getting to BNL is more of a problem because of scheduling issues (no time slots). 3rd party tests work some times and other times it gives me a session rejected message. I assume that this is due to a shortage of time slots, but don't have any evidence to back this assumption up.

What kind of performance are you getting when it works? (I'm wondering if performance might be so bad that the iperf test is lasting longer than it should, so it doesn't complete in time and is therefore killed before it can print out any data...)

I believe this was suggested before, but I would recommend reducing the number of test peers substantially (like perhaps to one) and get it working before scaling up.

It is much easier to debug the connectivity issues one peer-wise pair at a time if that is part of this. And it makes it easier to augment the debugging of problems with on-demand command-line tests without having the scheduling issues be such a problem.

jeff



Rich

On Feb 27, 2009, at 10:49 AM, Jeff W. Boote wrote:

I would recommend attempting to run bwctl tests on the command line. And if that doesn't work, try iperf directly with each one of the ports that bwctl is defined to use.

Every single one of the errors I see in this email looks like the iperf data is null - either because the iperf connection didn' t happen at all, or because no data was received.

My off-the-cuff, not enough data to really determine, guess is that there is a firewall interfering with the iperf communication.

jeff

On Feb 27, 2009, at 8:13 AM, Richard Carlson wrote:

All;

John sent me a tarball with the contents of the /var/log directory. I looked through the syslog files and see the bwcollector.pl script writing log entries. I can put this tarball somewhere, any suggestions (my home dir on packrat?).

I did notice that the BNL node is contacting the UC node, but it usually fails. Most of the time it says
Feb 23 08:49:28 NPToolkit bwcollector.pl[3983]: Use of uninitialized value in numeric lt (<) at /usr/local/AMI/script/ bwcollector.pl line 960, <SESS> line 16.
Feb 23 08:49:28 NPToolkit bwcollector.pl[3983]: Use of uninitialized value in subtraction (-) at /usr/local/AMI/script/ bwcollector.pl line 960, <SESS> line 16.
Feb 23 08:49:28 NPToolkit bwcollector.pl[3983]: IGNORED: termination problem KNOPPIX-BWTCP4_UCT2-NET1_UCHICAGO_EDU_KNOPPIX file = /var/lib/pSB_MP/bwctl/upload/p8gFNrjVZI at /usr/local/AMI/ script/bwcollector.pl line 961, <SESS> line 16.

A few times it recorded a contact with the UC node, but no test was run
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: BEGIN IGNORED: empty KNOPPIX-BWTCP4_KNOPPIX_UCT2-NET1_UCHICAGO_EDU file = /var/ lib/pSB_MP/bwctl/upload/lt0YQJGqAN at /usr/local/AMI/script/ bwcollector.pl line 941, <SESS> line 10.
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: bwctl: exec_line: /usr/local/bin/iperf - B 192.12.15.23 -s -f b -m -p 5008 -t 10 -i 2
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: bwctl: start_tool: 3444388022.863175
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: bind failed: Address already in use
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: ------------------------------------------------------------
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: Server listening on TCP port 5008
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: Binding to local address 192.12.15.23
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: TCP window size: 87380 Byte (default)
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: ------------------------------------------------------------
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: bwctl: remote peer cancelled test
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: /var/lib/pSB_MP/ bwctl/upload/lt0YQJGqAN: bwctl: stop_exec: 3444388027.477541
Feb 23 09:27:07 NPToolkit bwcollector.pl[3983]: END IGNORED: empty KNOPPIX-BWTCP4_KNOPPIX_UCT2-NET1_UCHICAGO_EDU file = /var/ lib/pSB_MP/bwctl/upload/lt0YQJGqAN at /usr/local/AMI/script/ bwcollector.pl line 946, <SESS> line 20.

I did find at least 1 example where the test successfully ran.

What's the next step?

Rich

On Feb 25, 2009, at 1:27 PM, Jeff W. Boote wrote:

Then initially, I would suggest that the BNL host is not even contacting the chicago host. (At least not the one that corresponds to this log file.) Otherwise, there would be more connections logged.

jeff

On Feb 25, 2009, at 12:20 PM, Richard Carlson wrote:

Jeff;

These are the log files form the receiving side. The pS-SB is running on the BNL node.

John, would you please create a tarball of the /var/log dir on lhcmon and email/post it on a URL?
Thanks.

Rich

On Feb 25, 2009, at 12:32 PM, Jeff W. Boote wrote:

Rich - in looking at that log dir, I don't see any of the regularly scheduled tests... Only some on-demand ones. These logs don't seem to match the data you describe below... where are you seeing this data?

jeff

On Feb 25, 2009, at 10:21 AM, Richard Carlson wrote:

Hi Aaron;

Charles created a tarball with a complete dump from the /var/ log directory on the UC machine. Unfortunately I can't tell what is causing the tests to fail. I can see requests coming in from the BNL node, but nothing jumps out at me as to why the test failed. What should we be looking for? (I can also point you at the tarball if that would help).

Rich

On Feb 25, 2009, at 9:02 AM, Aaron Brown wrote:


On Feb 25, 2009, at 9:57 AM, Richard Carlson wrote:

Hi John;

Thanks for the logs. I see from the log extract that your server is running 15 sec tests. However I see other 10 & 60 sec requests, so I'm assuming that the 10 sec tests are from people using the command line version (10 sec is the default test period) and some other servers are scheduling tests via the perfSONAR-Buoy interface on their machine.

Here's what I see right now. Testing to/from MSU is working, but we are loosing data. The current graphs show 7 successful tests from MSU --> BNL starting at 1:47 am and ending at 8:27. The 4:45 am test is missing. The graphs also show 4 tests in the opposite direction missing 1, 3, & 6 am.

I also see tests to/from OU with 4 tests from OU -> BNL at 1, 2, 7, & 9 am with 3 in the opposite direction at midnight, 1, & 7 am. The rest were unsuccessful.

I did notice that you are testing to the psum01 node, which I think Shawn setup as the delay server so that's why none of those work. I think you want to test to the psum02 node.

I'm still wondering if your server is just trying to run too many tests (19 peers in your config) and I don't know how many other servers are requesting tests. As I see it, we can try reducing the number of peers, or we can get the pS folks to help guide us on determining why some tests are failing and why some peers aren't responding at all.

Jeff/Aaron, what log files should we be looking at to determine what is going on?

I'd take a look in /var/log/messages, the output should be going there.

Cheers,
Aaron

Richard Carlson

1000 Oakbrook Dr
Ann Arbor, MI 48104

P: 734-352-7043
C: 630-251-4572



Richard Carlson

1000 Oakbrook Dr
Ann Arbor, MI 48104

P: 734-352-7043
C: 630-251-4572



Richard Carlson

1000 Oakbrook Dr
Ann Arbor, MI 48104

P: 734-352-7043
C: 630-251-4572



Richard Carlson

1000 Oakbrook Dr
Ann Arbor, MI 48104

P: 734-352-7043
C: 630-251-4572



Richard Carlson

1000 Oakbrook Dr
Ann Arbor, MI 48104

P: 734-352-7043
C: 630-251-4572




Archive powered by MHonArc 2.6.16.

Top of Page