Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] More on missing tests

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] More on missing tests


Chronological Thread 
  • From: Brian Candler <>
  • To: Aaron Brown <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] More on missing tests
  • Date: Sat, 20 Dec 2014 23:58:51 +0000
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=message-id:date :from:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; q=dns; s=sasl; b=JdJdYS 94CbzUCuXAs265LuzeBgcmlEuyXt7PR23S1Ff0AdkSBfiKTQgU580oDDPp6wurKy CEuIBuDe9kE6TBkTlwuByPEnvQppPlx44plIFh9OqP+pUtodHLp/mHyIBh42331I b8hfDEIn6ckC19Mi5KkJgsJYXs6F9YXkr0Rhc=

I finally found the problem with empty iperf3 tests, and I can't believe how stupid it was :-)

Symptom: bwctl shows empty data between "SERVER START" and "SERVER END", and terminates in around 6 seconds.

While I was running the daemons with valgrind and strace attached, the problem recurred. Picking through strace output, I found that the two ends had negotiated to use TCP port 5624, but the connection was being rejected - and it turns out the iptables rules which PS Toolkit 3.4.1 installs only permit ports 5000 to 5600.

#### /etc/sysconfig/ip[6]tables ####
-A perfSONAR -m state --state NEW,ESTABLISHED -m tcp -p tcp --dport 5000:5600 -j ACCEPT
-A perfSONAR -m udp -p udp --dport 5000:5900 -j ACCEPT

Given that the port range is up to 5900 for UDP, and bwctld.conf sets test_port 5001-5900, I'm pretty sure that 5600 for TCP is a typo, and it needs to be changed to 5900.

After a one-character edit and /etc/init.d/iptables restart, it was working. If you're interested, the relevant bits of strace output are attached.

This still doesn't explain the problems I saw previously with increasing reservation time windows, possibly to do with unsynced clocks - I think that needs further investigation (if I can replicate it). I also have a live system which is only pinging in one direction and not on v6, but that may be related to OSX.

Finally: as part of this work I have made a number of cleanups in a branch at
https://code.google.com/r/bcandler100-bwctl/source/list?name=candlerb-cleanup
which fix problems identified by compiler warnings and valgrind, and I have a few more I'd like to make. I think they're pretty uncontentious and hopefully can be merged. (This is the source I was testing with).

Regards,

Brian Candler.

------------------------------------------------
Test timestamps (converted into human-readable):

current time = 13:09:37
reservation start = 13:09:40
test start = 13:09:42
reservation end = 13:09:54

On the client side, strace shows:

[pid 14839] 13:09:38.564166 nanosleep({3, 684728967}, 0xffeffc2b0) = 0
[pid 14839] 13:09:42.335900 gettid() = 14839
... snip
[pid 14839] 13:09:42.415783 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
[pid 14839] 13:09:42.416649 bind(15, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.56.13")}, 16) = 0
[pid 14839] 13:09:42.424289 gettid() = 14839
[pid 14839] 13:09:42.424378 write(1029, "F", 1) = 1
[pid 14839] 13:09:42.424520 connect(15, {sa_family=AF_INET, sin_port=htons(5624), sin_addr=inet_addr("192.168.56.6")}, 16) = -1 ECONNREFUSED (Connection refused)

(Note: the gettid() and write(1029) calls are artefacts of valgrind, but I don't want to detach it while the problem is replicating nicely)

On the listening side, strace shows:

[pid 9593] 13:09:40.524099 close(15) = 0
[pid 9593] 13:09:40.525178 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
[pid 9593] 13:09:40.526047 setsockopt(15, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 9593] 13:09:40.526952 bind(15, {sa_family=AF_INET, sin_port=htons(5624), sin_addr=inet_addr("192.168.56.6")}, 16) = 0
[pid 9593] 13:09:40.528252 listen(15, 5) = 0
[pid 9593] 13:09:40.535577 fstat(1, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
[pid 9593] 13:09:40.535745 mmap(0x4c15000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4c15000
[pid 9593] 13:09:40.548760 gettimeofday({1419080980, 548841}, NULL) = 0
[pid 9593] 13:09:40.552338 times({tms_utime=8, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 431938808
[pid 9593] 13:09:40.555662 getrusage(RUSAGE_SELF, {ru_utime={0, 87986}, ru_stime={0, 34994}, ...}) = 0
[pid 9593] 13:09:40.560897 gettimeofday({1419080980, 561096}, NULL) = 0
[pid 9593] 13:09:40.563604 gettid() = 9593
[pid 9593] 13:09:40.564053 write(1029, "J", 1) = 1
[pid 9593] 13:09:40.564352 select(16, [15], [], NULL, NULL <unfinished ...>
[pid 9592] 13:09:44.429012 <... select resumed> ) = 1 (in [14])
... snip valgrind artefacts in pid 9592 ...
[pid 9592] 13:09:45.440407 kill(4294957703, SIGTERM <unfinished ...>
[pid 9593] 13:09:45.440446 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
[pid 9593] 13:09:45.440464 --- SIGTERM (Terminated) @ 0 (0) ---

So clearly the server calls listen() on a bound socket and was waiting in select(), but didn't see any connection. At 13:09:44, 2 seconds after the test was due to start, it gets cancelled (presumably an internal timeout)




Archive powered by MHonArc 2.6.16.

Top of Page