perfsonar-user - [perfsonar-user] Another IPv6 throughput test problem
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Brian Candler <>
- To: "" <>
- Subject: [perfsonar-user] Another IPv6 throughput test problem
- Date: Sun, 8 Sep 2019 18:57:05 +0100
- Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=to:from:subject :message-id:date:mime-version:content-type; q=dns; s=sasl; b=twx G0FeycE3SMVAa13PvbPSLJPvXCn93gS7+tgQuRQ6q1/kJwbVIxjvam4rrdLpUIjB WAzp6MefwRiHKy78CdfeqCBsD5Txfr3e5AFPmyq5YsSDZ8BGtpwjsfbjZYZKGBoX ILR7AbcT1kYm3+K6oB0PqDpdQKoMu6RzrFQamtjY=
Just to make things more fun, I also had a third (completely different) problem with inbound tests not working from another host. That remote host is running ps-toolkit 4.0.2 under CentOS 6. The symptom was that again outbound tests were fine; but this
time, inbound tests were not even being scheduled. It turns out this was another problem caused by IPv6, and again it was a long problem to debug, but I've put the thread below in case anyone is interested in more perfsonar internals. TL;DR version is: - the local GUI updates tests in a file, while a background
process picks up the file and converts it into pscheduler
requests. Any errors happen in the background, and are not
reported back in the GUI. (Also, a change in 4.2 means the error
detail is removed from the log file too) - when you schedule an throughput test, the scheduled task for the inbound direction is actually installed on the *remote* node. That is, you are reliant on the third-party scheduler implementing your requested schedule. - perfsonar 4.0 toolkit makes a call back to the requesting host to see whether it's running pscheduler or bwctl- this callback fails if the other node's address is an IPv6
literal - the server returns a 400 response. The code to make the
callback has been fixed in perfsonar 4.2, so it's not important to
fix going forward, but it is very simple to reproduce. (I'm not sure why it was using a literal, but I suspect it's
because the originating node has forward DNS but not reverse DNS).
Brian Candler. -=-=-=-=- LONG VERSION: The remote host is running toolkit 4.0.2. This time, the problem is that the inbound task is not scheduled at all. In fact, when I add the remote host in the GUI, then check the "task" table in my local pscheduler postgres database, only one new task has been added. But if I add a different remote host (toolkit 4.1.6), two tasks are added - one inbound, one outbound. In the GUI, no error is displayed. It turns out that the GUI
just writes out some XML and JSON, and relies on a background task
(psconfig-pschduler-agent) to pick this up and apply this to the
database. I found an error in
/var/log/perfsonar/psconfig-pscheduler-agent-transactions.log: 2019/09/08 15:37:40 INFO
guid=7910e517-d841-46e0-86e1-32474618ec60 action="create"
test_type=throughput
lead_url=https://wrn-perf1.CCCC.net/pscheduler
checksum=Zb3vQyp2KcNPld8yKyvoJg
task={"tools":["bwctliperf3","iperf3"],"archives":[{"archiver":"esmond","data":{"measurement-agent":"A:A:A:A::45","url":"https://[A:A:A:A::45]/esmond/perfsonar/archive/","_auth-token":"b49e1d83e44321a56a6aa8d6ebb7957bb1bb9f35"}}],"reference":{"psconfig":{"created-by":{"uuid":"b042dea5-40fd-4c63-9b0b-1e0f63bf59a6","user-agent":"psconfig-pscheduler-agent"}}},"schedule":{"slip":"PT43200S","repeat":"PT86400S","sliprand":true,"until":"2019-09-10T15:37:29Z"},"test":{"spec":{"dest-node":"[A:A:A:A::45]","ip-version":6,"duration":"PT20S","source-node":"wrn-perf1.CCCC.net","schema":1,"source":"wrn-perf1.CCCC.net","dest":"A:A:A:A::45"},"type":"throughput"}} Perfsonar just ignores the problem and continues regardless :-( Unfortunately, "INTERNAL SERVER ERROR" doesn't tell me anything useful. If this error is from the local esmond, nothing is logged in apache error log, nor in /var/log/esmond/*.log I added some debugging, and found that this post is being made to /pscheduler/tasks on the *remote* host. It seems that if I request in my local GUI to have a regular throughput test to a third-party host, the scheduled task for inbound throughput is created on *their* system. That is: I have to trust that other system to run the scheduled task on my behalf, at the interval I requested! Anyway, now I need to look on the remote host for the error (which fortunately I also control). Over there, apache ssl_access_log shows a 500, but there's nothing in ssl_error_log. With some more debugging I was able to reproduce the request with
curl and see the response body: < HTTP/1.1 500 INTERNAL SERVER ERROR Aha. That would have been a much more helpful message to see. I've made a PR to fix that - clearly it ought to be logged. Now it is: 2019/09/08 16:34:17 ERROR
guid=e09abb00-aed4-41c3-b963-2ac3953713f7 action="create"
msg=Problem adding test
throughput(wrn-perf1.CCCC.net->A:A:A:A::45), continuing with
rest of config: INTERNAL SERVER ERROR: Unable to determine
participants: Can't find pScheduler or BWCTL on [A:A:A:A::45] (Aside: it *was* logged in perfsonar 4.0, this ability was
removed in May 2019 in commit 97fbb726) The next question is: why is this error happening? Firewall is completely open from that remote host, and tcpdump shows it calling back on port 443 successfully. Looking on the remote host (remember that's the old perfsonar 4.0.2 node), it has been logging this error for other hosts too, for some of its meshconfig tests. This error comes from
pscheduler-server/pscheduler-server/api-server/pschedulerapiserver/tasks.py,
and at that point it invokes external command: pscheduler internal invoke test <type> participants # with the test spec on stdin On the 4.0.2 node, /var/www/pscheduler-server/pschedulerapiserver/tasks.pyc exists but not tasks.py. That's *really* annoying, as I can't look at or modify the source. But it runs /usr/libexec/pscheduler/classes/test/throughput/participants, and that's where the message "Can't find pScheduler or BWCTL on %s" comes from. It's calling pscheduler.api_has_services(). Adding some more debugging, I see it returns {'bwctl': False, 'pscheduler': False} for my 4.2 node - which is clearly wrong. And the problem is, it doesn't work with IPv6 address literals. Simple reproducer on perfsonar 4.0: >> import pscheduler >>>
pscheduler.url_get(pscheduler.api_url_hostport('2001:400:4201:1158::3'),timeout=3,throw=False,json=False,bind=None) Oddly, the 400 error is coming from the remote side. But if I
use a full hostname instead of an IPv6 literal it's fine: note the
200 response to the second request below. C:C:C:C::231 - - [08/Sep/2019:17:33:05 +0000] "GET
/pscheduler/ HTTP/1.1" 400 1279 "-" "python-requests/2.6.0
CPython/2.6.6 Linux/2.6.32-754.15.3.el6.x86_64" Sadly I can't tcpdump the requests to see what's different, because it's SSL. But what I can do is test whether the code has been fixed in perfsonar 4.2, and it has: >>>
pscheduler.url_get(pscheduler.api_url_hostport('2001:400:4201:1158::3'),timeout=3,throw=False,json=False,bind=None) Either that, or the problem was with the underlying python2.6
libraries in CentOS 6, and a newer OS with python2.7 has fixed it. |
- [perfsonar-user] Another IPv6 throughput test problem, Brian Candler, 09/08/2019
- Re: [perfsonar-user] Another IPv6 throughput test problem, Mark Feit, 09/09/2019
- Re: [perfsonar-user] Another IPv6 throughput test problem, Brian Candler, 09/09/2019
- Re: [perfsonar-user] Another IPv6 throughput test problem, Mark Feit, 09/09/2019
Archive powered by MHonArc 2.6.19.