perfsonar-user - [perfsonar-user] Another IPv6 throughput test problem

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Another IPv6 throughput test problem

From: Brian Candler <>
To: "" <>
Subject: [perfsonar-user] Another IPv6 throughput test problem
Date: Sun, 8 Sep 2019 18:57:05 +0100
Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=to:from:subject :message-id:date:mime-version:content-type; q=dns; s=sasl; b=twx G0FeycE3SMVAa13PvbPSLJPvXCn93gS7+tgQuRQ6q1/kJwbVIxjvam4rrdLpUIjB WAzp6MefwRiHKy78CdfeqCBsD5Txfr3e5AFPmyq5YsSDZ8BGtpwjsfbjZYZKGBoX ILR7AbcT1kYm3+K6oB0PqDpdQKoMu6RzrFQamtjY=

Just to make things more fun, I also had a third (completely different) problem with inbound tests not working from another host. That remote host is running ps-toolkit 4.0.2 under CentOS 6.

The symptom was that again outbound tests were fine; but this time, inbound tests were not even being scheduled.

It turns out this was another problem caused by IPv6, and again it was a long problem to debug, but I've put the thread below in case anyone is interested in more perfsonar internals.

TL;DR version is:

- the local GUI updates tests in a file, while a background process picks up the file and converts it into pscheduler requests. Any errors happen in the background, and are not reported back in the GUI. (Also, a change in 4.2 means the error detail is removed from the log file too)

- when you schedule an throughput test, the scheduled task for the inbound direction is actually installed on the *remote* node. That is, you are reliant on the third-party scheduler implementing your requested schedule.

- perfsonar 4.0 toolkit makes a call back to the requesting host to see whether it's running pscheduler or bwctl

- this callback fails if the other node's address is an IPv6 literal - the server returns a 400 response. The code to make the callback has been fixed in perfsonar 4.2, so it's not important to fix going forward, but it is very simple to reproduce.

(I'm not sure why it was using a literal, but I suspect it's because the originating node has forward DNS but not reverse DNS).

Cheers,

Brian Candler.

-=-=-=-=-

LONG VERSION:

The remote host is running toolkit 4.0.2. This time, the problem is that the inbound task is not scheduled at all. In fact, when I add the remote host in the GUI, then check the "task" table in my local pscheduler postgres database, only one new task has been added. But if I add a different remote host (toolkit 4.1.6), two tasks are added - one inbound, one outbound.

In the GUI, no error is displayed. It turns out that the GUI just writes out some XML and JSON, and relies on a background task (psconfig-pschduler-agent) to pick this up and apply this to the database. I found an error in /var/log/perfsonar/psconfig-pscheduler-agent-transactions.log:

2019/09/08 15:37:40 INFO guid=7910e517-d841-46e0-86e1-32474618ec60 action="create" test_type=throughput lead_url=https://wrn-perf1.CCCC.net/pscheduler checksum=Zb3vQyp2KcNPld8yKyvoJg task={"tools":["bwctliperf3","iperf3"],"archives":[{"archiver":"esmond","data":{"measurement-agent":"A:A:A:A::45","url":"https://[A:A:A:A::45]/esmond/perfsonar/archive/","_auth-token":"b49e1d83e44321a56a6aa8d6ebb7957bb1bb9f35"}}],"reference":{"psconfig":{"created-by":{"uuid":"b042dea5-40fd-4c63-9b0b-1e0f63bf59a6","user-agent":"psconfig-pscheduler-agent"}}},"schedule":{"slip":"PT43200S","repeat":"PT86400S","sliprand":true,"until":"2019-09-10T15:37:29Z"},"test":{"spec":{"dest-node":"[A:A:A:A::45]","ip-version":6,"duration":"PT20S","source-node":"wrn-perf1.CCCC.net","schema":1,"source":"wrn-perf1.CCCC.net","dest":"A:A:A:A::45"},"type":"throughput"}}2019/09/08 15:37:42 ERROR guid=7910e517-d841-46e0-86e1-32474618ec60 action="create" msg=Problem adding test throughput(wrn-perf1.CCCC.net->A:A:A:A::45), continuing with rest of config: INTERNAL SERVER ERROR2019/09/08 15:37:42 INFO guid=7910e517-d841-46e0-86e1-32474618ec60 action="create" msg=Done creating tasks

Perfsonar just ignores the problem and continues regardless :-(

Unfortunately, "INTERNAL SERVER ERROR" doesn't tell me anything useful. If this error is from the local esmond, nothing is logged in apache error log, nor in /var/log/esmond/*.log

I added some debugging, and found that this post is being made to /pscheduler/tasks on the *remote* host. It seems that if I request in my local GUI to have a regular throughput test to a third-party host, the scheduled task for inbound throughput is created on *their* system. That is: I have to trust that other system to run the scheduled task on my behalf, at the interval I requested!

Anyway, now I need to look on the remote host for the error (which fortunately I also control). Over there, apache ssl_access_log shows a 500, but there's nothing in ssl_error_log.

With some more debugging I was able to reproduce the request with curl and see the response body:

< HTTP/1.1 500 INTERNAL SERVER ERROR...Unable to determine participants: Can't find pScheduler or BWCTL on [A:A:A:A::45]

Aha. That would have been a much more helpful message to see. I've made a PR to fix that - clearly it ought to be logged. Now it is:

2019/09/08 16:34:17 ERROR guid=e09abb00-aed4-41c3-b963-2ac3953713f7 action="create" msg=Problem adding test throughput(wrn-perf1.CCCC.net->A:A:A:A::45), continuing with rest of config: INTERNAL SERVER ERROR: Unable to determine participants: Can't find pScheduler or BWCTL on [A:A:A:A::45]

(Aside: it *was* logged in perfsonar 4.0, this ability was removed in May 2019 in commit 97fbb726)

The next question is: why is this error happening? Firewall is completely open from that remote host, and tcpdump shows it calling back on port 443 successfully.

Looking on the remote host (remember that's the old perfsonar 4.0.2 node), it has been logging this error for other hosts too, for some of its meshconfig tests.

This error comes from pscheduler-server/pscheduler-server/api-server/pschedulerapiserver/tasks.py, and at that point it invokes external command:

pscheduler internal invoke test <type> participants # with the test spec on stdin

On the 4.0.2 node, /var/www/pscheduler-server/pschedulerapiserver/tasks.pyc exists but not tasks.py. That's *really* annoying, as I can't look at or modify the source. But it runs /usr/libexec/pscheduler/classes/test/throughput/participants, and that's where the message "Can't find pScheduler or BWCTL on %s" comes from.

It's calling pscheduler.api_has_services(). Adding some more debugging, I see it returns {'bwctl': False, 'pscheduler': False} for my 4.2 node - which is clearly wrong. And the problem is, it doesn't work with IPv6 address literals. Simple reproducer on perfsonar 4.0:

>> import pscheduler>>> pscheduler.api_has_pscheduler("albq-pt1-v6.es.net")True>>> pscheduler.api_has_pscheduler("[2001:400:4201:1158::3]")False

>>> pscheduler.url_get(pscheduler.api_url_hostport('2001:400:4201:1158::3'),timeout=3,throw=False,json=False,bind=None)(400, u'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>400 Bad Request</title>\n</head><body>\n<h1>Bad Request</h1>\n<p>Your browser sent a request that this server could not understand.<br />\n</p>\n<p>Additionally, a 400 Bad Request\nerror was encountered while trying to use an ErrorDocument to handle the request.</p>\n</body></html>\n')>>> pscheduler.api_url_hostport('2001:400:4201:1158::3')'https://[2001:400:4201:1158::3]/pscheduler/'>>> pscheduler.url_get(pscheduler.api_url_hostport('albq-pt1-v6.es.net'),timeout=3,throw=False,json=False,bind=None)(200, u'"This is the pScheduler API server on albq-pt1-v6.es.net (albq-pt1-v4.es.net)."\n')

Oddly, the 400 error is coming from the remote side. But if I use a full hostname instead of an IPv6 literal it's fine: note the 200 response to the second request below.

C:C:C:C::231 - - [08/Sep/2019:17:33:05 +0000] "GET /pscheduler/ HTTP/1.1" 400 1279 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-754.15.3.el6.x86_64"C:C:C:C::231 - - [08/Sep/2019:17:33:25 +0000] "GET /pscheduler/ HTTP/1.1" 200 1555 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-754.15.3.el6.x86_64"

Sadly I can't tcpdump the requests to see what's different, because it's SSL.

But what I can do is test whether the code has been fixed in perfsonar 4.2, and it has:

>>> pscheduler.url_get(pscheduler.api_url_hostport('2001:400:4201:1158::3'),timeout=3,throw=False,json=False,bind=None)(200, '"This is the pScheduler API server on 2001:400:4201:1158::3 (albq-pt1.es.net)."\n') >>> pscheduler.api_has_pscheduler("[2001:400:4201:1158::3]") True

Either that, or the problem was with the underlying python2.6 libraries in CentOS 6, and a newer OS with python2.7 has fixed it.

[perfsonar-user] Another IPv6 throughput test problem, Brian Candler, 09/08/2019
- Re: [perfsonar-user] Another IPv6 throughput test problem, Mark Feit, 09/09/2019
  - Re: [perfsonar-user] Another IPv6 throughput test problem, Brian Candler, 09/09/2019

List archive

[perfsonar-user] Another IPv6 throughput test problem