This started off as a request for help, but has ended up as a
sort of meta bug report: as well as fixing the issue, I would like
to highlight how hard it is to get useful debugging information
out of pscheduler.
The issue is between two nodes. Node A has perfsonar-toolkit 4.2
under Ubuntu 18.04 lxd; node B is perfsonar-testpoint 4.1 under
Debian. Both nodes have NTP-synced clocks.
On node A, I configured a throughput test to B, with v6-only
selected (since node A is behind NAT for IPv4). The problem seen
in the GUI is that results for the outbound direction are shown,
but nothing in the inbound direction (and no red dot for 'failed
test' either)
Logs on A and B didn't really seem to show any reason (*).
Firewall rules allow all traffic between A and B on IPv4. I was
able to run tests using iperf3 -s/-c, in both directions, without
problems.
So I tried using pscheduler to run tests from the command line -
and sure enough, I found that tests in the outbound direction
worked, but tests in the inbound direction did not.
Here's the outbound test command which works:
# pscheduler task throughput -d ns1.BBBB.com -s
perf1.home.AAAA.net --ip-version=6
And here's the inbound test command which fails: just reverse -s
and -d, added --debug too.
--------
# pscheduler task --debug throughput -s ns1.BBBB.com -d
perf1.home.AAAA.net --ip-version=6
2019-09-01T16:17:26 Debug started
2019-09-01T16:17:26 Assistance is from localhost
2019-09-01T16:17:26 Forcing default slip of PT5M
2019-09-01T16:17:26 Converting to spec via
https://localhost/pscheduler/tests/throughput/spec
Submitting task...
2019-09-01T16:17:27 Fetching participant list
2019-09-01T16:17:27 Spec is: {"dest":
"perf1.home.AAAA.net", "source": "ns1.BBBB.com", "ip-version":
6, "schema": 1}
2019-09-01T16:17:27 Params are: {'spec': '{"dest":
"perf1.home.AAAA.net", "source": "ns1.BBBB.com", "ip-version":
6, "schema": 1}'}
2019-09-01T16:17:28 Got participants: {u'participants':
[u'ns1.BBBB.com', u'perf1.home.AAAA.net']}
2019-09-01T16:17:28 Lead is ns1.BBBB.com
2019-09-01T16:17:28 Pinging
https://ns1.BBBB.com/pscheduler/
2019-09-01T16:17:28 ns1.BBBB.com is up
2019-09-01T16:17:28 Posting task to
https://ns1.BBBB.com/pscheduler/tasks
2019-09-01T16:17:28 Data is {'test': {'type':
'throughput', 'spec': {u'dest': u'perf1.home.AAAA.net',
u'source': u'ns1.BBBB.com', u'ip-version': 6, u'schema': 1}},
'schedule': {'slip': 'PT5M'}}
Task URL:
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:40 Posted
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:40 Submission diagnostics:
2019-09-01T16:17:40 Hints:
2019-09-01T16:17:40 requester: A:A:A:A::45
2019-09-01T16:17:40 server: B:B:B:B:feff:ff:fe00:f16
2019-09-01T16:17:40 Identified as everybody
2019-09-01T16:17:40 Classified as default
2019-09-01T16:17:40 Application: Defaults applied to
non-friendly hosts
2019-09-01T16:17:40 Group 1: Limit 'innocuous-tests'
failed: Passed but inverted
2019-09-01T16:17:40 Group 1: Limit
'throughput-default-time' passed
2019-09-01T16:17:40 Group 1: Limit 'idleex-default'
failed: Test is not 'idleex'
2019-09-01T16:17:40 Group 1: Want any, 1/3 passed,
2/3 failed: PASS
2019-09-01T16:17:40 Application PASSES
2019-09-01T16:17:40 Proposal meets limits
Running with tool 'iperf3'
Fetching first run...
2019-09-01T16:17:40 Fetching
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/first
2019-09-01T16:17:42 Handing off: pscheduler watch --first
--debug
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:42 Debug started
2019-09-01T16:17:42 Fetching
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7
2019-09-01T16:17:43 Fetching next run from
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/first
Next scheduled run:
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/051b6594-67e1-44eb-891f-a80a3d195550
Starts 2019-09-01T17:17:50+01:00 (~6 seconds)
Ends 2019-09-01T17:18:09+01:00 (~18 seconds)
Waiting for result...
2019-09-01T16:18:14 Calling ['pscheduler', 'result',
'--quiet', '--format', 'text/plain', '--debug',
u'https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/051b6594-67e1-44eb-891f-a80a3d195550']
Run did not complete: Failed
Diagnostics from ns1.BBBB.com:
No diagnostics.
Error from ns1.BBBB.com:
iperf3 returned an error: error - unable to connect to
server: Connection refused
Diagnostics from perf1.home.AAAA.net:
Run was preempted.
Error from perf1.home.AAAA.net:
No error.
2019-09-01T16:18:15 Fetching next run from
https://ns1.BBBB.com/pscheduler/tasks/b68280cb-f9ad-4b74-bfda-7bb0e11063c7/runs/next
No further runs scheduled.
--------
No useful indication of the problem, other than "Connection
refused" and "Run was preempted.". Furthermore, running "ps |
grep iperf3" on both nodes during the test window, I didn't see
any iperf3 process running at either side.
Running tcpdump on node A, I see an inbound TCP connection
attempt which is refused, agreeing with the error above.
--------
root@perf1:~# tcpdump -i eth0 -nn tcp port 5201
tcpdump: verbose output suppressed, use -v or -vv for
full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture
size 262144 bytes
17:01:40.220897 IP6 B:B:B:B:feff:ff:fe00:f16.52707 >
A:A:A:A::45.5201: Flags [S], seq 3700706462, win 28800, options
[mss 1440,sackOK,TS val 338337963 ecr 0,nop,wscale 7], length 0
17:01:40.221105 IP6 A:A:A:A::45.5201 >
B:B:B:B:feff:ff:fe00:f16.52707: Flags [R.], seq 0, ack
3700706463, win 0, length 0
--------
More strangely: replacing /usr/bin/iperf3 on node A with a shell
script which writes to /tmp, I could see that iperf3 wasn't being
run at all for inbound tests. (But it *is* being run for tests in
the outbound direction).
Next, I found the pid of the runner process on node A:
pschedu+ 739 0.2 1.1 1168772 89124 ? Ssl Aug31 0:45
/usr/bin/python /usr/lib/pscheduler/daemons/runner --daemon
--pid-file /var/run/pscheduler-server/pscheduler-runner.pid --dsn
@/etc/pscheduler/database/database-dsn
and attached strace -f -s 128 -p <pid>
For inbound tests, I could see *no exec* taking place! And yet,
for outbound tests, I can see it happily clone and exec of various
processes, including /usr/bin/iperf3. So for some reason,
pscheduler isn't even attempting to run the iperf3 -s process on
node A.
/var/log/pscheduler/pscheduler.log just says:
Sep 1 16:29:24 perf1 runner INFO 6945: Running
https://ns1.BBBB.com/pscheduler/tasks/f7e0b30d-e247-4507-b2eb-b9c641abf8a5/runs/59bf7b96-94a0-404e-9557-3ff1e62e2a76
Sep 1 16:29:24 perf1 runner INFO 6945: With iperf3:
throughput --source ns1.BBBB.com --ip-version 6 --dest
perf1.home.AAAA.net
Sep 1 16:29:24 perf1 runner INFO 6945: Run
was preempted.
And looking through the python code, all I find for this error is
in /usr/lib/pscheduler/daemons/runner:
db = self._get_db_conn()
with db.cursor() as cursor:
cursor.execute("SELECT
run_can_proceed(%s)", [self.id])
assert(cursor.rowcount == 1)
if not cursor.fetchone()[0]:
if log_details:
self.log.info("%d: Run was
preempted.",
self.id)
But why does run_can_proceed not return true? Could it be a
database problem? /var/log/postgresql/postgresql-10-main.log only
shows:
--------
2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler
ERROR: duplicate key value violates unique constraint
"heartbeat_pkey"
2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler
DETAIL: Key (name)=(scheduler) already exists.
2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler
CONTEXT: SQL statement "INSERT INTO heartbeat (name)
VALUES
('ticker'),
('scheduler'),
('runner'),
('archiver')"
PL/pgSQL function heartbeat_cold_boot() line 4 at
SQL statement
SQL statement "SELECT heartbeat_cold_boot()"
PL/pgSQL function cold_boot() line 3 at PERFORM
2019-08-31 12:39:50.832 UTC [845] pscheduler@pscheduler
STATEMENT: SELECT cold_boot()
2019-08-31 12:39:51.089 UTC [845] pscheduler@pscheduler
LOG: could not receive data from client: Connection reset by
peer
2019-08-31 12:39:51.090 UTC [846] pscheduler@pscheduler
LOG: could not receive data from client: Connection reset by
peer
2019-08-31 21:57:34.485 UTC [30594] esmond@esmond ERROR:
null value in column "source" violates not-null constraint
2019-08-31 21:57:34.485 UTC [30594] esmond@esmond
DETAIL: Failing row contains (8, pscheduler/iperf3, null,
A:A:A:A::45, null, ns1.BBBB.com, A:A:A:A::45, 8).
2019-08-31 21:57:34.485 UTC [30594] esmond@esmond
STATEMENT: INSERT INTO "ps_p2p_subject" ("metadata_id",
"tool_name", "source", "destination", "measurement_agent",
"input_source", "input_destination") VALUES (8,
'pscheduler/iperf3', NULL, 'A:A:A:A::45'::inet, NULL,
'ns1.BBBB.com', 'A:A:A:A::45') RETURNING "ps_p2p_subject"."id"
2019-09-01 02:30:29.021 UTC [9364] esmond@esmond LOG:
could not receive data from client: Connection reset by peer
--------
- but these are error messages from yesterday and the day before.
Does the stored procedure exist? Yes.
pscheduler=# select prosrc from pg_proc where
proname='run_can_proceed';
prosrc
-------------------------------------------------------------
+
BEGIN
+
+
IF NOT EXISTS (SELECT * FROM run WHERE id =
run_id) +
THEN +
RAISE EXCEPTION 'No such
run.'; +
END
IF; +
+
RETURN NOT EXISTS
( +
SELECT
* +
+
FROM +
run
run1 +
JOIN task task1 ON task1.id =
run1.task +
JOIN test test1 ON test1.id =
task1.test +
JOIN scheduling_class scheduling_class1
ON +
scheduling_class1.id =
test1.scheduling_class+
JOIN run run2
ON +
run2.times &&
run1.times +
AND run2.id <>
run1.id +
AND run2.priority >
run1.priority +
AND NOT
run_state_is_finished(run2.state) +
WHERE +
run1.id =
run_id +
AND NOT
scheduling_class1.anytime +
); +
+
END;
+
(1 row)
Looking at recent test runs, I can see the *scheduled* throughput
tests that I made between A and B:
pscheduler=# select run.id,task,times,state,errors,status from
run join task on run.task=task.id join test on task.test=test.id
where test.name='throughput' order by times desc limit 5;
id | task |
times | state | errors | status
------+------+-----------------------------------------------------+-------+--------+--------
6675 | 12 | ["2019-09-02 12:54:32+00","2019-09-02
12:55:01+00") | 1 | |
6939 | 16 | ["2019-09-02 12:45:00+00","2019-09-02
12:45:29+00") | 1 | |
5427 | 16 | ["2019-09-02 04:30:31+00","2019-09-02
04:31:00+00") | 1 | |
5170 | 12 | ["2019-09-02 04:23:37+00","2019-09-02
04:24:06+00") | 1 | |
5186 | 16 | ["2019-09-02 04:21:11+00","2019-09-02
04:21:40+00") | 1 | |
(5 rows)
pscheduler=# select run_can_proceed(6675);
run_can_proceed
-----------------
t
(1 row)
pscheduler=# select run_can_proceed(6939);
run_can_proceed
-----------------
f
(1 row)
"select json from task where id=16;" shows that this is the
scheduled inbound test, which isn't taking place, bringing me back
to the original problem.
Duplicating the query from run_can_proceed:
SELECT
run1.id,run1.times,run1.priority,run1.state,run2.id,run2.times,run2.priority,run2.state
FROM
run run1
JOIN task task1 ON task1.id = run1.task
JOIN test test1 ON test1.id = task1.test
JOIN scheduling_class scheduling_class1 ON
scheduling_class1.id = test1.scheduling_class
JOIN run run2 ON
run2.times && run1.times
AND run2.id <> run1.id
AND run2.priority > run1.priority
AND NOT run_state_is_finished(run2.state)
WHERE
run1.id = 6939
AND NOT scheduling_class1.anytime;
id |
times | priority | state | id
| times | priority
| state
------+-----------------------------------------------------+----------+-------+------+-----------------------------------------------------+----------+-------
6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00")
| 0 | 1 | 4224 | ["2019-09-01
12:48:24+00","2019-09-02 12:48:24+00") | 5 | 3
6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00")
| 0 | 1 | 4221 | ["2019-09-01
12:48:24+00","2019-09-02 12:48:24+00") | 5 | 3
6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00")
| 0 | 1 | 4832 | ["2019-09-01
15:27:17+00","2019-09-02 15:27:17+00") | 5 | 3
6939 | ["2019-09-02 12:45:00+00","2019-09-02 12:45:29+00")
| 0 | 1 | 4828 | ["2019-09-01
15:27:17+00","2019-09-02 15:27:17+00") | 5 | 3
(4 rows)
OK, at this point I'm stumped. I think I've dug further into the
innards of pscheduler than any end-user really is supposed to do.
And all I've found is: both ad-hoc and scheduled tests are being
dropped on the floor, with message "Run was preempted.", and I
believe this is because run_can_proceed() is returning false.
Going back to what I said at the start: whilst I would appreciate
hints to help fix the specific problem here, I also think that
pscheduler could do a better job of reporting problems. If it
decides that it's not a good idea to launch a tool, I think it
should say *why* it has decided this. Or at least document this
error message: googling "+pscheduler run was preempted" turns up
nothing.
Regards,
Brian Candler.
(*) Actually, on node A, I did find a single instance of an error
in /var/log/esmond/django.log:
2019-08-31 21:57:34,486 [ERROR]
/usr/lib/python2.7/dist-packages/django/core/handlers/exception.py:
Internal Server Error: /esmond/perfsonar/archive/
Traceback (most recent call last):
File
"/usr/lib/python2.7/dist-packages/django/core/handlers/exception.py",
line 41, in inner
response = get_response(request)
File
"/usr/lib/python2.7/dist-packages/django/core/handlers/base.py",
line 249, in _legacy_get_response
response = self._get_response(request)
File
"/usr/lib/python2.7/dist-packages/django/core/handlers/base.py",
line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File
"/usr/lib/python2.7/dist-packages/django/core/handlers/base.py",
line 185, in _get_response
response = wrapped_callback(request, *callback_args,
**callback_kwargs)
File
"/usr/lib/python2.7/dist-packages/django/views/decorators/csrf.py",
line 58, in wrapped_view
return view_func(*args, **kwargs)
File
"/usr/lib/python2.7/dist-packages/rest_framework/viewsets.py",
line 95, in view
return self.dispatch(request, *args, **kwargs)
File
"/usr/lib/python2.7/dist-packages/rest_framework/views.py", line
494, in dispatch
response = self.handle_exception(exc)
File
"/usr/lib/python2.7/dist-packages/rest_framework/views.py", line
454, in handle_exception
self.raise_uncaught_exception(exc)
File
"/usr/lib/python2.7/dist-packages/rest_framework/views.py", line
491, in dispatch
response = handler(request, *args, **kwargs)
File
"/usr/lib/python2.7/dist-packages/esmond/api/perfsonar/api_v2.py",
line 876, in create
return super(ArchiveViewset, self).create(request)
File
"/usr/lib/python2.7/dist-packages/rest_framework/mixins.py",
line 21, in create
self.perform_create(serializer)
File
"/usr/lib/python2.7/dist-packages/rest_framework/mixins.py",
line 26, in perform_create
serializer.save()
File
"/usr/lib/python2.7/dist-packages/rest_framework/serializers.py",
line 214, in save
self.instance = self.create(validated_data)
File
"/usr/lib/python2.7/dist-packages/esmond/api/perfsonar/api_v2.py",
line 674, in create
self.get_fields()[subject_model].Meta.model.objects.create(metadata=metadata,
**subject)
File
"/usr/lib/python2.7/dist-packages/django/db/models/manager.py",
line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File
"/usr/lib/python2.7/dist-packages/django/db/models/query.py",
line 394, in create
obj.save(force_insert=True, using=self.db)
File
"/usr/lib/python2.7/dist-packages/django/db/models/base.py",
line 808, in save
force_update=force_update, update_fields=update_fields)
File
"/usr/lib/python2.7/dist-packages/django/db/models/base.py",
line 838, in save_base
updated = self._save_table(raw, cls, force_insert,
force_update, using, update_fields)
File
"/usr/lib/python2.7/dist-packages/django/db/models/base.py",
line 924, in _save_table
result = self._do_insert(cls._base_manager, using, fields,
update_pk, raw)
File
"/usr/lib/python2.7/dist-packages/django/db/models/base.py",
line 963, in _do_insert
using=using, raw=raw)
File
"/usr/lib/python2.7/dist-packages/django/db/models/manager.py",
line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File
"/usr/lib/python2.7/dist-packages/django/db/models/query.py",
line 1079, in _insert
return
query.get_compiler(using=using).execute_sql(return_id)
File
"/usr/lib/python2.7/dist-packages/django/db/models/sql/compiler.py",
line 1112, in execute_sql
cursor.execute(sql, params)
File
"/usr/lib/python2.7/dist-packages/django/db/backends/utils.py",
line 64, in execute
return self.cursor.execute(sql, params)
File "/usr/lib/python2.7/dist-packages/django/db/utils.py",
line 94, in __exit__
six.reraise(dj_exc_type, dj_exc_value, traceback)
File
"/usr/lib/python2.7/dist-packages/django/db/backends/utils.py",
line 64, in execute
return self.cursor.execute(sql, params)
IntegrityError: null value in column "source" violates not-null
constraint
DETAIL: Failing row contains (8, pscheduler/iperf3, null,
A:A:A:A::45, null, ns1.BBBB.com, A:A:A:A::45, 8).
And a pscheduler error:
/var/log/perfsonar/psconfig-pscheduler-agent-transactions.log:2019/08/31
15:27:50 ERROR guid=8df7d348-85f9-4bdb-9ce9-62db18bc4f3e
action="create" msg=Problem adding test
throughput(ns1.BBBB.com->A:A:A:A::45), continuing with rest
of config: INTERNAL SERVER ERROR
/var/log/perfsonar/psconfig-pscheduler-agent.log:2019/08/31
15:28:24 WARN pid=887
prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end
line=227 guid=8df7d348-85f9-4bdb-9ce9-62db18bc4f3e msg=Problem
adding test throughput(ns1.BBBB.com->A:A:A:A::45), continuing
with rest of config: INTERNAL SERVER ERROR
But I think those are likely red herrings, since quite a few
scheduled throughput tests should have run by now, but there was
only a single instance of those errors. They were at different
timestamps too.