perfsonar-user - Re: [perfsonar-user] Problem with PS node move
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To: "Garnizov, Ivan" <>
- Cc: "" <>
- Subject: Re: [perfsonar-user] Problem with PS node move
- Date: Fri, 12 Apr 2019 10:59:26 -0500
Ivan,I wouldn't say there are necessarily problems with the perfsonar documentation. Generally I find it pretty helpful. It's just that in this very specific case, I couldn't find what I needed, as you can see, in the area that gives examples of how to run pscheduler tests by hand, there isn't a specific example for latencybg, and using the built-in --help command I got close, but the run still seems to fail, even between hosts that are working. I did actually try the import task thing a time or two even before you mentioned it. But it assumes you HAVE a (working) exported json task to work with, and when I tried using the task definitions created by psconfig and offered up by the API, those also failed.Taking some cues from your email this morning I stripped everything out of a task from the API that was related to psconfig (archivers, reference numbers, schedules and such) and cut it down to just this.{"schema": 1,"test": {"spec": {"bucket-width": 0.001,"data-ports": {"lower": 8760,"upper": 9960},"dest": "ps-ksu-lt.perfsonar.kanren.net","dest-node": "ps-ksu-lt.perfsonar.kanren.net","flip": false,"ip-version": 6,"packet-count": 600,"packet-interval": 0.1,"packet-padding": 0,"schema": 1,"source": "ps-esu-lt.perfsonar.kanren.net","source-node": "ps-esu-lt.perfsonar.kanren.net"},"type": "latencybg"},"tool": "powstream"}and the --import switch takes that just fine, except that the default length of the test is to run for a full day (86399 seconds), and I can't seem to figure out where in that json task definition to stick the -t (duration) statement to shorten it. Everytime I do it fails with[crussell@ps-esu-bw ~]$ pscheduler task --debug --import testtask-ksu.json latencybg2019-04-12T10:36:30 Debug started2019-04-12T10:36:30 Assistance is from localhostInvalid JSON: No JSON object could be decodedand all /var/log/perfsonar/pscheduler.log shows is:Apr 12 10:36:30 ps-esu-bw journal: task DEBUG Debug startedApr 12 10:36:30 ps-esu-bw journal: task DEBUG Assistance is from localhostOn the question of whether or not pscheduler spawns a powstream process, I'm not sure I know a command that will tell me for that particular instance of pscheduler. The boxes I'm testing with are part of a mid-sized mesh and have 70+ powstream processes running at any given time and that number fluctuates up and down each 4-5 seconds. But in case it helps, here is the debug output from a hand-run instance of a latencybg test (30 seconds only) that fails between two hosts that ARE working with each other in the mesh (but fail the handrun test).[crussell@ps-esu-bw ~]$ pscheduler task --debug latencybg -t PT30S --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net2019-04-12T10:41:22 Debug started2019-04-12T10:41:22 Assistance is from localhost2019-04-12T10:41:22 Forcing default slip of PT5M2019-04-12T10:41:22 Converting to spec via https://localhost/pscheduler/tests/latencybg/specSubmitting task...2019-04-12T10:41:22 Fetching participant list2019-04-12T10:41:22 Spec is: {"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}2019-04-12T10:41:22 Params are: {'spec': '{"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}'}2019-04-12T10:41:22 Got participants: {u'participants': [u'ps-esu-lt.perfsonar.kanren.net']}2019-04-12T10:41:22 Lead is ps-esu-lt.perfsonar.kanren.net2019-04-12T10:41:22 Pinging https://ps-esu-lt.perfsonar.kanren.net/pscheduler/2019-04-12T10:41:22 ps-esu-lt.perfsonar.kanren.net is up2019-04-12T10:41:22 Posting task to https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks2019-04-12T10:41:22 Data is {"test": {"type": "latencybg", "spec": {"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}}, "schema": 1, "schedule": {"slip": "PT5M"}}Task URL:2019-04-12T10:41:24 Posted https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a12019-04-12T10:41:24 Submission diagnostics:2019-04-12T10:41:24 Hints:2019-04-12T10:41:24 requester: 2001:49d0:23c0:1002::22019-04-12T10:41:24 server: 2001:49d0:23c0:1002::22019-04-12T10:41:24 Identified as everybody, local-interfaces, KanREN-PS, r-and-e2019-04-12T10:41:24 Classified as default, friendlies2019-04-12T10:41:24 Application: Hosts we trust to do everything2019-04-12T10:41:24 Group 1: Limit 'always' passed2019-04-12T10:41:24 Group 1: Want all, 1/1 passed, 0/1 failed: PASS2019-04-12T10:41:24 Application PASSES2019-04-12T10:41:24 Application: Defaults applied to non-friendly hosts2019-04-12T10:41:24 Group 1: Limit 'innocuous-tests' passed2019-04-12T10:41:24 Group 1: Limit 'throughput-default-tcp' failed: Test is not 'throughput'2019-04-12T10:41:24 Group 1: Limit 'throughput-default-udp' failed: Test is not 'throughput'2019-04-12T10:41:24 Group 1: Limit 'idleex-default' failed: Test is not 'idleex'2019-04-12T10:41:24 Group 1: Want any, 1/4 passed, 3/4 failed: PASS2019-04-12T10:41:24 Application PASSES2019-04-12T10:41:24 Proposal meets limitsRunning with tool 'powstream'Fetching first run...2019-04-12T10:41:24 Fetching https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1/runs/first2019-04-12T10:41:25 Handing off: pscheduler watch --first --format text/plain --debug https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a12019-04-12T10:41:25 Debug started2019-04-12T10:41:25 Fetching https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a12019-04-12T10:41:25 Fetching next run from https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1/runs/firstNext scheduled run:Starts 2019-04-12T15:41:34Z (~8 seconds)Ends 2019-04-12T15:42:04Z (~29 seconds)Waiting for result...Run has not completed.2019-04-12T10:42:35 Fetching next run from https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1/runs/nextNo further runs scheduled.Could this failure be occurring because latencybg/powstream is a daemon and there is already a (day-long) latencybg test running between the hosts?On Fri, Apr 12, 2019 at 3:58 AM Garnizov, Ivan <> wrote:Hello Casey,
“I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).”
I was not aware that there are problems with pS Documentation. Certainly these can be reported on the pS issue tracker. I can do this, but I guess you can do this as well. In any case a more concrete information is needed.
Here I am applying a task specification based on the configuration you shared (the url for the runs).
You can use it as described in the documentation ;) http://docs.perfsonar.net/pscheduler_client_tasks.html#importing-tasks-from-json
pscheduler task --debug --import kanren-latbg-task.json latencybg
Of course you can adjust the fields however you like and it goes without saying, but this json is prepared exactly as it had been submitted to pscheduler.
Also in my previous email I asked you, whether pscheduler produced / spawned a powstream process. From your response I still have no idea about it.
In general I would suggest, when debugging issues always to use the ‘—debug’ option like on my example.
The overdue question is related to some findings I myself had recently.
Best regards,
Ivan
From: Casey Russell [mailto:]
Sent: Thursday, April 11, 2019 4:01 PM
To: Garnizov, Ivan (RRZE) <>
Cc:
Subject: Re: [perfsonar-user] Problem with PS node move
Ivan,
Ivan, I was aware that latency and latencybg used different tools, but I was never able to find any examples in the documentation for running latencybg tasks between hosts, and when I try with the built-in documentation, I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).
No further runs scheduled.
[crussell@ps-esu-bw ~]$ pscheduler task latencybg --duration 10 --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net
Submitting task...
Task URL:
Running with tool 'powstream'
Fetching first run...
Next scheduled run:
Starts 2019-04-11T13:56:06Z (~7 seconds)
Ends 2019-04-11T13:56:16Z (~9 seconds)
Waiting for result...
Run has not completed.
No further runs scheduled.
So, unfortunately, I haven't replicated (exactly) the test that's failing, I used owping, since (as I understand it) it's the closest to replicating powstream and it's related ports and protocol setups.
As for the overdue status messages, I generally have been looking at those runs in the API when I get in in the morning, so it will have been several hours after the mesh config (psconfig) kicked them off in the early morning hours.
On Thu, Apr 11, 2019 at 7:16 AM Garnizov, Ivan <> wrote:
Hello Casey,
I had a look into your task spec and the run.
As you know “latency” and “latencybg” tests use a different tool. You can also request a “latencybg” measurement from CLI. The task will then be given an ID and there must be a process spawned to run this measurement.
Are you able to verify this?
When are you getting this overdue status? Immediately after the request or ?
Regards,
Ivan Garnizov
GEANT WP6T3: pS development team
GEANT WP7T1: pS deployments GN Operations
GEANT WP9T2: Software governance in GEANT
From: [mailto:] On Behalf Of Casey Russell
Sent: Wednesday, April 10, 2019 4:41 PM
To:
Subject: Re: [perfsonar-user] Problem with PS node move
Group,
An update here, and another request for assistance. I still have the problem with my (moved) host. The other hosts in the mesh still can't successfully run latency tests to it. But I have gathered a bit more info.
Review:
The host was moved across campus
subnets and IPs moved with it.
routing is good, traceroute and ping are fine.
throughput tests, traceroute tests and OUTbound latency tests to other hosts in the mesh are fine.
inbound tests to the moved host never get posted to the Central MA.
inbound tests to the moved host DO get created on the other hosts in the mesh as proved by "pscheduler schedule"
I can run "one off" latency tests from remote hosts INTO the moved host by hand just fine.
New info:
So I've done some more digging around in the API and discovered that the inbound latencybg tests get created on my remote hosts, but never seem to generate any "runs" or generate any "results posted" entries in the pscheduler.log file. Here are the URLs for one of my testing hosts. The first is a latencybg test created this morning at 05:41 to the moved host. Notice it only ever generates a single run. It also never generates any "results posted " entries in the pscheduler.log
second, is a similar latencybg test, also created at 05:41 this morning to another host in the mesh. It has created and posted many runs in the pscheduler.log file.
Does anyone have any insight on why that first run is failing/sticking? thoughts on how to see what's going on with it, or next troubleshooting steps?
On Wed, Apr 3, 2019 at 12:21 PM Casey Russell <> wrote:
Group,
On 3/23 we moved one of our PS nodes from one building on a campus to another. The subnets moved with it, and it went from being direct-attached to one KanREN router, to being direct-attached to another KanREN router. After that move we observed the following change:
All tests in our mesh continued to operate normally except for latency tests Inbound to this node. If you look at our dashboard, you'll note that IPv4 and IPv6 latency tests work fine when initiated outbound from KSU (the node we moved), but fail when every other host in the mesh tries to initiate a test inbound. My first thought was an ACL didn't get applied properly, but I've reviewed them and they seem sane. For reference, we use the host-based firewall and Juniper MX firewall filters on the router side to secure the hosts. The Juniper firewall rule seems the same as it was before, and I can't see any reason the host-based filtering would have changed with a physical move.
Strangely, the psconfig-pscheduler-agent.log on one of our non-KSU hosts reports on 03/03 (two days before the move) that it scheduled 70 tasks. Yesterday (and everyday since the move) it reports scheduling the same number of tasks. So the external hosts don't seem to have a problem reaching KSU or setting up the test initially. However the latency test doesn't show any results in maddash or the individual host test results.
2019/03/21 18:01:27 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=3060949A-4C2C-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks
2019/04/02 17:52:39 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=ED1E3036-5598-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks
Similarly, I don't have any problem hand-running latency tests to KSU from our outside hosts using owping or any of the other latency tools (I don't have the output from a latencybg test since it's default behavior is to run for a full day).
To further troubleshoot, I went to one of my external hosts to verify that the "pscheduler schedule" shows tests to KSU. It does.
]# pscheduler schedule | grep -v throughput | grep -v trace | grep -A 5 -B 2 ps-ksu-lt
2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z (Running)
latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net
--dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width
0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-
lt.perfsonar.kanren.net --ip-version 4 --packet-interval 0.1 --packet-count 600
(Run with tool 'powstream')
--
2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z (Running)
latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net
--dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width
0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-
lt.perfsonar.kanren.net --ip-version 6 --packet-interval 0.1 --packet-count 600
(Run with tool 'powstream')
(to see the public url's replace localhost with ps-fhsu-lt.perfsonar.kanren.net)
When I go to those URL's of course they're still running, so I'll have to try them again tomorrow to see what the results were.
Does anyone have any thoughts on what could be happening here other than an inbound ACL or MTU issue? The tests should be storing data to a central repository,
- [perfsonar-user] Problem with PS node move, Casey Russell, 04/03/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/10/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/11/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/11/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- RE: [perfsonar-user] Problem with PS node move, Holtzman, Thomas, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/11/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/11/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/10/2019
Archive powered by MHonArc 2.6.19.