perfsonar-user - Re: [perfsonar-user] Problem with PS node move
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To: "Garnizov, Ivan" <>
- Cc: "" <>
- Subject: Re: [perfsonar-user] Problem with PS node move
- Date: Fri, 12 Apr 2019 10:46:03 -0500
Hello Casey,
“I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).”
I was not aware that there are problems with pS Documentation. Certainly these can be reported on the pS issue tracker. I can do this, but I guess you can do this as well. In any case a more concrete information is needed.
Here I am applying a task specification based on the configuration you shared (the url for the runs).
You can use it as described in the documentation ;) http://docs.perfsonar.net/pscheduler_client_tasks.html#importing-tasks-from-json
pscheduler task --debug --import kanren-latbg-task.json latencybg
Of course you can adjust the fields however you like and it goes without saying, but this json is prepared exactly as it had been submitted to pscheduler.
Also in my previous email I asked you, whether pscheduler produced / spawned a powstream process. From your response I still have no idea about it.
In general I would suggest, when debugging issues always to use the ‘—debug’ option like on my example.
The overdue question is related to some findings I myself had recently.
Best regards,
Ivan
From: Casey Russell [mailto:]
Sent: Thursday, April 11, 2019 4:01 PM
To: Garnizov, Ivan (RRZE) <>
Cc:
Subject: Re: [perfsonar-user] Problem with PS node move
Ivan,
Ivan, I was aware that latency and latencybg used different tools, but I was never able to find any examples in the documentation for running latencybg tasks between hosts, and when I try with the built-in documentation, I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).
No further runs scheduled.
[crussell@ps-esu-bw ~]$ pscheduler task latencybg --duration 10 --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net
Submitting task...
Task URL:
Running with tool 'powstream'
Fetching first run...
Next scheduled run:
Starts 2019-04-11T13:56:06Z (~7 seconds)
Ends 2019-04-11T13:56:16Z (~9 seconds)
Waiting for result...
Run has not completed.
No further runs scheduled.
So, unfortunately, I haven't replicated (exactly) the test that's failing, I used owping, since (as I understand it) it's the closest to replicating powstream and it's related ports and protocol setups.
As for the overdue status messages, I generally have been looking at those runs in the API when I get in in the morning, so it will have been several hours after the mesh config (psconfig) kicked them off in the early morning hours.
On Thu, Apr 11, 2019 at 7:16 AM Garnizov, Ivan <> wrote:
Hello Casey,
I had a look into your task spec and the run.
As you know “latency” and “latencybg” tests use a different tool. You can also request a “latencybg” measurement from CLI. The task will then be given an ID and there must be a process spawned to run this measurement.
Are you able to verify this?
When are you getting this overdue status? Immediately after the request or ?
Regards,
Ivan Garnizov
GEANT WP6T3: pS development team
GEANT WP7T1: pS deployments GN Operations
GEANT WP9T2: Software governance in GEANT
From: [mailto:] On Behalf Of Casey Russell
Sent: Wednesday, April 10, 2019 4:41 PM
To:
Subject: Re: [perfsonar-user] Problem with PS node move
Group,
An update here, and another request for assistance. I still have the problem with my (moved) host. The other hosts in the mesh still can't successfully run latency tests to it. But I have gathered a bit more info.
Review:
The host was moved across campus
subnets and IPs moved with it.
routing is good, traceroute and ping are fine.
throughput tests, traceroute tests and OUTbound latency tests to other hosts in the mesh are fine.
inbound tests to the moved host never get posted to the Central MA.
inbound tests to the moved host DO get created on the other hosts in the mesh as proved by "pscheduler schedule"
I can run "one off" latency tests from remote hosts INTO the moved host by hand just fine.
New info:
So I've done some more digging around in the API and discovered that the inbound latencybg tests get created on my remote hosts, but never seem to generate any "runs" or generate any "results posted" entries in the pscheduler.log file. Here are the URLs for one of my testing hosts. The first is a latencybg test created this morning at 05:41 to the moved host. Notice it only ever generates a single run. It also never generates any "results posted " entries in the pscheduler.log
second, is a similar latencybg test, also created at 05:41 this morning to another host in the mesh. It has created and posted many runs in the pscheduler.log file.
Does anyone have any insight on why that first run is failing/sticking? thoughts on how to see what's going on with it, or next troubleshooting steps?
On Wed, Apr 3, 2019 at 12:21 PM Casey Russell <> wrote:
Group,
On 3/23 we moved one of our PS nodes from one building on a campus to another. The subnets moved with it, and it went from being direct-attached to one KanREN router, to being direct-attached to another KanREN router. After that move we observed the following change:
All tests in our mesh continued to operate normally except for latency tests Inbound to this node. If you look at our dashboard, you'll note that IPv4 and IPv6 latency tests work fine when initiated outbound from KSU (the node we moved), but fail when every other host in the mesh tries to initiate a test inbound. My first thought was an ACL didn't get applied properly, but I've reviewed them and they seem sane. For reference, we use the host-based firewall and Juniper MX firewall filters on the router side to secure the hosts. The Juniper firewall rule seems the same as it was before, and I can't see any reason the host-based filtering would have changed with a physical move.
Strangely, the psconfig-pscheduler-agent.log on one of our non-KSU hosts reports on 03/03 (two days before the move) that it scheduled 70 tasks. Yesterday (and everyday since the move) it reports scheduling the same number of tasks. So the external hosts don't seem to have a problem reaching KSU or setting up the test initially. However the latency test doesn't show any results in maddash or the individual host test results.
2019/03/21 18:01:27 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=3060949A-4C2C-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks
2019/04/02 17:52:39 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=ED1E3036-5598-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks
Similarly, I don't have any problem hand-running latency tests to KSU from our outside hosts using owping or any of the other latency tools (I don't have the output from a latencybg test since it's default behavior is to run for a full day).
To further troubleshoot, I went to one of my external hosts to verify that the "pscheduler schedule" shows tests to KSU. It does.
]# pscheduler schedule | grep -v throughput | grep -v trace | grep -A 5 -B 2 ps-ksu-lt
2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z (Running)
latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net
--dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width
0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-
lt.perfsonar.kanren.net --ip-version 4 --packet-interval 0.1 --packet-count 600
(Run with tool 'powstream')
--
2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z (Running)
latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net
--dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width
0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-
lt.perfsonar.kanren.net --ip-version 6 --packet-interval 0.1 --packet-count 600
(Run with tool 'powstream')
(to see the public url's replace localhost with ps-fhsu-lt.perfsonar.kanren.net)
When I go to those URL's of course they're still running, so I'll have to try them again tomorrow to see what the results were.
Does anyone have any thoughts on what could be happening here other than an inbound ACL or MTU issue? The tests should be storing data to a central repository,
- [perfsonar-user] Problem with PS node move, Casey Russell, 04/03/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/10/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/11/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/11/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- RE: [perfsonar-user] Problem with PS node move, Holtzman, Thomas, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/12/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/11/2019
- RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/11/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/10/2019
Archive powered by MHonArc 2.6.19.