perfsonar-user - Re: [perfsonar-user] Problem with PS node move

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Problem with PS node move

From: Casey Russell <>
To: "Garnizov, Ivan" <>
Cc: "" <>
Subject: Re: [perfsonar-user] Problem with PS node move
Date: Fri, 12 Apr 2019 10:59:26 -0500

Ivan,

I apologize for not googling it before I responded. Just because I don't know a command for determining which processes a specific command spawns doesn't mean it doesn't exist. Here is an strace output that might be useful to you. Again, for reference this is NOT run to my moved host (which I originally asked about). This is a test between two hosts in the mesh which are otherwise doing latencybg tests in both directions just fine each day. If you'd like to see it inbound to the host that's failing let me know and I'm happy to provide.

No further runs scheduled.

[crussell@ps-esu-bw ~]$ strace

strace: must have PROG [ARGS] or -p PID

Try 'strace -h' for more information.

[crussell@ps-esu-bw ~]$ strace -f -e execve pscheduler task latencybg -t PT30S --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net

execve("/usr/bin/pscheduler", ["pscheduler", "task", "latencybg", "-t", "PT30S", "--source", "ps-esu-lt.perfsonar.kanren.net", "--dest", "ps-fhsu-lt.perfsonar.kanren.net"], [/* 24 vars */]) = 0

strace: Process 13451 attached

[pid 13451] execve("/usr/bin/basename", ["basename", "/usr/bin/pscheduler"], [/* 24 vars */]) = 0

[pid 13451] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13451, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

execve("/usr/libexec/pscheduler/commands/task", ["/usr/libexec/pscheduler/commands"..., "latencybg", "-t", "PT30S", "--source", "ps-esu-lt.perfsonar.kanren.net", "--dest", "ps-fhsu-lt.perfsonar.kanren.net"], [/* 23 vars */]) = 0

strace: Process 13452 attached

[pid 13452] execve("/bin/sh", ["sh", "-c", "/sbin/ldconfig -p 2>/dev/null"], [/* 23 vars */]) = 0

strace: Process 13453 attached

[pid 13453] execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], [/* 24 vars */]) = 0

[pid 13453] +++ exited with 0 +++

[pid 13452] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13453, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13452] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13452, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

strace: Process 13454 attached

[pid 13454] execve("/bin/sh", ["sh", "-c", "/sbin/ldconfig -p 2>/dev/null"], [/* 23 vars */]) = 0

strace: Process 13455 attached

[pid 13455] execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], [/* 24 vars */]) = 0

[pid 13455] +++ exited with 0 +++

[pid 13454] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13455, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13454] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13454, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

strace: Process 13456 attached

[pid 13456] +++ exited with 0 +++

strace: Process 13457 attached

[pid 13457] execve("/bin/sh", ["sh", "-c", "uname -p 2> /dev/null"], [/* 24 vars */]) = 0

strace: Process 13458 attached

[pid 13458] execve("/usr/bin/uname", ["uname", "-p"], [/* 25 vars */]) = 0

[pid 13458] +++ exited with 0 +++

[pid 13457] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13458, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13457] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13457, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

Submitting task...

Task URL:

https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/b9b17844-878d-4638-9dfd-bbb344b32945

Running with tool 'powstream'

Fetching first run...

execve("/bin/sh", ["/bin/sh", "-c", "pscheduler watch --first --forma"...], [/* 24 vars */]) = 0

execve("/usr/bin/pscheduler", ["pscheduler", "watch", "--first", "--format", "text/plain", "https://ps-esu-lt.perfsonar.kanr"...], [/* 25 vars */]) = 0

strace: Process 13574 attached

[pid 13574] execve("/usr/bin/basename", ["basename", "/usr/bin/pscheduler"], [/* 25 vars */]) = 0

[pid 13574] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13574, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

execve("/usr/libexec/pscheduler/commands/watch", ["/usr/libexec/pscheduler/commands"..., "--first", "--format", "text/plain", "https://ps-esu-lt.perfsonar.kanr"...], [/* 24 vars */]) = 0

strace: Process 13575 attached

[pid 13575] execve("/bin/sh", ["sh", "-c", "/sbin/ldconfig -p 2>/dev/null"], [/* 24 vars */]) = 0

strace: Process 13576 attached

[pid 13576] execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], [/* 25 vars */]) = 0

[pid 13576] +++ exited with 0 +++

[pid 13575] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13576, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13575] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13575, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

strace: Process 13577 attached

[pid 13577] execve("/bin/sh", ["sh", "-c", "/sbin/ldconfig -p 2>/dev/null"], [/* 24 vars */]) = 0

strace: Process 13578 attached

[pid 13578] execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], [/* 25 vars */]) = 0

[pid 13578] +++ exited with 0 +++

[pid 13577] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13578, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13577] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13577, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

strace: Process 13579 attached

[pid 13579] execve("/bin/sh", ["sh", "-c", "uname -p 2> /dev/null"], [/* 24 vars */]) = 0

strace: Process 13580 attached

[pid 13580] execve("/usr/bin/uname", ["uname", "-p"], [/* 25 vars */]) = 0

[pid 13580] +++ exited with 0 +++

[pid 13579] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13580, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13579] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13579, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

Next scheduled run:

https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/b9b17844-878d-4638-9dfd-bbb344b32945/runs/d2cb7e8d-fb89-4ef7-882d-5caf544f5ca0

Starts 2019-04-12T15:54:03Z (~7 seconds)

Ends 2019-04-12T15:54:33Z (~29 seconds)

Waiting for result...

strace: Process 13816 attached

[pid 13816] execve("/usr/lib/perfsonar/scripts/pscheduler", ["pscheduler", "result", "--quiet", "--format", "text/plain", "https://ps-esu-lt.perfsonar.kanr"...], [/* 24 vars */]) = -1 ENOENT (No such file or directory)

[pid 13816] execve("/sbin/pscheduler", ["pscheduler", "result", "--quiet", "--format", "text/plain", "https://ps-esu-lt.perfsonar.kanr"...], [/* 24 vars */]) = -1 ENOENT (No such file or directory)

[pid 13816] execve("/usr/local/bin/pscheduler", ["pscheduler", "result", "--quiet", "--format", "text/plain", "https://ps-esu-lt.perfsonar.kanr"...], [/* 24 vars */]) = -1 ENOENT (No such file or directory)

[pid 13816] execve("/usr/bin/pscheduler", ["pscheduler", "result", "--quiet", "--format", "text/plain", "https://ps-esu-lt.perfsonar.kanr"...], [/* 24 vars */]) = 0

strace: Process 13817 attached

[pid 13817] execve("/usr/bin/basename", ["basename", "/usr/bin/pscheduler"], [/* 25 vars */]) = 0

[pid 13817] +++ exited with 0 +++

[pid 13816] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13817, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13816] execve("/usr/libexec/pscheduler/commands/result", ["/usr/libexec/pscheduler/commands"..., "--quiet", "--format", "text/plain", "https://ps-esu-lt.perfsonar.kanr"...], [/* 24 vars */]) = 0

strace: Process 13818 attached

[pid 13818] execve("/bin/sh", ["sh", "-c", "/sbin/ldconfig -p 2>/dev/null"], [/* 24 vars */]) = 0

strace: Process 13819 attached

[pid 13819] execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], [/* 25 vars */]) = 0

[pid 13819] +++ exited with 0 +++

[pid 13818] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13819, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13818] +++ exited with 0 +++

[pid 13816] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13818, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

strace: Process 13820 attached

[pid 13820] execve("/bin/sh", ["sh", "-c", "/sbin/ldconfig -p 2>/dev/null"], [/* 24 vars */]) = 0

strace: Process 13821 attached

[pid 13821] execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], [/* 25 vars */]) = 0

[pid 13821] +++ exited with 0 +++

[pid 13820] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13821, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13820] +++ exited with 0 +++

[pid 13816] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13820, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

strace: Process 13826 attached

[pid 13826] execve("/bin/sh", ["sh", "-c", "uname -p 2> /dev/null"], [/* 24 vars */]) = 0

strace: Process 13827 attached

[pid 13827] execve("/usr/bin/uname", ["uname", "-p"], [/* 25 vars */]) = 0

[pid 13827] +++ exited with 0 +++

[pid 13826] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13827, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

[pid 13826] +++ exited with 0 +++

[pid 13816] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13826, si_uid=4508, si_status=0, si_utime=0, si_stime=0} ---

Run has not completed.

[pid 13816] +++ exited with 0 +++

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13816, si_uid=4508, si_status=0, si_utime=18, si_stime=10} ---

No further runs scheduled.

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Fri, Apr 12, 2019 at 10:46 AM Casey Russell <> wrote:

Ivan,

I wouldn't say there are necessarily problems with the perfsonar documentation. Generally I find it pretty helpful. It's just that in this very specific case, I couldn't find what I needed, as you can see, in the area that gives examples of how to run pscheduler tests by hand, there isn't a specific example for latencybg, and using the built-in --help command I got close, but the run still seems to fail, even between hosts that are working. I did actually try the import task thing a time or two even before you mentioned it. But it assumes you HAVE a (working) exported json task to work with, and when I tried using the task definitions created by psconfig and offered up by the API, those also failed.

Taking some cues from your email this morning I stripped everything out of a task from the API that was related to psconfig (archivers, reference numbers, schedules and such) and cut it down to just this.
{
"schema": 1,
"test": {
"spec": {
"bucket-width": 0.001,
"data-ports": {
"lower": 8760,
"upper": 9960
},
"dest": "ps-ksu-lt.perfsonar.kanren.net",
"dest-node": "ps-ksu-lt.perfsonar.kanren.net",
"flip": false,
"ip-version": 6,
"packet-count": 600,
"packet-interval": 0.1,
"packet-padding": 0,
"schema": 1,
"source": "ps-esu-lt.perfsonar.kanren.net",
"source-node": "ps-esu-lt.perfsonar.kanren.net"
},
"type": "latencybg"
},
"tool": "powstream"
}

and the --import switch takes that just fine, except that the default length of the test is to run for a full day (86399 seconds), and I can't seem to figure out where in that json task definition to stick the -t (duration) statement to shorten it. Everytime I do it fails with

[crussell@ps-esu-bw ~]$ pscheduler task --debug --import testtask-ksu.json latencybg
2019-04-12T10:36:30 Debug started
2019-04-12T10:36:30 Assistance is from localhost
Invalid JSON: No JSON object could be decoded

and all /var/log/perfsonar/pscheduler.log shows is:
Apr 12 10:36:30 ps-esu-bw journal: task DEBUG Debug started
Apr 12 10:36:30 ps-esu-bw journal: task DEBUG Assistance is from localhost

On the question of whether or not pscheduler spawns a powstream process, I'm not sure I know a command that will tell me for that particular instance of pscheduler. The boxes I'm testing with are part of a mid-sized mesh and have 70+ powstream processes running at any given time and that number fluctuates up and down each 4-5 seconds. But in case it helps, here is the debug output from a hand-run instance of a latencybg test (30 seconds only) that fails between two hosts that ARE working with each other in the mesh (but fail the handrun test).

[crussell@ps-esu-bw ~]$ pscheduler task --debug latencybg -t PT30S --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net
2019-04-12T10:41:22 Debug started
2019-04-12T10:41:22 Assistance is from localhost
2019-04-12T10:41:22 Forcing default slip of PT5M
2019-04-12T10:41:22 Converting to spec via https://localhost/pscheduler/tests/latencybg/spec
Submitting task...
2019-04-12T10:41:22 Fetching participant list
2019-04-12T10:41:22 Spec is: {"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}
2019-04-12T10:41:22 Params are: {'spec': '{"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}'}
2019-04-12T10:41:22 Got participants: {u'participants': [u'ps-esu-lt.perfsonar.kanren.net']}
2019-04-12T10:41:22 Lead is ps-esu-lt.perfsonar.kanren.net
2019-04-12T10:41:22 Pinging https://ps-esu-lt.perfsonar.kanren.net/pscheduler/
2019-04-12T10:41:22 ps-esu-lt.perfsonar.kanren.net is up
2019-04-12T10:41:22 Posting task to https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks
2019-04-12T10:41:22 Data is {"test": {"type": "latencybg", "spec": {"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}}, "schema": 1, "schedule": {"slip": "PT5M"}}
Task URL:
https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1
2019-04-12T10:41:24 Posted https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1
2019-04-12T10:41:24 Submission diagnostics:
2019-04-12T10:41:24 Hints:
2019-04-12T10:41:24 requester: 2001:49d0:23c0:1002::2
2019-04-12T10:41:24 server: 2001:49d0:23c0:1002::2
2019-04-12T10:41:24 Identified as everybody, local-interfaces, KanREN-PS, r-and-e
2019-04-12T10:41:24 Classified as default, friendlies
2019-04-12T10:41:24 Application: Hosts we trust to do everything
2019-04-12T10:41:24 Group 1: Limit 'always' passed
2019-04-12T10:41:24 Group 1: Want all, 1/1 passed, 0/1 failed: PASS
2019-04-12T10:41:24 Application PASSES
2019-04-12T10:41:24 Application: Defaults applied to non-friendly hosts
2019-04-12T10:41:24 Group 1: Limit 'innocuous-tests' passed
2019-04-12T10:41:24 Group 1: Limit 'throughput-default-tcp' failed: Test is not 'throughput'
2019-04-12T10:41:24 Group 1: Limit 'throughput-default-udp' failed: Test is not 'throughput'
2019-04-12T10:41:24 Group 1: Limit 'idleex-default' failed: Test is not 'idleex'
2019-04-12T10:41:24 Group 1: Want any, 1/4 passed, 3/4 failed: PASS
2019-04-12T10:41:24 Application PASSES
2019-04-12T10:41:24 Proposal meets limits
Running with tool 'powstream'
Fetching first run...
2019-04-12T10:41:24 Fetching https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1/runs/first
2019-04-12T10:41:25 Handing off: pscheduler watch --first --format text/plain --debug https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1
2019-04-12T10:41:25 Debug started
2019-04-12T10:41:25 Fetching https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1
2019-04-12T10:41:25 Fetching next run from https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1/runs/first

Next scheduled run:
https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1/runs/6ff58454-50ed-4a03-a4b2-d6ec17550e2e
Starts 2019-04-12T15:41:34Z (~8 seconds)
Ends 2019-04-12T15:42:04Z (~29 seconds)
Waiting for result...

Run has not completed.
2019-04-12T10:42:35 Fetching next run from https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1/runs/next

No further runs scheduled.

Could this failure be occurring because latencybg/powstream is a daemon and there is already a (day-long) latencybg test running between the hosts?

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Fri, Apr 12, 2019 at 3:58 AM Garnizov, Ivan <> wrote:

Hello Casey,

“I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).”

I was not aware that there are problems with pS Documentation. Certainly these can be reported on the pS issue tracker. I can do this, but I guess you can do this as well. In any case a more concrete information is needed.

Here I am applying a task specification based on the configuration you shared (the url for the runs).

You can use it as described in the documentation ;) http://docs.perfsonar.net/pscheduler_client_tasks.html#importing-tasks-from-json

pscheduler task --debug --import kanren-latbg-task.json latencybg

Of course you can adjust the fields however you like and it goes without saying, but this json is prepared exactly as it had been submitted to pscheduler.

Also in my previous email I asked you, whether pscheduler produced / spawned a powstream process. From your response I still have no idea about it.

In general I would suggest, when debugging issues always to use the ‘—debug’ option like on my example.

The overdue question is related to some findings I myself had recently.

Best regards,

Ivan

From: Casey Russell [mailto:]
Sent: Thursday, April 11, 2019 4:01 PM
To: Garnizov, Ivan (RRZE) <>
Cc:
Subject: Re: [perfsonar-user] Problem with PS node move

Ivan,

Ivan, I was aware that latency and latencybg used different tools, but I was never able to find any examples in the documentation for running latencybg tasks between hosts, and when I try with the built-in documentation, I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).

No further runs scheduled.

[crussell@ps-esu-bw ~]$ pscheduler task latencybg --duration 10 --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net

Submitting task...

Task URL:

https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/98236ed1-aa3d-4560-9dcd-d48c05bf147a

Running with tool 'powstream'

Fetching first run...

Next scheduled run:

https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/98236ed1-aa3d-4560-9dcd-d48c05bf147a/runs/b0fded2e-edb4-4efe-9c0b-abf69e5d882d

Starts 2019-04-11T13:56:06Z (~7 seconds)

Ends 2019-04-11T13:56:16Z (~9 seconds)

Waiting for result...

Run has not completed.

No further runs scheduled.

So, unfortunately, I haven't replicated (exactly) the test that's failing, I used owping, since (as I understand it) it's the closest to replicating powstream and it's related ports and protocol setups.

As for the overdue status messages, I generally have been looking at those runs in the API when I get in in the morning, so it will have been several hours after the mesh config (psconfig) kicked them off in the early morning hours.

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Thu, Apr 11, 2019 at 7:16 AM Garnizov, Ivan <> wrote:

Hello Casey,

I had a look into your task spec and the run.

As you know “latency” and “latencybg” tests use a different tool. You can also request a “latencybg” measurement from CLI. The task will then be given an ID and there must be a process spawned to run this measurement.

Are you able to verify this?

When are you getting this overdue status? Immediately after the request or ?

Regards,

Ivan Garnizov

GEANT WP6T3: pS development team

GEANT WP7T1: pS deployments GN Operations

GEANT WP9T2: Software governance in GEANT

From: [mailto:] On Behalf Of Casey Russell
Sent: Wednesday, April 10, 2019 4:41 PM
To:
Subject: Re: [perfsonar-user] Problem with PS node move

Group,

An update here, and another request for assistance. I still have the problem with my (moved) host. The other hosts in the mesh still can't successfully run latency tests to it. But I have gathered a bit more info.

Review:

The host was moved across campus

subnets and IPs moved with it.

routing is good, traceroute and ping are fine.

throughput tests, traceroute tests and OUTbound latency tests to other hosts in the mesh are fine.

inbound tests to the moved host never get posted to the Central MA.

inbound tests to the moved host DO get created on the other hosts in the mesh as proved by "pscheduler schedule"

I can run "one off" latency tests from remote hosts INTO the moved host by hand just fine.

New info:

So I've done some more digging around in the API and discovered that the inbound latencybg tests get created on my remote hosts, but never seem to generate any "runs" or generate any "results posted" entries in the pscheduler.log file. Here are the URLs for one of my testing hosts. The first is a latencybg test created this morning at 05:41 to the moved host. Notice it only ever generates a single run. It also never generates any "results posted " entries in the pscheduler.log

https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/2f88d11a-8c0b-4966-8a28-415c2aaac727/runs?pretty

second, is a similar latencybg test, also created at 05:41 this morning to another host in the mesh. It has created and posted many runs in the pscheduler.log file.

https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/992e7091-d05c-4ef1-bfe9-627ac242a09b/runs?pretty

Does anyone have any insight on why that first run is failing/sticking? thoughts on how to see what's going on with it, or next troubleshooting steps?

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Wed, Apr 3, 2019 at 12:21 PM Casey Russell <> wrote:

Group,

On 3/23 we moved one of our PS nodes from one building on a campus to another. The subnets moved with it, and it went from being direct-attached to one KanREN router, to being direct-attached to another KanREN router. After that move we observed the following change:

All tests in our mesh continued to operate normally except for latency tests Inbound to this node. If you look at our dashboard, you'll note that IPv4 and IPv6 latency tests work fine when initiated outbound from KSU (the node we moved), but fail when every other host in the mesh tries to initiate a test inbound. My first thought was an ACL didn't get applied properly, but I've reviewed them and they seem sane. For reference, we use the host-based firewall and Juniper MX firewall filters on the router side to secure the hosts. The Juniper firewall rule seems the same as it was before, and I can't see any reason the host-based filtering would have changed with a physical move.

http://ps-dashboard.kanren.net/maddash-webui

Strangely, the psconfig-pscheduler-agent.log on one of our non-KSU hosts reports on 03/03 (two days before the move) that it scheduled 70 tasks. Yesterday (and everyday since the move) it reports scheduling the same number of tasks. So the external hosts don't seem to have a problem reaching KSU or setting up the test initially. However the latency test doesn't show any results in maddash or the individual host test results.

2019/03/21 18:01:27 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=3060949A-4C2C-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks

2019/04/02 17:52:39 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=ED1E3036-5598-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks

Similarly, I don't have any problem hand-running latency tests to KSU from our outside hosts using owping or any of the other latency tools (I don't have the output from a latencybg test since it's default behavior is to run for a full day).

To further troubleshoot, I went to one of my external hosts to verify that the "pscheduler schedule" shows tests to KSU. It does.

]# pscheduler schedule | grep -v throughput | grep -v trace | grep -A 5 -B 2 ps-ksu-lt

2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z (Running)

latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net

--dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width

0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-

lt.perfsonar.kanren.net --ip-version 4 --packet-interval 0.1 --packet-count 600

(Run with tool 'powstream')

https://localhost/pscheduler/tasks/a10a8512-f5b1-4e1c-a767-7955fe392ce0/runs/b9b8f8a6-e3c4-4aca-9bf2-4e0a37a2936a

--

2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z (Running)

latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net

--dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width

0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-

lt.perfsonar.kanren.net --ip-version 6 --packet-interval 0.1 --packet-count 600

(Run with tool 'powstream')

https://localhost/pscheduler/tasks/a09b35d9-fa52-4a58-a11a-331ccbd04ade/runs/3cc7fa67-5d42-4cca-903d-ccd2a3658ed2

(to see the public url's replace localhost with ps-fhsu-lt.perfsonar.kanren.net)

When I go to those URL's of course they're still running, so I'll have to try them again tomorrow to see what the results were.

Does anyone have any thoughts on what could be happening here other than an inbound ACL or MTU issue? The tests should be storing data to a central repository,

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

[perfsonar-user] Problem with PS node move, Casey Russell, 04/03/2019
- Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/10/2019
  - RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/11/2019
    - Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/11/2019
      - RE: [perfsonar-user] Problem with PS node move, Garnizov, Ivan, 04/12/2019
        
        Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
        
        Re: [perfsonar-user] Problem with PS node move, Casey Russell, 04/12/2019
        
        RE: [perfsonar-user] Problem with PS node move, Holtzman, Thomas, 04/12/2019

List archive

Re: [perfsonar-user] Problem with PS node move