Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Problem with PS node move

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Problem with PS node move


Chronological Thread 
  • From: Casey Russell <>
  • To: "Garnizov, Ivan" <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] Problem with PS node move
  • Date: Fri, 12 Apr 2019 10:46:03 -0500

Ivan,

     I wouldn't say there are necessarily problems with the perfsonar documentation.  Generally I find it pretty helpful.  It's just that in this very specific case, I couldn't find what I needed, as you can see, in the area that gives examples of how to run pscheduler tests by hand, there isn't a specific example for latencybg, and using the built-in --help command I got close, but the run still seems to fail, even between hosts that are working.  I did actually try the import task thing a time or two even before you mentioned it.  But it assumes you HAVE a (working) exported json task to work with, and when I tried using the task definitions created by psconfig and offered up by the API, those also failed. 

     Taking some cues from your email this morning I stripped everything out of a task from the API that was related to psconfig (archivers, reference numbers, schedules and such) and cut it down to just this.
{
    "schema": 1,
    "test": {
        "spec": {
            "bucket-width": 0.001,
            "data-ports": {
                "lower": 8760,
                "upper": 9960
            },
            "dest": "ps-ksu-lt.perfsonar.kanren.net",
            "dest-node": "ps-ksu-lt.perfsonar.kanren.net",
            "flip": false,
            "ip-version": 6,
            "packet-count": 600,
            "packet-interval": 0.1,
            "packet-padding": 0,
            "schema": 1,
            "source": "ps-esu-lt.perfsonar.kanren.net",
            "source-node": "ps-esu-lt.perfsonar.kanren.net"
        },
        "type": "latencybg"
    },
    "tool": "powstream"
}

and the --import switch takes that just fine, except that the default length of the test is to run for a full day (86399 seconds), and I can't seem to figure out where in that json task definition to stick the -t (duration) statement to shorten it.  Everytime I do it fails with 

[crussell@ps-esu-bw ~]$ pscheduler task --debug --import testtask-ksu.json latencybg
2019-04-12T10:36:30 Debug started
2019-04-12T10:36:30 Assistance is from localhost
Invalid JSON: No JSON object could be decoded

and all /var/log/perfsonar/pscheduler.log shows is:
Apr 12 10:36:30 ps-esu-bw journal: task DEBUG    Debug started
Apr 12 10:36:30 ps-esu-bw journal: task DEBUG    Assistance is from localhost

On the question of whether or not pscheduler spawns a powstream process, I'm not sure I know a command that will tell me for that particular instance of pscheduler.  The boxes I'm testing with are part of a mid-sized mesh and have 70+ powstream processes running at any given time and that number fluctuates up and down each 4-5 seconds.  But in case it helps, here is the debug output from a hand-run instance of a latencybg test (30 seconds only) that fails between two hosts that ARE working with each other in the mesh (but fail the handrun test).

[crussell@ps-esu-bw ~]$ pscheduler task --debug latencybg -t PT30S --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net
2019-04-12T10:41:22 Debug started
2019-04-12T10:41:22 Assistance is from localhost
2019-04-12T10:41:22 Forcing default slip of PT5M
2019-04-12T10:41:22 Converting to spec via https://localhost/pscheduler/tests/latencybg/spec
Submitting task...
2019-04-12T10:41:22 Fetching participant list
2019-04-12T10:41:22 Spec is: {"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}
2019-04-12T10:41:22 Params are: {'spec': '{"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}'}
2019-04-12T10:41:22 Got participants: {u'participants': [u'ps-esu-lt.perfsonar.kanren.net']}
2019-04-12T10:41:22 Lead is ps-esu-lt.perfsonar.kanren.net
2019-04-12T10:41:22 ps-esu-lt.perfsonar.kanren.net is up
2019-04-12T10:41:22 Data is {"test": {"type": "latencybg", "spec": {"dest": "ps-fhsu-lt.perfsonar.kanren.net", "source": "ps-esu-lt.perfsonar.kanren.net", "duration": "PT30S", "schema": 1}}, "schema": 1, "schedule": {"slip": "PT5M"}}
Task URL:
2019-04-12T10:41:24 Submission diagnostics:
2019-04-12T10:41:24   Hints:
2019-04-12T10:41:24     requester: 2001:49d0:23c0:1002::2
2019-04-12T10:41:24     server: 2001:49d0:23c0:1002::2
2019-04-12T10:41:24   Identified as everybody, local-interfaces, KanREN-PS, r-and-e
2019-04-12T10:41:24   Classified as default, friendlies
2019-04-12T10:41:24   Application: Hosts we trust to do everything
2019-04-12T10:41:24     Group 1: Limit 'always' passed
2019-04-12T10:41:24     Group 1: Want all, 1/1 passed, 0/1 failed: PASS
2019-04-12T10:41:24     Application PASSES
2019-04-12T10:41:24   Application: Defaults applied to non-friendly hosts
2019-04-12T10:41:24     Group 1: Limit 'innocuous-tests' passed
2019-04-12T10:41:24     Group 1: Limit 'throughput-default-tcp' failed: Test is not 'throughput'
2019-04-12T10:41:24     Group 1: Limit 'throughput-default-udp' failed: Test is not 'throughput'
2019-04-12T10:41:24     Group 1: Limit 'idleex-default' failed: Test is not 'idleex'
2019-04-12T10:41:24     Group 1: Want any, 1/4 passed, 3/4 failed: PASS
2019-04-12T10:41:24     Application PASSES
2019-04-12T10:41:24   Proposal meets limits
Running with tool 'powstream'
Fetching first run...
2019-04-12T10:41:25 Handing off: pscheduler watch --first --format text/plain --debug https://ps-esu-lt.perfsonar.kanren.net/pscheduler/tasks/810398a2-15b4-41fe-882e-1238c5be40a1
2019-04-12T10:41:25 Debug started

Next scheduled run:
Starts 2019-04-12T15:41:34Z (~8 seconds)
Ends   2019-04-12T15:42:04Z (~29 seconds)
Waiting for result...

Run has not completed.

No further runs scheduled.


Could this failure be occurring because latencybg/powstream is a daemon and there is already a (day-long) latencybg test running between the hosts? 


Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter



On Fri, Apr 12, 2019 at 3:58 AM Garnizov, Ivan <> wrote:

Hello Casey,

 

I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).

I was not aware that there are problems with pS Documentation. Certainly these can be reported on the pS issue tracker. I can do this, but I guess you can do this as well. In any case a more concrete information is needed.

 

Here I am applying a task specification based on the configuration you shared (the url for the runs).

You can use it as described in the documentation ;) http://docs.perfsonar.net/pscheduler_client_tasks.html#importing-tasks-from-json

 

pscheduler task --debug --import kanren-latbg-task.json latencybg

 

Of course you can adjust the fields however you like and it goes without saying, but this json is prepared exactly as it had been submitted to pscheduler.

Also in my previous email I asked you, whether pscheduler produced / spawned a powstream process. From your response I still have no idea about it.

In general I would suggest, when debugging issues always to use the ‘—debug’ option like on my example.

 

The overdue question is related to some findings I myself had recently.

 

 

Best regards,

Ivan

 

From: Casey Russell [mailto:]
Sent: Thursday, April 11, 2019 4:01 PM
To: Garnizov, Ivan (RRZE) <>
Cc:
Subject: Re: [perfsonar-user] Problem with PS node move

 

Ivan,

 

     Ivan, I was aware that latency and latencybg used different tools, but I was never able to find any examples in the documentation for running latencybg tasks between hosts, and when I try with the built-in documentation, I was never able to get the syntax right and get a latencybg test to run, even between my working hosts in the mesh (example below).

 

No further runs scheduled.

[crussell@ps-esu-bw ~]$ pscheduler task latencybg --duration 10 --source ps-esu-lt.perfsonar.kanren.net --dest ps-fhsu-lt.perfsonar.kanren.net                    

Submitting task...

Task URL:

Running with tool 'powstream'

Fetching first run...

 

Next scheduled run:

Starts 2019-04-11T13:56:06Z (~7 seconds)

Ends   2019-04-11T13:56:16Z (~9 seconds)

Waiting for result...

 

Run has not completed.

 

No further runs scheduled.

 

So, unfortunately, I haven't replicated (exactly) the test that's failing, I used owping, since (as I understand it) it's the closest to replicating powstream and it's related ports and protocol setups.

 

As for the overdue status messages, I generally have been looking at those runs in the API when I get in in the morning, so it will have been several hours after the mesh config (psconfig) kicked them off in the early morning hours.

 

 

 

Sincerely,

Casey Russell

Network Engineer

KanREN

phone785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

 

 

On Thu, Apr 11, 2019 at 7:16 AM Garnizov, Ivan <> wrote:

Hello Casey,

 

I had a look into your task spec and the run.

As you know “latency” and “latencybg” tests use a different tool. You can also request a “latencybg” measurement from CLI. The task will then be given an ID and there must be a process spawned to run this measurement.

Are you able to verify this?

When are you getting this overdue status? Immediately after the request or ?

 

 

Regards,

Ivan Garnizov

 

GEANT WP6T3: pS development team

GEANT WP7T1: pS deployments GN Operations

GEANT WP9T2: Software governance in GEANT

 

From: [mailto:] On Behalf Of Casey Russell
Sent: Wednesday, April 10, 2019 4:41 PM
To:
Subject: Re: [perfsonar-user] Problem with PS node move

 

Group,

 

     An update here, and another request for assistance.  I still have the problem with my (moved) host.  The other hosts in the mesh still can't successfully run latency tests to it.  But I have gathered a bit more info.

 

Review:

The host was moved across campus

subnets and IPs moved with it.

routing is good, traceroute and ping are fine.

throughput tests, traceroute tests and OUTbound latency tests to other hosts in the mesh are fine.

inbound tests to the moved host never get posted to the Central MA.

inbound tests to the moved host DO get created on the other hosts in the mesh as proved by "pscheduler schedule"

I can run "one off" latency tests from remote hosts INTO the moved host by hand just fine.

 

New info:

     So I've done some more digging around in the API and discovered that the inbound latencybg tests get created on my remote hosts, but never seem to generate any "runs" or generate any "results posted" entries in the pscheduler.log file.  Here are the URLs for one of my testing hosts.  The first is a latencybg test created this morning at 05:41 to the moved host.  Notice it only ever generates a single run.  It also never generates any "results posted " entries in the pscheduler.log

 

second, is a similar latencybg test, also created at 05:41 this morning to another host in the mesh.  It has created and posted many runs in the pscheduler.log file.

 

Does anyone have any insight on why that first run is failing/sticking?  thoughts on how to see what's going on with it, or next troubleshooting steps?

 


 

Sincerely,

Casey Russell

Network Engineer

KanREN

phone785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

 

 

On Wed, Apr 3, 2019 at 12:21 PM Casey Russell <> wrote:

Group,

 

     On 3/23 we moved one of our PS nodes from one building on a campus to another.  The subnets moved with it, and it went from being direct-attached to one KanREN router, to being direct-attached to another KanREN router.  After that move we observed the following change:

 

     All tests in our mesh continued to operate normally except for latency tests Inbound to this node.  If you look at our dashboard, you'll note that IPv4 and IPv6 latency tests work fine when initiated outbound from KSU (the node we moved), but fail when every other host in the mesh tries to initiate a test inbound.  My first thought was an ACL didn't get applied properly, but I've reviewed them and they seem sane.  For reference, we use the host-based firewall and Juniper MX firewall filters on the router side to secure the hosts.  The Juniper firewall rule seems the same as it was before, and I can't see any reason the host-based filtering would have changed with a physical move.  

 

 

      Strangely, the psconfig-pscheduler-agent.log on one of our non-KSU hosts reports on 03/03 (two days before the move) that it scheduled 70 tasks.  Yesterday (and everyday since the move) it reports scheduling the same number of tasks.  So the external hosts don't seem to have a problem reaching KSU or setting up the test initially.  However the latency test doesn't show any results in maddash or the individual host test results.

 

2019/03/21 18:01:27 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=3060949A-4C2C-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks

 

2019/04/02 17:52:39 INFO pid=7183 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=226 guid=ED1E3036-5598-11E9-BD6C-F71A4CD1F608 msg=Added 70 new tasks, and deleted 0 old tasks

 

Similarly, I don't have any problem hand-running latency tests to KSU from our outside hosts using owping or any of the other latency tools (I don't have the output from a latencybg test since it's default behavior is to run for a full day).  

 

To further troubleshoot, I went to one of my external hosts to verify that the "pscheduler schedule" shows tests to KSU.  It does.

 

]# pscheduler schedule | grep -v throughput | grep -v trace | grep -A 5 -B 2 ps-ksu-lt

2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z  (Running)

latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net

  --dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width

  0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-

  lt.perfsonar.kanren.net --ip-version 4 --packet-interval 0.1 --packet-count 600

  (Run with tool 'powstream')

 

--

2019-04-03T12:14:35Z - 2019-04-04T12:14:35Z  (Running)

latencybg --data-ports 8760-9960 --source-node ps-fhsu-lt.perfsonar.kanren.net

  --dest ps-ksu-lt.perfsonar.kanren.net --packet-padding 0 --flip --bucket-width

  0.001 --dest-node ps-ksu-lt.perfsonar.kanren.net --source ps-fhsu-

  lt.perfsonar.kanren.net --ip-version 6 --packet-interval 0.1 --packet-count 600

  (Run with tool 'powstream')

 

(to see the public url's replace localhost with ps-fhsu-lt.perfsonar.kanren.net)

 

When I go to those URL's of course they're still running, so I'll have to try them again tomorrow to see what the results were.

 

Does anyone have any thoughts on what could be happening here other than an inbound ACL or MTU issue?  The tests should be storing data to a central repository, 

 

 

Sincerely,

Casey Russell

Network Engineer

KanREN

phone785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 




Archive powered by MHonArc 2.6.19.

Top of Page