perfsonar-user - [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Test scheduling behavior post 4.4.0 upgrade

From: "Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.]" <>
To: "" <>
Subject: [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade
Date: Tue, 20 Jul 2021 14:20:03 +0000
Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nasa.gov; dmarc=pass action=none header.from=nasa.gov; dkim=pass header.d=nasa.gov; arc=none
Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=zyPWJy5vwLacGqWIMOfWgz8Nc8APM4c6q/YGmGLLd8E=; b=XTrE5yY03oUaRkAWw3VM1UJO7dSb5/h64pBMXRomTrkw7zq8Yq6rqCyGIaRulzjAzcJVhX/RgMkOujMVB1o8aFO/bvC7HA1wWkjG15hRjWXiIyQecI9ZiT9f2/nbpnHug2ih4/gqe6r0ibdRccWRMV2mKs0Z7J+fwVtS8HldPekH3mZ75G1B1yfTzePxdow2AvnbPbT1NOVhrAKwqbHMkaQrszZpmnJf18f7VpKgiKiy6AoZiRJu4V3SLFwPzFs1b4P8JbeAFK3fx3goa5NcCrHXcJs3Ak7GRBL4suYOY0GBMk8nABRWVd9blWZf3vC5zUBs19Jf7Wkg117I7tA0pQ==
Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=AjMx1XLJ6F7ccgO5DxpBLaTTi/kraJs+Q9anaVF2+sCJebPUEdtdI/AIcE7mkxm4sW2wfHPUBHOoFx8mBBjnMg2rz1ULBwBeq5a0yKzPQXpoagrSbZziJML4DZ0JXrEP5R9gv+Khaq8Rt7kATQr4hPmPl6ynVkxGtfkQZEPegZCqK4lmOM+mR6op5VhubGwePYjED99QuezpsjP0d61th6XVcLTkpSNQHhXZytg/bsJmVJkHVnf0bjffRKjzbW73RwrjzPh2wOfNcGT/kaboGTihxh9uNfl8K+TicOrdQzgE0u9u+2vQAjHb6uqn0eYcP2YR6QwbrXqVBVIpbP0gvw==
Dkim-filter: OpenDKIM Filter v2.11.0 ndjsvnpf103.ndc.nasa.gov 0E5B5400ABF0

Ivan,

Thanks again for your help. We will monitor the Github site for outcomes on this issue. We do run a large number of tests to non-agent hosts including quite a few Geant perfsonar test nodes. I’m thinking the best option for us might be to rollback to perfsonar version 4.3.4 in the interim. On the night of July8/9 there were 95 packages updated by yum, 84 of which were perfsonar specific but there were updates to some python and postgres packages as part of that upgrade. Doing yum rollbacks have associated risks of breaking things and I’m particularly concerned that the postgres upgrades might not be cleanly downgraded with yum. Has anyone had any experience doing something like this?

Thanks,

George

From: "Garnizov, Ivan" <>
Date: Monday, July 19, 2021 at 8:48 AM
To: "George.D.Uhl" <>
Cc: "Jackson, Wayne P. (GSFC-590.0)[Arctic Slope Technical Services, Inc.]" <>
Subject: RE: [EXTERNAL] RE: Test scheduling behavior post 4.4.0 upgrade

Hello George,

“On a multi-homed server can tests assigned to each individual interface run a schedule that’s independent of the other test interfaces”

No it is not possible with the current implementation of pScheduler even version 4.4. The only way to achieve this is with multiple containers.

“and reverse direction tests are ignoring the test frequency specification in the mesh for tests run against a no-agent host”

This is correct. There is an opened issue in Github about it.

About pScheduler being a lot less tolerant….it is difficult for me to comment. Please send this feedback to the pS Userlist, so that more people are able to share their experience.

Regards,

Ivan

From: Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.] []
Sent: Friday, July 16, 2021 4:35 PM
To: Garnizov, Ivan (RRZE) <>
Cc: Jackson, Wayne P. (GSFC-590.0)[Arctic Slope Technical Services, Inc.] <>
Subject: Re: [EXTERNAL] RE: Test scheduling behavior post 4.4.0 upgrade

Hi Ivan,

Thanks for the feedback! We do have a very busy server but it was better for us to run a single multi-homed server than multiple single homed servers so that leads me to a question about scheduling on multi-homed systems. On a multi-homed server can tests assigned to each individual interface run a schedule that’s independent of the other test interfaces or is there a single test schedule that’s applied to all the test interfaces? I’d like to balance throughput test load between two test interfaces on my server and have them schedule and execute tests independently of each other.

By the way, these scheduling test failures only started occurring after we upgraded to pS 4.4.0. I’ve dug through the logs and I’ve run ad hoc tests with debug mode turned on but I couldn’t find anything that would pinpoint a cause. I’ve set the slip time up to PT30M when running ad hoc tests and that hasn’t resulted in consistently successful tests either. I’ve also reduced the test duration down to PT10S with some success on ad hoc pscheduler tests but after setting the duration to PT10S for tests in my mesh, they continue to fail. It seems that the scheduler is a lot less tolerant than it was prior to the 4.4 upgrade and reverse direction tests are ignoring the test frequency specification in the mesh for tests run against a no-agent host. Instead they seem to be scheduled haphazardly.

Regards,

George

From: "Garnizov, Ivan" <>
Date: Friday, July 16, 2021 at 5:19 AM
To: "George.D.Uhl" <>
Cc: "Jackson, Wayne P. (GSFC-590.0)[Arctic Slope Technical Services, Inc.]" <>
Subject: [EXTERNAL] RE: Test scheduling behavior post 4.4.0 upgrade

Dear George,

Indeed I can confirm there is an issue when running throughput tests with versions of pS prior to 4.4

With respect to the outcomes you share:

Gave up after too many scheduling conflicts

Means one or both of these systems are quite busy …with well populated schedule.
You could try to increase the “slip” time, so that pscheduler is able to negotiate a common free slot
This one is difficult to relate to the 4.4 upgrade. This happens often with popular organisations like yours ;)

Run not found; task may have been canceled

May mean many things. Requires more details from the pScheduler output. Use –debug option to get more details.
A very common case is the consequence of priorities overruling being applied

Regards,

Ivan Garnizov

GEANT WP6T3: pS development team

GEANT WP7T1: pS deployments GN Operations

GEANT WP9T2: Software governance in GEANT

From: [] On Behalf Of "Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.]"
Sent: Thursday, July 15, 2021 9:44 PM
To:
Cc: Jackson, Wayne P. (GSFC-590.0)[Arctic Slope Technical Services, Inc.] <>
Subject: [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade

Hi,

A number of my tests have begun to fail after I upgraded my pS testnode software from 4.3.4 to 4.4.0 on the night of July 8/9. It appears that something fairly dramatic has changed with test scheduling causing outbound tests to fail while inbound tests are running on a haphazard schedule sometimes only a few minutes apart. The graphs below show test results that are scheduled to run on a 2 hour cycle. These tests are generated though my psconfig test mesh. There are several instances of this in my mesh and one commonality is that the remote perfsonar servers are designated as no-agent.

One week’s throughput test results prior to upgrade:

One week’s throughput test results post upgrade:

Latest one day’s worth of test results:

No issues when running a troubleshoot

$ pscheduler troubleshoot 198.124.238.154

Performing basic troubleshooting of localhost and 198.124.238.154.

localhost:

Measuring MTU... 65535 (Local)

Looking for pScheduler... OK.

Fetching API level... 5

Checking clock... OK.

Exercising API... Archivers... Clock... Contexts... Tests... Tools... OK.

Fetching service status... OK.

Checking services... Ticker... Scheduler... Runner... Archiver... OK.

Checking limits... OK.

Idle test.... 9 seconds.... Checking archiving... OK.

xxx.xxx.xxx.xxx:

Measuring MTU... 1500+

Looking for pScheduler... OK.

Fetching API level... 5

Checking clock... OK.

Exercising API... Archivers... Clock... Contexts... Tests... Tools... OK.

Fetching service status... OK.

Checking services... Ticker... Scheduler... Runner... Archiver... OK.

Checking limits... OK.

Idle test.... 5 seconds.... Checking archiving... OK.

localhost and xxx.xxx.xxx.xxx:

Checking IP addresses... IPv4

Measuring MTU... 1500+

Checking timekeeping... OK.

Simple stream test.... 13 seconds.... OK.

pScheduler on both hosts appears to be functioning normally.

When running tests between the same two serves on the command line with pscheduler, they fail. Every so often I get a successful test using the CLI, but it’s rare and inconsistent. They usually fail with the following errors.

Gave up after too many scheduling conflicts.

Run not found; task may have been canceled.

Thanks,

George Uhl

NASA GSFC

[perfsonar-user] Test scheduling behavior post 4.4.0 upgrade, Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.], 07/15/2021
- <Possible follow-up(s)>
- [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade, Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.], 07/20/2021

List archive

[perfsonar-user] Test scheduling behavior post 4.4.0 upgrade