perfsonar-user - [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: "Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.]" <>
- To: "" <>
- Subject: [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade
- Date: Tue, 20 Jul 2021 14:20:03 +0000
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nasa.gov; dmarc=pass action=none header.from=nasa.gov; dkim=pass header.d=nasa.gov; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=zyPWJy5vwLacGqWIMOfWgz8Nc8APM4c6q/YGmGLLd8E=; b=XTrE5yY03oUaRkAWw3VM1UJO7dSb5/h64pBMXRomTrkw7zq8Yq6rqCyGIaRulzjAzcJVhX/RgMkOujMVB1o8aFO/bvC7HA1wWkjG15hRjWXiIyQecI9ZiT9f2/nbpnHug2ih4/gqe6r0ibdRccWRMV2mKs0Z7J+fwVtS8HldPekH3mZ75G1B1yfTzePxdow2AvnbPbT1NOVhrAKwqbHMkaQrszZpmnJf18f7VpKgiKiy6AoZiRJu4V3SLFwPzFs1b4P8JbeAFK3fx3goa5NcCrHXcJs3Ak7GRBL4suYOY0GBMk8nABRWVd9blWZf3vC5zUBs19Jf7Wkg117I7tA0pQ==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=AjMx1XLJ6F7ccgO5DxpBLaTTi/kraJs+Q9anaVF2+sCJebPUEdtdI/AIcE7mkxm4sW2wfHPUBHOoFx8mBBjnMg2rz1ULBwBeq5a0yKzPQXpoagrSbZziJML4DZ0JXrEP5R9gv+Khaq8Rt7kATQr4hPmPl6ynVkxGtfkQZEPegZCqK4lmOM+mR6op5VhubGwePYjED99QuezpsjP0d61th6XVcLTkpSNQHhXZytg/bsJmVJkHVnf0bjffRKjzbW73RwrjzPh2wOfNcGT/kaboGTihxh9uNfl8K+TicOrdQzgE0u9u+2vQAjHb6uqn0eYcP2YR6QwbrXqVBVIpbP0gvw==
- Dkim-filter: OpenDKIM Filter v2.11.0 ndjsvnpf103.ndc.nasa.gov 0E5B5400ABF0
Ivan,
Thanks again for your help. We will monitor the Github site for outcomes on this issue. We do run a large number of tests to non-agent hosts including quite a few Geant perfsonar test nodes. I’m thinking the best option for us might be to rollback to perfsonar version 4.3.4 in the interim. On the night of July8/9 there were 95 packages updated by yum, 84 of which were perfsonar specific but there were updates to some python and postgres packages as part of that upgrade. Doing yum rollbacks have associated risks of breaking things and I’m particularly concerned that the postgres upgrades might not be cleanly downgraded with yum. Has anyone had any experience doing something like this?
Thanks, George
From: "Garnizov, Ivan" <>
Hello George,
“On a multi-homed server can tests assigned to each individual interface run a schedule that’s independent of the other test interfaces” No it is not possible with the current implementation of pScheduler even version 4.4. The only way to achieve this is with multiple containers.
“and reverse direction tests are ignoring the test frequency specification in the mesh for tests run against a no-agent host” This is correct. There is an opened issue in Github about it.
About pScheduler being a lot less tolerant….it is difficult for me to comment. Please send this feedback to the pS Userlist, so that more people are able to share their experience.
Regards, Ivan
From: Uhl, George D. (GSFC-423.0)[Arctic Slope Technical
Services, Inc.] []
Hi Ivan,
Thanks for the feedback! We do have a very busy server but it was better for us to run a single multi-homed server than multiple single homed servers so that leads me to a question about scheduling on multi-homed systems. On a multi-homed server can tests assigned to each individual interface run a schedule that’s independent of the other test interfaces or is there a single test schedule that’s applied to all the test interfaces? I’d like to balance throughput test load between two test interfaces on my server and have them schedule and execute tests independently of each other.
By the way, these scheduling test failures only started occurring after we upgraded to pS 4.4.0. I’ve dug through the logs and I’ve run ad hoc tests with debug mode turned on but I couldn’t find anything that would pinpoint a cause. I’ve set the slip time up to PT30M when running ad hoc tests and that hasn’t resulted in consistently successful tests either. I’ve also reduced the test duration down to PT10S with some success on ad hoc pscheduler tests but after setting the duration to PT10S for tests in my mesh, they continue to fail. It seems that the scheduler is a lot less tolerant than it was prior to the 4.4 upgrade and reverse direction tests are ignoring the test frequency specification in the mesh for tests run against a no-agent host. Instead they seem to be scheduled haphazardly.
Regards, George
From: "Garnizov, Ivan" <>
Dear George,
Indeed I can confirm there is an issue when running throughput tests with versions of pS prior to 4.4 With respect to the outcomes you share:
Regards, Ivan Garnizov
GEANT WP6T3: pS development team GEANT WP7T1: pS deployments GN Operations GEANT WP9T2: Software governance in GEANT
From: [] On
Behalf Of "Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.]"
Hi,
A number of my tests have begun to fail after I upgraded my pS testnode software from 4.3.4 to 4.4.0 on the night of July 8/9. It appears that something fairly dramatic has changed with test scheduling causing outbound tests to fail while inbound tests are running on a haphazard schedule sometimes only a few minutes apart. The graphs below show test results that are scheduled to run on a 2 hour cycle. These tests are generated though my psconfig test mesh. There are several instances of this in my mesh and one commonality is that the remote perfsonar servers are designated as no-agent.
One week’s throughput test results prior to upgrade:
One week’s throughput test results post upgrade:
Latest one day’s worth of test results:
No issues when running a troubleshoot
$ pscheduler troubleshoot 198.124.238.154 Performing basic troubleshooting of localhost and 198.124.238.154.
localhost:
Measuring MTU... 65535 (Local) Looking for pScheduler... OK. Fetching API level... 5 Checking clock... OK. Exercising API... Archivers... Clock... Contexts... Tests... Tools... OK. Fetching service status... OK. Checking services... Ticker... Scheduler... Runner... Archiver... OK. Checking limits... OK. Idle test.... 9 seconds.... Checking archiving... OK.
xxx.xxx.xxx.xxx:
Measuring MTU... 1500+ Looking for pScheduler... OK. Fetching API level... 5 Checking clock... OK. Exercising API... Archivers... Clock... Contexts... Tests... Tools... OK. Fetching service status... OK. Checking services... Ticker... Scheduler... Runner... Archiver... OK. Checking limits... OK. Idle test.... 5 seconds.... Checking archiving... OK.
localhost and xxx.xxx.xxx.xxx:
Checking IP addresses... IPv4 Measuring MTU... 1500+ Checking timekeeping... OK. Simple stream test.... 13 seconds.... OK.
pScheduler on both hosts appears to be functioning normally.
When running tests between the same two serves on the command line with pscheduler, they fail. Every so often I get a successful test using the CLI, but it’s rare and inconsistent. They usually fail with the following errors. Gave up after too many scheduling conflicts. Run not found; task may have been canceled.
Thanks, George Uhl NASA GSFC
|
- [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade, Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.], 07/15/2021
- <Possible follow-up(s)>
- [perfsonar-user] Test scheduling behavior post 4.4.0 upgrade, Uhl, George D. (GSFC-423.0)[Arctic Slope Technical Services, Inc.], 07/20/2021
Archive powered by MHonArc 2.6.24.