perfsonar-user - Re: [perfsonar-user] Added tests not listed in the Test Results
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Mark Feit <>
- To: Onno Zweers <>, "" <>
- Subject: Re: [perfsonar-user] Added tests not listed in the Test Results
- Date: Wed, 11 Aug 2021 21:49:31 +0000
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=internet2.edu; dmarc=pass action=none header.from=internet2.edu; dkim=pass header.d=internet2.edu; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=JcziOoTh+fNTq7Y/zFndVouV0+q3yhSToptqQ4YpP2w=; b=RllMwkYLcx49bcPXVNi5MeWzhqV1YD5z65XYy6zJTKrD1sH4WZkI13coI72n1fh0Mqg7Ejhgwc3kEhjRfUHbniQWerzW06bkSEhREfjpjzuKtEAd6IxJuVcaagig//koTp67CyOxds11lKff856wz06YNUwLpGt6W6wXcj1FUQ5aeLjt+kR/RGjjvVaDZKaJF+b4/Hr4csbR0p+Eujb9+NERDn9yMfdhstMxN75Qbi4+P2Ryi/I8O9eJQMZBn4GFCsNdKnbJ/NHy8gw2ecfM1LV/paOXCX54meitgusiE1TroJfivjNWkoW1iqXc4rRbRFmIQNvcP0UBRi7z7fHfgw==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gl5OGl5ddA2sc07u8SjC9P1GbO7/vM3tClZbKrw/6TWhuvbpnEYOpFMBnczmsH6ulsFIYkSMKD8tvvC+cb/A58xfmYFB6ROAiXpJMrkFdiSDalhqx3YSVZi9VbyZoq8Lf9DgkfgZpkEpK4FLaecvnqdTmV8aS4Fgh6h+eeQNtZ/bq8NFvyagxHu4rj3nV6Vy/uD0ZaQvYv3ZHh1746S2JPa6Z+d7JewvRnDcJknW5leUjF95ZfXajR3x+z39X23TuEL3BBAClkwaSkXbpNLNVNOYbsOUyANvt0GSP/8HxS27RHE548ETLoZAJkLSr+5iVeBnmXqu9nxaCLpCsfqNPA==
Onno Zweers writes:
[root@perfsonar ~]# pscheduler task throughput --source perfsonar-bandwidth.grid.surfsara.nl --dest luisteraar.nikhef.nl … Gave up after too many scheduling conflicts.
Every so often when two pSchedulers are trying to arrange a run, the time slot they were planning to use gets consumed by something else in the middle of the process. The usual process is to start over and try again, and most times it gets resolved because everyone trying to grab schedule time backs off for a random interval. Sometimes these conflicts happen too many times in a row and the scheduler is forced to give up.
I pulled the schedule from both machines and found that the one at SURF is fairly crowded, which makes this a more-likely occurrence.
[root@perfsonar ~]# pscheduler task throughput --source perfsonar-bandwidth.grid.surfsara.nl --dest luisteraar.nikhef.nl … Unable to fetch runs: No such run.
This happens when the CLI is trying to fetch the first run for a task and the API says it can’t find one. It can happen if the scheduling process takes too long and is something of a side effect of what I described above.
There’s been a latent bug in the scheduler where the API makes a run that isn’t fully-scheduled yet available and the CLI grabs it before a conflict causes it to be deleted. This became a more-visible problem in 4.4.0 when we made part of the scheduling process more efficient. The upcoming 4.4.1 includes an interlock to prevent it, and you’ll get the “too many conflicts” message.
[root@perfsonar ~]# pscheduler task throughput --source perfsonar-bandwidth.grid.surfsara.nl --dest luisteraar.nikhef.nl … Run failed.
Error: … iperf3: error - unable to send control message: Bad file descriptor … Diagnostics from perfsonar-bandwidth.grid.surfsara.nl: numactl -N 0 /usr/bin/iperf3 -p 5201 -B 2001:610:108:203a::32 -c 2a07:8504:120:e068::72 -t 10 --json --rsa-public-key-path /run/pscheduler-server/runner/tmp/tmp80cl8f6t/tmp0xaouzpa/public-key --username H0ZOxEmIfGkflgY1leYT
Error from luisteraar.nikhef.nl: iperf3 returned an error:
Process took too long to run.
Without a lot more detail, it’s hard to say precisely why this is happening, but my guess is that it has something to do with IPv6 at SURF. But if I can pontificate:
We’ve seen instances where a program tries to do a DNS lookup and takes an excessively-long time to do it because of a misconfiguration. That forces pScheduler to kill it so it doesn’t bleed over into any other run’s time slot. Unfortunately, there isn’t a way for us to get good diagnostics in that situation. I ran the same task forced to by adding --ip-version 4 to the command line results in a successful test. Forced IPv6 resulted in the same failure, which leads me to believe there might be a misconfiguration on the network somewhere. I’d have a look at IPv6 DNS, firewalls and routing, in that order.
Hope that helps.
--Mark
|
- Re: [perfsonar-user] Added tests not listed in the Test Results, Onno Zweers, 08/04/2021
- RE: [perfsonar-user] Added tests not listed in the Test Results, Garnizov, Ivan, 08/04/2021
- Re: [perfsonar-user] Added tests not listed in the Test Results, Onno Zweers, 08/05/2021
- Re: [perfsonar-user] Added tests not listed in the Test Results, Antoine Delvaux, 08/09/2021
- Re: [perfsonar-user] Added tests not listed in the Test Results, Onno Zweers, 08/11/2021
- Re: [perfsonar-user] Added tests not listed in the Test Results, Antoine Delvaux, 08/09/2021
- Re: [perfsonar-user] Added tests not listed in the Test Results, Onno Zweers, 08/05/2021
- Re: [perfsonar-user] Added tests not listed in the Test Results, Mark Feit, 08/11/2021
- <Possible follow-up(s)>
- RE: [perfsonar-user] Added tests not listed in the Test Results, Garnizov, Ivan, 08/04/2021
- RE: [perfsonar-user] Added tests not listed in the Test Results, Garnizov, Ivan, 08/04/2021
Archive powered by MHonArc 2.6.24.