Skip to Content.
Sympa Menu

perfsonar-user - RE: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler

Subject: perfSONAR User Q&A and Other Discussion

List archive

RE: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler


Chronological Thread 
  • From: "Pennington, Mike" <>
  • To: Mark Feit <>, "" <>
  • Subject: RE: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler
  • Date: Wed, 13 Feb 2019 17:35:43 +0000

I personally haven’t made any changes to this thing in ages, but it is part of a couple meshes.  The Quilt and another one.  Let me work on that other stuff and send you the info off list, thanks!

 

http://imagizer.imageshack.us/a/img922/6971/U3UQ7b.png

http://imagizer.imageshack.us/a/img923/9040/vsML6s.png http://imagizer.imageshack.us/a/img923/1338/a5oRtA.png http://imagizer.imageshack.us/a/img922/9225/uYe2eD.png http://imagizer.imageshack.us/a/img922/9551/WTBSEq.png                      

Mike Pennington

CEN | Network Engineer

Hartford CT | 06105-3702

p 860 622 4566

Member Conference - May 10th 2019 – Register:  https://t.co/laqXY47EZl

 

 

From: Mark Feit [mailto:]
Sent: Wednesday, February 13, 2019 12:17 PM
To: Pennington, Mike <>;
Subject: Re: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler

 

Pennington, Mike writes:

 

Also saw this in the pscheduler.log:

 

Feb 13 10:37:32 perfsonar-hartford journal: runner ERROR    84864837: Failed to post run for result: Database connection pool exhausted. Unable to get connection after 60 attempts.

Feb 13 10:37:32 perfsonar-hartford journal: runner ERROR    84828777: Failed to post run for result: Database connection pool exhausted. Unable to get connection after 60 attempts.

 

You have two things happening, which I suspect are both related to having more workload than the machine can handle.

 

The runner problem came up last month, and I discussed some of the under-the-hood implications here:  https://lists.internet2.edu/sympa/arc/perfsonar-user/2019-01/msg00013.html.

 

I can’t say what’s happening with the scheduler other than “it’s very busy,” which isn’t particularly helpful.  The scheduler takes its to-do list from the tasks in the database and isn’t prone to doing excess work, so this might be something as simple as one or more tasks is configured with a short-enough repeat interval that the work is all legitimate.   If this system is part of a mesh, have there been any changes to its configuration, such as having an artificially-low repeat interval for some of the tasks or a switch from MeshConfig to pSConfig format?  (Andy Lake is working a bug in the latter that might be related.)

 

If I could ask you to do a couple of things to give me some insight into the problem:  As root, run “pscheduler debug on scheduler,”  Wait 30 seconds and then run “pscheduler debug off.”  Then grep the string ‘ scheduler ‘ (with spaces on either side) out of /var/log/pscheduler/pscheduler.log and send me the results off-list.  If the file is more than 5-10 MB, just send the last few thousand lines.  If the machine is reachable from the outside, please also send me its FQDN and I’ll take a look at what it’s up to through the API.

 

--Mark

 

 




Archive powered by MHonArc 2.6.19.

Top of Page