Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler


Chronological Thread 
  • From: William Abbott <>
  • To: "Pennington, Mike" <>, Mark Feit <>, "" <>
  • Subject: Re: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler
  • Date: Wed, 13 Feb 2019 21:50:47 +0000

Please cc me on the off list messages, as the mesh is hosted on one of my systems.  It's underpowered and is possibly the cause of the database exhaustion.  I'll take a look at the logs if you can tell me which one; there are so many different perfsonar components it can be a chore to track things down.

Bill

On 2/13/19 12:35 PM, Pennington, Mike wrote:
">

I personally haven’t made any changes to this thing in ages, but it is part of a couple meshes.  The Quilt and another one.  Let me work on that other stuff and send you the info off list, thanks!

 

http://imagizer.imageshack.us/a/img922/6971/U3UQ7b.png

http://imagizer.imageshack.us/a/img923/9040/vsML6s.png http://imagizer.imageshack.us/a/img923/1338/a5oRtA.png http://imagizer.imageshack.us/a/img922/9225/uYe2eD.png http://imagizer.imageshack.us/a/img922/9551/WTBSEq.png                      

Mike Pennington

CEN | Network Engineer

Hartford CT | 06105-3702

p 860 622 4566

Member Conference - May 10th 2019 – Register:  https://t.co/laqXY47EZl

 

 

From: Mark Feit []
Sent: Wednesday, February 13, 2019 12:17 PM
To: Pennington, Mike ;
Subject: Re: [perfsonar-user] Perfsonar node showing 100% CPU by scheduler

 

Pennington, Mike writes:

 

Also saw this in the pscheduler.log:

 

Feb 13 10:37:32 perfsonar-hartford journal: runner ERROR    84864837: Failed to post run for result: Database connection pool exhausted. Unable to get connection after 60 attempts.

Feb 13 10:37:32 perfsonar-hartford journal: runner ERROR    84828777: Failed to post run for result: Database connection pool exhausted. Unable to get connection after 60 attempts.

 

You have two things happening, which I suspect are both related to having more workload than the machine can handle.

 

The runner problem came up last month, and I discussed some of the under-the-hood implications here:  https://lists.internet2.edu/sympa/arc/perfsonar-user/2019-01/msg00013.html.

 

I can’t say what’s happening with the scheduler other than “it’s very busy,” which isn’t particularly helpful.  The scheduler takes its to-do list from the tasks in the database and isn’t prone to doing excess work, so this might be something as simple as one or more tasks is configured with a short-enough repeat interval that the work is all legitimate.   If this system is part of a mesh, have there been any changes to its configuration, such as having an artificially-low repeat interval for some of the tasks or a switch from MeshConfig to pSConfig format?  (Andy Lake is working a bug in the latter that might be related.)

 

If I could ask you to do a couple of things to give me some insight into the problem:  As root, run “pscheduler debug on scheduler,”  Wait 30 seconds and then run “pscheduler debug off.”  Then grep the string ‘ scheduler ‘ (with spaces on either side) out of /var/log/pscheduler/pscheduler.log and send me the results off-list.  If the file is more than 5-10 MB, just send the last few thousand lines.  If the machine is reachable from the outside, please also send me its FQDN and I’ll take a look at what it’s up to through the API.

 

--Mark

 

 


--
To unsubscribe from this list: https://na01.safelinks.protection.outlook.com/?url="https%3A%2F%2Flists.internet2.edu%2Fsympa%2Fsignoff%2Fperfsonar-user&amp;data=02%7C01%7Cbabbott%40rutgers.edu%7C5c5ee3de11bf4215615208d691d9ef70%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636856762553868570&amp;sdata=Alkc%2FdGlqVNtqdhTUfrIQ9pmHuDi9lDBv6rEKyo8eRM%3D&amp;reserved=0



Archive powered by MHonArc 2.6.19.

Top of Page