Skip to Content.
Sympa Menu

perfsonar-user - AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

Subject: perfSONAR User Q&A and Other Discussion

List archive

AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day


Chronological Thread 
  • From: "Garnizov, Ivan" <>
  • To: Phil Reese <>, "" <>
  • Subject: AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
  • Date: Wed, 30 Oct 2019 10:24:05 +0000

Hello Phil,

 

With regards RabbitMQ presence:

Please note that pS task runs are being scheduled 24h ahead with their configuration. Meaning you should expect to see “old” run configuration present.

 

With regards http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr-afterreboot.txt  <-- log file contents after reboot (no debug)

You should note the archival of results is not about the successful run in the output, but for some runs that happened earlier…. Meaning you still have a big queue of results to be processed, which exhausts the pScheduler-archiver resource.  As a proof of this is again: http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr.txt  <-- log file contents from failed agent with debug started

Please also note the RabbitMQ has a different pool of resources.

 

 

With regards http://srcf-ps.stanford.edu/sups-rwc-srtr-log0-afterreboot-success.json   <-- after reboot, archive is successful

I notice you have removed the pS archival retries. I can see this also from the mesh configuration you shared. IMO this is not a good idea, despite the fact that you have some successful archival on restart.

This only proves me you are most likely seeing some exhaustion on the central Esmond archiver. I can imagine all of your MPs running with 15 workers towards the Esmond archiver….or are there only 1 or 2 exceptions with these failures?

 

I would expect also that among the failing archival, you are also having some successful ones, which should support the thesis of reaching limitations on the Esmond server side.

 

In all cases I would suggest bringing back the pscheduler archival retries with several attemps spanned over at least 2h (depending on your use case), but in no case intensive retries in the first 15 min. The proper balance you can find after multiple iterations and depend on your requirements for the project…. Perhaps results older than 10 min are of no use for you?

 

 

 

Regards,

Ivan Garnizov

 

GEANT WP6T3: pS development team

GEANT WP7T1: pS deployments GN Operations

GEANT WP9T2: Software governance in GEANT

 

 

 

 

 

Von: Phil Reese [mailto:]
Gesendet: Dienstag, 29. Oktober 2019 18:41
An: Garnizov, Ivan (RRZE) <>;
Betreff: Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

 

Hi Ivan,

Hmm, not so sure what is going on now.

I've removed all mention of the RabbitMQ archive option from the initial .json config file.  You can see the config file here:
http://srcf-ps.stanford.edu/config-file-sups2.json

Since that change, I've had two more hosts stop archiving data.  I've redone the data collection you had originally suggested. 

More temporary files on my web server:
http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr.txt  <-- log file contents from failed agent with debug started
http://srcf-ps.stanford.edu/psched-log2-sups-rwc-srtr.txt  <-- more log file with debug running
http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr-afterreboot.txt  <-- log file contents after reboot (no debug)

http://srcf-ps.stanford.edu/psched-log0-sups-wc.txt        <-- log file contents from failed agent with debug started
http://srcf-ps.stanford.edu/psched-log2-sups-wc.txt        <-- more log file with debug running

Of note is the inclusion of RabbitMQ archive references in the log file despite the original config file no longer containing any references to RabbitMQ.

http://srcf-ps.stanford.edu/sups-rwc-srtr-log0.json                       <-- example .json file from failed archive attempt
http://srcf-ps.stanford.edu/sups-rwc-srtr-log2-succeeded.json             <-- pscheduler log file shows this HTTP link as 'succeeded' though, it really didn't do the archive

http://srcf-ps.stanford.edu/sups-rwc-srtr-log0-afterreboot-success.json   <-- after reboot, archive is successful

http://srcf-ps.stanford.edu/sups-wc-log.json                              <-- .json file showing failure to archive to esmond, no mention of RabbitMQ
http://srcf-ps.stanford.edu/sups-wc-afterreboot.json                      <-- reboot, then successful archive to esmond AND successful archive to RabbitMQ!!!!!!  WHY???

----
It is now looking like having only an esmond archive is still too much for the current level of 'workers' (and we want to add more agents!).  Is there a variable that can be tweaked to increase the number of 'workers' available?  Our agents are small but pretty mighty, I5 CPU and 8g of RAM, so they should be capable of more 'workers' than the average agents often used.

I'd really like to make progress on this issue!  Do let me know if you'd like more logs or more specific lines in the logs or .json files.

I do appreciate your efforts sorting this out with me.

Phil



On 10/29/19 8:08 AM, Garnizov, Ivan wrote:

Yes, we should be on the right direction, especially if the rate of the “a full slate of workers” message has disappeared.

Still having only 2 attempts for archival too small. You are still quite easily/quickly dropping the measurement results. I would suggest to have attempts within 1 day with 2 attempts with interval of 1-2h in addition to the ones you have.

 

Once you reduce the rate of the “full slate of workers” failure, you should also be able to spot more easily another failure, which should be the real cause of the problem. Obviously there is more to it apart of the exhaustion of pScheduler archiver workers. It might be the case not all of the attempts fail, but still there are.

Perhaps there is an exhaustion / overload  on your Esmond server, if the failure is a timeout.

 

 




Archive powered by MHonArc 2.6.19.

Top of Page