Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day


Chronological Thread 
  • From: Phil Reese <>
  • To: "Garnizov, Ivan" <>, "" <>
  • Subject: Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
  • Date: Tue, 29 Oct 2019 10:40:46 -0700

Hi Ivan,

Hmm, not so sure what is going on now.

I've removed all mention of the RabbitMQ archive option from the initial .json config file.  You can see the config file here:
http://srcf-ps.stanford.edu/config-file-sups2.json

Since that change, I've had two more hosts stop archiving data.  I've redone the data collection you had originally suggested. 

More temporary files on my web server:
http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr.txt  <-- log file contents from failed agent with debug started
http://srcf-ps.stanford.edu/psched-log2-sups-rwc-srtr.txt  <-- more log file with debug running
http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr-afterreboot.txt 
<-- log file contents after reboot (no debug)

http://srcf-ps.stanford.edu/psched-log0-sups-wc.txt        <-- log file contents from failed agent
with debug started
http://srcf-ps.stanford.edu/psched-log2-sups-wc.txt        <-- more log file with debug running

Of note is the inclusion of RabbitMQ archive references in the log file despite the original config file no longer containing any references to RabbitMQ.

http://srcf-ps.stanford.edu/sups-rwc-srtr-log0.json                       <-- example .json file from failed archive attempt
http://srcf-ps.stanford.edu/sups-rwc-srtr-log2-succeeded.json             <-- pscheduler log file shows this HTTP link as 'succeeded' though, it really didn't do the archive
http://srcf-ps.stanford.edu/sups-rwc-srtr-log0-afterreboot-success.json   <-- after reboot, archive is successful

http://srcf-ps.stanford.edu/sups-wc-log.json
                              <-- .json file showing failure to archive to esmond, no mention of RabbitMQ
http://srcf-ps.stanford.edu/sups-wc-afterreboot.json
                      <-- reboot, then successful archive to esmond AND successful archive to RabbitMQ!!!!!!  WHY???

----
It is now looking like having only an esmond archive is still too much for the current level of 'workers' (and we want to add more agents!).  Is there a variable that can be tweaked to increase the number of 'workers' available?  Our agents are small but pretty mighty, I5 CPU and 8g of RAM, so they should be capable of more 'workers' than the average agents often used.

I'd really like to make progress on this issue!  Do let me know if you'd like more logs or more specific lines in the logs or .json files.

I do appreciate your efforts sorting this out with me.

Phil




On 10/29/19 8:08 AM, Garnizov, Ivan wrote:

Yes, we should be on the right direction, especially if the rate of the “a full slate of workers” message has disappeared.

Still having only 2 attempts for archival too small. You are still quite easily/quickly dropping the measurement results. I would suggest to have attempts within 1 day with 2 attempts with interval of 1-2h in addition to the ones you have.

 

Once you reduce the rate of the “full slate of workers” failure, you should also be able to spot more easily another failure, which should be the real cause of the problem. Obviously there is more to it apart of the exhaustion of pScheduler archiver workers. It might be the case not all of the attempts fail, but still there are.

Perhaps there is an exhaustion / overload  on your Esmond server, if the failure is a timeout.

 





Archive powered by MHonArc 2.6.19.

Top of Page