perfsonar-user - Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
Subject: perfSONAR User Q&A and Other Discussion
List archive
Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
Chronological Thread
- From: Phil Reese <>
- To: "Garnizov, Ivan" <>, "" <>
- Subject: Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
- Date: Tue, 29 Oct 2019 10:40:46 -0700
Hi Ivan,
Hmm, not so sure what is going on now.
I've removed all mention of the RabbitMQ archive option from the initial .json config file. You can see the config file here:
http://srcf-ps.stanford.edu/config-file-sups2.json
Since that change, I've had two more hosts stop archiving data. I've redone the data collection you had originally suggested.
More temporary files on my web server:
http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr.txt <-- log file contents from failed agent with debug started
http://srcf-ps.stanford.edu/psched-log2-sups-rwc-srtr.txt <-- more log file with debug running
http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr-afterreboot.txt <-- log file contents after reboot (no debug)
http://srcf-ps.stanford.edu/psched-log0-sups-wc.txt <-- log file contents from failed agent with debug started
http://srcf-ps.stanford.edu/psched-log2-sups-wc.txt <-- more log file with debug running
Of note is the inclusion of RabbitMQ archive references in the log file despite the original config file no longer containing any references to RabbitMQ.
http://srcf-ps.stanford.edu/sups-rwc-srtr-log0.json <-- example .json file from failed archive attempt
http://srcf-ps.stanford.edu/sups-rwc-srtr-log2-succeeded.json <-- pscheduler log file shows this HTTP link as 'succeeded' though, it really didn't do the archive
http://srcf-ps.stanford.edu/sups-rwc-srtr-log0-afterreboot-success.json <-- after reboot, archive is successful
http://srcf-ps.stanford.edu/sups-wc-log.json <-- .json file showing failure to archive to esmond, no mention of RabbitMQ
http://srcf-ps.stanford.edu/sups-wc-afterreboot.json <-- reboot, then successful archive to esmond AND successful archive to RabbitMQ!!!!!! WHY???
----
It is now looking like having only an esmond archive is still too much for the current level of 'workers' (and we want to add more agents!). Is there a variable that can be tweaked to increase the number of 'workers' available? Our agents are small but pretty mighty, I5 CPU and 8g of RAM, so they should be capable of more 'workers' than the average agents often used.
I'd really like to make progress on this issue! Do let me know if you'd like more logs or more specific lines in the logs or .json files.
I do appreciate your efforts sorting this out with me.
Phil
Yes, we should be on the right direction, especially if the rate of the “a full slate of workers” message has disappeared.
Still having only 2 attempts for archival too small. You are still quite easily/quickly dropping the measurement results. I would suggest to have attempts within 1 day with 2 attempts with interval of 1-2h in addition to the ones you have.
Once you reduce the rate of the “full slate of workers” failure, you should also be able to spot more easily another failure, which should be the real cause of the problem. Obviously there is more to it apart of the exhaustion of pScheduler archiver workers. It might be the case not all of the attempts fail, but still there are.
Perhaps there is an exhaustion / overload on your Esmond server, if the failure is a timeout.
- [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/29/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/29/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/30/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/30/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/30/2019
- <Possible follow-up(s)>
- [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/31/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/29/2019
Archive powered by MHonArc 2.6.19.