perfsonar-user - AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
Subject: perfSONAR User Q&A and Other Discussion
List archive
AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
Chronological Thread
- From: "Garnizov, Ivan" <>
- To: Phil Reese <>, "" <>
- Subject: AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
- Date: Wed, 30 Oct 2019 10:24:05 +0000
Hello Phil,
With regards RabbitMQ presence: Please note that pS task runs are being scheduled 24h ahead with their configuration. Meaning you should expect to see “old” run configuration present.
With regards http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr-afterreboot.txt <-- log file contents after reboot (no debug) You should note the archival of results is not about the successful run in the output, but for some runs that happened earlier…. Meaning you still have a big queue of results to be processed, which exhausts the pScheduler-archiver resource. As a proof of this is again: http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr.txt <-- log file contents from failed agent with debug started Please also note the RabbitMQ has a different pool of resources.
With regards http://srcf-ps.stanford.edu/sups-rwc-srtr-log0-afterreboot-success.json <-- after reboot, archive is successful I notice you have removed the pS archival retries. I can see this also from the mesh configuration you shared. IMO this is not a good idea, despite the fact that you have some successful archival on restart. This only proves me you are most likely seeing some exhaustion on the central Esmond archiver. I can imagine all of your MPs running with 15 workers towards the Esmond archiver….or are there only 1 or 2 exceptions with these failures?
I would expect also that among the failing archival, you are also having some successful ones, which should support the thesis of reaching limitations on the Esmond server side.
In all cases I would suggest bringing back the pscheduler archival retries with several attemps spanned over at least 2h (depending on your use case), but in no case intensive retries in the first 15 min. The proper balance you can find after multiple iterations and depend on your requirements for the project…. Perhaps results older than 10 min are of no use for you?
Regards, Ivan Garnizov
GEANT WP6T3: pS development team GEANT WP7T1: pS deployments GN Operations GEANT WP9T2: Software governance in GEANT
Von: Phil Reese [mailto:]
Hi Ivan, On 10/29/19 8:08 AM, Garnizov, Ivan wrote:
|
- [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/29/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/29/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/30/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/30/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/30/2019
- <Possible follow-up(s)>
- [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/31/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/29/2019
Archive powered by MHonArc 2.6.19.