Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day


Chronological Thread 
  • From: Phil Reese <>
  • To: "Garnizov, Ivan" <>, "" <>
  • Subject: Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
  • Date: Wed, 30 Oct 2019 15:05:21 -0700

HI Ivan,

I think this long running shaggy dog story is coming to a close at last!  The grid has been stable, but for those two systems yesterday,for almost two days!

Now, if I might, I'd like to ask you for some clarification on a few points.

The forward scheduling of runs certainly explains why RabbitMQ archiving was still taking place from a few agents.  I also see what you mean by the log message 'success' and the archiving being two different processes, which can be independent of each other.

Now my questions.  You suggest RabbitMQ uses a different pool of resources, yet it was only when I got rid of the RabbitMQ archiving that things returned to normal.  Seems to me there is a sharing of  worker pool, which I'd guess is on the agent, not the database system.  Do I have that wrong?

This comment-- "I notice you have removed the pS archival retries. I can see this also from the mesh configuration you shared. IMO this is not a good idea, despite the fact that you have some successful archival on restart." has me scratching my head.  I didn't change anything about the default perfSONAR agent nor MaDDash configs, the effort was only to remove the RabbitMQ archiving from the core .json file (http://srcf-ps.stanford.edu/config-file-sups2.json).   Looking at the Archiver page in the docs, I do see how to add retry options (note that the provided Examples and Skeleton .json files don't use the retry stanza) .  Missing a data point, when we collect so many, doesn't bother me too much.  Would adding the retry stanza potentially make the lack of 'worker' processes worse?

Yes, I know we're making the Cassandra/esmond server work hard but it should be able to handle it.  Htop shows a Load Average of 2.12  1.18 0.67 (1min, 5min, 15min), memory is using 13G of 64G, and there are 14 cores, 28 threads in the single CPU Dell server.

We currently have 17 agents testing owamp, packet loss, to all others every 5 minutes, then the usual ping, throughput and traceroute tests.  This is a lot of tests and we want to add another 7-10 agents, AND we'd like to pass all the data to RabbitMQ!

Is there a variable which dictates the number of archiving workers a system can use?  What are the risks of bumping up that variable (assuming there is one)?

Thank you very much for your patience with me in going through my problem.  I know you respond to many questions each day, so your efforts, day after day, on my problem is very much appreciated.

Phil





On 10/30/19 3:24 AM, Garnizov, Ivan wrote:

Hello Phil,

 

With regards RabbitMQ presence:

Please note that pS task runs are being scheduled 24h ahead with their configuration. Meaning you should expect to see “old” run configuration present.

 

With regards http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr-afterreboot.txt  <-- log file contents after reboot (no debug)

You should note the archival of results is not about the successful run in the output, but for some runs that happened earlier…. Meaning you still have a big queue of results to be processed, which exhausts the pScheduler-archiver resource.  As a proof of this is again: http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr.txt  <-- log file contents from failed agent with debug started

Please also note the RabbitMQ has a different pool of resources.

 

 

With regards http://srcf-ps.stanford.edu/sups-rwc-srtr-log0-afterreboot-success.json   <-- after reboot, archive is successful

I notice you have removed the pS archival retries. I can see this also from the mesh configuration you shared. IMO this is not a good idea, despite the fact that you have some successful archival on restart.

This only proves me you are most likely seeing some exhaustion on the central Esmond archiver. I can imagine all of your MPs running with 15 workers towards the Esmond archiver….or are there only 1 or 2 exceptions with these failures?

 

I would expect also that among the failing archival, you are also having some successful ones, which should support the thesis of reaching limitations on the Esmond server side.

 

In all cases I would suggest bringing back the pscheduler archival retries with several attemps spanned over at least 2h (depending on your use case), but in no case intensive retries in the first 15 min. The proper balance you can find after multiple iterations and depend on your requirements for the project…. Perhaps results older than 10 min are of no use for you?

 





Archive powered by MHonArc 2.6.19.

Top of Page