Skip to Content.
Sympa Menu

perfsonar-user - [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Modest sized grid has agent failure to archive once or twice a day


Chronological Thread 
  • From: "Garnizov, Ivan" <>
  • To: Phil Reese <>, "" <>
  • Subject: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
  • Date: Thu, 31 Oct 2019 15:27:11 +0000

Hello Phil,

 

With regards documentation on configuring pscheduler to do multiple attempts for archiving please check here: http://docs.perfsonar.net/pscheduler_ref_archivers.html#pscheduler-ref-archivers-archivers-esmond-data . Below there is also an example of it.

I would expect that adding more retries will not ease the operation of pScheduler, but certainly will help it catch up with the amounts of archival requests.

Interestingly enough initially you had configured retries in your config, but these were in the matter of seconds, which only did harm, than help the pscheduler operation.

 

With regards the separate RabbitMQ workers, I am not 100% sure, but the message suggest it. I’ll leave for a comment from Mark Feit.

As far as I am aware there is no parameter exposed for the number of the pScheduler archiver workers, but it had been tested extensively. Actually from what I recall with reduced number of workers on a somewhat busy system. One of the important factors was to reduce the impact on the systems load and performance having in mind it operates on a measurement instrument = the pS kit server.

 

Don’t worry about length of the thread and the time it takes. My business stems from the effort sharing with other priorities, not from the number of answered questions.

 

Regards,

Ivan Garnizov

 

GEANT WP6T3: pS development team

GEANT WP7T1: pS deployments GN Operations

GEANT WP9T2: Software governance in GEANT

 

 

 

Von: Phil Reese [mailto:preese@stanford.edu]
Gesendet: Mittwoch, 30. Oktober 2019 23:05
An: Garnizov, Ivan (RRZE) <>;
Betreff: Re: AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

 

HI Ivan,

I think this long running shaggy dog story is coming to a close at last!  The grid has been stable, but for those two systems yesterday,for almost two days!

Now, if I might, I'd like to ask you for some clarification on a few points.

The forward scheduling of runs certainly explains why RabbitMQ archiving was still taking place from a few agents.  I also see what you mean by the log message 'success' and the archiving being two different processes, which can be independent of each other.

Now my questions.  You suggest RabbitMQ uses a different pool of resources, yet it was only when I got rid of the RabbitMQ archiving that things returned to normal.  Seems to me there is a sharing of  worker pool, which I'd guess is on the agent, not the database system.  Do I have that wrong?

This comment-- "I notice you have removed the pS archival retries. I can see this also from the mesh configuration you shared. IMO this is not a good idea, despite the fact that you have some successful archival on restart." has me scratching my head.  I didn't change anything about the default perfSONAR agent nor MaDDash configs, the effort was only to remove the RabbitMQ archiving from the core .json file (http://srcf-ps.stanford.edu/config-file-sups2.json).   Looking at the Archiver page in the docs, I do see how to add retry options (note that the provided Examples and Skeleton .json files don't use the retry stanza) .  Missing a data point, when we collect so many, doesn't bother me too much.  Would adding the retry stanza potentially make the lack of 'worker' processes worse?

Yes, I know we're making the Cassandra/esmond server work hard but it should be able to handle it.  Htop shows a Load Average of 2.12  1.18 0.67 (1min, 5min, 15min), memory is using 13G of 64G, and there are 14 cores, 28 threads in the single CPU Dell server.

We currently have 17 agents testing owamp, packet loss, to all others every 5 minutes, then the usual ping, throughput and traceroute tests.  This is a lot of tests and we want to add another 7-10 agents, AND we'd like to pass all the data to RabbitMQ!

Is there a variable which dictates the number of archiving workers a system can use?  What are the risks of bumping up that variable (assuming there is one)?

Thank you very much for your patience with me in going through my problem.  I know you respond to many questions each day, so your efforts, day after day, on my problem is very much appreciated.

Phil




On 10/30/19 3:24 AM, Garnizov, Ivan wrote:

Hello Phil,

 

With regards RabbitMQ presence:

Please note that pS task runs are being scheduled 24h ahead with their configuration. Meaning you should expect to see “old” run configuration present.

 

With regards http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr-afterreboot.txt  <-- log file contents after reboot (no debug)

You should note the archival of results is not about the successful run in the output, but for some runs that happened earlier…. Meaning you still have a big queue of results to be processed, which exhausts the pScheduler-archiver resource.  As a proof of this is again: http://srcf-ps.stanford.edu/psched-log0-sups-rwc-srtr.txt  <-- log file contents from failed agent with debug started

Please also note the RabbitMQ has a different pool of resources.

 

 

With regards http://srcf-ps.stanford.edu/sups-rwc-srtr-log0-afterreboot-success.json   <-- after reboot, archive is successful

I notice you have removed the pS archival retries. I can see this also from the mesh configuration you shared. IMO this is not a good idea, despite the fact that you have some successful archival on restart.

This only proves me you are most likely seeing some exhaustion on the central Esmond archiver. I can imagine all of your MPs running with 15 workers towards the Esmond archiver….or are there only 1 or 2 exceptions with these failures?

 

I would expect also that among the failing archival, you are also having some successful ones, which should support the thesis of reaching limitations on the Esmond server side.

 

In all cases I would suggest bringing back the pscheduler archival retries with several attemps spanned over at least 2h (depending on your use case), but in no case intensive retries in the first 15 min. The proper balance you can find after multiple iterations and depend on your requirements for the project…. Perhaps results older than 10 min are of no use for you?

 

 




Archive powered by MHonArc 2.6.19.

Top of Page