Skip to Content.
Sympa Menu

perfsonar-user - AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

Subject: perfSONAR User Q&A and Other Discussion

List archive

AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day


Chronological Thread 
  • From: "Garnizov, Ivan" <>
  • To: Phil Reese <>, "" <>
  • Subject: AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
  • Date: Fri, 25 Oct 2019 13:28:34 +0000

Hello Phil,

 

Thanks fort he info.

It appears your mesh configuration for the archival of data is causing you troubles.

      "archiver_data": {

        "retry-policy": [

          {

            "attempts": 5,

            "wait": "PT1S"

          },

          {

            "attempts": 5,

            "wait": "PT3S"

          }

        ],

 

This one says, that pScheduler will try to submit the results 5 times each second and then another 5 times in 3sec intervals and then give up.

This means you start for every run 5 workers almost immediately. My expectations are that thus quickly the archiver workers are exhausted, which leads to the failures you are observing.

 

I would suggest you spread the archival attempts at least in 5min and make them so, that these expire in 1day, so that you are able to minimize a data loss in case the pS MA experiences a short failure.

 

I hope this helps.

 

 

Regards,

Ivan Garnizov

 

GEANT WP6T3: pS development team

GEANT WP7T1: pS deployments GN Operations

GEANT WP9T2: Software governance in GEANT

 

 

 

 

 

Von: Phil Reese [mailto:]
Gesendet: Donnerstag, 24. Oktober 2019 19:48
An: Garnizov, Ivan (RRZE) <>;
Betreff: Re: AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

 

Hi Ivan,

I've gathered some additional info that you've asked about.

Instead of copying the long curl output files, I've put them on a public web site for a short period of time for this debug process.  The URLs are included as part of the debug info below. (Firefox has a builtin json file viewer)

The first set is from a failed agent, not looking carefully at the pscheduler.log with debug enabled.  The second is set of lines from the pscheduler.log file with archiving failing, but with debug turned on.

Seems there might be something around the 'full slate of workers' lines but I've not heard of or read anything about that.

Phil




Example of a few failed archiving attempts:
Line from /var/log/pscheduler/pscheduler.log file:
Oct 24 08:38:34 sups-mc journal: runner INFO     953284: Running https://sups-mc.stanford.edu/pscheduler/tasks/1503a972-9601-4569-bcf4-99237a29338c/runs/1f68ad26-6b21-43e9-9d02-4c4fc940ccab
Short term public page for curl output: http://srcf-ps.stanford.edu/sups-mc1.json

Oct 24 08:34:56 sups-mc journal: runner INFO     952164: Posted result to https://sups-mc.stanford.edu/pscheduler/tasks/248b1c84-e4cc-4517-8d0c-9e1c42eaa446/runs/ef2a4016-e2c3-4919-b13c-4d3f060ea2b4
Short term publick page for curl output http://srcf-ps.stanford.edu/sups-mc2.json

Debug info in log file, central archiving not working:
Oct 24 09:07:07 sups-mc journal: runner INFO     953390: Running https://sups-mc.stanford.edu/pscheduler/tasks/1503a972-9601-4569-bcf4-99237a29338c/runs/b5ec65ac-4600-4b5e-ab76-4e351018fd90
Oct 24 09:07:07 sups-mc journal: runner INFO     953390: With traceroute: trace --dest sups-wc.stanford.edu --source sups-mc.stanford.edu
Oct 24 09:07:07 sups-mc journal: runner INFO     953390: Run succeeded.
Oct 24 09:07:07 sups-mc journal: archiver DEBUG    Notifications: archiving_change
Oct 24 09:07:07 sups-mc journal: archiver DEBUG    Already have a full slate of workers.
Oct 24 09:07:07 sups-mc journal: archiver DEBUG    Waiting 15.0 for change or notification
Short term publick page for curl output:   http://srcf-ps.stanford.edu/sups-mc3a.json


Restarted 'pscheduler-archiver'

Oct 24 09:59:21 sups-mc journal: runner INFO     952197: Posted result to https://sups-mc.stanford.edu/pscheduler/tasks/2b655cda-652c-4d2f-9ea7-20a64e251334/runs/9628ab76-0586-4207-841b-a002927fb92f
Short term public page for curl output:  http://srcf-ps.stanford.edu/sups-mc3-fix.json

debug on, archiving working:
LOTS of output ending as follow:
denominator': 600, 'numerator': 0}, 'event-type': 'packet-loss-rate'}, {'val': {'href': 'https://sups-mc.stanford.edu/pscheduler/tasks/92de8a33-74c0-4825-a8b8-9fdf4ff30bb1/runs/8dba5b88-b0cb-4de6-909b-ad21fbd026ca'}, 'event-type': 'pscheduler-run-href'}]}]}
Oct 24 10:03:46 sups-mc journal: archiver DEBUG    1875996: Returned JSON from archiver: {u'succeeded': True}
Oct 24 10:03:46 sups-mc journal: archiver DEBUG    1875996: Succeeded: 8dba5b88-b0cb-4de6-909b-ad21fbd026ca to esmond
Oct 24 10:03:46 sups-mc journal: archiver DEBUG    1875996: Thread finished
Short term public page for curl output: http://srcf-ps.stanford.edu/sups-mc-4-debug.json

--------
Another failed agents log output, catching the enabling of debug:

Oct 24 10:27:39 sups-moa-west journal: runner INFO     554831: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/d347205e-e47e-4629-b5d4-0822e6c537ad/runs/4b4c0ac5-6406-4678-9170-518dffc70c84
Oct 24 10:27:39 sups-moa-west journal: runner INFO     554816: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/68945f37-9b40-47e2-a04d-f048c0437b16/runs/0b396fac-ad0a-4099-91b8-45a7426a5bc2
Oct 24 10:27:48 sups-moa-west journal: archiver DEBUG    Debug started
Oct 24 10:27:54 sups-moa-west journal: runner INFO     554760: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/77573184-2467-4d19-a026-593b2a2a017f/runs/1f44f98e-7350-4918-9c67-d65bd67d5067
(Short term public page for curl output:    http://srcf-ps.stanford.edu/sups-moa-west-fail.json)
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Notifications: archiving_change
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Already have a full slate of workers.
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Pool esmond: Drained
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Pool rabbitmq: Drained
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Waiting 15.0 for change or notification

Oct 24 10:27:54 sups-moa-west journal: runner INFO     554795: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/9d4fcda0-6637-4a92-a3d9-f6d9e11f3d54/runs/c44dcfd4-f5d7-452a-bd41-739e581b5d9a
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Notifications: archiving_change
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Already have a full slate of workers.
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG    Waiting 15.0 for change or notification

This last stanza repeats, as listed, for all remaining tests in the log file

----------

On 10/24/19 3:56 AM, Garnizov, Ivan wrote:

Hello Phil,

 

It would be great, if you are able to share the output of a run with a failure to archive.

curl –k https://<pscheduler-run-address>

 

I presume there is more clarification in there, than the mere 'false' statement.

 

You could also start some diagnostics for the pscheduler-archiver by running: pscheduler debug on archiver

and seek for the extended logging in /var/log/pscheduler/pscheduler.log

 

 

Regards,

Ivan Garnizov

 

GEANT WP6T3: pS development team

GEANT WP7T1: pS deployments GN Operations

GEANT WP9T2: Software governance in GEANT

 

 

 

 

 

-----Ursprüngliche Nachricht-----
Von: [] Im Auftrag von Phil Reese
Gesendet: Mittwoch, 23. Oktober 2019 21:01
An:
Betreff: Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

 

Hi,

 

I'm still having this annoying problem.  Any thoughts or suggestions for log files to look into?

 

Hosts seem to run for about 2 days and then fail with this error, corrected by restarting pscheduler-archiver.  (though sometime more often then 2 days)  With 17 agents, this error happens at least once a day.

 

I guess the good news is that the data seems to remain intact on the MaDDash page, once the agent's pscheduler-archiver is restarted.

 

Thanks,

Phil

 

 

 

On 10/18/19 11:28 AM, Phil Reese wrote:

> The problem is that at seemingly random times a host will stop

> reporting data to the MaDDash esmond database, from all tests. This

> shows as an orange line of boxes on each of the MaDDash grids.

> The guaranteed fix is to log into the host and 'systemctl restart

> pscheduler-archiver', this has ALWAYS solved the problem.

 

 

 




Archive powered by MHonArc 2.6.19.

Top of Page