perfsonar-user - Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
Subject: perfSONAR User Q&A and Other Discussion
List archive
Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
Chronological Thread
- From: Phil Reese <>
- To: "Garnizov, Ivan" <>, "" <>
- Subject: Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
- Date: Thu, 24 Oct 2019 10:47:54 -0700
Hi Ivan,
I've gathered some additional info that you've asked about.
Instead of copying the long curl output files, I've put them on a public web site for a short period of time for this debug process. The URLs are included as part of the debug info below. (Firefox has a builtin json file viewer)
The first set is from a failed agent, not looking carefully at the pscheduler.log with debug enabled. The second is set of lines from the pscheduler.log file with archiving failing, but with debug turned on.
Seems there might be something around the 'full slate of workers' lines but I've not heard of or read anything about that.
Phil
Example of a few failed archiving attempts:
Line from /var/log/pscheduler/pscheduler.log file:
Oct 24 08:38:34 sups-mc journal: runner INFO 953284: Running https://sups-mc.stanford.edu/pscheduler/tasks/1503a972-9601-4569-bcf4-99237a29338c/runs/1f68ad26-6b21-43e9-9d02-4c4fc940ccab
Short term public page for curl output: http://srcf-ps.stanford.edu/sups-mc1.json
Oct 24 08:34:56 sups-mc journal: runner INFO 952164: Posted result to https://sups-mc.stanford.edu/pscheduler/tasks/248b1c84-e4cc-4517-8d0c-9e1c42eaa446/runs/ef2a4016-e2c3-4919-b13c-4d3f060ea2b4
Short term publick page for curl output: http://srcf-ps.stanford.edu/sups-mc2.json
Debug info in log file, central archiving not working:
Oct 24 09:07:07 sups-mc journal: runner INFO 953390: Running https://sups-mc.stanford.edu/pscheduler/tasks/1503a972-9601-4569-bcf4-99237a29338c/runs/b5ec65ac-4600-4b5e-ab76-4e351018fd90
Oct 24 09:07:07 sups-mc journal: runner INFO 953390: With traceroute: trace --dest sups-wc.stanford.edu --source sups-mc.stanford.edu
Oct 24 09:07:07 sups-mc journal: runner INFO 953390: Run succeeded.
Oct 24 09:07:07 sups-mc journal: archiver DEBUG Notifications: archiving_change
Oct 24 09:07:07 sups-mc journal: archiver DEBUG Already have a full slate of workers.
Oct 24 09:07:07 sups-mc journal: archiver DEBUG Waiting 15.0 for change or notification
Short term publick page for curl output: http://srcf-ps.stanford.edu/sups-mc3a.json
Restarted 'pscheduler-archiver'
Oct 24 09:59:21 sups-mc journal: runner INFO 952197: Posted result to https://sups-mc.stanford.edu/pscheduler/tasks/2b655cda-652c-4d2f-9ea7-20a64e251334/runs/9628ab76-0586-4207-841b-a002927fb92f
Short term public page for curl output: http://srcf-ps.stanford.edu/sups-mc3-fix.json
debug on, archiving working:
LOTS of output ending as follow:
denominator': 600, 'numerator': 0}, 'event-type': 'packet-loss-rate'}, {'val': {'href': 'https://sups-mc.stanford.edu/pscheduler/tasks/92de8a33-74c0-4825-a8b8-9fdf4ff30bb1/runs/8dba5b88-b0cb-4de6-909b-ad21fbd026ca'}, 'event-type': 'pscheduler-run-href'}]}]}
Oct 24 10:03:46 sups-mc journal: archiver DEBUG 1875996: Returned JSON from archiver: {u'succeeded': True}
Oct 24 10:03:46 sups-mc journal: archiver DEBUG 1875996: Succeeded: 8dba5b88-b0cb-4de6-909b-ad21fbd026ca to esmond
Oct 24 10:03:46 sups-mc journal: archiver DEBUG 1875996: Thread finished
Short term public page for curl output: http://srcf-ps.stanford.edu/sups-mc-4-debug.json
--------
Another failed agents log output, catching the enabling of debug:
Oct 24 10:27:39 sups-moa-west journal: runner INFO 554831: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/d347205e-e47e-4629-b5d4-0822e6c537ad/runs/4b4c0ac5-6406-4678-9170-518dffc70c84
Oct 24 10:27:39 sups-moa-west journal: runner INFO 554816: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/68945f37-9b40-47e2-a04d-f048c0437b16/runs/0b396fac-ad0a-4099-91b8-45a7426a5bc2
Oct 24 10:27:48 sups-moa-west journal: archiver DEBUG Debug started
Oct 24 10:27:54 sups-moa-west journal: runner INFO 554760: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/77573184-2467-4d19-a026-593b2a2a017f/runs/1f44f98e-7350-4918-9c67-d65bd67d5067
(Short term public page for curl output: http://srcf-ps.stanford.edu/sups-moa-west-fail.json)
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Notifications: archiving_change
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Already have a full slate of workers.
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Pool esmond: Drained
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Pool rabbitmq: Drained
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Waiting 15.0 for change or notification
Oct 24 10:27:54 sups-moa-west journal: runner INFO 554795: Posted result to https://sups-moa-west.stanford.edu/pscheduler/tasks/9d4fcda0-6637-4a92-a3d9-f6d9e11f3d54/runs/c44dcfd4-f5d7-452a-bd41-739e581b5d9a
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Notifications: archiving_change
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Already have a full slate of workers.
Oct 24 10:27:54 sups-moa-west journal: archiver DEBUG Waiting 15.0 for change or notification
This last stanza repeats, as listed, for all remaining tests in the log file
----------
Hello Phil,
It would be great, if you are able to share the output of a run with a failure to archive.
curl –k https://<pscheduler-run-address>
I presume there is more clarification in there, than the mere 'false' statement.
You could also start some diagnostics for the pscheduler-archiver by running: pscheduler debug on archiver
and seek for the extended logging in /var/log/pscheduler/pscheduler.log
Regards,
Ivan Garnizov
GEANT WP6T3: pS development team
GEANT WP7T1: pS deployments GN Operations
GEANT WP9T2: Software governance in GEANT
-----Ursprüngliche
Nachricht-----
Von:
[] Im Auftrag von
Phil Reese
Gesendet: Mittwoch, 23. Oktober 2019 21:01
An:
Betreff: Re: [perfsonar-user] Modest sized grid has agent
failure to archive once or twice a day
Hi,
I'm still having this annoying problem. Any thoughts or suggestions for log files to look into?
Hosts seem to run for about 2 days and then fail with this error, corrected by restarting pscheduler-archiver. (though sometime more often then 2 days) With 17 agents, this error happens at least once a day.
I guess the good news is that the data seems to remain intact on the MaDDash page, once the agent's pscheduler-archiver is restarted.
Thanks,
Phil
On 10/18/19 11:28 AM, Phil Reese wrote:
> The problem is that at seemingly random times a host will stop
> reporting data to the MaDDash esmond database, from all tests. This
> shows as an orange line of boxes on each of the MaDDash grids.
>
> The guaranteed fix is to log into the host and 'systemctl restart
> pscheduler-archiver', this has ALWAYS solved the problem.
>
- [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/18/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/23/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/24/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/24/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/25/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/28/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/25/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/24/2019
- AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/24/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/23/2019
Archive powered by MHonArc 2.6.19.