perfsonar-user - [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Modest sized grid has agent failure to archive once or twice a day

From: Phil Reese <>
To: "" <>
Subject: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day
Date: Fri, 18 Oct 2019 11:28:52 -0700

Hi,

The project we've been working has 17 agents working with another 10+ in the offing.

We are having an issue that we'd like to solve before deploying any more agents.

First of all, all agents are running Centos 7.7 and PS 4.2.2 and have auto update enabled.

We're using this grid to observe how traffic behaves within our campus's backbone paths, trying to provide agents at typical places where campus users traffic would connect to the backbone. The idea being that we could then see issues across the whole network at the same time or, hopefully, before the community sees packet loss or throughput issues.

The implication of this is we've changed the frequency of running the Loss test down to 5 minutes (default for this test is 30min). For each agent we get 12 tests per hour, from a 17x17 grid, plus all the other tests on the default time frames. The typical line count from running 'pscheduler schedule | wc -l' is ~1350 lines, where each test takes up 5 lines, thus 270 tests, from each of the 17 agents. The grid is busy. A pscheduler schedule .png file shows this busyness but doesn't suggest many if any blocked tests.

The problem is that at seemingly random times a host will stop reporting data to the MaDDash esmond database, from all tests. This shows as an orange line of boxes on each of the MaDDash grids.

The guaranteed fix is to log into the host and 'systemctl restart pscheduler-archiver', this has ALWAYS solved the problem.

After awhile this gets old, of course. We want to know why it happens and how to avoid it going forward.

Below is a dump of as much data as I could come up with. Happy to provide more info or logs as directed.

Thanks,
Phil

------
Checking the /var/log/pscheduler/pscheduler.log file shows that the tests are being successfully logged to the local agent's database but not the the central data base. Looking at the http:// line from a test shows that as well. The tests are being run just fine, but the 'archivings:0:archived: and archivings:0:completed:' are both 'false'.

The /var/log/perfsonar files show no WARN or ERROR messages in any of the psconfig-pscheduler-agent-* log files.

Running 'pscheduler troubleshoot' on the failed agents show all good results except for the last archiving step where the result is:

Checking archiving... Failed.
Archiving never completed.

Moving to the central management host now:
Nothing new since 9/22 in /var/log/cassandra/cassandra.log
No WARN or ERROR messages in system.log

No entries in django.log since 10/3
esmond.log is empty
No files in crashlog/

In the /var/log/httpd directory, it seems like the entries from the stopped agents are clear, as noted by the time stamps:
10.127.57.170 - - [18/Oct/2019:07:26:56 -0700] "PUT /esmond/perfsonar/archive/43df6618f5b74669baca4ea5d8f9f4b5/? HTTP/1.1" 201 2
10.127.57.170 - - [18/Oct/2019:07:27:00 -0700] "PUT /esmond/perfsonar/archive/e0ad3c64fe814d099482ec24f4961ed8/? HTTP/1.1" 201 2 <<--- seems like the time the agent failed
10.127.57.170 - - [18/Oct/2019:07:32:09 -0700] "GET /psconfig/sups2.json HTTP/1.1" 200 3479 <<-- not sure what these suggest
10.127.57.170 - - [18/Oct/2019:08:32:24 -0700] "GET /psconfig/sups2.json HTTP/1.1" 200 3479
10.127.57.170 - - [18/Oct/2019:09:32:49 -0700] "GET /psconfig/sups2.json HTTP/1.1" 200 3479
10.127.57.170 - - [18/Oct/2019:10:33:04 -0700] "GET /psconfig/sups2.json HTTP/1.1" 200 3479

/var/log/maddash/maddash-server.netlogger.log has "Unable to find any tests with data..." for each line from the failed agents.

(just an editorial comment, I've yet to see any data in any agent or central server file for /var/log/perfsonar/servicewatcher_error.log is the process really doing anything?)

[perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/18/2019
- Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/23/2019
  - AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/24/2019
    - Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/24/2019
      - AW: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Garnizov, Ivan, 10/25/2019
        
        Re: [perfsonar-user] Modest sized grid has agent failure to archive once or twice a day, Phil Reese, 10/28/2019

List archive

[perfsonar-user] Modest sized grid has agent failure to archive once or twice a day