perfsonar-user - [perfsonar-user] Maddash web server woes

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Maddash web server woes

From: "Garnizov, Ivan (RRZE)" <>
To: Phil Reese <>, Andrew Lake <>, "" <>
Subject: [perfsonar-user] Maddash web server woes
Date: Wed, 8 Aug 2018 08:28:06 +0000
Accept-language: en-GB, de-DE, en-US
Ironport-phdr: 9a23:B/lUxRAl30LotJDbd8TmUyQJP3N1i/DPJgcQr6AfoPdwSP37ps6wAkXT6L1XgUPTWs2DsrQY07SQ6/iocFdDyK7JiGoFfp1IWk1NouQttCtkPvS4D1bmJuXhdS0wEZcKflZk+3amLRodQ56mNBXdrXKo8DEdBAj0OxZrKeTpAI7SiNm82/yv95HJbAhEmDuwbaluIBmqsA7cqtQYjYx+J6gr1xDHuGFIe+NYxWNpIVKcgRPx7dqu8ZBg7ipdpesv+9ZPXqvmcas4S6dYDCk9PGAu+MLrrxjDQhCR6XYaT24bjwBHAwnB7BH9Q5fxri73vfdz1SWGIcH7S60/VC+85Kl3VhDnlCYHNyY48G7JjMxwkLlbqw+lqxBm3oLYfJ2ZOP94c6jAf90VWHBBU95RWSJfH428c4UBAekAPelEoIbwvEEBoQe6CAS2GO/j1j1Fi3nr1qM6yeQhFgTG0RQuE9wPqnvUttP1NKYTUOCy0qnE1SjIYfBI2Tjn7ojDbxQtr+2QU71zfsTdzEcjHB7Cg1WRt4zqJTWV2v4Cs2eB9epgU+Ovim8gqwFvuTWvyN0jipTTio0I1F/J7CN0y5s7K92/TU50e9+kEJ1IuiGULYR2X9kuTHx2tys817YIuoa7cTAXxJkp2hLTceGLfouL7x75SeqdPC10iGx7dL6nmhq//1WsxvfhWsS301tGtClInsTWunwT2BHe7tCLRuVh8ku9xDqC0gHe5+9HLE0xj6XXN4ItzqI1m5YOrUjPAC/7lUDzgaKYckgp/PWj5f79bbX8vJCcMpd5igHgPaQqncyyGeE4MwcXU2iB+OWwzbPu8VfjQLVQj/w5jLPVsI3cJcQav6K2HRVV0oI55Ba5ADepztIYkWMaI11bYB6Hjo7pNE/SIP3gEPuyjUmgnC12y/3FPbDtGIjBImXZnLv/Y7px80tcxxAyzdBb6ZJUELYBIPfrV0/wqtzXFBk5Pxa7w+n9EtV90IIeWGOIAq+HK67Sv0WH5v81L+aReoAZoCz9JOQ95/7ykX85nkcQfbG30psNZnC4BfNmI0ODbnr2m9sBDHwKsRQkTOHxjF2CUCVTZ2qpX64i/D07CYSmDZvdSYC3hryOwju7EoNMam9YF1+MDCSgS4LRY/cNbGqoL8J6n3RQTaKsRpQs/QqyvQn0zKBraOfY53tc/bvl2MJ4+KX3nBI/vWh9CciM+2yWCWd5gjVMD3Uu0bpxukt7w03GzLN1mdRZE8Be/fVETl18OJLBhaQuE93oVBnGeN6TDUu9T8+OADctQ8g3zsNUJUtxBoPxoArE2n/gLLYPkb2ZA4JwuoPC1n65Z/xH+VuHnuF1jkcvRI1APHegh7x46SDdDpOPn0jPxPXiTrgVwCOYrDTL9mGJpkwNFVcoCag=

Hi Phil,

It appears to me, you have some MaDDash configuration issue, which results in increased Apache load, which accumulates with time and in the end crashes Apache.

Please start monitoring the usage of memory and cpu on the host and by Apache in attempt to relate to the problem to the issue.

The numerous Nagios check failures result in numerous Apache requests and if these timeout, then they put load on the Apache.

If there are Nagios check failures, these should be observed on the GUI as well. Is this true?

The simplest example: You have left the “example” MaDDash config lines in the maddash.yaml file.

Please check all the dashboards you may have and seek for “no data” cells.

WRT events in “/var/log/httpd/error_log”

Unfortunately the normal operation of Cassandra and Esmond lead to these false positives in the Apache log. Please note these are regular on the hour.

Regards,

Ivan Garnizov

GEANT SA1T2: pS deployments GN Operations

GEANT SA2T3: pS development team

GEANT SA3T5: eduPERT team

Jubiläumsjahr 2018 - IT in Bewegung

Das RRZE - der IT-Dienstleister der FAU

www.50-jahre.rrze.fau.de

Von: Phil Reese [mailto:]
Gesendet: Dienstag, 7. August 2018 18:10
An: Garnizov, Ivan (RRZE) <>; Andrew Lake <>;
Betreff: Re: AW: AW: [perfsonar-user] Maddash web server woes

Hi Ivan,

Thanks for being willing to work with me on this.

Below are some log outputs which might be of help.

Phil

First this MA server is dedicated to just this single service. There are 13 PS testpoint systems feeding data to the server in a full mesh config. There is a lot of DNS traffic so I think it is keeping the server pretty busy but it seems like it should work and not kill the httpd (apache) process.

I made these updates to the Apache Config, without much impact:
<IfModule prefork.c>
    StartServers        25
    MinSpareServers     40
    MaxSpareServers     70
    MaxRequestWorkers   640
    MaxConnectionsPerChild 7500
</IfModule>

From /var/log/httpd/error_log:

Various assortment of these:
[Tue Aug 07 08:25:37.943206 2018] [mpm_worker:notice] [pid 8873:tid 139689130432640] AH00296: caught SIGWINCH, shutting down gracefully
[Tue Aug 07 08:27:08.334259 2018] [suexec:notice] [pid 28354:tid 140127999797376] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Tue Aug 07 08:27:08.335219 2018] [ssl:warn] [pid 28354:tid 140127999797376] AH02292: Init: Name-based SSL virtual hosts only work for clients with TLS server name indication support (RFC 4366)
[Tue Aug 07 08:27:08.353946 2018] [auth_digest:notice] [pid 28354:tid 140127999797376] AH01757: generating secret for digest authentication ...
[Tue Aug 07 08:27:08.354430 2018] [lbmethod_heartbeat:notice] [pid 28354:tid 140127999797376] AH02282: No slotmem from mod_heartmonitor

Lots of these:
[Tue Aug 07 08:27:09.246540 2018] [:error] [pid 28358:tid 140127691790080] path= ['/usr/lib/esmond/src/dlnetsnmp/lib', '/usr/lib/esmond', '/usr/lib/esmond/esmond_client', '/usr/lib/esmond/esmond', '/usr/lib/esmond/lib/python2.7', '/usr/lib/esmond/lib/python2.7/site-packages', '/usr/lib64/python27.zip', '/usr/lib64/python2.7', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old', '/usr/lib64/python2.7/lib-dynload', '/usr/lib64/python2.7/site-packages', '/usr/lib/python2.7/site-packages']
[Tue Aug 07 08:27:09.294315 2018] [:error] [pid 28359:tid 140127691790080] path= ['/usr/lib/esmond/src/dlnetsnmp/lib', '/usr/lib/esmond', '/usr/lib/esmond/esmond_client', '/usr/lib/esmond/esmond', '/usr/lib/esmond/lib/python2.7', '/usr/lib/esmond/lib/python2.7/site-packages', '/usr/lib64/python27.zip', '/usr/lib64/python2.7', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old', '/usr/lib64/python2.7/lib-dynload', '/usr/lib64/python2.7/site-packages', '/usr/lib/python2.7/site-packages']

These started after a httpd restart:
[Tue Aug 07 08:27:11.967617 2018] [:error] [pid 28363:tid 140127683397376] cassandra_db [INFO] Checking/creating column families
[Tue Aug 07 08:27:11.968883 2018] [:error] [pid 28363:tid 140127683397376] cassandra_db [INFO] Schema check done
(lots more)

---
maddash-server.log details:
These come in and out:
INFO 2018-08-01 00:00:00,001 oldestAllowedTime=1532502000
ERROR 2018-08-01 09:38:25,385 Error scheduling job The Scheduler has been shutdown.
ERROR 2018-08-01 09:38:25,416 Error executing CheckSchedulerJob: The Scheduler has been shutdown.
INFO 2018-08-01 12:00:00,002 oldestAllowedTime=1532545200
INFO 2018-08-02 00:00:00,001 oldestAllowedTime=1532588400

These seem like the INFO lines suggest good stuff, but the Nagios checks are concerning:
INFO 2018-08-04 12:00:00,001 oldestAllowedTime=1532804400
INFO 2018-08-05 00:00:00,002 oldestAllowedTime=1532847600
ERROR 2018-08-05 11:51:47,803 Error running nagios check: null
ERROR 2018-08-05 11:55:44,599 Error running nagios check: null
ERROR 2018-08-05 11:55:44,603 Error running nagios check: null
ERROR 2018-08-05 11:55:44,604 Error running nagios check: null
ERROR 2018-08-05 11:55:44,617 Error running nagios check: null
ERROR 2018-08-05 11:55:44,619 Error running nagios check: null
ERROR 2018-08-05 11:55:44,620 Error running nagios check: null
ERROR 2018-08-05 11:55:44,621 Error running nagios check: null
ERROR 2018-08-05 11:55:44,622 Error running nagios check: null
ERROR 2018-08-05 11:55:44,623 Error running nagios check: null
ERROR 2018-08-05 11:55:44,624 Error running nagios check: null
ERROR 2018-08-05 11:55:44,624 Error running nagios check: null
ERROR 2018-08-05 11:55:44,624 Error running nagios check: null
ERROR 2018-08-05 11:55:44,624 Error running nagios check: null
ERROR 2018-08-05 11:55:44,627 Error running nagios check: null
ERROR 2018-08-05 11:55:44,628 Error running nagios check: null
ERROR 2018-08-05 11:55:44,629 Error running nagios check: null
ERROR 2018-08-05 11:55:44,602 Error running nagios check: null
ERROR 2018-08-05 11:55:44,602 Error running nagios check: null
ERROR 2018-08-05 11:55:44,632 Error running nagios check: null
INFO 2018-08-05 12:00:00,002 oldestAllowedTime=1532890800
INFO 2018-08-06 00:00:00,002 oldestAllowedTime=1532934000

On 8/7/18 2:19 AM, Garnizov, Ivan (RRZE) wrote:

Hi Phil,

Actually it is the normal behavior of CentOS 7.5 to have the Apache daemon run with this flag –DFOREGROUND.

We need to understand, where the problem stems from.

If Appache is opening so many connections, then probably your sever is receiving many requests (hopefully not too many)

Please try to restrict the access in order to see if that will make the system stable again.

Also in the logs of Apache there might be references to issues, which are related to the problem.

Please send logs of Apache from the server to see, if there are issues related to MaDDash operation.

Best regards,

Ivan

Re: [perfsonar-user] Maddash web server woes, Phil Reese, 08/03/2018
- AW: [perfsonar-user] Maddash web server woes, Garnizov, Ivan (RRZE), 08/06/2018
  - Re: AW: [perfsonar-user] Maddash web server woes, Phil Reese, 08/06/2018
    - AW: AW: [perfsonar-user] Maddash web server woes, Garnizov, Ivan (RRZE), 08/07/2018
      - Re: AW: AW: [perfsonar-user] Maddash web server woes, Phil Reese, 08/07/2018
- <Possible follow-up(s)>
- [perfsonar-user] Maddash web server woes, Garnizov, Ivan (RRZE), 08/08/2018
  - Re: [perfsonar-user] Maddash web server woes, Phil Reese, 08/10/2018

List archive

[perfsonar-user] Maddash web server woes