Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Re: meshconfig-agent-tasks not scheduling tasks regularly

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Re: meshconfig-agent-tasks not scheduling tasks regularly


Chronological Thread 
  • From: Casey Russell <>
  • To: Mark Feit <>
  • Cc: Larry Blunk <>, "" <>
  • Subject: Re: [perfsonar-user] Re: meshconfig-agent-tasks not scheduling tasks regularly
  • Date: Thu, 19 Oct 2017 13:01:50 -0500
  • Ironport-phdr: 9a23:aK/d7xD9ubJibLD5g4tIUyQJP3N1i/DPJgcQr6AfoPdwSPX+ocbcNUDSrc9gkEXOFd2CrakV26yO6+jJYi8p2d65qncMcZhBBVcuqP49uEgeOvODElDxN/XwbiY3T4xoXV5h+GynYwAOQJ6tL1LdrWev4jEMBx7xKRR6JvjvGo7Vks+7y/2+94fdbghMhzexe69+IAmrpgjNq8cahpdvJLwswRXTuHtIfOpWxWJsJV2Nmhv3+9m98p1+/SlOovwt78FPX7n0cKQ+VrxYES8pM3sp683xtBnMVhWA630BWWgLiBVIAgzF7BbnXpfttybxq+Rw1DWGMcDwULs5Qiqp4bt1RxD0iScHLz85/3/Risxsl6JQvRatqwViz4LIfI2ZMfxzdb7fc9wHX2pMRsReVyJBDI2ybIUBEvQPMvpDoobnu1cDtwGzCRWwCO7tzDJDm3/43bc90+QkCQzLwhYvH8kQv3XUsd77KLoSUfuuzKbWyTXDa+5d1DDh54jSbxAhuuqMUqx0ccrV0kQvFBnKjlOKqYP7OTOZzOINvHaH7+d5U++klmApqwZ0oje1x8csjJHEhoYUylDC9iV23ps6Jdy+SEJhfdGkF55QuzmGN4p4Q8MiX31otzggyrEcpZG7ey0KxIw6yxPeZPGLaZWE7g/tWeqLLzp0mmhpdK+wihuy6USs1+zxW8au3FpXsyZInMPAu34T2xDJ6sWKSONx/kS71jaJzQDc9OdELVoylaXFN54sxKM7mJkLsUnbACP7mVn6ga2Te0Uq+eWn8Pjrb7Dpq5OAK4N5hATzPbgylsG+BOk1NxYCU3aB9um6ybbt51f2QK9Qgf0ziqTZsI7VJcAcpqOhBg9ayIcj6xKmAzeh0dQUgWALLV1bdB6ZlYflIV7OIPf/Dfewh1Sjji1nyOzBPr3kGpnNL37Dn6n9fbtl9UJRyRY/wNJa6pJaCbwOO+7/V0r+udDEEhM0PQm5zPr7BNh8044TXHyDDrGDP6/KtF+H/OMvI+2CZI8Pvzb9LuAo6OL0jX8kgl8dZrem3Z8TaH2jHfRpOUOZYWDyjdcHC2sKuBQxTPDyhF2YTTFTf2qyX7475jwjEIKpE53DRo62gLyG2ie0BIdWanlbClCXD3jobZ6JW/MNaCKJPs9hiSIIWaKgS48nyRGhqhX6y7x5IerI5CEUr4zs28Vo576bqRZn0DVuDIyz3mCRBzVxkG4JWxc32rxyu0pw1g3F3KRl1a92D9tWsthASQorfaTB1Pd3D9S6DgnbY8ySRVKiatavBys8SJQ3zsNYMBU1IMmrkh2Wh3niOLQSjbHeQcVsqq8=

Mark and Larry,

    One of my hosts (ps-ku-bw) has failed to schedule tasks today.  This is one of my larger hosts and the MaxClients problem might have actually been the trigger that began the avalanche.  I've left the host broken in case Mark or one of the other developers wants information from it while it's in this failed state.

     At 9:47am yesterday, the httpd error log showed the following:

[root@ps-ku-bw crussell]# tail -f /var/log/httpd/error_log
[Wed Oct 18 06:22:07 2017] [warn] [client 139.162.108.53] incomplete redirection target of '/toolkit/' for URI '/' modified to 'http://164.113.32.57/toolkit/'
[Wed Oct 18 06:44:26 2017] [warn] [client 141.212.122.81] incomplete redirection target of '/toolkit/' for URI '/' modified to 'http://164.113.32.57/toolkit/'
[Wed Oct 18 08:41:56 2017] [warn] [client 54.174.92.112] incomplete redirection target of '/toolkit/' for URI '/' modified to 'http://ps-ku-bw.perfsonar.kanren.net/toolkit/'
[Wed Oct 18 08:44:26 2017] [warn] [client 107.170.201.175] incomplete redirection target of '/toolkit/' for URI '/' modified to 'http://164.113.32.145/toolkit/'
[Wed Oct 18 08:46:51 2017] [warn] [client 107.170.201.175] incomplete redirection target of '/toolkit/' for URI '/' modified to 'http://164.113.32.57/toolkit/'
[Wed Oct 18 09:11:21 2017] [error] [client 66.249.66.139] File does not exist: /var/www/html/robots.txt
[Wed Oct 18 09:28:02 2017] [error] [client 46.229.164.99] File does not exist: /var/www/html/robots.txt
[Wed Oct 18 09:32:52 2017] [warn] [client 155.94.88.58] incomplete redirection target of '/toolkit/' for URI '/' modified to 'http://ps-ku-bw.perfsonar.kanren.net/toolkit/'
[Wed Oct 18 09:41:38 2017] [warn] [client 155.94.88.58] incomplete redirection target of '/toolkit/' for URI '/' modified to 'http://ps-ku-bw.perfsonar.kanren.net/toolkit/'
[Wed Oct 18 09:47:29 2017] [error] server reached MaxClients setting, consider raising the MaxClients setting

Since then, nothing has logged in the httpd access log:  
[root@ps-ku-bw crussell]# tail -f /var/log/httpd/access_log
::1 - - [18/Oct/2017:09:47:02 -0500] "PUT /esmond/perfsonar/archive/9bc084ac0a8349ec9b2e94488ca62716/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:03 -0500] "PUT /esmond/perfsonar/archive/ccc9b3f6b47b4ec5b63337c182dd2f97/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:04 -0500] "PUT /esmond/perfsonar/archive/347c05205385475d988d6e663501096e/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:04 -0500] "PUT /esmond/perfsonar/archive/347c05205385475d988d6e663501096e/ HTTP/1.1" 409 101 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:05 -0500] "PUT /esmond/perfsonar/archive/317550e6dae940bcb028c715220ec36c/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:07 -0500] "PUT /esmond/perfsonar/archive/54534a09177a472c8c2880c89322100e/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:08 -0500] "PUT /esmond/perfsonar/archive/693d654508ba4f209728da0de249fda6/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:16 -0500] "PUT /esmond/perfsonar/archive/aad5e250b92044a9be0582a1890acafb/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:31 -0500] "PUT /esmond/perfsonar/archive/0e594a3f088a422b9c2f253954e6be5a/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"
::1 - - [18/Oct/2017:09:47:36 -0500] "PUT /esmond/perfsonar/archive/2024f2829908405e9353db034ec54c2d/ HTTP/1.1" 201 2 "-" "python-requests/2.6.0 CPython/2.6.6 Linux/2.6.32-696.10.3.el6.x86_64"

The Pscheduler log shows that the tests that WERE scheduled are running and able to log (I believe) to the Central archive, but not locally.

Oct 19 12:57:19 ps-ku-bw archiver WARNING  17050500: Failed to archive https://localhost/pscheduler/tasks/e67b5ce2-dcfc-47e0-9989-c94b4fa52481/runs/61e2a10e-2915-4cff-9bee-de5a16756baa to esmond: 400: Invalid JSON returned
Oct 19 12:57:20 ps-ku-bw archiver WARNING  17050432: Failed to archive https://localhost/pscheduler/tasks/51f1cb7d-bd30-472f-bd51-5eb0b0ae2350/runs/75ce65dd-0606-49fc-8dcb-23f381429815 to esmond: Archiver permanently abandoned registering test after 2 attempt(s): 400: Invalid JSON returned
Oct 19 12:57:20 ps-ku-bw archiver WARNING  17050432: Gave up archiving https://localhost/pscheduler/tasks/51f1cb7d-bd30-472f-bd51-5eb0b0ae2350/runs/75ce65dd-0606-49fc-8dcb-23f381429815 to esmond
Oct 19 12:57:21 ps-ku-bw archiver WARNING  17050502: Failed to archive https://localhost/pscheduler/tasks/0bea838f-9368-4250-a893-3d61a4bdc7cd/runs/6d4417a8-3f18-470d-860c-9b6bf23e9dc0 to esmond: 400: Invalid JSON returned

I say "I believe" they're logging to the central MA, but not locally, because, as you may note, the API is unavailable if you try querying it for one of those URLs to see what happened.  (problem with the HTTPD daemon?).

[root@ps-ku-bw crussell]# service httpd status
httpd (pid  16970) is running...
[root@ps-ku-bw crussell]# service pscheduler-scheduler status
scheduler (pid  17017) is running...
[root@ps-ku-bw crussell]# service pscheduler-archiver status
archiver (pid  17004) is running...
[root@ps-ku-bw crussell]# service pscheduler-ticker status
ticker (pid  16999) is running...
[root@ps-ku-bw crussell]# service cassandra status
cassandra (pid  1810) is running...

I'll leave the host alone for a few hours in case anyone wants me to gather more info.  


Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter

On Tue, Oct 17, 2017 at 2:44 PM, Casey Russell <> wrote:
Mark and Larry,

     I have seen this occasionally on my lower powered hosts, (and maybe on others, although I've been watching these lower powered hosts much closer, so I'm much more likely to have noticed it there.

     I don't have a host where that error is active, but here you can see where one of my hosts encountered the error yesterday (this was just before I installed the 4.0.2 beta on it and rebooted it). 

[root@ps-washburn-bw crussell]# cat /var/log/httpd/error_log | grep Max
[Mon Oct 16 16:50:42 2017] [error] server reached MaxClients setting, consider raising the MaxClients setting
[root@ps-washburn-bw crussell]# 

You can see that (today at least) these hosts do have a lot of connections open (sparing you the detailed output, although it's available if you want it).  Although that in an of its self is not necessarily a problem

[root@ps-washburn-bw crussell]# netstat -tan | wc -l
1238
(that's 1238 active TCP connections)

(Another of my lower powered hosts)
[crussell@ps-esu-bw ~]$ netstat -tan | wc -l
1903

Out of curiosity, I checked to see how many of those were hitting Apache on tcp port 80:
[root@ps-washburn-bw crussell]# netstat -an | grep ':80' | wc -l
78

[crussell@ps-esu-bw ~]$ netstat -tan | grep ':80' | wc -l
66

It doesn't seem too out of whack, but today may be entirely non-representative of what it looks like when the "MaxClients" problem was occurring.  Again, I installed the 4.0.2 beta on both of these hosts yesterday and the problem hasn't recurred since, so today's netstat results may not reflect what it looks like on an affected host.


Sincerely,
Casey Russell
Network Engineer
KanREN
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter

On Tue, Oct 17, 2017 at 12:32 PM, Mark Feit <> wrote:

Larry Blunk writes:

 


Has anyone experienced Apache hitting the MaxClients limit and hanging?   We've had this happen on several boxes
since upgrading to 4.01.   We've had to restart Apache to get them functioning again.  We've upped the
MaxClients limit on them, but it still occurs even after doubling the setting to 512.   These are high perfomance
boxes, so it doesn't seem like it should be a CPU issue.

 

There shouldn’t be a lot of connections to the HTTP server during normal operations; MeshConfig and task setup from remote nodes are the only things that should be connecting.  The internal parts of pScheduler poke the database directly.

 

If you encounter that again, I’d be interested to see what the process table and netstat say about what’s connected and from where and if there are old processes that connect and aren’t dying.

 

--Mark

 






Archive powered by MHonArc 2.6.19.

Top of Page