Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Re: meshconfig-agent-tasks not scheduling tasks regularly

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Re: meshconfig-agent-tasks not scheduling tasks regularly


Chronological Thread 
  • From: Andrew Lake <>
  • To: "" <>, Casey Russell <>
  • Subject: Re: [perfsonar-user] Re: meshconfig-agent-tasks not scheduling tasks regularly
  • Date: Fri, 13 Oct 2017 11:32:30 -0700
  • Ironport-phdr: 9a23:aEYaLB1EUqgrUjcasmDT+DRfVm0co7zxezQtwd8ZseIRI/ad9pjvdHbS+e9qxAeQG96Ku7Qc06L/iOPJYSQ4+5GPsXQPItRndiQuroEopTEmG9OPEkbhLfTnPGQQFcVGU0J5rTngaRAGUMnxaEfPrXKs8DUcBgvwNRZvJuTyB4Xek9m72/q89pDXYAhEniaxba9vJxiqsAvdsdUbj5F/Iagr0BvJpXVIe+VSxWx2IF+Yggjx6MSt8pN96ipco/0u+dJOXqX8ZKQ4UKdXDC86PGAv5c3krgfMQA2S7XYBSGoWkx5IAw/Y7BHmW5r6ryX3uvZh1CScIMb7Vq4/Vyi84Kh3SR/okCYHOCA/8GHLkcx7kaZXrAu8qxBj34LYZYeYP+d8cKzAZ9MXXWpPUcRfVyJGDYyyYYgBAfcfM+lEtITyvUcCoAGkCAS2GO/iyDlFjWL2060g1OQhFBnL0RAmH90TqnTbstv0P7oUX++vz6nH0yjIYvRM1jf79YfJcgssru+XXb5qd8re11UvGhrDg16Np4LlODaV2f4Ms2id9+dgUeOvi2gkqw5vvzevx8EshpPViYISz1DI7Tl5wYg0Jd2kVE50f8SkHIFMuCGdMot6WsAiTHtuuCYg1LIGv4S3fC4Ux5Q7wRPUdv+Jc5CQ7x7+SuqcIi10iXx/dL+wmhq+60qtxvDkWsWqzFpHqjBJn9rMu3wXyRDf9MaKRuFg8kql2zuC0R3Y5PteLkAuj6XbLoYswr4umZoXtkTOBiH2l1v5gaOMckUr4eyo5/7oYrXhuJ+QL450igfgPaQygsGzHPo0PwsUU2WV4+ix26Dv8Vf7TblXlvE2l7PWsJHeJcQVvK65BApV35455Ba5Ejin0M8VkmccLF5ffhKIkZTpN0nUIP/kFfe/n0iskDBzyvDeILLhGJvNLmPEkLfnZ7l98VdQyBcozd9B/ZJZEbUBIPPoWk/tr9zUEAU1Mw2yw+b7Ftp9zIUeVnyTAqOHKq/dr0KH5v98a9WLMcUNtSzzMP8j7uSrkGQ0g3cce7Wkx50adCr+E/h7aQ3NeXf2jMwGF24Q+xclQfbCiVueXCRVamroGa8w+2d/QKmvForSDrqwm6eM2CPzSpZMe3tdB1SIOXThcZ+JXbEKZT7EceF7lTlRfL6tUYI+nTWntwKyn7NhIvv89zZes5//gosmr9bPnA0/oGQnR/+W1HuAGiQtxjsF

Hi Casey,

You might want to try the 4.0.2 beta on one or more of the problematic hosts. It has quite a few performance fixes, in particular with respect to the CPU usage of archiving which in turn leads to lots of other things breaking. On some of our lower-powered hosts and small nodes we have definitely seen problems with limit timeouts like you are seeing in 4.0.1 and then when we put the beta on there they went away. If you want to try the beta run the following:

yum install perfSONAR-repo-staging
yum clean all
yum update

Once the final is out you may want to do a “yum remove perfSONAR-repo-staging” if you don't want those hosts to automatically get future beta versions.

Thanks,
Andy


On October 13, 2017 at 1:21:14 PM, Casey Russell () wrote:

And additional piece of information.  searching through old threads, I came across the "pscheduler validate-limits" command.  On one of my larger hosts, that command appears to succeed pretty much all the time, but on my smaller hosts, it fails more often than not:

[crussell@ps-washburn-bw ~]$ pscheduler validate-limits
Failed to validate limit: Process took too long to run.
[crussell@ps-washburn-bw ~]$ pscheduler validate-limits
Limit configuration is valid.
[crussell@ps-washburn-bw ~]$ pscheduler validate-limits
Failed to validate limit: Process took too long to run.
[crussell@ps-washburn-bw ~]$ pscheduler validate-limits
Failed to validate limit: Process took too long to run.
[crussell@ps-washburn-bw ~]$ pscheduler validate-limits
Limit configuration is valid.

You can also see that this is a contributor to at least some of these tests not being posted (although I haven't yet captured the reason it fails on the larger host).  


I have a fairly long CIDR-LIST in that limits file (the file is identical on the two hosts), does anyone know is that limit processing more likely to be memory intensive or processor intensive?  I'd have to look it back up again, but I think I also saw a reference somewhere to a method for referencing an outside list of CIDRs.  Does anyone know if that's less intensive than a long CIDR-LIST statement in the limit file?


Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter

On Fri, Oct 13, 2017 at 9:55 AM, Casey Russell <> wrote:
Group,

     I mentioned it some time back, when I thought it was a problem with my 4 lower powered hosts running out of CPU, but I've been chasing it ever since and it's hitting my larger hosts as well.  Ever since I upgraded to 4.0 several months ago, I've had an issue where regularly, my hosts stop scheduling tests from the mesh.  My dashboard today shows a mess of hosts that failed to schedule tests last night some of them are on their second, (or more) continuous day.

     I can't figure out if this is a problem with the mesh config file or on the hosts (although since it's spread everywhere, even a newly installed CentOS7 host) I'm leaning toward some problem in the mesh config file.

     I'm not sure what to give you that will help, so below you'll find some diagnostic commands from an affected host this morning that is only running bandwidth tests, none of the latency tests scheduled.

Any ideas or help is appreciated.

Sincerely,
Casey Russell
Network Engineer
KanREN
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter

Since the latency tests were never scheduled, I don't have anything from the API to show you, the mesh config file is at:  

[root@ps-ksu-bw crussell]# pscheduler schedule
2017-10-13T09:47:54-05:00 - 2017-10-13T09:48:23-05:00  (Pending)
throughput --duration PT20S --source ps-fhsu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')


2017-10-13T09:49:33-05:00 - 2017-10-13T09:49:52-05:00  (Pending)
throughput --bandwidth 920000000 --duration PT10S --source ps-esu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')


2017-10-13T09:52:08-05:00 - 2017-10-13T09:52:27-05:00  (Pending)
throughput --bandwidth 920000000 --duration PT10S --source ps-bryant-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')


2017-10-13T09:58:44-05:00 - 2017-10-13T09:59:03-05:00  (Pending)
throughput --bandwidth 920000000 --duration PT10S --source ps-bryant-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')


2017-10-13T10:07:36-05:00 - 2017-10-13T10:08:05-05:00  (Pending)
throughput --duration PT20S --source ps-ku-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')


2017-10-13T10:08:38-05:00 - 2017-10-13T10:08:57-05:00  (Pending)
throughput --bandwidth 920000000 --duration PT10S --source ps-ku-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')


2017-10-13T10:10:18-05:00 - 2017-10-13T10:10:47-05:00  (Pending)
throughput --duration PT20S --source ps-esu-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')


2017-10-13T10:10:49-05:00 - 2017-10-13T10:11:08-05:00  (Pending)
throughput --bandwidth 920000000 --duration PT10S --source ps-esu-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')


2017-10-13T10:16:39-05:00 - 2017-10-13T10:17:08-05:00  (Pending)
throughput --duration PT20S --source ps-fhsu-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')


2017-10-13T10:36:46-05:00 - 2017-10-13T10:37:15-05:00  (Pending)
throughput --duration PT20S --source ps-esu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')



[root@ps-ksu-bw crussell]# service pscheduler-runner status
runner (pid  13073) is running...

[root@ps-ksu-bw crussell]# service pscheduler-ticker status
ticker (pid  13071) is running...

[root@ps-ksu-bw crussell]# service pscheduler-archiver status
archiver (pid  13078) is running...

[root@ps-ksu-bw crussell]# service pscheduler-server status
pscheduler-server: unrecognized service

[root@ps-ksu-bw crussell]# service pscheduler-scheduler status
scheduler (pid  13090) is running...

[root@ps-ksu-bw crussell]# ps -ax | grep pscheduler
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
 3448 pts/0    S+     0:00 grep pscheduler
 8236 ?        Ss     0:17 postgres: pscheduler pscheduler 127.0.0.1(41520) idle   
13071 ?        Sl     0:42 /usr/bin/python /usr/libexec/pscheduler/daemons/ticker --daemon --pid-file /var/run/pscheduler-ticker.pid --dsn @/etc/pscheduler/database/database-dsn
13073 ?        Sl    21:20 /usr/bin/python /usr/libexec/pscheduler/daemons/runner --daemon --pid-file /var/run/pscheduler-runner.pid --dsn @/etc/pscheduler/database/database-dsn
13075 ?        Ss     1:20 postgres: pscheduler pscheduler 127.0.0.1(48114) idle   
13076 ?        Ss     9:40 postgres: pscheduler pscheduler 127.0.0.1(48116) idle   
13078 ?        S     67:00 /usr/bin/python /usr/libexec/pscheduler/daemons/archiver --daemon --pid-file /var/run/pscheduler-archiver.pid --dsn @/etc/pscheduler/database/database-dsn
13079 ?        Ss   360:11 postgres: pscheduler pscheduler 127.0.0.1(48118) idle   
13081 ?        Ss     8:31 postgres: pscheduler pscheduler 127.0.0.1(48122) idle   
13083 ?        Ss     0:00 postgres: pscheduler pscheduler 127.0.0.1(48126) idle   
13090 ?        Sl    65:19 /usr/bin/python /usr/libexec/pscheduler/daemons/scheduler --daemon --pid-file /var/run/pscheduler-scheduler.pid --dsn @/etc/pscheduler/database/database-dsn
13108 ?        Ss   115:36 postgres: pscheduler pscheduler 127.0.0.1(48132) idle   
13114 ?        Ss     0:00 postgres: pscheduler pscheduler 127.0.0.1(48136) idle   
28737 ?        Ss     0:01 postgres: pscheduler pscheduler 127.0.0.1(55217) idle   
[root@ps-ksu-bw crussell]# 

[root@ps-ksu-bw crussell]# service perfsonar-meshconfig-agent
usage: /etc/init.d/perfsonar-meshconfig-agent (start|stop|restart|help)

start      - start perfSONAR MeshConfig Agent
stop       - stop perfSONAR MeshConfig Agent
restart    - restart perfSONAR MeshConfig Agent if running by sending a SIGHUP or start if 
             not running
status     - Indicates if the service is running
help       - this screen

[root@ps-ksu-bw crussell]# service perfsonar-meshconfig-agent restart
/etc/init.d/perfsonar-meshconfig-agent stop: perfSONAR MeshConfig Agent stopped
waiting...
/usr/lib/perfsonar/bin/perfsonar_meshconfig_agent --config=/etc/perfsonar/meshconfig-agent.conf --pidfile=/var/run/perfsonar-meshconfig-agent.pid --logger=/etc/perfsonar/meshconfig-agent-logger.conf --user=perfsonar --group=perfsonar
/etc/init.d/perfsonar-meshconfig-agent start: perfSONAR MeshConfig Agent started

[root@ps-ksu-bw crussell]# tail -n 50 /var/log/perfsonar/meshconfig-agent.log 
2017/10/12 20:10:55 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 3 new tasks, and deleted 0 old tasks
2017/10/12 21:10:10 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 1 new tasks, and deleted 0 old tasks
2017/10/13 03:10:37 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 2 new tasks, and deleted 0 old tasks
2017/10/13 04:10:40 (8826) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for deletion, skipping test throughput/iperf3(ps-ksu-bw.perfsonar.kanren.net->ps-fhsu-bw.perfsonar.kanren.net): 500 Internal Server Error: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 root@localhost to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>
2017/10/13 07:11:39 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 5 new tasks, and deleted 0 old tasks
2017/10/13 09:20:23 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 97 new tasks, and deleted 0 old tasks






Archive powered by MHonArc 2.6.19.

Top of Page