Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] meshconfig-agent-tasks not scheduling tasks regularly

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] meshconfig-agent-tasks not scheduling tasks regularly


Chronological Thread 
  • From: Casey Russell <>
  • To: "Garnizov, Ivan (RRZE)" <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] meshconfig-agent-tasks not scheduling tasks regularly
  • Date: Mon, 16 Oct 2017 14:14:32 -0500
  • Ironport-phdr: 9a23:Kc6yuR8vjZkKIv9uRHKM819IXTAuvvDOBiVQ1KB21u0cTK2v8tzYMVDF4r011RmSDNWds6oMotGVmpioYXYH75eFvSJKW713fDhBt/8rmRc9CtWOE0zxIa2iRSU7GMNfSA0tpCnjYgBaF8nkelLdvGC54yIMFRXjLwp1Ifn+FpLPg8it2e2//57ebx9UiDahfLh/MAi4oQLNu8cMnIBsMLwxyhzHontJf+RZ22ZlLk+Nkhj/+8m94odt/zxftPw9+cFAV776f7kjQrxDEDsmKWE169b1uhTFUACC+2ETUmQSkhpPHgjF8BT3VYr/vyfmquZw3jSRMNboRr4oRzut86ZrSAfpiCgZMT457HrXgdF0gK5CvR6tuwBzz4vSbYqINvRxY7ndcMsYSmpPXshfWS9PDJ6iYYQTFOcOJ/pUopPnqlcSsRezBw+hD/7vxD9SgX/22LU33vo7HgHdwgMhH88FvmjJrNXuL6cdT+S1zK3VxjjEc/xWwyr96JPTch8/pfGMWal9ccnLxkkpDQPKkFOQpZbjPzyLyuQAqm6W5PduW+Kojm4osQBxoj63y8coi4nJgIEVxU7Z+iV4xoY5P8G3SEl+YdK8DJtRuSCaN5dqQsw8WWFkojo1yroDuZKjcygKz5MnxxHba/OZaYSH/hXjVOOJLTd8hXJlfbO/hwqp8US61uL8Ucy03E5JriVflNnMrG4C1xrJ5siBVPR94kGs0iuM2QDL8uxIP0E5mbbZJpMkzL49lYEcvVjGEyL5hEn6kKGbe0A49eS06unqZ7DrqoGSOoJ7jAz1L74gldalAesiNwgDR2ib9vq41L3k5UD5Ra9FjvwykqXAt5DaJNgXqre2AgNL3Isu5AyzDzih0NQfknkHKExKdAibgIjuPlHCOPH4DfGhjFSwiDpn2fHLMqHjD5jIIHjOk6zucap45kNT1AY/0d5S6pdIBb0dIf/+X0r8uMLWAxI2KwC0xvzoCNR51oMQQ2KPBaqZPbvOvl+S++IvOPKMa5ERuDb5MPUl5OThjXkjmVADe6mlx50XZ26kHvh+OUWWfWLsgssdEWcNpgc+VPLliEeMUT5IYHayWbgz5isiBIK7FofMWJqtjaeF3Ce6BZ1WentGBk6WHXfpcYWER+kDaDiUIsB/jjwIS6KtRJE82hGz50fGzO8tAePO/CwKsoynnPlr7uub1SsIxxE1RYzJ0n+ESSd7l3kORi0xwohxqFE7xlrVgoZihPkNPtVI6uIBaBohLpPYy6QuAMrvQRnMetOhS1+gWNigRzc8UoRikJc1f09hFoD63Vj41C2wDupQzuTTCQ==

Ivan,

     I hope I've understood your questions correctly.  When a host gets "broken" it does not seem to matter whether the scheduling request (pscheduer) is initiated from the local host or a remote host, it will fail because the "broken" host seems to time out (usually the logfile says some variation of "process took to long to run").  

    Sometimes it's because it failed to validate the limits file, sometimes there's no extra detail, it's just "host closed connection: process took too long to run" or something of the like. 




Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter

On Mon, Oct 16, 2017 at 10:21 AM, Garnizov, Ivan (RRZE) <> wrote:

Hello Casey,

 

I am about to create an issue report about it. Could you please also share, if an impacted system is able to negotiate tests upon external requests or gets totally isolated from the rest of the mesh failing on both external and internal measurement requests?

 

Regards,

Ivan Garnizov

 

From: Casey Russell [mailto:]
Sent: Montag, 16. Oktober 2017 16:24
To: Garnizov, Ivan (RRZE)
Cc:
Subject: Re: [perfsonar-user] meshconfig-agent-tasks not scheduling tasks regularly

 

Ivan,

 

     I'm running PS 4.0.1-1 across all the boxes I believe.  7 of the testing hosts run CentOS 6 and one runs CentOS 7.  It doesn't seem to matter which OS is in play, but my 4 lower powered hosts are somewhat more likely to see these failures than my bigger boxes.  They all share a identical limits.com file for pscheduler.

 

     When it happens, there's a small chance that if I leave that host alone, it may go ahead and schedule the failed tests the next day (and they'll run for 24 hours) but more likely than not, they'll fail again.  I've left them alone for as long as a couple of weeks and they may have scheduled tests for 3-4 days during that time.  The only way I've discovered to clear the problem is to reboot the node.  This will bring that node back into line for some period of time (maybe a day or two, maybe a couple of weeks).

 

     All of the intervals, lifetime minimums, and task renewal fudge factor style timers in the meshconfig-agent.conf file still have "#" in front of them to comment them out, so they should be running their default values.  The only thing I've added or modified in that file on any of the hosts should be the <mesh> statements.


 

Sincerely,

Casey Russell

Network Engineer

KanREN

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

On Mon, Oct 16, 2017 at 2:23 AM, Garnizov, Ivan (RRZE) <> wrote:

Hello Cassey,

 

Please share your version of the pS software.

Are you able to observe a pattern of the issue (timewise)?

Do the systems automatically recover the flow of measurements?

OR What steps are required for the schedule to be recovered?

Do you have any specifics in your meschconfig-agent.conf file or are you using the defaults?

More specifically have you adjusted the interval parameters in the conf file?

 

Regards,

Ivan

 

 

From: [mailto:] On Behalf Of Casey Russell
Sent: Freitag, 13. Oktober 2017 16:55
To:
Subject: [perfsonar-user] meshconfig-agent-tasks not scheduling tasks regularly

 

Group,

 

     I mentioned it some time back, when I thought it was a problem with my 4 lower powered hosts running out of CPU, but I've been chasing it ever since and it's hitting my larger hosts as well.  Ever since I upgraded to 4.0 several months ago, I've had an issue where regularly, my hosts stop scheduling tests from the mesh.  My dashboard today shows a mess of hosts that failed to schedule tests last night some of them are on their second, (or more) continuous day.

 

     I can't figure out if this is a problem with the mesh config file or on the hosts (although since it's spread everywhere, even a newly installed CentOS7 host) I'm leaning toward some problem in the mesh config file.

 

     I'm not sure what to give you that will help, so below you'll find some diagnostic commands from an affected host this morning that is only running bandwidth tests, none of the latency tests scheduled.

 

Any ideas or help is appreciated.

 

Sincerely,

Casey Russell

Network Engineer

KanREN

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

Since the latency tests were never scheduled, I don't have anything from the API to show you, the mesh config file is at:  

 

[root@ps-ksu-bw crussell]# pscheduler schedule

2017-10-13T09:47:54-05:00 - 2017-10-13T09:48:23-05:00  (Pending)

throughput --duration PT20S --source ps-fhsu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')

 

 

2017-10-13T09:49:33-05:00 - 2017-10-13T09:49:52-05:00  (Pending)

throughput --bandwidth 920000000 --duration PT10S --source ps-esu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')

 

 

2017-10-13T09:52:08-05:00 - 2017-10-13T09:52:27-05:00  (Pending)

throughput --bandwidth 920000000 --duration PT10S --source ps-bryant-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')

 

 

2017-10-13T09:58:44-05:00 - 2017-10-13T09:59:03-05:00  (Pending)

throughput --bandwidth 920000000 --duration PT10S --source ps-bryant-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')

 

 

2017-10-13T10:07:36-05:00 - 2017-10-13T10:08:05-05:00  (Pending)

throughput --duration PT20S --source ps-ku-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')

 

 

2017-10-13T10:08:38-05:00 - 2017-10-13T10:08:57-05:00  (Pending)

throughput --bandwidth 920000000 --duration PT10S --source ps-ku-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')

 

 

2017-10-13T10:10:18-05:00 - 2017-10-13T10:10:47-05:00  (Pending)

throughput --duration PT20S --source ps-esu-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')

 

 

2017-10-13T10:10:49-05:00 - 2017-10-13T10:11:08-05:00  (Pending)

throughput --bandwidth 920000000 --duration PT10S --source ps-esu-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 --udp (Run with tool 'iperf3')

 

 

2017-10-13T10:16:39-05:00 - 2017-10-13T10:17:08-05:00  (Pending)

throughput --duration PT20S --source ps-fhsu-bw.perfsonar.kanren.net --ip-version 6 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')

 

 

2017-10-13T10:36:46-05:00 - 2017-10-13T10:37:15-05:00  (Pending)

throughput --duration PT20S --source ps-esu-bw.perfsonar.kanren.net --ip-version 4 --dest ps-ksu-bw.perfsonar.kanren.net --parallel 1 (Run with tool 'iperf3')

 

 

 

[root@ps-ksu-bw crussell]# service pscheduler-runner status

runner (pid  13073) is running...

 

[root@ps-ksu-bw crussell]# service pscheduler-ticker status

ticker (pid  13071) is running...

 

[root@ps-ksu-bw crussell]# service pscheduler-archiver status

archiver (pid  13078) is running...

 

[root@ps-ksu-bw crussell]# service pscheduler-server status

pscheduler-server: unrecognized service

 

[root@ps-ksu-bw crussell]# service pscheduler-scheduler status

scheduler (pid  13090) is running...

 

[root@ps-ksu-bw crussell]# ps -ax | grep pscheduler

Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ

 3448 pts/0    S+     0:00 grep pscheduler

 8236 ?        Ss     0:17 postgres: pscheduler pscheduler 127.0.0.1(41520) idle   

13071 ?        Sl     0:42 /usr/bin/python /usr/libexec/pscheduler/daemons/ticker --daemon --pid-file /var/run/pscheduler-ticker.pid --dsn @/etc/pscheduler/database/database-dsn

13073 ?        Sl    21:20 /usr/bin/python /usr/libexec/pscheduler/daemons/runner --daemon --pid-file /var/run/pscheduler-runner.pid --dsn @/etc/pscheduler/database/database-dsn

13075 ?        Ss     1:20 postgres: pscheduler pscheduler 127.0.0.1(48114) idle   

13076 ?        Ss     9:40 postgres: pscheduler pscheduler 127.0.0.1(48116) idle   

13078 ?        S     67:00 /usr/bin/python /usr/libexec/pscheduler/daemons/archiver --daemon --pid-file /var/run/pscheduler-archiver.pid --dsn @/etc/pscheduler/database/database-dsn

13079 ?        Ss   360:11 postgres: pscheduler pscheduler 127.0.0.1(48118) idle   

13081 ?        Ss     8:31 postgres: pscheduler pscheduler 127.0.0.1(48122) idle   

13083 ?        Ss     0:00 postgres: pscheduler pscheduler 127.0.0.1(48126) idle   

13090 ?        Sl    65:19 /usr/bin/python /usr/libexec/pscheduler/daemons/scheduler --daemon --pid-file /var/run/pscheduler-scheduler.pid --dsn @/etc/pscheduler/database/database-dsn

13108 ?        Ss   115:36 postgres: pscheduler pscheduler 127.0.0.1(48132) idle   

13114 ?        Ss     0:00 postgres: pscheduler pscheduler 127.0.0.1(48136) idle   

28737 ?        Ss     0:01 postgres: pscheduler pscheduler 127.0.0.1(55217) idle   

[root@ps-ksu-bw crussell]# 

 

[root@ps-ksu-bw crussell]# service perfsonar-meshconfig-agent

usage: /etc/init.d/perfsonar-meshconfig-agent (start|stop|restart|help)

 

start      - start perfSONAR MeshConfig Agent

stop       - stop perfSONAR MeshConfig Agent

restart    - restart perfSONAR MeshConfig Agent if running by sending a SIGHUP or start if 

             not running

status     - Indicates if the service is running

help       - this screen

 

[root@ps-ksu-bw crussell]# service perfsonar-meshconfig-agent restart

/etc/init.d/perfsonar-meshconfig-agent stop: perfSONAR MeshConfig Agent stopped

waiting...

/usr/lib/perfsonar/bin/perfsonar_meshconfig_agent --config=/etc/perfsonar/meshconfig-agent.conf --pidfile=/var/run/perfsonar-meshconfig-agent.pid --logger=/etc/perfsonar/meshconfig-agent-logger.conf --user=perfsonar --group=perfsonar

/etc/init.d/perfsonar-meshconfig-agent start: perfSONAR MeshConfig Agent started

 

[root@ps-ksu-bw crussell]# tail -n 50 /var/log/perfsonar/meshconfig-agent.log 

2017/10/12 20:10:55 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 3 new tasks, and deleted 0 old tasks

2017/10/12 21:10:10 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 1 new tasks, and deleted 0 old tasks

2017/10/13 03:10:37 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 2 new tasks, and deleted 0 old tasks

2017/10/13 04:10:40 (8826) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for deletion, skipping test throughput/iperf3(ps-ksu-bw.perfsonar.kanren.net->ps-fhsu-bw.perfsonar.kanren.net): 500 Internal Server Error: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>

<title>500 Internal Server Error</title>

</head><body>

<h1>Internal Server Error</h1>

<p>The server encountered an internal error or

misconfiguration and was unable to complete

your request.</p>

<p>Please contact the server administrator at 

 root@localhost to inform them of the time this error occurred,

 and the actions you performed just before this error.</p>

<p>More information about this error may be available

in the server error log.</p>

</body></html>

2017/10/13 07:11:39 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 5 new tasks, and deleted 0 old tasks

2017/10/13 09:20:23 (8826) INFO> perfsonar_meshconfig_agent:438 main:: - Added 97 new tasks, and deleted 0 old tasks

 

 

 





Archive powered by MHonArc 2.6.19.

Top of Page