Skip to Content.
Sympa Menu

perfsonar-user - [perfsonar-user] perfSONAR-BUOY Services not restarting after nightly maintenance

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] perfSONAR-BUOY Services not restarting after nightly maintenance


Chronological Thread 
  • From: Will Lipe <>
  • To:
  • Subject: [perfsonar-user] perfSONAR-BUOY Services not restarting after nightly maintenance
  • Date: Thu, 23 May 2013 14:52:26 -0500
  • Authentication-results: sfpop-ironport03.merit.edu; dkim=neutral (message not signed) header.i=none

Hi All,

We're running a PerfSONAR v3.2.2 from ISO, and are currently having a problem with a few services failing to restart after some automatic process that runs during the night.

At around 1:08am, two of our services, "perfSONAR-BUOY Throughput" and "perfSONAR-BUOY One-Way Latency" (though lately "Traceroute Regular Testing" has suffered this fate as well) die and do not return to service until I restart them manually from the web interface (by navigating to the "Enabled Services" page and clicking the "Save" button to restart all services).  

This has been happening off and on for over a month, and does not occur every day, though it has been occurring every day lately.  Can anyone think of why this might be occurring?  Thanks for any ideas you might be able to provide.  Below are details and log excerpts for anyone interested.

-- 
Will Lipe
Systems Administrator
KanREN, Inc.
(785) 856-9800 x9812



Log excerpts and details:

It seems this event is triggered by a script that runs at 1:05 daily, "5 1 * * * root /opt/perfsonar_ps/toolkit/scripts/service_watcher --regular_restart &> /var/log/perfsonar/service_watcher_error.log"

These messages appear in /var/log/messages when this happens:

May  3 01:08:47 ps-data powcollector.pl[1637]: Exiting... Deleting sub-processes...
May  3 01:08:52 ps-data bwcollector.pl[1660]: Exiting... Deleting sub-processes...
May  3 01:08:52 ps-data bwcollector.pl[1660]: Dead

An example from /var/log/perfsonar/owamp_bwctl.log, when the service doesn't automatically restart.  Note the time delay:

May  3 01:08:47 ps-data powcollector.pl[1637]: Exiting... Deleting sub-processes...
May  3 01:08:52 ps-data bwcollector.pl[1660]: Exiting... Deleting sub-processes...
May  3 01:08:52 ps-data bwcollector.pl[1660]: Dead
May  3 10:49:35 ps-data powcollector.pl[15705]: Opened syslog
May  3 10:49:35 ps-data powmaster.pl[15715]: Opened syslog

Compared to a normal day:

May  2 01:08:47 ps-data powcollector.pl[15547]: Exiting... Deleting sub-processes...
May  2 01:08:48 ps-data powcollector.pl[15547]: handle_req:1525 exited: 0
May  2 01:08:48 ps-data powcollector.pl[15547]: handle_req:1526 exited: 0
May  2 01:08:48 ps-data powcollector.pl[15547]: handle_req:1529 exited: 0
May  2 01:08:48 ps-data powcollector.pl[15547]: handle_req:1536 exited: 0
May  2 01:08:48 ps-data powcollector.pl[15547]: Dead
May  2 01:08:53 ps-data powcollector.pl[1636]: Opened syslog
May  2 01:08:54 ps-data powmaster.pl[1646]: Opened syslog
May  2 01:08:54 ps-data powmaster.pl[1646]: No tests to be run by the nodes (DATA). at /opt/perfsonar_ps/perfsonarbuoy_ma/bin/powmaster.pl line 378.
May  2 01:08:55 ps-data bwcollector.pl[15602]: Exiting... Deleting sub-processes...
May  2 01:08:55 ps-data bwcollector.pl[15602]: Dead
May  2 01:09:00 ps-data bwcollector.pl[1659]: Opened syslog
May  2 01:09:01 ps-data bwmaster.pl[1669]: Opened syslog

Output of service_watcher.log on a normal day:

2013/05/02 01:05:02 (1259) INFO> service_watcher:177 main:: - Regular restart for mysql.
2013/05/02 01:05:14 (1259) INFO> service_watcher:177 main:: - Regular restart for hls.
2013/05/02 01:05:25 (1259) INFO> service_watcher:177 main:: - Regular restart for perfsonarbuoy_ma.
2013/05/02 01:08:30 (1259) INFO> service_watcher:177 main:: - Regular restart for traceroute_ma.
2013/05/02 01:08:37 (1259) INFO> service_watcher:177 main:: - Regular restart for traceroute_scheduler.
2013/05/02 01:08:46 (1259) INFO> service_watcher:177 main:: - Regular restart for perfsonarbuoy_owamp.
2013/05/02 01:08:54 (1259) INFO> service_watcher:177 main:: - Regular restart for perfsonarbuoy_bwctl.
2013/05/02 01:09:01 (1259) INFO> service_watcher:177 main:: - Regular restart for ls_registration_daemon.
2013/05/02 01:09:08 (1259) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/02 01:09:08 (1259) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/02 01:09:08 (1259) INFO> service_watcher:267 main::handle_reporting - service='hls' status='OK' report='Service is up'
2013/05/02 01:09:08 (1259) INFO> service_watcher:267 main::handle_reporting - service='mysql' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_ma' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='traceroute_ma' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='traceroute_scheduler' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_owamp' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_owamp' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_bwctl' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_bwctl' status='OK' report='Service is up'
2013/05/02 01:09:09 (1259) INFO> service_watcher:267 main::handle_reporting - service='ls_registration_daemon' status='OK' report='Service is up'
2013/05/02 01:09:10 (1259) INFO> service_watcher:267 main::handle_reporting - service='ls_cache_daemon' status='OK' report='Service is up'
2013/05/02 01:09:10 (1259) INFO> service_watcher:267 main::handle_reporting - service='config_daemon' status='OK' report='Service is up'

Output of service_watcher.log on a problem day:

2013/05/03 01:05:02 (2011) INFO> service_watcher:177 main:: - Regular restart for mysql.
2013/05/03 01:05:14 (2011) INFO> service_watcher:177 main:: - Regular restart for hls.
2013/05/03 01:05:25 (2011) INFO> service_watcher:177 main:: - Regular restart for perfsonarbuoy_ma.
2013/05/03 01:08:30 (2011) ERROR> ConfigClient.pm:158 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::startService - Problem starting service perfsonarbuoy_ma: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: read timeout
2013/05/03 01:08:30 (2011) INFO> service_watcher:177 main:: - Regular restart for traceroute_ma.
2013/05/03 01:08:35 (2011) ERROR> ConfigClient.pm:192 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::stopService - Problem stopping service traceroute_ma: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Server closed connection without sending any data back
2013/05/03 01:08:41 (2011) ERROR> ConfigClient.pm:158 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::startService - Problem starting service traceroute_ma: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Can't connect to localhost:9000 (connect: Connection refused)
2013/05/03 01:08:41 (2011) INFO> service_watcher:177 main:: - Regular restart for traceroute_scheduler.
2013/05/03 01:08:41 (2011) ERROR> ConfigClient.pm:192 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::stopService - Problem stopping service traceroute_scheduler: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Can't connect to localhost:9000 (connect: Connection refused)
2013/05/03 01:08:46 (2011) ERROR> ConfigClient.pm:158 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::startService - Problem starting service traceroute_scheduler: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Can't connect to localhost:9000 (connect: Connection refused)
2013/05/03 01:08:46 (2011) INFO> service_watcher:177 main:: - Regular restart for perfsonarbuoy_owamp.
2013/05/03 01:08:52 (2011) ERROR> ConfigClient.pm:158 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::startService - Problem starting service perfsonarbuoy_owamp: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Can't connect to localhost:9000 (connect: Connection refused)
2013/05/03 01:08:52 (2011) INFO> service_watcher:177 main:: - Regular restart for perfsonarbuoy_bwctl.
2013/05/03 01:08:57 (2011) ERROR> ConfigClient.pm:158 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::startService - Problem starting service perfsonarbuoy_bwctl: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Can't connect to localhost:9000 (connect: Connection refused)
2013/05/03 01:08:57 (2011) INFO> service_watcher:177 main:: - Regular restart for ls_registration_daemon.
2013/05/03 01:09:04 (2011) ERROR> ConfigClient.pm:192 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::stopService - Problem stopping service traceroute_ma: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Can't connect to localhost:9000 (connect: Connection refused)
2013/05/03 01:09:09 (2011) ERROR> ConfigClient.pm:158 perfSONAR_PS::NPToolkit::ConfigManager::ConfigClient::startService - Problem starting service traceroute_ma: RPC::XML::Client::simple_request: RPC::XML::Client::send_request: HTTP server error: Can't connect to localhost:9000 (connect: Connection refused)
2013/05/03 01:09:09 (2011) INFO> service_watcher:267 main::handle_reporting - service='traceroute_ma' status='Critical' report='Service traceroute_ma is down. Errors were reported while trying to start the service'
2013/05/03 01:10:54 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:10:54 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:10:54 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:10:54 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:10:54 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:11:07 (2011) INFO> service_watcher:267 main::handle_reporting - service='config_daemon' status='Critical' report='Service config_daemon is down.'
2013/05/03 01:12:27 (2011) INFO> service_watcher:267 main::handle_reporting - service='hls' status='OK' report='Service is up'
2013/05/03 01:12:27 (2011) INFO> service_watcher:267 main::handle_reporting - service='mysql' status='OK' report='Service is up'
2013/05/03 01:12:28 (2011) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_ma' status='OK' report='Service is up'
2013/05/03 01:12:28 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:12:28 (2011) INFO> service_watcher:267 main::handle_reporting - service='traceroute_scheduler' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_owamp' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_owamp' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_owamp' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_bwctl' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_bwctl' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='ls_registration_daemon' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='ls_cache_daemon' status='OK' report='Service is up'
2013/05/03 01:12:29 (2011) INFO> service_watcher:267 main::handle_reporting - service='config_daemon' status='OK' report='Service is up'


The service watcher script seems to think the services are still running during this time.  Every hourly report after that indicates that the services are up, however the webpage lists them as "Not Running", our network monitoring system sees them as unavailable (via tcp port), and there is no logging activity for them during this time.

An excerpt from service_watcher.log while the services are down:

2013/05/03 05:00:03 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:03 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:03 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:03 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:03 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:03 (8051) INFO> service_watcher:267 main::handle_reporting - service='hls' status='OK' report='Service is up'
2013/05/03 05:00:03 (8051) INFO> service_watcher:267 main::handle_reporting - service='mysql' status='OK' report='Service is up'
2013/05/03 05:00:03 (8051) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_ma' status='OK' report='Service is up'
2013/05/03 05:00:03 (8051) INFO> service_watcher:267 main::handle_reporting - service='traceroute_ma' status='OK' report='Service is up'
2013/05/03 05:00:03 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:03 (8051) INFO> service_watcher:267 main::handle_reporting - service='traceroute_scheduler' status='OK' report='Service is up'
2013/05/03 05:00:03 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:03 (8051) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_owamp' status='OK' report='Service is up'
2013/05/03 05:00:04 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:04 (8051) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_owamp' status='OK' report='Service is up'
2013/05/03 05:00:04 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:04 (8051) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_bwctl' status='OK' report='Service is up'
2013/05/03 05:00:04 (8051) ERROR> RegularTesting.pm:248 perfSONAR_PS::NPToolkit::Config::RegularTesting::reset_state - Can only handle 'star' groups currently at /opt/perfsonar_ps/toolkit/scripts/../lib/perfSONAR_PS/NPToolkit/Config/RegularTesting.pm line 1787.
2013/05/03 05:00:04 (8051) INFO> service_watcher:267 main::handle_reporting - service='perfsonarbuoy_bwctl' status='OK' report='Service is up'
2013/05/03 05:00:04 (8051) INFO> service_watcher:267 main::handle_reporting - service='ls_registration_daemon' status='OK' report='Service is up'
2013/05/03 05:00:04 (8051) INFO> service_watcher:267 main::handle_reporting - service='ls_cache_daemon' status='OK' report='Service is up'
2013/05/03 05:00:04 (8051) INFO> service_watcher:267 main::handle_reporting - service='config_daemon' status='OK' report='Service is up'

They do not return to service until I restart the services via the "enabled services" webpage.

It seems like maybe the toolkit_config_daemon is dead during the restart time and thus is unable to start the services back up?  Not sure why this would be.  The /var/log/perfsonar/config_daemon.log file is empty.  No other services seem to be impacted by this problem (though this is perhaps because the config_daemon is also failing to stop them?). 


  • [perfsonar-user] perfSONAR-BUOY Services not restarting after nightly maintenance, Will Lipe, 05/23/2013

Archive powered by MHonArc 2.6.16.

Top of Page