perfsonar-user - Re: [perfsonar-user] Restarting eash service while debug

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Restarting eash service while debug

From: Winnie Lacesso <>
To: Andrew Lake <>
Cc:
Subject: Re: [perfsonar-user] Restarting eash service while debug
Date: Wed, 4 Feb 2015 14:56:41 +0000 (GMT)

Good afternoon Gentlemen!

THANK YOU *immensely* kindly for your patient help!

> what exactly are you trying to fix?
My colleague upgraded lcgnetmon.phy.bris.ac.uk (Latency) & lcgnetmon02
(Bandwidth) from v3.3 to v3.4 on 17 Oct 2014. In late Nov we were ticketed
that lcgnetmon.phy.bris.ac.uk was not working. (NB Our site is low on
manpower & neither of us know much about them, so after upgrade we assumed
they were ok.....)

So try to fix lcgnetmon. At that time
https://lcgnetmon.phy.bris.ac.uk/serviceTest/psGraph.cgi
was showing connections but gave errors like "Negative latency values
found in the reverse direction." & maddash was showing only "500 Can't
connect to lcgnetmon.phy.bris.ac.uk:80 (connect: no route to host)"

After some changes psGraph.cgi just cycled forever & forever. After some
more changes psGraph.cgi cycled a little bit then showed nothing.
(Sigh... progresss?)
Over this latter time, the disk filled up with 9million tiny files in
/var/lib/perfsonar/regular_testing/ & available inodes went to 0%

*All* perfsonar services on lcgnetmon were found in rc3.d order, shutdown,
the perfsonar crontabs were disabled, & 2 days to clean up 9million tiny
files in /var/lib/perfsonar/regular_testing. PAIN but Done.

> Are there particular services listed as not running on the main
> toolkit web page?
The webserver is not yet running. In the past it listed all looking well.
During that time, the disk filled up with 9million files from perfsonar
process or config badness. So the web page saying "all is well" was not.
Don't want to do that again. Please.

> are there particular graphs not loading?
Confirmed - when the webserver was last running, NO lcgnetmon graphs.

Now want to restart personar services few by few & make sure they are
healthy. The order as found by rc3.d is

config_daemon
cassandra
oppd
simple_ls_bootstrap_client
ls_cache_daemon
ls_registration_daemon
owampd
traceroute_ma
traceroute_master
traceroute_ondemand_mp
traceroute_scheduler
mysqld
perfsonarbuoy_ma # told yesterday these are not needed for v3.4;
perfsonarbuoy_owp_collector # v3.3 -> v3.4 upgrade did not disable them.
perfsonarbuoy_owp_master # Now stopped & chkconfig'd off

PingER # told today old v3.3 service not disabled by v3.4 upgrade
regular_testing
httpd
fail2ban
configure_nic_parameters
generate_motd
psb_to_esmond

Since some are old v3.3 services that the upgrade should have but did not
disable (maybe that's what caused the broken = get ticketed, & filling up
the disk), *what* is the list of above that should NOT be chkconfig'd on
for v3.4? (It would be nice if the documentation listed that.)

> Just to reiterate, it would be much better to understand what triggered
> your digging through the old logs.

Don't want to dig thru old logs! Want to start a necessary perfsonar
process in the order it should be started, confirm healthy, move onto next.
Since none of the perfsonar services have a "status" option the only
confirm seems to be see if the process is running & check its logfile for
"I am running + healthy" type entries vs "I am sick / broken / dead".

Since the "broken/dead" were seen - hence the question. But, since same on
bandwith box, as you say "nothing to be concerned about" - GOOD!

Now you say
> The old traceroue daemon's are not used. Everything has been moved to the
> new MA
Not sure what "new MA" means. Should *all* the above traceroute services
be shut off & disabled?
Or by "new MA" do you mean that traceroute_ma (or one of them) is a v3.4
service that should be left enabled? Beg of you please be clear!

Same question for lcgnetmon02 = Bandwidth, which DOES appear to be working
but poorly, so it too may have some old v3.3 services left enabled causing
poor performance. Here are its service chkconfig'd on in the order they
start in rc3.d; did the v3.3 -> v3.4 leave some enabled that should be shut
off & disabled ?

config_daemon
cassandra
oppd
simple_ls_bootstrap_client
ls_cache_daemon
ls_registration_daemon
owampd
traceroute_ma
traceroute_master
traceroute_ondemand_mp
traceroute_scheduler
mysqld
perfsonarbuoy_bw_collector
perfsonarbuoy_bw_master
perfsonarbuoy_ma
regular_testing
httpd
fail2ban
configure_nic_parameters
generate_motd
psb_to_esmond

It is not having pathological 0% inodes problem tho. :) Just poor
performance.

Re: [perfsonar-user] Restarting eash service while debug, (continued)

List archive

Re: [perfsonar-user] Restarting eash service while debug