perfsonar-user - Re: [perfsonar-user] Restarting eash service while debug

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Restarting eash service while debug

From: Andrew Lake <>
To:
Cc:
Subject: Re: [perfsonar-user] Restarting eash service while debug
Date: Wed, 4 Feb 2015 12:40:17 -0500

Hi Winnie,

Thanks for the context, it's extremely helpful. Just to summarize, you have
two hosts you are debugging:

1. The first is lcgnetmon and it recently had its disk fill-up due to an
excessive number of regular_testing files. You disabled all the services on
that host, deleted the offending files, and now are trying to bring the host
back.

2. The second host you are debugging is lcgnetmon02 and its disk is not full
it's just performing poorly in terms of high load leading to unresponsive web
pages etc..

Let's start with #1. Everything you list can be chkconfig'ed on and your host
should be fine (under normal circumstances). If you really want to turn the
following off you can, but it is not required:

traceroute_ma
traceroute_master
traceroute_ondemand_mp
traceroute_scheduler
perfsonarbuoy_ma
perfsonarbuoy_owp_collector
perfsonarbuoy_owp_master
PingER

I would also recommend a slightly different strategy then turning small
batches of stuff back on and looking at the logs. There are lots of
interactions between services and you are just going to create headaches for
yourself picking through logs and running into error message that go away
when you enable the rest. Instead, turn everything back on (optionally
keeping the stuff above off) and keep your eye on things. It sounds like the
original culprit was somewhere between the regular_testing service and what
we call the measurement archive (MA) that runs under httpd. If the disk
starts filling you can turn off the regular_testing service and it will stop.
No need to disable everything on the box. In the meantime, once you have
everything enabled again we can see if the web interface returns. If it does,
I can help you assess what caused the original problem. If the web interface
does not return after re-enabling everything else I can help with that too.

For problem #2, what does the command 'top' output? Do things get better if
you run "/sbin/service regular_testing stop"? From information registered in
or lookup service, i see that your host has 1.89GB of memory (which i am
guessing is actually 2GB of memory) which is the below the recommended system
requirements for a perfsonar host of 4GB
(http://www.perfsonar.net/deploy/hardware-selection/). It doesn't mean its
impossible to run a Toolkit with that much memory, but if you're running a
relatively large number of tests that could be problematic. Depending on what
top says it may be worth adding more memory keeping in mind that it looks
like you've done a 32-bit install so you'll be limited to 4GB of memory by
the OS.

Please let me know if I missed anything, but hopefully those prove to be
useful next steps.

Thanks,
Andy

On Feb 4, 2015, at 9:56 AM, Winnie Lacesso
<>
wrote:

> Good afternoon Gentlemen!
>
> THANK YOU *immensely* kindly for your patient help!
>
>> what exactly are you trying to fix?
> My colleague upgraded lcgnetmon.phy.bris.ac.uk (Latency) & lcgnetmon02
> (Bandwidth) from v3.3 to v3.4 on 17 Oct 2014. In late Nov we were ticketed
> that lcgnetmon.phy.bris.ac.uk was not working. (NB Our site is low on
> manpower & neither of us know much about them, so after upgrade we assumed
> they were ok.....)
>
> So try to fix lcgnetmon. At that time
> https://lcgnetmon.phy.bris.ac.uk/serviceTest/psGraph.cgi
> was showing connections but gave errors like "Negative latency values
> found in the reverse direction." & maddash was showing only "500 Can't
> connect to lcgnetmon.phy.bris.ac.uk:80 (connect: no route to host)"
>
> After some changes psGraph.cgi just cycled forever & forever. After some
> more changes psGraph.cgi cycled a little bit then showed nothing.
> (Sigh... progresss?)
> Over this latter time, the disk filled up with 9million tiny files in
> /var/lib/perfsonar/regular_testing/ & available inodes went to 0%
>
> *All* perfsonar services on lcgnetmon were found in rc3.d order, shutdown,
> the perfsonar crontabs were disabled, & 2 days to clean up 9million tiny
> files in /var/lib/perfsonar/regular_testing. PAIN but Done.
>
>> Are there particular services listed as not running on the main
>> toolkit web page?
> The webserver is not yet running. In the past it listed all looking well.
> During that time, the disk filled up with 9million files from perfsonar
> process or config badness. So the web page saying "all is well" was not.
> Don't want to do that again. Please.
>
>> are there particular graphs not loading?
> Confirmed - when the webserver was last running, NO lcgnetmon graphs.
>
> Now want to restart personar services few by few & make sure they are
> healthy. The order as found by rc3.d is
>
> config_daemon
> cassandra
> oppd
> simple_ls_bootstrap_client
> ls_cache_daemon
> ls_registration_daemon
> owampd
> traceroute_ma
> traceroute_master
> traceroute_ondemand_mp
> traceroute_scheduler
> mysqld
> perfsonarbuoy_ma # told yesterday these are not needed for v3.4;
> perfsonarbuoy_owp_collector # v3.3 -> v3.4 upgrade did not disable them.
> perfsonarbuoy_owp_master # Now stopped & chkconfig'd off
>
> PingER # told today old v3.3 service not disabled by v3.4 upgrade
> regular_testing
> httpd
> fail2ban
> configure_nic_parameters
> generate_motd
> psb_to_esmond
>
> Since some are old v3.3 services that the upgrade should have but did not
> disable (maybe that's what caused the broken = get ticketed, & filling up
> the disk), *what* is the list of above that should NOT be chkconfig'd on
> for v3.4? (It would be nice if the documentation listed that.)
>
>> Just to reiterate, it would be much better to understand what triggered
>> your digging through the old logs.
>
> Don't want to dig thru old logs! Want to start a necessary perfsonar
> process in the order it should be started, confirm healthy, move onto next.
> Since none of the perfsonar services have a "status" option the only
> confirm seems to be see if the process is running & check its logfile for
> "I am running + healthy" type entries vs "I am sick / broken / dead".
>
> Since the "broken/dead" were seen - hence the question. But, since same on
> bandwith box, as you say "nothing to be concerned about" - GOOD!
>
> Now you say
>> The old traceroue daemon's are not used. Everything has been moved to the
>> new MA
> Not sure what "new MA" means. Should *all* the above traceroute services
> be shut off & disabled?
> Or by "new MA" do you mean that traceroute_ma (or one of them) is a v3.4
> service that should be left enabled? Beg of you please be clear!
>
>
> Same question for lcgnetmon02 = Bandwidth, which DOES appear to be working
> but poorly, so it too may have some old v3.3 services left enabled causing
> poor performance. Here are its service chkconfig'd on in the order they
> start in rc3.d; did the v3.3 -> v3.4 leave some enabled that should be shut
> off & disabled ?
>
> config_daemon
> cassandra
> oppd
> simple_ls_bootstrap_client
> ls_cache_daemon
> ls_registration_daemon
> owampd
> traceroute_ma
> traceroute_master
> traceroute_ondemand_mp
> traceroute_scheduler
> mysqld
> perfsonarbuoy_bw_collector
> perfsonarbuoy_bw_master
> perfsonarbuoy_ma
> regular_testing
> httpd
> fail2ban
> configure_nic_parameters
> generate_motd
> psb_to_esmond
>
> It is not having pathological 0% inodes problem tho. :) Just poor
> performance.

Re: [perfsonar-user] Restarting eash service while debug, (continued)

List archive

Re: [perfsonar-user] Restarting eash service while debug