perfsonar-user - Re: [perfsonar-user] Cassandra runaway CPU
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To: Andrew Lake <>
- Cc: "" <>
- Subject: Re: [perfsonar-user] Cassandra runaway CPU
- Date: Tue, 5 Apr 2016 15:05:40 -0500
Group,
Here's what I've done since last week. I've taken the box offline for several maintenance windows and booted it from liveCDs to run memory diagnostics, HD diagnostics, CPU and chipset diagnostics (cpuburn to heat up the box and look for fan problems etc). I upgraded the BIOS thinking maybe I'd get better (or different) S.M.A.R.T. info, and did a full badblocks block level check of the drive. Then I did a RAID controller consistency check on the mirrored pair, because. meh. why not? :-) I even followed the directions in the FAQ to nuke the Esmond database and re-initialize it. Casey Russell
Network Engineer
Kansas Research and Education Network
2029 Becker Drive, Suite 282
Lawrence, KS 66047
(785)856-9820 ext 9809
On Thu, Mar 31, 2016 at 9:17 AM, Casey Russell <> wrote:
Andy,Thank you for giving me an avenue to chase down.
I haven't checked that. But it had crossed my mind. A little bit of my Cassandra reading hinted in that direction. I may have to schedule a maintenance window and see if I can do some offline diagnostics of that box.Casey RussellNetwork EngineerKansas Research and Education Network2029 Becker Drive, Suite 282
Lawrence, KS 66047
On Thu, Mar 31, 2016 at 8:15 AM, Andrew Lake <> wrote:Hi,Have you checked for a failing disk or bad memory on the host question? It could be something else, but I’ve seen similar before on our ESnet hosts when we have had hardware failures.Thanks,Andy
On March 30, 2016 at 6:07:34 PM, Casey Russell () wrote:
Does anyone have any idea what happened to my esmond database here and how to get it back on track? I'll obviously be glad to provide log data if it'll be helpful.I read up a little, and enabled the nodetool utility, and discovered that there is a compaction process that is never completing on this node. If I restart cassandra, it will kick off about 3 compactions, all of them complete but this one, it always stops at around 5.32 percent and never progresses any further (even after 2-3 hours).Normal system show no single process using any substantial CPU (and cassandra rarely even appears on the list), while the system that's acting up shows something similar to the following all the time.Three of them run perfectly all day long and loaf along with a load average of 0.03 or so. My oddball has been running way hotter than that for weeks with a load average of 1.3 or more. Top show real quickly what the culprit is.I've had a node for some time that has been acting strangely, and just in the last day or two I've had some time to dedicate to our PS gear, so I've dug into this to try to figure out what's going on. Now that I have, I don't know how to fix it.I have 4 more or less identical testing nodes at different points in my network that are part of a Mesh Config. The 4 nodes are identical hardware purchased at the same time and (unless I messed something up) all installed with essentially identical images, except for their IP/Subnet/Host info etc. Quad core processors (x2 for a total of 8 cores), 8GB of RAM, local storage and 1G testing interfaces.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25058 cassandr 20 0 7632m 1.7g 17m S 111.5 22.6 17:21.62 java
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
compaction type keyspace table completed total unit progress
Compaction esmondrate_aggregations 140157665 2632220068 bytes 5.32%
Active compaction remaining time : n/a
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
compaction type keyspace table completed total unit progress
Compaction esmondrate_aggregations 140157665 2632220068 bytes 5.32%
Active compaction remaining time : n/a
Casey RussellNetwork EngineerKansas Research and Education Network2029 Becker Drive, Suite 282
Lawrence, KS 66047
- Re: [perfsonar-user] Cassandra runaway CPU, Casey Russell, 04/05/2016
- RE: [perfsonar-user] Cassandra runaway CPU, Garnizov, Ivan (RRZE), 04/06/2016
- Re: [perfsonar-user] Cassandra runaway CPU, Casey Russell, 04/06/2016
- Re: [perfsonar-user] Cassandra runaway CPU, Andrew Lake, 04/06/2016
- Re: [perfsonar-user] Cassandra runaway CPU, Casey Russell, 04/06/2016
- RE: [perfsonar-user] Cassandra runaway CPU, Garnizov, Ivan (RRZE), 04/06/2016
Archive powered by MHonArc 2.6.16.