I've had a node for some time that has been acting strangely,
and just in the last day or two I've had some time to dedicate to
our PS gear, so I've dug into this to try to figure out what's
going on. Now that I have, I don't know how to fix it.
I have 4 more or less identical testing nodes at different points
in my network that are part of a Mesh Config. The 4 nodes are
identical hardware purchased at the same time and (unless I messed
something up) all installed with essentially identical images,
except for their IP/Subnet/Host info etc. Quad core
processors (x2 for a total of 8 cores), 8GB of RAM, local storage
and 1G testing interfaces.
Three of them run perfectly all day long and loaf along with a load
average of 0.03 or so. My oddball has been running way hotter
than that for weeks with a load average of 1.3 or more. Top
show real quickly what the culprit is.
Normal system show no single process using any substantial CPU (and
cassandra rarely even appears on the list), while the system that's
acting up shows something similar to the following all the
time.
PID USER PR NI
VIRT RES SHR S %CPU %MEM TIME+
COMMAND
25058 cassandr 20 0 7632m 1.7g 17m S
111.5 22.6 17:21.62 java
I read up a little, and enabled the nodetool utility, and
discovered that there is a compaction process that is never
completing on this node. If I restart cassandra, it will kick
off about 3 compactions, all of them complete but this one, it
always stops at around 5.32 percent and never progresses any
further (even after 2-3 hours).
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
compaction
type
keyspace
table
completed
total unit progress
Compaction
esmondrate_aggregations
140157665
2632220068 bytes
5.32%
Active compaction remaining time
: n/a
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
compaction
type
keyspace
table
completed
total unit progress
Compaction
esmondrate_aggregations
140157665
2632220068 bytes
5.32%
Active compaction remaining time
: n/a
Does anyone have any idea what happened to my esmond database here
and how to get it back on track? I'll obviously be glad to
provide log data if it'll be helpful.