perfsonar-user - Re: [perfsonar-user] Cassandra runaway CPU

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Cassandra runaway CPU

From: Casey Russell <>
To: Andrew Lake <>
Cc: "" <>
Subject: Re: [perfsonar-user] Cassandra runaway CPU
Date: Thu, 31 Mar 2016 09:17:19 -0500

Andy,

I haven't checked that. But it had crossed my mind. A little bit of my Cassandra reading hinted in that direction. I may have to schedule a maintenance window and see if I can do some offline diagnostics of that box.

Thank you for giving me an avenue to chase down.

Casey Russell

Network Engineer

Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS 66047

(785)856-9820 ext 9809

On Thu, Mar 31, 2016 at 8:15 AM, Andrew Lake <> wrote:

Hi,

Have you checked for a failing disk or bad memory on the host question? It could be something else, but I’ve seen similar before on our ESnet hosts when we have had hardware failures.

Thanks,
Andy

On March 30, 2016 at 6:07:34 PM, Casey Russell () wrote:

I've had a node for some time that has been acting strangely, and just in the last day or two I've had some time to dedicate to our PS gear, so I've dug into this to try to figure out what's going on. Now that I have, I don't know how to fix it.

I have 4 more or less identical testing nodes at different points in my network that are part of a Mesh Config. The 4 nodes are identical hardware purchased at the same time and (unless I messed something up) all installed with essentially identical images, except for their IP/Subnet/Host info etc. Quad core processors (x2 for a total of 8 cores), 8GB of RAM, local storage and 1G testing interfaces.

Three of them run perfectly all day long and loaf along with a load average of 0.03 or so. My oddball has been running way hotter than that for weeks with a load average of 1.3 or more. Top show real quickly what the culprit is.

Normal system show no single process using any substantial CPU (and cassandra rarely even appears on the list), while the system that's acting up shows something similar to the following all the time.

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
25058 cassandr 20   0 7632m 1.7g 17m S 111.5 22.6 17:21.62 java

I read up a little, and enabled the nodetool utility, and discovered that there is a compaction process that is never completing on this node. If I restart cassandra, it will kick off about 3 compactions, all of them complete but this one, it always stops at around 5.32 percent and never progresses any further (even after 2-3 hours).

[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
          compaction type        keyspace           table       completed           total      unit progress
               Compaction          esmondrate_aggregations       140157665      2632220068     bytes     5.32%
Active compaction remaining time :        n/a
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
          compaction type        keyspace           table       completed           total      unit progress
               Compaction          esmondrate_aggregations       140157665      2632220068     bytes     5.32%
Active compaction remaining time :        n/a

Does anyone have any idea what happened to my esmond database here and how to get it back on track? I'll obviously be glad to provide log data if it'll be helpful.

Casey Russell
Network Engineer

Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS 66047

(785)856-9820 ext 9809

[perfsonar-user] Cassandra runaway CPU, Casey Russell, 03/30/2016
- Re: [perfsonar-user] Cassandra runaway CPU, Andrew Lake, 03/31/2016
  - Re: [perfsonar-user] Cassandra runaway CPU, Casey Russell, 03/31/2016

List archive

Re: [perfsonar-user] Cassandra runaway CPU