Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Cassandra runaway CPU

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Cassandra runaway CPU


Chronological Thread 
  • From: Casey Russell <>
  • To: Andrew Lake <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] Cassandra runaway CPU
  • Date: Thu, 31 Mar 2016 09:17:19 -0500

Andy,

I haven't checked that.  But it had crossed my mind.  A little bit of my Cassandra reading hinted in that direction.  I may have to schedule a maintenance window and see if I can do some offline diagnostics of that box.

Thank you for giving me an avenue to chase down. 

Casey Russell
Network Engineer
Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS  66047

(785)856-9820  ext 9809

On Thu, Mar 31, 2016 at 8:15 AM, Andrew Lake <> wrote:
Hi,

Have you checked for a failing disk or bad memory on the host question? It could be something else, but I’ve seen similar before on our ESnet hosts when we have had hardware failures.

Thanks,
Andy



On March 30, 2016 at 6:07:34 PM, Casey Russell () wrote:

I've had a node for some time that has been acting strangely, and just in the last day or two I've had some time to dedicate to our PS gear, so I've dug into this to try to figure out what's going on.  Now that I have, I don't know how to fix it.

I have 4 more or less identical testing nodes at different points in my network that are part of a Mesh Config.  The 4 nodes are identical hardware purchased at the same time and (unless I messed something up) all installed with essentially identical images, except for their IP/Subnet/Host info etc.  Quad core processors (x2 for a total of 8 cores), 8GB of RAM, local storage and 1G testing interfaces.

Three of them run perfectly all day long and loaf along with a load average of 0.03 or so.  My oddball has been running way hotter than that for weeks with a load average of 1.3 or more.  Top show real quickly what the culprit is. 

Normal system show no single process using any substantial CPU (and cassandra rarely even appears on the list), while the system that's acting up shows something similar to the following all the time.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  25058 cassandr  20   0 7632m 1.7g  17m S 111.5 22.6  17:21.62 java

I read up a little, and enabled the nodetool utility, and discovered that there is a compaction process that is never completing on this node.  If I restart cassandra, it will kick off about 3 compactions, all of them complete but this one, it always stops at around 5.32 percent and never progresses any further (even after 2-3 hours).

[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
          compaction type        keyspace           table       completed           total      unit  progress
               Compaction          esmondrate_aggregations       140157665      2632220068     bytes     5.32%
Active compaction remaining time :        n/a
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
          compaction type        keyspace           table       completed           total      unit  progress
               Compaction          esmondrate_aggregations       140157665      2632220068     bytes     5.32%
Active compaction remaining time :        n/a


Does anyone have any idea what happened to my esmond database here and how to get it back on track?  I'll obviously be glad to provide log data if it'll be helpful.






Casey Russell
Network Engineer
Kansas Research and Education Network

2029 Becker Drive, Suite 282

Lawrence, KS  66047





Archive powered by MHonArc 2.6.16.

Top of Page