perfsonar-user - Re: [perfsonar-user] Cassandra runaway CPU
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To: "Garnizov, Ivan (RRZE)" <>
- Cc: "" <>
- Subject: Re: [perfsonar-user] Cassandra runaway CPU
- Date: Wed, 6 Apr 2016 09:35:50 -0500
INFO [CompactionExecutor:8] 2016-04-02 12:32:18,205 CompactionController.java (line 192) Compacting large row esmond/rate_aggregations:ps:packet_loss_rate:b30f54e8df9549ceb8292278b782f05b:2015 (121215124 bytes) incrementally
INFO [CompactionExecutor:8] 2016-04-03 04:50:45,168 CompactionController.java (line 192) Compacting large row esmond/rate_aggregations:ps:time_error_estimates:b30f54e8df9549ceb8292278b782f05b:2015 (123923983 bytes) incrementally
INFO [CompactionExecutor:8] 2016-04-03 22:06:38,417 CompactionController.java (line 192) Compacting large row esmond/rate_aggregations:ps:packet_loss_rate:76b654c4279241f19898dcdb8cacdfb2:2015 (120871402 bytes) incrementally
in_memory_compaction_limit_in_mb" up from 64 to 256 and restarted cassandra. This time, using "nodetool compactionstats" I watched cassandra slowly chew through the entire table (took about 4 minutes) and then all compaction tasks ended and the processor load came back to normal.
I restarted Cassandra a couple of more times and it never tried to re-compact that large row in esmond/rate_aggregations, so I set the value back to 64.
Thank you to everyone who offered advice and assistance.
2029 Becker Drive, Suite 282
Lawrence, KS 66047
Hi Casey,
I have no idea, what the problem could be, but I guess you are still away from the real problem.
My suggestion is to activate debug level logging on the system, do a service restart and provide these details. If you do the same thing on a system with no such symptoms, you/we will be able to compare.
There it should become apparent with what parameters the Java application starts.
Best regards,
Ivan
From: [mailto:] On Behalf Of Casey Russell
Sent: Dienstag, 5. April 2016 22:06
To: Andrew Lake
Cc:
Subject: Re: [perfsonar-user] Cassandra runaway CPU
Group,
Here's what I've done since last week. I've taken the box offline for several maintenance windows and booted it from liveCDs to run memory diagnostics, HD diagnostics, CPU and chipset diagnostics (cpuburn to heat up the box and look for fan problems etc). I upgraded the BIOS thinking maybe I'd get better (or different) S.M.A.R.T. info, and did a full badblocks block level check of the drive. Then I did a RAID controller consistency check on the mirrored pair, because. meh. why not? :-) I even followed the directions in the FAQ to nuke the Esmond database and re-initialize it.
However after all that, I still have a system that consumes an entire CPU core as soon as cassandra starts up. I notice that nuking the Esmond database per the instructions on the FAQ had no impact on the size of the Cassandra data files. My hosts are in a mesh and use a Central MA. Is there any harm in just nuking the Cassandra database/datafiles on this host and starting fresh. I'm looking back through my mesh config file and the only thing that doesn't use the Central MA as the read/write host is PingER. It wouldn't break my heart to lose PingER data for this one host if that's all that's stored locally. If it's relatively safe, does anyone have a process or set of instructions for doing so?
Is there anything else I should be trying, or should I just consider re-installing this host since it's a mesh node anyway and very little data will be lost?
Casey Russell
Network Engineer
Kansas Research and Education Network
2029 Becker Drive, Suite 282
Lawrence, KS 66047
On Thu, Mar 31, 2016 at 9:17 AM, Casey Russell <> wrote:
Andy,
I haven't checked that. But it had crossed my mind. A little bit of my Cassandra reading hinted in that direction. I may have to schedule a maintenance window and see if I can do some offline diagnostics of that box.Thank you for giving me an avenue to chase down.
Casey Russell
Network Engineer
Kansas Research and Education Network
2029 Becker Drive, Suite 282
Lawrence, KS 66047
On Thu, Mar 31, 2016 at 8:15 AM, Andrew Lake <> wrote:
Hi,
Have you checked for a failing disk or bad memory on the host question? It could be something else, but I’ve seen similar before on our ESnet hosts when we have had hardware failures.
Thanks,
Andy
On March 30, 2016 at 6:07:34 PM, Casey Russell () wrote:
I've had a node for some time that has been acting strangely, and just in the last day or two I've had some time to dedicate to our PS gear, so I've dug into this to try to figure out what's going on. Now that I have, I don't know how to fix it.
I have 4 more or less identical testing nodes at different points in my network that are part of a Mesh Config. The 4 nodes are identical hardware purchased at the same time and (unless I messed something up) all installed with essentially identical images, except for their IP/Subnet/Host info etc. Quad core processors (x2 for a total of 8 cores), 8GB of RAM, local storage and 1G testing interfaces.
Three of them run perfectly all day long and loaf along with a load average of 0.03 or so. My oddball has been running way hotter than that for weeks with a load average of 1.3 or more. Top show real quickly what the culprit is.
Normal system show no single process using any substantial CPU (and cassandra rarely even appears on the list), while the system that's acting up shows something similar to the following all the time.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25058 cassandr 20 0 7632m 1.7g 17m S 111.5 22.6 17:21.62 javaI read up a little, and enabled the nodetool utility, and discovered that there is a compaction process that is never completing on this node. If I restart cassandra, it will kick off about 3 compactions, all of them complete but this one, it always stops at around 5.32 percent and never progresses any further (even after 2-3 hours).
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
compaction type keyspace table completed total unit progress
Compaction esmondrate_aggregations 140157665 2632220068 bytes 5.32%
Active compaction remaining time : n/a
[crussell@ps-bryant-bw ~]$ nodetool compactionstats
pending tasks: 1
compaction type keyspace table completed total unit progress
Compaction esmondrate_aggregations 140157665 2632220068 bytes 5.32%
Active compaction remaining time : n/a
Does anyone have any idea what happened to my esmond database here and how to get it back on track? I'll obviously be glad to provide log data if it'll be helpful.
Casey Russell
Network Engineer
Kansas Research and Education Network
2029 Becker Drive, Suite 282
Lawrence, KS 66047
- Re: [perfsonar-user] Cassandra runaway CPU, Casey Russell, 04/05/2016
- RE: [perfsonar-user] Cassandra runaway CPU, Garnizov, Ivan (RRZE), 04/06/2016
- Re: [perfsonar-user] Cassandra runaway CPU, Casey Russell, 04/06/2016
- Re: [perfsonar-user] Cassandra runaway CPU, Andrew Lake, 04/06/2016
- Re: [perfsonar-user] Cassandra runaway CPU, Casey Russell, 04/06/2016
- RE: [perfsonar-user] Cassandra runaway CPU, Garnizov, Ivan (RRZE), 04/06/2016
Archive powered by MHonArc 2.6.16.