perfsonar-user - Re: [perfsonar-user] Central MA database size snuck up on me

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Central MA database size snuck up on me

From: Casey Russell <>
To: Andrew Lake <>
Cc: "" <>
Subject: Re: [perfsonar-user] Central MA database size snuck up on me
Date: Wed, 26 Jul 2017 15:51:34 -0500
Ironport-phdr: 9a23:COI7wRX2LZHK7vf6gEjrGSDticrV8LGtZVwlr6E/grcLSJyIuqrYbBGGt8tkgFKBZ4jH8fUM07OQ6PGwHzRYqb+681k6OKRWUBEEjchE1ycBO+WiTXPBEfjxciYhF95DXlI2t1uyMExSBdqsLwaK+i764jEdAAjwOhRoLerpBIHSk9631+ev8JHPfglEnjSwbLdxIRmssQndqtQdjJd/JKo21hbHuGZDdf5MxWNvK1KTnhL86dm18ZV+7SleuO8v+tBZX6nicKs2UbJXDDI9M2Ao/8LrrgXMTRGO5nQHTGoblAdDDhXf4xH7WpfxtTb6tvZ41SKHM8D6Uaw4VDK/5KptVRTmijoINyQh/W7VhMx+jKxVrhG8qRJh34HZe5uaOOZkc67HYd8WWWhMU8BMXCJBGIO8aI4PAvIfMOlCtInyuVsPpgaiCwmxH+Pv0SFHhnvy3aYn1OkuDRvG3BE7H9IVrnvUqNH1ObwRUe+vyqnI1yvMb/VM1Tf79ofIbgksrPeRVrx+dsrRzFMgFwLDjliIrYzlPjWV1ucTvGeG7upgU/ijhHIgqwF0ujSv2sktiojVhoIJ1F/E7z91z5oyJd29UEJ7YsSrEJ1Kty6EMYt6WMUiTH90uCs817YIuoa7cTAUxJkm2xLSafmKc4aL7x34SOqcJDh1iG55dL2jghu97VSsx+75W8SxzlpGsDRKn9/RvX4XzRPT8NKISv5l80ehxzmP0wfT5/lBIU8ulKrbL4ctwqMqmZYPqEjCETH6lFvog6OMeUUk/e+o6+vjYrr4vJOTK4h0igTmPqQvnMywH/g4PxAQU2Wa5eix1rju/UP6TbpRkvE7l6bUvIzGKcsHo6O2HxNZ34Um5hu6ETuqzsoXkWECLF1feRKHi4bpO0vJIPD9Ffqwn06skCpwx/DdILLhBpHNI2PAkLj7e7Z98VBTyAwpwdBZ+Z1UFqkNIOjvVU/pqNzYEhg5PhSsw+n5EtV92JgeWWWJAqCDKqPeqEKI5vkxLOmWf4IVvDf9K+M55/71k3M1g14dfa+13ZQJcnC4GOppI1mHbXb2nNgODHoK7UICS7nDjlGYXCEbQ3+xUupo7zc3GaqrFsHFS5z705Kb2yLuNZRNa3EOME2XCnrsc83QUO0RczmfJstJkTUCT7WnDYksyUf950fB17N7I7+MqWUjvpX52Y0t6g==

Thanks Andrew,

Can you tell me where to find/verify that cron job, I suspect it was running since this was a full toolkit install from the ISO, I think I just had too many checks running for the allotted storage space.

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Wed, Jul 26, 2017 at 3:47 PM, Andrew Lake <> wrote:

Hi,

FWIW a similar issue actually came-up internally within the development team in just the past week or so. When you “delete” something in cassandra it actually gets marked as “tombstone” and doesn't actually get deleted until a process called “compaction” happens. The kicker is that compaction actually requires temporary disk space which is obviously problematic when you run out of disk space. We haven't found a good solution yet unfortunately short of trying to free-up disk space other ways and deleting the data or just doing the following to wipe all the data:

rm -rf /var/lib/cassandra/saved_caches/esmond*
rm -rf /var/lib/cassandra/data/esmond
rm -rf /var/lib/cassandra/commitlog/*

Not ideal for a number of reasons obviously. Scouring the Internet this is a pretty common problem people run into with cassandra without clear answers. On the toolkit side, we automatically set-up a cron job to clean old data every night so you are less likely to run into this, but when you install esmond as a standalone service that cron is not setup. I think we’ll need to review that going forward. Sorry not to have a better immediate solution at the moment but if anyone has one we’re open to it.

Thanks,
Andy

On July 26, 2017 at 11:15:27 AM, Casey Russell () wrote:

Group,

I dug myself a hole and I only see a couple of ways out now. I wasn't watching the database size on my central MA and my disk utilization is now over 90%. I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-duplicates-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-reorders-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509

Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Error: Retried 1 times. Last failure was timeout: timed out

[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/

47G /var/lib/cassandra/data/esmond/raw_data

4.0K /var/lib/cassandra/data/esmond/stat_aggregations

9.9G /var/lib/cassandra/data/esmond/rate_aggregations

13G /var/lib/cassandra/data/esmond/base_rates

69G /var/lib/cassandra/data/esmond/

[root@ps-dashboard esmond]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/VolGroup-lv_root

95G 82G 8.1G 91% /

tmpfs 3.9G 4.0K 3.9G 1% /dev/shm

/dev/sda1 477M 99M 353M 22% /boot

At the time of the "timeout" as I watch, the disk reaches 100% utilization. It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data. During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.

At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.

So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options. Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").

Before I take one of those approaches, does anyone else have other ideas or thoughts?

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

[perfsonar-user] Central MA database size snuck up on me, Casey Russell, 07/26/2017
- Re: [perfsonar-user] Central MA database size snuck up on me, Andrew Lake, 07/26/2017
  - Re: [perfsonar-user] Central MA database size snuck up on me, Casey Russell, 07/26/2017
    - Re: [perfsonar-user] Central MA database size snuck up on me, Andrew Lake, 07/26/2017

List archive

Re: [perfsonar-user] Central MA database size snuck up on me