perfsonar-user - Re: [perfsonar-user] Central MA database size snuck up on me
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To: Andrew Lake <>
- Cc: "" <>
- Subject: Re: [perfsonar-user] Central MA database size snuck up on me
- Date: Wed, 26 Jul 2017 15:51:34 -0500
- Ironport-phdr: 9a23:COI7wRX2LZHK7vf6gEjrGSDticrV8LGtZVwlr6E/grcLSJyIuqrYbBGGt8tkgFKBZ4jH8fUM07OQ6PGwHzRYqb+681k6OKRWUBEEjchE1ycBO+WiTXPBEfjxciYhF95DXlI2t1uyMExSBdqsLwaK+i764jEdAAjwOhRoLerpBIHSk9631+ev8JHPfglEnjSwbLdxIRmssQndqtQdjJd/JKo21hbHuGZDdf5MxWNvK1KTnhL86dm18ZV+7SleuO8v+tBZX6nicKs2UbJXDDI9M2Ao/8LrrgXMTRGO5nQHTGoblAdDDhXf4xH7WpfxtTb6tvZ41SKHM8D6Uaw4VDK/5KptVRTmijoINyQh/W7VhMx+jKxVrhG8qRJh34HZe5uaOOZkc67HYd8WWWhMU8BMXCJBGIO8aI4PAvIfMOlCtInyuVsPpgaiCwmxH+Pv0SFHhnvy3aYn1OkuDRvG3BE7H9IVrnvUqNH1ObwRUe+vyqnI1yvMb/VM1Tf79ofIbgksrPeRVrx+dsrRzFMgFwLDjliIrYzlPjWV1ucTvGeG7upgU/ijhHIgqwF0ujSv2sktiojVhoIJ1F/E7z91z5oyJd29UEJ7YsSrEJ1Kty6EMYt6WMUiTH90uCs817YIuoa7cTAUxJkm2xLSafmKc4aL7x34SOqcJDh1iG55dL2jghu97VSsx+75W8SxzlpGsDRKn9/RvX4XzRPT8NKISv5l80ehxzmP0wfT5/lBIU8ulKrbL4ctwqMqmZYPqEjCETH6lFvog6OMeUUk/e+o6+vjYrr4vJOTK4h0igTmPqQvnMywH/g4PxAQU2Wa5eix1rju/UP6TbpRkvE7l6bUvIzGKcsHo6O2HxNZ34Um5hu6ETuqzsoXkWECLF1feRKHi4bpO0vJIPD9Ffqwn06skCpwx/DdILLhBpHNI2PAkLj7e7Z98VBTyAwpwdBZ+Z1UFqkNIOjvVU/pqNzYEhg5PhSsw+n5EtV92JgeWWWJAqCDKqPeqEKI5vkxLOmWf4IVvDf9K+M55/71k3M1g14dfa+13ZQJcnC4GOppI1mHbXb2nNgODHoK7UICS7nDjlGYXCEbQ3+xUupo7zc3GaqrFsHFS5z705Kb2yLuNZRNa3EOME2XCnrsc83QUO0RczmfJstJkTUCT7WnDYksyUf950fB17N7I7+MqWUjvpX52Y0t6g==
Thanks Andrew,
Can you tell me where to find/verify that cron job, I suspect it was running since this was a full toolkit install from the ISO, I think I just had too many checks running for the allotted storage space.
On Wed, Jul 26, 2017 at 3:47 PM, Andrew Lake <> wrote:
Hi,FWIW a similar issue actually came-up internally within the development team in just the past week or so. When you “delete” something in cassandra it actually gets marked as “tombstone” and doesn't actually get deleted until a process called “compaction” happens. The kicker is that compaction actually requires temporary disk space which is obviously problematic when you run out of disk space. We haven't found a good solution yet unfortunately short of trying to free-up disk space other ways and deleting the data or just doing the following to wipe all the data:rm -rf /var/lib/cassandra/saved_caches/esmond* rm -rf /var/lib/cassandra/data/esmond rm -rf /var/lib/cassandra/commitlog/* Not ideal for a number of reasons obviously. Scouring the Internet this is a pretty common problem people run into with cassandra without clear answers. On the toolkit side, we automatically set-up a cron job to clean old data every night so you are less likely to run into this, but when you install esmond as a standalone service that cron is not setup. I think we’ll need to review that going forward. Sorry not to have a better immediate solution at the moment but if anyone has one we’re open to it.Thanks,AndyOn July 26, 2017 at 11:15:27 AM, Casey Russell () wrote:
Group,I dug myself a hole and I only see a couple of ways out now. I wasn't watching the database size on my central MA and my disk utilization is now over 90%. I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380 Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-ttl- reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-count-lost- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-duplicates- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503 Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-loss-rate- bidir, summary_type=aggregation, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-loss-rate- bidir, summary_type=aggregation, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-loss-rate- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-reorders- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509 Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819c e8, event_type=packet-count-lost, summary_type=base, summary_window=0 Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819c e8, event_type=packet-count-lost, summary_type=base, summary_window=0 Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819c e8, event_type=packet-count-lost, summary_type=base, summary_window=0 Error: Retried 1 times. Last failure was timeout: timed out[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/ 47G /var/lib/cassandra/data/esmond/raw_data 4.0K /var/lib/cassandra/data/esmond/stat_aggregations 9.9G /var/lib/cassandra/data/esmond/rate_aggregations 13G /var/lib/cassandra/data/esmond/base_rates 69G /var/lib/cassandra/data/esmond/ [root@ps-dashboard esmond]# df -hFilesystem Size Used Avail Use% Mounted on/dev/mapper/VolGroup-lv_root95G 82G 8.1G 91% /tmpfs 3.9G 4.0K 3.9G 1% /dev/shm/dev/sda1 477M 99M 353M 22% /bootAt the time of the "timeout" as I watch, the disk reaches 100% utilization. It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data. During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options. Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").Before I take one of those approaches, does anyone else have other ideas or thoughts?
- [perfsonar-user] Central MA database size snuck up on me, Casey Russell, 07/26/2017
- Re: [perfsonar-user] Central MA database size snuck up on me, Andrew Lake, 07/26/2017
- Re: [perfsonar-user] Central MA database size snuck up on me, Casey Russell, 07/26/2017
- Re: [perfsonar-user] Central MA database size snuck up on me, Andrew Lake, 07/26/2017
- Re: [perfsonar-user] Central MA database size snuck up on me, Casey Russell, 07/26/2017
- Re: [perfsonar-user] Central MA database size snuck up on me, Andrew Lake, 07/26/2017
Archive powered by MHonArc 2.6.19.