Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Central MA database size snuck up on me

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Central MA database size snuck up on me


Chronological Thread 
  • From: Andrew Lake <>
  • To: "" <>, Casey Russell <>
  • Subject: Re: [perfsonar-user] Central MA database size snuck up on me
  • Date: Wed, 26 Jul 2017 13:47:37 -0700
  • Ironport-phdr: 9a23:Hm0dHxZOiCR+Gb5yJ8Zh17f/LSx+4OfEezUN459isYplN5qZr8u/bnLW6fgltlLVR4KTs6sC0LuG9fi4EUU7or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRpOOv1BpTSj8Oq3Oyu5pHfeQtFiT6+bL9oMBm6sRjau9ULj4dlNqs/0AbCrGFSe+RRy2NoJFaTkAj568yt4pNt8Dletuw4+cJYXqr0Y6o3TbpDDDQ7KG81/9HktQPCTQSU+HQRVHgdnwdSDAjE6BH6WYrxsjf/u+Fg1iSWIdH6QLYpUjm58axlVAHnhzsGNz4h8WHYlMpwjL5AoBm8oxBz2pPYbJ2JOPZ7eK7WYNEUSndbXstJVyJOAI28YYwAAOQPPuhWspfzqEcVoBSkGQWhHvnixyVUinL026AxzuQvERvB3AwlB98DrHLUo8jvNKgMX+G+0a/Gwi/Ab/xIxDzw75LHchY8rvCMRr9/b9HRxVMpFwzbklWdsIroNC6b2OQKtmiU9etgVeS3hm4jqgFxpDuvydkxhYnIgIIZ0EzL9SJ8wIotOd25Rk97YcK4EJROrSGWLZd5QsQnQ21wuyY10LsGuYSlcygM0pgnwQDQa+CBfoOV4RzjTP6cLSpmiH9mYr6yiQy+/Ee9xuHmV8S5005GoyhKn9XWq3wByRze5tKER/Zz5Eus2yiD2gbO4e9eO080j7DUK5s5z74wiJUTtUPDEzf4mErogqKabEEk9fOs6+j9bbXmoYGcO5d1igH4LKsuhtSyDfkmPgUNRWSW9/6w2bL+8UHjQbhHjeU6kqzDv5DbIcQbqLS5AwhQ0os78Rm/CSqp0dQDkHYZN1JJYhSHgJb1O13WOvD3Ee+/g0iwkDds3/3GJqPuAo/DLnjYl7fhe6xy61RFxAou1tBQ+YhUB6oFIPLyQU/xqMfYAgEjPwy1xebnFMty1pkYWW2RHq+VLrnevkGV6eIycKGwY9pfoDvnJeMi4ff0yGIilEU1fK+10IERZWziWPlqPg/RNWLhmNkaFmEDpE8jV+HwoFyETTNJYXuuBeQx6ixtTMqNBJzOV8iXnaea0SO/VsldfH1dEV2IFV/rfoOeVvFKbi+OdJxPiDsBAJGnR5UsylmKvQz3g+5uKObF0iAD85TuyI4mtKXoiRgu+GksXIym2GaXQjQxxztQSg==

Hi,

FWIW a similar issue actually came-up internally within the development team in just the past week or so. When you “delete” something in cassandra it actually gets marked as “tombstone” and doesn't actually get deleted until a process called “compaction” happens. The kicker is that compaction actually requires temporary disk space which is obviously problematic when you run out of disk space. We haven't found a good solution yet unfortunately short of trying to free-up disk space other ways and deleting the data or just doing the following to wipe all the data:

rm -rf /var/lib/cassandra/saved_caches/esmond*
rm -rf /var/lib/cassandra/data/esmond
rm -rf /var/lib/cassandra/commitlog/*

Not ideal for a number of reasons obviously. Scouring the Internet this is a pretty common problem people run into with cassandra without clear answers. On the toolkit side, we automatically set-up a cron job to clean old data every night so you are less likely to run into this, but when you install esmond as a standalone service that cron is not setup. I think we’ll need to review that going forward. Sorry not to have a better immediate solution at the moment but if anyone has one we’re open to it. 

Thanks,
Andy





On July 26, 2017 at 11:15:27 AM, Casey Russell () wrote:

Group,

     I dug myself a hole and I only see a couple of ways out now.  I wasn't watching the database size on my central MA and my disk utilization is now over 90%.  I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380
Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-duplicates-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503
Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-reorders-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509
Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Error: Retried 1 times. Last failure was timeout: timed out

[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/
47G     /var/lib/cassandra/data/esmond/raw_data
4.0K    /var/lib/cassandra/data/esmond/stat_aggregations
9.9G    /var/lib/cassandra/data/esmond/rate_aggregations
13G     /var/lib/cassandra/data/esmond/base_rates
69G     /var/lib/cassandra/data/esmond/

[root@ps-dashboard esmond]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       95G   82G  8.1G  91% /
tmpfs                 3.9G  4.0K  3.9G   1% /dev/shm
/dev/sda1             477M   99M  353M  22% /boot

At the time of the "timeout" as I watch, the disk reaches 100% utilization.  It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data.  During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.

At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.  

So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options.  Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").

Before I take one of those approaches, does anyone else have other ideas or thoughts?

Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter



Archive powered by MHonArc 2.6.19.

Top of Page