Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Central MA database size snuck up on me

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Central MA database size snuck up on me


Chronological Thread 
  • From: Andrew Lake <>
  • To: Casey Russell <>
  • Cc: "" <>
  • Subject: Re: [perfsonar-user] Central MA database size snuck up on me
  • Date: Wed, 26 Jul 2017 13:58:09 -0700
  • Ironport-phdr: 9a23:L/vHrROCMxfsdQSIfbQl6mtUPXoX/o7sNwtQ0KIMzox0Iv35rarrMEGX3/hxlliBBdydsKMUzbKO+4nbGkU4qa6bt34DdJEeHzQksu4x2zIaPcieFEfgJ+TrZSFpVO5LVVti4m3peRMNQJW2aFLduGC94iAPERvjKwV1Ov71GonPhMiryuy+4ZPebgFLiTanfb9+MAi9oBnMuMURnYZsMLs6xAHTontPdeRWxGdoKkyWkh3h+Mq+/4Nt/jpJtf45+MFOTav1f6IjTbxFFzsmKHw65NfqtRbYUwSC4GYXX3gMnRpJBwjF6wz6Xov0vyDnuOdxxDWWMMvrRr0vRz+s87lkRwPpiCcfNj427mfXitBrjKlGpB6tvgFzz5LIbI2QMvd1Y6HTcs4ARWdZXsheVSJBDISzYIUBDOQPIPhWoJXmqlQUsRezHxOhCfnzxjJKgHL9wK000/4mEQHDxAEtA9QOv2nOrNrrOqYZTOa7w7PLzTrdcvhb3i3y6I7VfREhuvyDQ6lwfdDXyUYxCwPIl1OdopHrMTOS0+QCqWmb7+x4WOKrim4nrQJxrSayycctjInFnJ4aylfB9Slh3IY0K9y4SFJnYdG6CptcrT2VN4xzQs86QGFnoiA6yqcYtp69ZiQKzoooxwLZZveacIaI+gruWPuPLTp7nn5odqizihmv/US6y+DxVdG43VRFoyZfj9XAqHAA2wbQ58WJUPdw/Fqt1DCS3A7J8O5EO1o7la/DJp4h3LEwkp0TvFzNHiDolkj6lquWeV4g+uSy5OTnZavmqoedN49ylA7+LrwjltGhDek7KAQDUXKX9Ouh2LH5/ED0Q61GjvgsnanYtJDaK94bpqm8AwJNyIkj7QuwDje93dsGhnkLNlRFdwybj4TxIVHBPOj4Deujg1SriDpr3+7JPqfvApXWKXjDlq3ufaxk505B0wo808pf6olQCrEAO/LzRlTxuMLCAh84NQy03/joCM971owARWKDHLWVP73Pvl+VtaoTJLynbZQYqX7HNuM+6vrqxSsyg0IGZqSt2bMUYXaiE/IgJUiFNynCmNAERE4MshAzU6TOgV6PGWpaYXqjd68noDc2FNT1Xs/4WomxjenZj2+AFZpMazUDUwjUHA==

The cronjob lives at /etc/cron.d/cron-clean_esmond_db.



On July 26, 2017 at 4:51:38 PM, Casey Russell () wrote:

Thanks Andrew, 

     Can you tell me where to find/verify that cron job, I suspect it was running since this was a full toolkit install from the ISO, I think I just had too many checks running for the allotted storage space.


Sincerely,
Casey Russell
Network Engineer
KanREN
phone785-856-9809
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter

On Wed, Jul 26, 2017 at 3:47 PM, Andrew Lake <> wrote:
Hi,

FWIW a similar issue actually came-up internally within the development team in just the past week or so. When you “delete” something in cassandra it actually gets marked as “tombstone” and doesn't actually get deleted until a process called “compaction” happens. The kicker is that compaction actually requires temporary disk space which is obviously problematic when you run out of disk space. We haven't found a good solution yet unfortunately short of trying to free-up disk space other ways and deleting the data or just doing the following to wipe all the data:

rm -rf /var/lib/cassandra/saved_caches/esmond*
rm -rf /var/lib/cassandra/data/esmond
rm -rf /var/lib/cassandra/commitlog/*

Not ideal for a number of reasons obviously. Scouring the Internet this is a pretty common problem people run into with cassandra without clear answers. On the toolkit side, we automatically set-up a cron job to clean old data every night so you are less likely to run into this, but when you install esmond as a standalone service that cron is not setup. I think we’ll need to review that going forward. Sorry not to have a better immediate solution at the moment but if anyone has one we’re open to it. 

Thanks,
Andy





On July 26, 2017 at 11:15:27 AM, Casey Russell () wrote:

Group,

     I dug myself a hole and I only see a couple of ways out now.  I wasn't watching the database size on my central MA and my disk utilization is now over 90%.  I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380
Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-duplicates-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503
Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-reorders-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509
Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Error: Retried 1 times. Last failure was timeout: timed out

[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/
47G     /var/lib/cassandra/data/esmond/raw_data
4.0K    /var/lib/cassandra/data/esmond/stat_aggregations
9.9G    /var/lib/cassandra/data/esmond/rate_aggregations
13G     /var/lib/cassandra/data/esmond/base_rates
69G     /var/lib/cassandra/data/esmond/

[root@ps-dashboard esmond]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       95G   82G  8.1G  91% /
tmpfs                 3.9G  4.0K  3.9G   1% /dev/shm
/dev/sda1             477M   99M  353M  22% /boot

At the time of the "timeout" as I watch, the disk reaches 100% utilization.  It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data.  During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.

At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.  

So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options.  Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").

Before I take one of those approaches, does anyone else have other ideas or thoughts?

Sincerely,
Casey Russell
Network Engineer
KanREN
2029 Becker Drive, Suite 282
Lawrence, Kansas 66047
linkedin twitter twitter




Archive powered by MHonArc 2.6.19.

Top of Page