perfsonar-user - Re: [perfsonar-user] Central MA database size snuck up on me

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Central MA database size snuck up on me

From: Andrew Lake <>
To: Casey Russell <>
Cc: "" <>
Subject: Re: [perfsonar-user] Central MA database size snuck up on me
Date: Wed, 26 Jul 2017 13:58:09 -0700
Ironport-phdr: 9a23:L/vHrROCMxfsdQSIfbQl6mtUPXoX/o7sNwtQ0KIMzox0Iv35rarrMEGX3/hxlliBBdydsKMUzbKO+4nbGkU4qa6bt34DdJEeHzQksu4x2zIaPcieFEfgJ+TrZSFpVO5LVVti4m3peRMNQJW2aFLduGC94iAPERvjKwV1Ov71GonPhMiryuy+4ZPebgFLiTanfb9+MAi9oBnMuMURnYZsMLs6xAHTontPdeRWxGdoKkyWkh3h+Mq+/4Nt/jpJtf45+MFOTav1f6IjTbxFFzsmKHw65NfqtRbYUwSC4GYXX3gMnRpJBwjF6wz6Xov0vyDnuOdxxDWWMMvrRr0vRz+s87lkRwPpiCcfNj427mfXitBrjKlGpB6tvgFzz5LIbI2QMvd1Y6HTcs4ARWdZXsheVSJBDISzYIUBDOQPIPhWoJXmqlQUsRezHxOhCfnzxjJKgHL9wK000/4mEQHDxAEtA9QOv2nOrNrrOqYZTOa7w7PLzTrdcvhb3i3y6I7VfREhuvyDQ6lwfdDXyUYxCwPIl1OdopHrMTOS0+QCqWmb7+x4WOKrim4nrQJxrSayycctjInFnJ4aylfB9Slh3IY0K9y4SFJnYdG6CptcrT2VN4xzQs86QGFnoiA6yqcYtp69ZiQKzoooxwLZZveacIaI+gruWPuPLTp7nn5odqizihmv/US6y+DxVdG43VRFoyZfj9XAqHAA2wbQ58WJUPdw/Fqt1DCS3A7J8O5EO1o7la/DJp4h3LEwkp0TvFzNHiDolkj6lquWeV4g+uSy5OTnZavmqoedN49ylA7+LrwjltGhDek7KAQDUXKX9Ouh2LH5/ED0Q61GjvgsnanYtJDaK94bpqm8AwJNyIkj7QuwDje93dsGhnkLNlRFdwybj4TxIVHBPOj4Deujg1SriDpr3+7JPqfvApXWKXjDlq3ufaxk505B0wo808pf6olQCrEAO/LzRlTxuMLCAh84NQy03/joCM971owARWKDHLWVP73Pvl+VtaoTJLynbZQYqX7HNuM+6vrqxSsyg0IGZqSt2bMUYXaiE/IgJUiFNynCmNAERE4MshAzU6TOgV6PGWpaYXqjd68noDc2FNT1Xs/4WomxjenZj2+AFZpMazUDUwjUHA==

The cronjob lives at /etc/cron.d/cron-clean_esmond_db.

On July 26, 2017 at 4:51:38 PM, Casey Russell () wrote:

Thanks Andrew,

Can you tell me where to find/verify that cron job, I suspect it was running since this was a full toolkit install from the ISO, I think I just had too many checks running for the allotted storage space.

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Wed, Jul 26, 2017 at 3:47 PM, Andrew Lake <> wrote:

Hi,

FWIW a similar issue actually came-up internally within the development team in just the past week or so. When you “delete” something in cassandra it actually gets marked as “tombstone” and doesn't actually get deleted until a process called “compaction” happens. The kicker is that compaction actually requires temporary disk space which is obviously problematic when you run out of disk space. We haven't found a good solution yet unfortunately short of trying to free-up disk space other ways and deleting the data or just doing the following to wipe all the data:

rm -rf /var/lib/cassandra/saved_caches/esmond*

rm -rf /var/lib/cassandra/data/esmond

rm -rf /var/lib/cassandra/commitlog/*

Not ideal for a number of reasons obviously. Scouring the Internet this is a pretty common problem people run into with cassandra without clear answers. On the toolkit side, we automatically set-up a cron job to clean old data every night so you are less likely to run into this, but when you install esmond as a standalone service that cron is not setup. I think we’ll need to review that going forward. Sorry not to have a better immediate solution at the moment but if anyone has one we’re open to it.

Thanks,

Andy

On July 26, 2017 at 11:15:27 AM, Casey Russell () wrote:

Group,

I dug myself a hole and I only see a couple of ways out now. I wasn't watching the database size on my central MA and my disk utilization is now over 90%. I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-duplicates-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-reorders-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509

Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Error: Retried 1 times. Last failure was timeout: timed out

[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/

47G /var/lib/cassandra/data/esmond/raw_data

4.0K /var/lib/cassandra/data/esmond/stat_aggregations

9.9G /var/lib/cassandra/data/esmond/rate_aggregations

13G /var/lib/cassandra/data/esmond/base_rates

69G /var/lib/cassandra/data/esmond/

[root@ps-dashboard esmond]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/VolGroup-lv_root

95G 82G 8.1G 91% /

tmpfs 3.9G 4.0K 3.9G 1% /dev/shm

/dev/sda1 477M 99M 353M 22% /boot

At the time of the "timeout" as I watch, the disk reaches 100% utilization. It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data. During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.

At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.

So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options. Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").

Before I take one of those approaches, does anyone else have other ideas or thoughts?

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

[perfsonar-user] Central MA database size snuck up on me, Casey Russell, 07/26/2017
- Re: [perfsonar-user] Central MA database size snuck up on me, Andrew Lake, 07/26/2017
  - Re: [perfsonar-user] Central MA database size snuck up on me, Casey Russell, 07/26/2017
    - Re: [perfsonar-user] Central MA database size snuck up on me, Andrew Lake, 07/26/2017

List archive

Re: [perfsonar-user] Central MA database size snuck up on me