perfsonar-user - [perfsonar-user] Re: Central MA database size snuck up on me

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Re: Central MA database size snuck up on me

From: Casey Russell <>
To: "" <>
Subject: [perfsonar-user] Re: Central MA database size snuck up on me
Date: Wed, 9 Aug 2017 16:55:49 -0500
Ironport-phdr: 9a23:p4hzKB1t3UM1jjO4smDT+DRfVm0co7zxezQtwd8ZseIeLfad9pjvdHbS+e9qxAeQG96Ku7Qc06L/iOPJYSQ4+5GPsXQPItRndiQuroEopTEmG9OPEkbhLfTnPGQQFcVGU0J5rTngaRAGUMnxaEfPrXKs8DUcBgvwNRZvJuTyB4Xek9m72/q89pDXYAhEniaxba9vJxiqsAvdsdUbj5F/Iagr0BvJpXVIe+VSxWx2IF+Yggjx6MSt8pN96ipco/0u+dJOXqX8ZKQ4UKdXDC86PGAv5c3krgfMQA2S7XYBSGoWkx5IAw/Y7BHmW5r6ryX3uvZh1CScIMb7S60/Vza/4KdxUBLniikHOT43/m/Ul8J+kr5UrQm7qBBj2YPZep2ZOOZ8c67bYNgURXBBXsFUVyFZHI6zdZAPAPQBPO1Fs4f9ukAOrQCgCgmoAOPk1zhFiWPs3a0nyOQhCh/J3AgkH98Vs3TbttP1NL0MXuCz1qXIyyvMb+9P1Dr79YPGfBchofSWUrJxd8rc0U0vFwLDjlWTt4PqIjKV1uIXv2eH6OpgUPqji3IpqgFwvjiv2tkjipPTio0JzVDE8D11wIUvKt2+Uk50f9ikHIFWty6EK4t7RN4pTWJwuCsi1LELuIK3cSoPxZQpxBPQcOCLfo2H7x7/SOqePTJ1i255dL+8ghu/9FasxvP4W8SyzV1EtDBKksPWuXAIzxHT6taISv96/kq53DaAzQHT6uVdLUApj6XXN4ctw7EumpYNtUnPBCD2mELxjK+ZckUr5PKk5PjgYrXjvpOcNol0hR/iMqk2hMCyAPg0PwoLUmiV+umzz6Hv8Ej2TblWkvE5jqzUv4zGKckYo6O0BhFZ3pgn5hqnCjepytUYnX0JLFJffxKHipDkNE3UIPDlFve/mEqjkDNvx/3dPb3uGJPNLmLdn7fnZ7p97VBTyBYrwdBF+51UEq0BIO70WkLprNzXEAU5MwKvw+bgDtVyzJkeVXuSAq+CLqzSq0SF5uYuI+mXeI8VoyjxJ+Ik5/7okX82h0Udfa+30psLdny0BOppLFiEYSmkvtBUW38HpAQlS+rjkhifSjNJT3e0Q68m4DwnUsSrAZqJDtS1jaaPxyC9F4cTe3tLEHiNF2vlbYOJR61KZS6PdJxPiDsBAJOoUIIwnSuzrxT3z74veuHO5zYDuJbn/Nt84ffek1c0+CAiXJfV6H2EU2whxjBAfDQxxq0q+UE=

Group,

I've now shut down the central MA and more than doubled the size of the Disk (root volume). So it's no longer a issue of raw space. But still the ps_remove_data.py won't run to completion. I get somewhere from 10 to 30 minutes in (it seems to vary) and it ends with something like the following:

Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Sending request to delete 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0

Deleted 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0

Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0

Error: Retried 1 times. Last failure was timeout: timed out

None of these:

/var/log/esmond/esmond.log

/var/log/cassandra/cassandra.log

/var/log/httpd/access_log

/var/log/https/error_log

appear to contain anything interesting when the problem occurs. I've tried running a manual compaction with nodetool in case there were simply so many tombstones hanging out there that it was causing cassandra or Java/Http a problem in processing, didn't make any difference. Anyone have any thoughts on anything else I should try (adjusting cassandra config file settings, etc.) before I just delete this database and start it fresh?

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

On Wed, Jul 26, 2017 at 10:15 AM, Casey Russell <> wrote:

Group,

I dug myself a hole and I only see a couple of ways out now. I wasn't watching the database size on my central MA and my disk utilization is now over 90%. I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380
Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-duplicates-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503
Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600
Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505
Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-reorders-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509
Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0
Error: Retried 1 times. Last failure was timeout: timed out

[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/
47G /var/lib/cassandra/data/esmond/raw_data
4.0K /var/lib/cassandra/data/esmond/stat_aggregations
9.9G /var/lib/cassandra/data/esmond/rate_aggregations
13G /var/lib/cassandra/data/esmond/base_rates
69G /var/lib/cassandra/data/esmond/

[root@ps-dashboard esmond]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
95G 82G 8.1G 91% /
tmpfs 3.9G 4.0K 3.9G 1% /dev/shm
/dev/sda1 477M 99M 353M 22% /boot

At the time of the "timeout" as I watch, the disk reaches 100% utilization. It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data. During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.

At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.

So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options. Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").

Before I take one of those approaches, does anyone else have other ideas or thoughts?

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

[perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
- [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
  - Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017
    - Re: [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/28/2017
      - Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017

List archive

[perfsonar-user] Re: Central MA database size snuck up on me