perfsonar-user - [perfsonar-user] Re: Central MA database size snuck up on me
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To: "" <>
- Subject: [perfsonar-user] Re: Central MA database size snuck up on me
- Date: Wed, 9 Aug 2017 16:55:49 -0500
- Ironport-phdr: 9a23:p4hzKB1t3UM1jjO4smDT+DRfVm0co7zxezQtwd8ZseIeLfad9pjvdHbS+e9qxAeQG96Ku7Qc06L/iOPJYSQ4+5GPsXQPItRndiQuroEopTEmG9OPEkbhLfTnPGQQFcVGU0J5rTngaRAGUMnxaEfPrXKs8DUcBgvwNRZvJuTyB4Xek9m72/q89pDXYAhEniaxba9vJxiqsAvdsdUbj5F/Iagr0BvJpXVIe+VSxWx2IF+Yggjx6MSt8pN96ipco/0u+dJOXqX8ZKQ4UKdXDC86PGAv5c3krgfMQA2S7XYBSGoWkx5IAw/Y7BHmW5r6ryX3uvZh1CScIMb7S60/Vza/4KdxUBLniikHOT43/m/Ul8J+kr5UrQm7qBBj2YPZep2ZOOZ8c67bYNgURXBBXsFUVyFZHI6zdZAPAPQBPO1Fs4f9ukAOrQCgCgmoAOPk1zhFiWPs3a0nyOQhCh/J3AgkH98Vs3TbttP1NL0MXuCz1qXIyyvMb+9P1Dr79YPGfBchofSWUrJxd8rc0U0vFwLDjlWTt4PqIjKV1uIXv2eH6OpgUPqji3IpqgFwvjiv2tkjipPTio0JzVDE8D11wIUvKt2+Uk50f9ikHIFWty6EK4t7RN4pTWJwuCsi1LELuIK3cSoPxZQpxBPQcOCLfo2H7x7/SOqePTJ1i255dL+8ghu/9FasxvP4W8SyzV1EtDBKksPWuXAIzxHT6taISv96/kq53DaAzQHT6uVdLUApj6XXN4ctw7EumpYNtUnPBCD2mELxjK+ZckUr5PKk5PjgYrXjvpOcNol0hR/iMqk2hMCyAPg0PwoLUmiV+umzz6Hv8Ej2TblWkvE5jqzUv4zGKckYo6O0BhFZ3pgn5hqnCjepytUYnX0JLFJffxKHipDkNE3UIPDlFve/mEqjkDNvx/3dPb3uGJPNLmLdn7fnZ7p97VBTyBYrwdBF+51UEq0BIO70WkLprNzXEAU5MwKvw+bgDtVyzJkeVXuSAq+CLqzSq0SF5uYuI+mXeI8VoyjxJ+Ik5/7okX82h0Udfa+30psLdny0BOppLFiEYSmkvtBUW38HpAQlS+rjkhifSjNJT3e0Q68m4DwnUsSrAZqJDtS1jaaPxyC9F4cTe3tLEHiNF2vlbYOJR61KZS6PdJxPiDsBAJOoUIIwnSuzrxT3z74veuHO5zYDuJbn/Nt84ffek1c0+CAiXJfV6H2EU2whxjBAfDQxxq0q+UE=
Group,
I've now shut down the central MA and more than doubled the size of the Disk (root volume). So it's no longer a issue of raw space. But still the ps_remove_data.py won't run to completion. I get somewhere from 10 to 30 minutes in (it seems to vary) and it ends with something like the following:
Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600
Sending request to delete 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=statistics, summary_window=3600
Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0
Deleted 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0
Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0
Error: Retried 1 times. Last failure was timeout: timed out
None of these:
/var/log/esmond/esmond.log
/var/log/cassandra/cassandra.log
/var/log/httpd/access_log
/var/log/https/error_log
appear to contain anything interesting when the problem occurs. I've tried running a manual compaction with nodetool in case there were simply so many tombstones hanging out there that it was causing cassandra or Java/Http a problem in processing, didn't make any difference. Anyone have any thoughts on anything else I should try (adjusting cassandra config file settings, etc.) before I just delete this database and start it fresh?
On Wed, Jul 26, 2017 at 10:15 AM, Casey Russell <> wrote:
Group,I dug myself a hole and I only see a couple of ways out now. I wasn't watching the database size on my central MA and my disk utilization is now over 90%. I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380 Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=histogram-ttl- reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-count-lost- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-duplicates- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503 Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-loss-rate- bidir, summary_type=aggregation, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-loss-rate- bidir, summary_type=aggregation, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-loss-rate- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37 b0, event_type=packet-reorders- bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509 Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819c e8, event_type=packet-count-lost, summary_type=base, summary_window=0 Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819c e8, event_type=packet-count-lost, summary_type=base, summary_window=0 Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819c e8, event_type=packet-count-lost, summary_type=base, summary_window=0 Error: Retried 1 times. Last failure was timeout: timed out[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/ 47G /var/lib/cassandra/data/esmond/raw_data 4.0K /var/lib/cassandra/data/esmond/stat_aggregations 9.9G /var/lib/cassandra/data/esmond/rate_aggregations 13G /var/lib/cassandra/data/esmond/base_rates 69G /var/lib/cassandra/data/esmond/ [root@ps-dashboard esmond]# df -hFilesystem Size Used Avail Use% Mounted on/dev/mapper/VolGroup-lv_root95G 82G 8.1G 91% /tmpfs 3.9G 4.0K 3.9G 1% /dev/shm/dev/sda1 477M 99M 353M 22% /bootAt the time of the "timeout" as I watch, the disk reaches 100% utilization. It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data. During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options. Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").Before I take one of those approaches, does anyone else have other ideas or thoughts?
- [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
- [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017
- [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
Archive powered by MHonArc 2.6.19.