perfsonar-user - [perfsonar-user] Re: Central MA database size snuck up on me
Subject: perfSONAR User Q&A and Other Discussion
List archive
- From: Casey Russell <>
- To: "" <>
- Subject: [perfsonar-user] Re: Central MA database size snuck up on me
- Date: Wed, 9 Aug 2017 16:59:51 -0500
- Ironport-phdr: 9a23:UG4owxanIHPhpn3D9gFkgAj/LSx+4OfEezUN459isYplN5qZr8S9bnLW6fgltlLVR4KTs6sC0LuG9fi4EUU7or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRpOOv1BpTSj8Oq3Oyu5pHfeQtFiT6+bL9oMBm6sRjau9ULj4dlNqs/0AbCrGFSe+RRy2NoJFaTkAj568yt4pNt8Dletuw4+cJYXqr0Y6o3TbpDDDQ7KG81/9HktQPCTQSU+HQRVHgdnwdSDAjE6BH6WYrxsjf/u+Fg1iSWIdH6QLYpUjmk8qxlSgLniD0fOjE7/mHZisJ+gqFGrhy/uxNy2JTbbJ2POfdkYq/RYdEXSGxcVchRTSxBBYa8YpMRAuUbJuZXsYn8rEYSoxujHgmsH/3gyjtMhnTr2qA1z/4hERzd3Aw7Ad0OtHDUoc72NKgIV+C11rfHzTPZY/NQxzj99JHFfxY8qv+CWrJwdNDeyUgpFw7dilWQqIrlPzCL2esQsmib6fBsWv6oi24isw1xvjauxsYwionVmI0V0ErI+jl+wIYwPdG4S1R0Ydi+EJROsSGWLY12Td0+Q2xupS00yaUGtIalcCUL1JgqxRvSa/KEfoeT/h7uUemcLStkiH15fb+wmwq+/Eulx+D5SMW4zlJHoyxYmdfWrH8NzQbc6s2fR/t94Eih3TGP2hjW6u5eIEA0kbPXK4M7zbIsj5YSvlrPEjHylUnsg6+WcUIk+ues6+v5eLnpupicN4pshgH/NKQhhNC/DPwmPgUPQ2SW++Gx1LPg8ELiXLlHi/I7nrXFvJ/GIMkUurK1DgxQ34sm9RqzETOr3MwdnXYdLVJFfByHj5LuO1HLOP33Ee2/g0m3kDdw2f/GOrnhD47OLnfZlrfhZ6hy60hGxAo1099f+4pYCqsdL/LrRk/xqNvYAwchMwOq2ebnBs591oQYWW2VGK+VKb7SsUSW6eI1OOSMYI4VuC3hK/g++fLil345mVkBfaa3x5sXbm63Huh4L0mDf3Xjn8oBQi82uV90VOHwhkaFVzdJImupUrgU5zcnBZigAJuZAI2hnfbJiD+2BJNNYWZPEBWRCnryX4SCR/oWbi+OeIlsniFSBpa7TIp0/hi1uR6y8ad8NefQ/mVMvoj+z8N44+n7lhg07zFyScKQzzfeHClPgmoUSmpuj+hEqktnxwLb3A==
Sorry, I should have mentioned. Following Andrew's hint, I did go back and look, and because this is a toolkit install, the cron job was installed and running, and should have been running periodically to keep the database from growing out of control. However, reviewing the logfile at /var/log/perfsonar/clean_esmond_db.log reveals that it has been failing with this error for as far back as my logs go each time it tries to run.
It had fooled me initially because the "Error: timed out" line doesn't appear as the last line in that log file, it's a few lines up from the bottom, so a cursory review of that logfile looks like it ran cleanly.
On Wed, Aug 9, 2017 at 4:55 PM, Casey Russell <> wrote:
Group,I've now shut down the central MA and more than doubled the size of the Disk (root volume). So it's no longer a issue of raw space. But still the ps_remove_data.py won't run to completion. I get somewhere from 10 to 30 minutes in (it seems to vary) and it ends with something like the following:Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c44 73, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600 Sending request to delete 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c44 73, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c44 73, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c44 73, event_type=histogram-ttl- reverse, summary_type=base, summary_window=0 Deleted 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c44 73, event_type=histogram-ttl- reverse, summary_type=base, summary_window=0 Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c44 73, event_type=packet-count-lost- bidir, summary_type=base, summary_window=0 Error: Retried 1 times. Last failure was timeout: timed outNone of these:/var/log/esmond/esmond.log/var/log/cassandra/cassandra.log /var/log/httpd/access_log/var/log/https/error_logappear to contain anything interesting when the problem occurs. I've tried running a manual compaction with nodetool in case there were simply so many tombstones hanging out there that it was causing cassandra or Java/Http a problem in processing, didn't make any difference. Anyone have any thoughts on anything else I should try (adjusting cassandra config file settings, etc.) before I just delete this database and start it fresh?
On Wed, Jul 26, 2017 at 10:15 AM, Casey Russell <> wrote:Group,I dug myself a hole and I only see a couple of ways out now. I wasn't watching the database size on my central MA and my disk utilization is now over 90%. I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380 Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-ttl-rever se, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-count-lost-b idir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-duplicates-b idir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503 Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bi dir, summary_type=aggregation, summary_window=3600 Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bi dir, summary_type=aggregation, summary_window=3600 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bi dir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505 Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-reorders-bid ir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509 Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0 Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0 Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0 Error: Retried 1 times. Last failure was timeout: timed out[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/ 47G /var/lib/cassandra/data/esmond/raw_data 4.0K /var/lib/cassandra/data/esmond/stat_aggregations 9.9G /var/lib/cassandra/data/esmond/rate_aggregations 13G /var/lib/cassandra/data/esmond/base_rates 69G /var/lib/cassandra/data/esmond/ [root@ps-dashboard esmond]# df -hFilesystem Size Used Avail Use% Mounted on/dev/mapper/VolGroup-lv_root95G 82G 8.1G 91% /tmpfs 3.9G 4.0K 3.9G 1% /dev/shm/dev/sda1 477M 99M 353M 22% /bootAt the time of the "timeout" as I watch, the disk reaches 100% utilization. It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data. During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options. Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").Before I take one of those approaches, does anyone else have other ideas or thoughts?
- [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
- [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/28/2017
- Re: [perfsonar-user] Re: Central MA database size snuck up on me, Uhl, George D. (GSFC-423.0)[SGT INC], 08/28/2017
- [perfsonar-user] Re: Central MA database size snuck up on me, Casey Russell, 08/09/2017
Archive powered by MHonArc 2.6.19.