Skip to Content.
Sympa Menu

perfsonar-user - RE: [perfsonar-user] Re: Central MA database size snuck up on me

Subject: perfSONAR User Q&A and Other Discussion

List archive

RE: [perfsonar-user] Re: Central MA database size snuck up on me


Chronological Thread 
  • From: "Garnizov, Ivan (RRZE)" <>
  • To: "Uhl, George D. (GSFC-423.0)[SGT INC]" <>, "Casey Russell" <>
  • Cc: "" <>, "Noss, Martyn J. (GSFC-420.0)[InuTeq, LLC]" <>
  • Subject: RE: [perfsonar-user] Re: Central MA database size snuck up on me
  • Date: Fri, 15 Sep 2017 07:47:48 +0000
  • Accept-language: en-GB, de-DE, en-US
  • Ironport-phdr: 9a23:nguqlRwsb2LW5UbXCy+O+j09IxM/srCxBDY+r6Qd0usfK/ad9pjvdHbS+e9qxAeQG96Ku7Qc06L/iOPJYSQ4+5GPsXQPItRndiQuroEopTEmG9OPEkbhLfTnPGQQFcVGU0J5rTngaRAGUMnxaEfPrXKs8DUcBgvwNRZvJuTyB4Xek9m72/q89pDXYAhEniaxba9vJxiqsAvdsdUbj5F/Iagr0BvJpXVIe+VSxWx2IF+Yggjx6MSt8pN96ipco/0u+dJOXqX8ZKQ4UKdXDC86PGAv5c3krgfMQA2S7XYBSGoWkx5IAw/Y7BHmW5r6ryX3uvZh1CScIMb7Vq4/Vyi84Kh3SR/okCYHOCA/8GHLkcx7kaZXrAu8qxBj34LYZYeYO/RkfqPZYNgUW2xPUMhMXCBFG4+xYY4DAuwcNuhasob9vUMDoxugCwexGOPhxDxGhnH00q07z+suHhrL0xY8E98KqnnYsMn5OLkUXOuozKfI1zLDb/ZO1Dvz6YbHaAohofeNXbNxdsrR11EjHB7GgFWOs4PlOS6e2uARvWaH7OVuWuejh2A6oAx2oziv2N0jio/TioIa0F/E7yN5wIc0JN2/Vk52etCkH4FNty2AKoR5XNovTmd1syg0zb0GvIS0fCkMyJk/wx7favqHc4uW7R3+VeaRJy10i25ieLK6nxqy7UahyuzgVsmozllKtDBJncXLtnAIzxDT8taISuFz/ke63jaP0Rrc6vteLUAyi6XbN4YtzaQolpUJrUvDHjH5lF/xjK+MeUUo4uuo5P7hYrX8uJCcMZV4igfgPaQygsCwHP43MhQUUGiA5eSzzrLi8VflT7VNi/07lLTSvpPCJckDuKK1HxNZ3psm5hu+ATqr09EVkmMbIF5ZZB6KipXlN03QLP34CPqyhkmgnThzy/zbMLDtH4/BImXAnbriZ7pw6FNQxBAtwd1Q459YEq8NLO7vVkPssdHVDxE0Pg+xzun9FNlxyp8SVGeSDqOFLq/fsVqF6+chLuSOYoIepSzzJOI/5/H0iH80gV8dcret3ZsQcH24G/tnL1yDYXvtm9sNDH0GvhAkTO3rllKOSyNTZ3CzX64l+D47EoamAp3FRoCinrOB2j23EYBIaWxeC1CMF2nnd5mcVvsSdC6ePtJtnzkFWLWvSIIs0AuhuBPmx7Z7K+fY5zEUuYzj2dVw4uDfiB4/+SRxD8uH0mGNS290nnkPRz8zxK1/oFJ9xU2F0aRijPxXD8ZT5/VIUgY7Mp7c0fd3C97oVQLcZNuGVUipTs28AT4tVtIx38MOY0FlFtW6kB/DxSSqA6QSl7yNHpM06LvQ32XqJ8lj0XbLz60hj1g9QstTLm2qmLRz9wnVB47VjUqZjaCqeroA3CLT7muM03eBvFwLGDJ3BO/4XXRbLmTbqs726wvuCffuJo8JGU4dk5XEcup5UfLVxXRgb8KpcIDRbmWrlGr2CVCVxrekaoPxciMS2zuLTABOuAkI8GfODxUlHSqlpyiWWDd0CEn0bkfo2ep3rmm2SAkywh3cPGN70L/gsDEcn/qdWfQNmvolpi4n43VOJmSQnpieX9uergwnc6xGZ9Iv50lv22vE8QBwaM/zZ5t+j0ITJlwk93jl0A96X8AZyZMn

Hi Casey, guys,

 

Here is what I have found about Cassandra DB and data cleanup.

 

Deletes in Cassandra

Cassandra uses a log-structured storage engine. Because of this, deletes do not remove the rows and columns immediately and in-place. Instead, Cassandra writes a special marker, called a tombstone, indicating that a row, column, or range of columns was deleted. These tombstones are kept for at least the period of time defined by the gc_grace_seconds per-table setting. Only then a tombstone can be permanently discarded by compaction.

This scheme allows for very fast deletes (and writes in general), but it's not free: aside from the obvious RAM/disk overhead of tombstones, you might have to pay a certain price when reading data back if you haven't modelled your data well.

Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.

https://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

 

This tells me, even after deletes of certain time slots, it takes time for these to transition from tombstones to free space.

In addition to this we are working towards improving the cleanup process as by the report from Casey on the timeouts.

 

Regards,

Ivan Garnizov

 

GEANT SA1T2: pS deployments GN Operations

GEANT SA2T3: pS development team

GEANT SA3T5: eduPERT team

 

 

From: [mailto:] On Behalf Of Uhl, George D. (GSFC-423.0)[SGT INC]
Sent: Montag, 28. August 2017 18:44
To: Casey Russell
Cc: ; Noss, Martyn J. (GSFC-420.0)[InuTeq, LLC]
Subject: Re: [perfsonar-user] Re: Central MA database size snuck up on me

 

Hi Casey,

 

Thanks for following up!  I’m running a Central MA that I stitched together back in 2015.  I’m thinking that a lot of the more recent cassandar/esmond support scripts might not be as backwards compatible as I had hoped.  So in lieu of trying to remove old data which eventually I will need to do, I went ahead and expanded the logical disk to buy some time until I can clean up the current MA and replace it with a new one.

 

Thanks,

George

 

From: Casey Russell <>
Date: Monday, August 28, 2017 at 9:46 AM
To: George Uhl <>
Cc: "" <>, "Noss, Martyn J. (GSFC-420.0)[InuTeq, LLC]" <>
Subject: Re: [perfsonar-user] Re: Central MA database size snuck up on me

 

George,

 

     I'm not sure if this is the cause of your error, but I do remember a couple of problems I encountered in the documentation for the script on this page:

 

 

 

First if you haven't yet migrated to CentOS 7, you have to run the following commands PRIOR to running the ps_remove_data.py script:

 

(taken from that page)

 

cd /usr/lib/esmond

source /opt/rh/python27/enable

/opt/rh/python27/root/usr/bin/virtualenv --prompt="(esmond)" .

. bin/activate

python /usr/lib/esmond/util/ps_remove_data.py -c usr/lib/esmond/util/ps_remove_data.conf

 

Three things to note:

 

That third command does have a period at the end and it does matter.  That fourth command does have a period and it matters as well.  :-)

 

If you've run this more than once, then you'll get an error in the output of the third command that reads: "New python executable in ./bin/python2  Not overwriting existing python script ./bin/python (you must use ./bin/python2)"  It appears to be harmless.

 

Finally, there's an error in the last command.  You'll have to add the leading "/" character in the file path at the end of the command.  So the corrected command will look like this.  

 

python /usr/lib/esmond/util/ps_remove_data.py -c /usr/lib/esmond/util/ps_remove_data.conf

 

Now maybe that will help you, maybe not.  But it might help others who are looking to run that script manually.


 

Sincerely,

Casey Russell

Network Engineer

KanREN

phone785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

On Mon, Aug 28, 2017 at 7:23 AM, Uhl, George D. (GSFC-423.0)[SGT INC] <> wrote:

 

I’ve had the same problem where I’ve run out of disk space to support my Central MA database but I’m getting an odd error and the script is failing:

 

# /usr/lib/esmond/util/ps_remove_data.py 

cassandra_db [INFO] Checking/creating column families

cassandra_db [INFO] Schema check done

cassandra_db [DEBUG] Opening ConnectionPool

cassandra_db [INFO] Connected to ['localhost:9160']

Error: [Errno 2] No such file or directory: 'p'

 

 

# du -h /var/lib/cassandra/data/esmond/

25G/var/lib/cassandra/data/esmond/raw_data

7.6G/var/lib/cassandra/data/esmond/base_rates

4.0K/var/lib/cassandra/data/esmond/stat_aggregations

5.0G/var/lib/cassandra/data/esmond/rate_aggregations

37G/var/lib/cassandra/data/esmond/



# df -h

Filesystem            Size  Used Avail Use% Mounted on

/dev/mapper/vg_archive2-lv_root

                       50G   47G  385M 100% /

tmpfs                 3.9G  4.0K  3.9G   1% /dev/shm

/dev/sda1             477M  167M  285M  37% /boot

/dev/mapper/vg_archive2-lv_home

                       69G   22G   44G  33% /home

 

Thanks,

George

 

From: <> on behalf of Casey Russell <>
Date: Wednesday, August 9, 2017 at 5:59 PM
To: "" <>
Subject: [perfsonar-user] Re: Central MA database size snuck up on me

 

Sorry, I should have mentioned.  Following Andrew's hint, I did go back and look, and because this is a toolkit install, the cron job was installed and running, and should have been running periodically to keep the database from growing out of control.  However, reviewing the logfile at /var/log/perfsonar/clean_esmond_db.log reveals that it has been failing with this error for as far back as my logs go each time it tries to run.  

 

It had fooled me initially because the "Error: timed out" line doesn't appear as the last line in that log file, it's a few lines up from the bottom, so a cursory review of that logfile looks like it ran cleanly.


 

Sincerely,

Casey Russell

Network Engineer

KanREN

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

On Wed, Aug 9, 2017 at 4:55 PM, Casey Russell <> wrote:

Group,

 

     I've now shut down the central MA and more than doubled the size of the Disk (root volume).  So it's no longer a issue of raw space.  But still the ps_remove_data.py won't run to completion.  I get somewhere from 10 to 30 minutes in (it seems to vary) and it ends with something like the following:

 

Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Sending request to delete 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Deleted 1 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0

Deleted 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0

Sending request to delete 11 rows for metadata_key=189be3bd15fb482a91f2bfb9524c4473, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0

Error: Retried 1 times. Last failure was timeout: timed out

 

None of these:

/var/log/esmond/esmond.log

/var/log/cassandra/cassandra.log

/var/log/httpd/access_log

/var/log/https/error_log

 

appear to contain anything interesting when the problem occurs.  I've tried running a manual compaction with nodetool in case there were simply so many tombstones hanging out there that it was causing cassandra or Java/Http a problem in processing, didn't make any difference.  Anyone have any thoughts on anything else I should try (adjusting cassandra config file settings, etc.) before I just delete this database and start it fresh?

 


 

Sincerely,

Casey Russell

Network Engineer

KanREN

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

On Wed, Jul 26, 2017 at 10:15 AM, Casey Russell <> wrote:

Group,

 

     I dug myself a hole and I only see a couple of ways out now.  I wasn't watching the database size on my central MA and my disk utilization is now over 90%.  I've tried using the ps_remove_data.py script several times with several different variations on the config script, but it will invariably end some minutes or hours later with a timeout like this:

 

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=aggregation, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=base, summary_window=0, begin_time=0, expire_time=1485529380

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=0, begin_time=0, expire_time=1485529380

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-rtt, summary_type=statistics, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=histogram-ttl-reverse, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-count-lost-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545381

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-duplicates-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545503

Sending request to delete 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600

Deleted 24 rows for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=aggregation, summary_window=3600

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-loss-rate-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545505

Query error for metadata_key=1bdb8f32fe9d4194828d134f37fb37b0, event_type=packet-reorders-bidir, summary_type=base, summary_window=0, begin_time=0, expire_time=1469545509

Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Deleted 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Sending request to delete 6 rows for metadata_key=1be4b626486c46be88776b3530819ce8, event_type=packet-count-lost, summary_type=base, summary_window=0

Error: Retried 1 times. Last failure was timeout: timed out

 

[root@ps-dashboard esmond]# du -h /var/lib/cassandra/data/esmond/

47G     /var/lib/cassandra/data/esmond/raw_data

4.0K    /var/lib/cassandra/data/esmond/stat_aggregations

9.9G    /var/lib/cassandra/data/esmond/rate_aggregations

13G     /var/lib/cassandra/data/esmond/base_rates

69G     /var/lib/cassandra/data/esmond/

 

[root@ps-dashboard esmond]# df -h

Filesystem            Size  Used Avail Use% Mounted on

/dev/mapper/VolGroup-lv_root

                       95G   82G  8.1G  91% /

tmpfs                 3.9G  4.0K  3.9G   1% /dev/shm

/dev/sda1             477M   99M  353M  22% /boot

 

At the time of the "timeout" as I watch, the disk reaches 100% utilization.  It appears to me that during the deletion of rows, Cassandra/Esmond uses chunks of disk space to store temporary data, and flushes that data.  During the process the disk utilization varies up and down from 91% to 100% until it finally reaches full and the timeout error occurs.

 

At the end of the failed attempt, even if I restart cassandra, the disk space utilization is approximately what it was before the failed run.  

 

So, without enough disk space to finish the ps_remove_data.py script, it would appear to me, I have two options.  Delete all my data and start over with a clean database, or shut the machine down and allocate more space to it (it's a VM, but I can't add the space "hot").

 

Before I take one of those approaches, does anyone else have other ideas or thoughts?

 

Sincerely,

Casey Russell

Network Engineer

KanREN

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

linkedintwittertwitter

 

 

 



  • RE: [perfsonar-user] Re: Central MA database size snuck up on me, Garnizov, Ivan (RRZE), 09/15/2017

Archive powered by MHonArc 2.6.19.

Top of Page