Skip to Content.
Sympa Menu

grouper-users - RE: [grouper-users] Maintaining Grouper database size

Subject: Grouper Users - Open Discussion List

List archive

RE: [grouper-users] Maintaining Grouper database size


Chronological Thread 
  • From: "Hyzer, Chris" <>
  • To: "Black, Carey M." <>
  • Cc: Shilen Patel <>, David Langenberg <>, Gail H Lift <>, "" <>, Rory Larson <>
  • Subject: RE: [grouper-users] Maintaining Grouper database size
  • Date: Fri, 16 Feb 2018 17:52:33 +0000
  • Accept-language: en-US
  • Authentication-results: spf=none (sender IP is ) ;
  • Ironport-phdr: 9a23:kKi15h1ti3J8NEKbsmDT+DRfVm0co7zxezQtwd8ZsesWK/vxwZ3uMQTl6Ol3ixeRBMOHs6kC07KempujcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS47xaFLIv3K98yMZFAnhOgppPOT1HZPZg9iq2+yo9JDffwtFiCChbb9uMR67sRjfus4KjIV4N60/0AHJonxGe+RXwWNnO1eelAvi68mz4ZBu7T1et+ou+MBcX6r6eb84TaFDAzQ9L281/szrugLdQgaJ+3ART38ZkhtMAwjC8RH6QpL8uTb0u+ZhxCWXO9D9QKsqUjq+8ahkVB7oiD8GNzEn9mHXltdwh79frB64uhBz35LYbISTOfFjfK3SYMkaSHJBUMhPSiJBHo2yYYgBD+UDPOZXs4byqkAUoheiGQWhHv/jxiNKi3LwwKY00/4hEQbD3AE4Ed4AsnTVrdTrO6cISey+0bfFzTXZb/NXwjfx5pXDfxckof6QXbJxccvQxlc1Fw7ej1WQspDqMymI1uQVrWeb6exgWfixhGE6tgF8uz6izdoihInOg4Ia0FHE9SNhzYY0I924VFB0YcSiEJROqyGWKZF6Td0/TGF1oCo60qcGuZm8fCgE0JQnwB/fa/qbc4SS/h3jU+ORLS93hHJ/YLKzngi+/lW9xuD9VMS531BHpTdGnNnUrn0ByQHf58mdRvZz4EutwyuD2gPP5u1ePEw5l6rWJ4Y8zrM+ipYfq0DOEjLslEnokaObcl8o9vWq5unmernmqIGTOoxohgz7N6khhNGwDvg2MgULUWiW9+Cx2bzm8ELjXLpHiuM6n6zXsJ3aIckWqai0CBJP3Ik58RawFTKm3cwYnXYZKFJFfwqKgZD1Nl/JPPz0EO6zjkm0njpl3vzGOabuDYvXInjEjbfhYa1y60lByAo10N9T/YpUCqsGIPLvRED+qMDYDh4+Mwyy2ernD8h91p8aWWKIBa+ZM7nevkOP5uIqO+WMZYkVtyjhK/U9+fLikH40lUUTcKW3x5cbdXO1Euh8L0mEY3fhgs8NEWIQsQo/SOzqhkeCUTlWZ3uqWqIz+jE7CYKnDIjdXICgm72B3DynEZFMe2BGEk6DEXHud4meRfgDdT+SLtd7kjMYTbihV5Mh1Ra2uQ/10bpnKffU+jUGupL5zdR1+vbTmg8o9TxvFMmd12CNT3ponmMTWTM6xqF/oUphylidy6h4heJXFcBN6/9TTAg1KIPcnKRGDIW4cAbIddTNAH2vWNi3SRR3BJplydsHaEU7Qo/5phfYwmynD6JDxJKRA5lhuILNzXXrY45Wy2zHz+Np21wtQtpdOHeOh7V0sRXLCojP1UiVivD5JuwnwCfR+TLbniK1t0ZCXVs1CP2dUA==
  • Spamdiagnosticmetadata: NSPM
  • Spamdiagnosticoutput: 1:99

Heres what I will do shortly.  Please read carefully and let me know asap on these two default choices which I believe are useful and conservative.

 

############################################

## audit entries with no logged in user aren’t really all that useful.  There is point in time data still.  So removing these shouldn’t be a big deal

## default is remove these that are 5 years old.

############################################

 

# number of days to retain db rows in grouper_audit_entry with no logged in user (loader, gsh, etc).  -1 is forever.  suggested is 365.  default is five years: 1825

loader.retain.db.audit_entry_no_logged_in_user.days=1825

 

############################################

## I think its ok to remove all audit entries over 10 years, but will default this to never since even at penn there aren’t that many records. 

## These are audits for things people do on the UI or WS generally. 

############################################

 

# number of days to retain db rows in grouper_audit_entry.  -1 is forever.  suggested is -1 or ten years: 3650

loader.retain.db.audit_entry.days=-1

 

############################################

## After you delete an object in grouper, it is still in point in time.  So if you want to know who was in a group a year ago, you need this info

## However, I think after some time its ok to let it go.  So the default is 5 years

############################################

 

# number of days to retain db rows for point in time deleted objects.  -1 is forever.  suggested is 365.  default is five years: 1825

loader.retain.db.point_in_time_deleted_object.days=1825

 

############################################

## This is optional.  You can set limits on deleted objects in point in time on a folder level.  So if you don’t need delete course point in time

## you can get rid of that sooner…

############################################

 

# number of days to retain db rows for point in time deleted objects in a folder.  "courses" or "someLabel" are variables you make up in these examples

#loader.retain.db.point_in_time_deleted_objects_in_folder.courses.days=180

#loader.retain.db.point_in_time_deleted_objects_in_folder.courses.folderName=my:folder:for:courses

 

#loader.retain.db.point_in_time_deleted_objects_in_folder.someLabel.days=365

#loader.retain.db.point_in_time_deleted_objects_in_folder.someLabel.folderName=my:folder:for:whatever

 

############################################

## This is optional.  You can just automatically obliterate folders in a parent folder that are a certain age old…  e.g. courses.

##  so you could delete a term of courses 4 years old if you like.  Note, make sure the loader isn’t going to recreate or you will get churn…

############################################

 

# number of days after a subfolder is created that it will be obliterated (deleted) and point in time will be deleted too.

# "courses" or "anotherLabel" are variables you make up in these examples

#loader.retain.db.folder.courses.days=1825

#loader.retain.db.folder.courses.parentFolderName=my:folder:for:courses

 

#loader.retain.db.folder.anotherLabel.days=1825

#loader.retain.db.folder.anotherLabel.parentFolderName=my:folder:for:courses

 

 

From: Black, Carey M. [mailto:]
Sent: Wednesday, January 31, 2018 1:32 PM
To: Hyzer, Chris <>
Cc: Shilen Patel <>; David Langenberg <>; Gail H Lift <>; ; Rory Larson <>
Subject: RE: [grouper-users] Maintaining Grouper database size

 

In general, I think there is a clear need here for operational tools/processes to manage the DB data growth.

 

 

However, I also hate losing data. ( Delete is a form of “loss”. Hopefully a willful choice, but still a loss.)

Mostly because we lose the ability to ask a whole range of questions about “what really happened”? ( While looking back instead of planning ahead. J )

 

 

 

Maybe it would be better to have a model where this kind of audit data is moved from “Active” to “Archived” then off to “delete”?

                Maybe a shadow table(s) where the “Archived data” can be held just out of sight of the operation of the UI/WS, but still around for other reporting?

                                Your schedule of a configuration to define the duration of “Active” (Days/weeks/months, move from “Active” to “Archive” on that schedule.) and “Achieved” (Days/weeks/months/years) data sounds good.  

                                Then add a later schedule to more from Archived to delete.

 

 

I also think there is the possibility for some to want to treat any membership change ( regardless of source [UI/WS/Loader/etc…]) as equally valuable, and others might see “non-human” process as less necessary to have in their active audit trail.

                So maybe the definition of that should be a separate config item? (AKA: “has a subject id”  vs “no subject id” for the change)

                Maybe even special groups that need more monitoring/carve outs for extra ( or reduced) retention too.

 

 

Also, I also wonder if there are some reports/summary/monitoring that should be done before the delete that would preserve some details/trends while still letting go of the volume of data?

                Maybe there are some groups that it would be nice to monitor the count of members once a day, month, etc.. across the cycles of the academic/finical calendar?

                Maybe seeing spikes/dips in Loader loaded data by group/job?

                Maybe seeing growth/shrinking basis, ref, access control policy groups in the system over time?

                Etc…

 

So I think it may be harder than just “archive/delete every N days”. Might even be a opportunity to tag with attributes to signal what to do for each group? ( maybe with a system config default if not tagged? ) .. Thinking like Attestation, but for the definition of  things like: “ArchiveAfter”, ‘DeleteAfter”, “CollectStatsEvery”….

 

--

Carey Matthew

 

From: [] On Behalf Of Rory Larson
Sent: Wednesday, January 31, 2018 1:04 PM
To: Hyzer, Chris <>
Cc: Shilen Patel <>; David Langenberg <>; Gail H Lift <>;
Subject: RE: [grouper-users] Maintaining Grouper database size

 

Agreed.  That would be a very nice feature.

 

Would time-based deletes be based on create-date or last-mod-date?  There seems to be a difference between these in the grouper_audit_entry table, though I'm not sure why a log record or point-in-time record would ever be modified.

 

Thanks,

Rory

 

 

From: Gail H Lift []
Sent: Wednesday, January 31, 2018 11:06 AM
To: David Langenberg <>
Cc: Hyzer, Chris <>; Rory Larson <>; Shilen Patel <>;
Subject: Re: [grouper-users] Maintaining Grouper database size

 

Sounds good here too. The configurable time intervals will make it easy to adjust to local needs.

 

On Wed, Jan 31, 2018 at 11:55 AM, David Langenberg <> wrote:

Sounds good to us.  We'd appreciate those maint jobs.

Dave

--
David Langenberg
Asst Director, Identity Management
The University of Chicago

On 1/31/18, 10:50 AM, " on behalf of Hyzer, Chris" < on behalf of > wrote:

    For audit records (this is the grouper_audit_entry_v):

    Penn currently has 11,685,969 records in grouper_audit_entry_v
    addGroupMemberhip and deleteGroupMembership has 80% of the entries
    99% of entries have no logged_in_subject_id
    99% are loader, blank, grouperShell
    70k (less than 1% are from UI)

    How about a daemon that:

    Deletes records older than a year (month?  Would be configurable) that have no logged in subject id (not tied to a user doing something)
    Deletes records older than a year (configurable) which are loader, blank, or grouperShell system
    There isn’t much left, but we could delete any record over 5 years if people want (we have 90k records older than 5 years which are grouperUI or grouperWS)

    https://bugs.internet2.edu/jira/browse/GRP-1674


    #####################

    For PIT Penn has 10 million memberships in the PIT table.
    7.5 million (at least) are loader jobs
    5 million of loader jobs older than 2 years

    How about a daemon that deletes all over 5 (configurable) years old, and loader data older than 2 (configurable) years?


    Thoughts?

    Thanks
    Chris

    -----Original Message-----
    From: Rory Larson [mailto:]
    Sent: Monday, January 29, 2018 4:22 PM
    To: Hyzer, Chris <>; Shilen Patel <>
    Cc:
    Subject: RE: [grouper-users] Maintaining Grouper database size

    Well, there is a "created_on" and a "last_updated" field, which are apparently Unix-type dates.  That would make it possible to delete everything in the table prior to a chosen date.  There's also "server_user_name", which gives the account that ran the transaction, e.g., myself vs. the grouperLoader.  That would let us delete only the ones that were run regularly under the grouperLoader, and keep the changes made by individual users.  I'm not sure how worried we should be about keeping that information.  On a 40 GB table, how long would a selected DML delete take?  And if we do it that way, we are still left with a huge table, because MySQL only marks records for deletion without actually deleting them.  In fact, wouldn't it just make the table bigger by adding all the deletion transactions?

    Thanks,
    Rory


    -----Original Message-----
    From: Hyzer, Chris [mailto:]
    Sent: Monday, January 29, 2018 2:45 PM
    To: Rory Larson <>; Shilen Patel <>
    Cc:
    Subject: RE: [grouper-users] Maintaining Grouper database size

    Is there a way to look at the data of the audit table and see which records to delete?   I think any loader jobs audits can be deleted (GrouperSystem is the entity?), but things users do through the UI for example should be kept so you know who did what.  Make sense?  Take a look and see if anything jumps out or we can look at our audit tables too...

    Thanks
    Chris

    -----Original Message-----
    From: Rory Larson [mailto:]
    Sent: Monday, January 29, 2018 3:42 PM
    To: Hyzer, Chris <>; Shilen Patel <>
    Cc:
    Subject: RE: [grouper-users] Maintaining Grouper database size

    Thanks, all, for the suggestions.  I think I'm hearing that I don't want to purge data from large point-in-time tables manually through SQL.  Instead, I should run an edu.internet2.middleware.grouper... command under gsh, such as:

        edu.internet2.middleware.grouper.pit.PITUtils.deleteInactiveRecords(new Date(), true); or
        edu.internet2.middleware.grouper.pit.PITUtils.deleteInactiveObjectsInStem("my:stem", true);

    This will eliminate inactive records or objects only, meaning that if the people are still around, they will still be taking up space with point-in-time data going back perhaps for many years, correct?  And additionally, we would still have to perform an OPTIMIZE TABLE on each PIT table to (possibly) realize any gains?

    I will plan to do this, but I'm unsure of how much space, if any, I'm likely to recover.

    Another suggestion was made off-line, that I should perform a TRUNCATE on the grouper_audit_entry table.  In fact, this table has grown to about 40 GB, and is apparently the biggest source of our problem.  I understand that this is simply a transaction log table, and that nothing else depends on it.  Truncation means that the table would essentially be dropped and then re-created, minus all the data.  That would be done in SQL.  Is there any reason not to do this?  If so, is there a better way to reduce its size?

    Thanks,
    Rory


    -----Original Message-----
    From: Hyzer, Chris [mailto:]
    Sent: Monday, January 29, 2018 1:03 PM
    To: Shilen Patel <>; Rory Larson <>
    Cc:
    Subject: RE: [grouper-users] Maintaining Grouper database size

    Im worried about the future state of using direct SQL, maybe something else in future will be needed that we put in the GSH command... you should definitely use that.



    Thanks

    Chris



    -----Original Message-----

    From: [mailto:] On Behalf Of Shilen Patel

    Sent: Monday, January 29, 2018 1:34 PM

    To: Rory Larson <>

    Cc:

    Subject: Re: [grouper-users] Maintaining Grouper database size



    If you’re going to trim the audit based on time, I’d suggest using the gsh command since it’ll delete the records from the tables in the right order (taking into account foreign keys).  But as long as you delete them in the right order via sql directly, that should be fine as well.  Also, recently, I documented how much space was being taken up at Duke (we use Oracle).  https://urldefense.proofpoint.com/v2/url?u=https-3A__spaces.internet2.edu_display_Grouper_Duke-2BDisk-2BSpace-2BUsage&d=DwIGaQ&c=Cu5g146wZdoqVuKpTNsYHVKLgTZS09MUACeOXHWmTvE&r=8UxP4WgXNK7VFhDT-iCutw&m=i5pbjVCyOSrgzJa3qHir_ok5n_Zw8MCNGbVSHFXVgZs&s=1qaggpDGSz3DUP03rlkM40RUuYd4lBXJ6ylGMs76vQQ&e=   I was surprised to see so much of the space being used by indexes.  We could possibly look at reducing the number of indexes to help.



    - Shilen



    On 1/29/18, 9:13 AM, " on behalf of Rory Larson" < on behalf of > wrote:



        Hello,



        We are encountering an issue of our Grouper database growing to a size that threatens to use up the entire hard drive space available.  The database server is a physical machine with no room to expand, running a MySQL-type database (5.5.56-MariaDB MariaDB Server).



        The problem is not so much the size of the group data itself, but with point-in-time tables that record what that data was back to the beginning of the Grouper installation.  Some of these have grown to huge sizes, on the order of 20 GB or so.  A suggestion has been made that we could delete a lot of this prior to some reasonable point in time, especially groups or memberships that no longer exist.  This would be fine, but I'm wondering if this can be done from the database command line on a per-table basis, or whether there are dependencies that require doing this through gsh function calls?



        Also, because this is a MySQL-type database, shrinking the database physical size after deletion of unnecessary data is not trivial.  OPTIMIZE TABLE ... might work, or it might make a table bigger.  ALTER TABLE ... ROW FORMAT=COMPRESSED is promising as tried on our test database, but takes about four times as long as OPTIMIZE TABLE.  In either case, updating has to be shut down while running them, and they may require duplicating the table during the operation.  We are at 90% now, and can't afford any more big tables.  The ultimate MySQL solution seems to be to dump the database, delete the whole thing, and reload it, which would mean complete downtime for Grouper.



        Does anyone else using a MySQL-type database have a system for handling this problem and maintaining a reasonable database size?



        Thanks,

        Rory









 

--


Gail H Lift
MCommunity, IAM-IIA, ITS, University of Michigan




Archive powered by MHonArc 2.6.19.

Top of Page