Skip to Content.
Sympa Menu

grouper-users - RE: [grouper-users] loader performance

Subject: Grouper Users - Open Discussion List

List archive

RE: [grouper-users] loader performance


Chronological Thread 
  • From: Chris Hyzer <>
  • To: Jon Gorrono <>
  • Cc: Gagné Sébastien <>, "" <>
  • Subject: RE: [grouper-users] loader performance
  • Date: Wed, 17 Oct 2012 04:31:56 +0000
  • Accept-language: en-US

Grouper does have “hooks” which do similar things to what you ask, but there is nothing for reading memberships, since not all reading memberships is done through the API.  For instance, when you see who the members are for a group, a group might be a member of another group, so those database joins are done without an API callback for each one.  So all the memberships in Grouper need to be pre-computed in the Grouper registry.  For course memberships, Penn supports the current term, next term, and previous term, before that data gets purged from Grouper.  The data is still in the warehouse, though that historical data might be less common of a request.  Also, if you want to keep the old data in Grouper, the course roster changes for past terms might not change 1 year later right?  If so then you don’t need to continue loading them…

 

Thanks,

Chris

 

From: Jon Gorrono [mailto:]
Sent: Wednesday, October 17, 2012 12:10 AM
To: Chris Hyzer
Cc: Gagné Sébastien;
Subject: Re: [grouper-users] loader performance

 

Sure....sorry. 

 

I was looking for some indication someone had cleaved a path for intercepting membership requests into the API with intermediate callouts to specific 'loader jobs' for the group(s) being queried about.... So with respect to course memberships, for example, I might have daily crons with loader jobs for groups in 'current' term(s)... and then intercept calls to membership methods in the API to do just-in-time updates to groups outside of that (term) range.

 

I wonder that since it appears to be that a full sweep all memberships in groups that might be fed in from different systems would take too long to practically complete in a daily cron job

 

Just poking around for future strategies :)

On Tue, Oct 16, 2012 at 5:52 PM, Chris Hyzer <> wrote:

 

Ø  Are there any AOP cut-points already defined in the API so that I might respond to a direct group membership query by first doing a loader run for the group?

 

Can you explain this more clearly please?  Not really sure what you are looking for here…

 

Thanks,

Chris

 

 

 

On Tue, Oct 16, 2012 at 12:17 PM, Jon Gorrono <> wrote:

Ok... Thanks (both of you) for the pointers

 

I changed the view result limiter from 'rownum <10001' to 'rownum < 200001' and I think I found a decent way to analyze the results.. there were about 12,900 groups created

 

The root of the stem for the loader-group is 'cms:' and the job I ran was the last job started yesterday... so I think this query makes sense:

 

select (et.jobEnd-st.jobStart)*24 as runTime

from

(select max (started_time) jobStart from grouper_loader_log 

 where started_time > to_date('16-10-12','DD-MM-YY PM')-1 

  and job_name like '%cms:%' 

  and job_name not like '%subjobFor%') st,

 

(select max (ended_time) jobEnd from grouper_loader_log 

 where started_time > to_date('16-10-12','DD-MM-YY PM')-1 

  and job_name like '%cms:%' 

  and job_name like '%subjobFor%') et

 

RUNTIME

-------

9.06166666666666666666666666666666666667

 

Since I have about 3 million rows to go through that would take around 135 hours, or 5 1/2 days.

 

I think we'll just plan to batch in sets of 200k for the initial provisioning (I am working in a dev environment), and perhaps look into a custom loader to do that.  And I'm anxious to see the result run times for a full update sweep

 

Thanks again.

 

On Tue, Oct 16, 2012 at 6:51 AM, Chris Hyzer <> wrote:

You can query grouper_loader_log to get times of loader jobs.  To see the group list jobs, put a filter like this on:

 

JOB_NAME LIKE 'SQL_GROUP_LIST%'

 

 

Not sure if the counts are exactly accurate, but try it out and see.  It breaks out how much time to run the query and get the data, and how much time to load to Grouper, and the overall time.

 

Basically, if you are selecting the subjectId and sourceId in the query, then it will not resolve any subjects (on a non-first-run).  So it is just a matter of selecting the loader query, and selecting the membership list(s).

 

Thanks,

Chris

 

 

From: Jon Gorrono [mailto:]
Sent: Tuesday, October 16, 2012 12:19 AM
To: Gagné Sébastien
Cc: Chris Hyzer;


Subject: Re: [grouper-users] loader performance

 

Thanks for all the input, Chris and Sebastien.

 

Regarding REST: I conflated Grouper Client with Grouper Shell... my bad.

 

creating a custom loader is something I had not considered... good option.

 

Was also successful in getting some groups loaded with a limited set of memberships (10k), but analyzing the results is a little tricky with the spawned subjobs for each group.... is there a standard query to give an overall time? I truthfully haven't looked that the data for very long yet...

 

My first attempt at loading all course memberships going back to 2005 (which may be necessary for some systems) gave up the ghost after about 36 hours with no memberships created. Indexes on subject_id were in place (and on the constituents of the one join in the group loader view) ... granted there where about 3.3million memberships..... could be that the view never returned any data... 

 

I am increasing the view return limit to 200k to see what happens.

 

 

On Fri, Oct 12, 2012 at 5:33 AM, Gagné Sébastien <> wrote:

I’ll chime in with my experience :

We did a custom loader job since we had more complex transformations to do. The Java classes uses the Grouper API and my subject source is LDAP.

 

Running a full import of 10 500 groups with 311 000 memberships (113 000 subjects) took 8 hours to complete. Running an update with no membership changes took only 2 hours (we check and edit group attributes).

 

The biggest performance improvement was caching all the subjects at once at startup in one query (takes 45 secs). Adding an uncached subject took ~250ms while adding a cached subject took ~50ms

 

I’m not sure what Grouper does, but it has a native non-negligible overhead however you try to improve performance around it. To improve our loader job’s performance I’m trying to reduce the number of calls to Grouper.

 

 

 

De : [mailto:] De la part de Chris Hyzer
Envoyé : 12 octobre 2012 01:11
À : Jon Gorrono
Cc :
Objet : RE: [grouper-users] loader performance

 

I just started a loader job for penn with 29k new members.  I made sure to have a SUBJECT_SOURCE_ID column in the loader query (this should speed things up), and use SUBJECT_ID and not SUBJECT_IDENTIFIER (also speeds things up).  It is processing 300 records per minute, which is similar to the performance you are seeing (maybe a little worse).  However, the run of a loaded 30k member group takes 2 minutes (added/removed 100 members).

 

Btw, this means that there is 200ms to add a member to a group (our registry has 5 million memberships), which is due to the queries required to do change to the registry, auditing, change log, rules, composites, etc.  At some point we should profile this to see if it can get faster, though it seems acceptable.

 

Thanks,

Chris

 

From: Jon Gorrono
Sent: Thursday, October 11, 2012 6:45 PM
To: Chris Hyzer
Cc:
Subject: Re: [grouper-users] loader performance

 

Thanks for the response, Chris.

 

I eventually did start the process again and watched the grouper log and it was chugging... I pared the list of users down to active accounts (a little less than 99k) and it took 4 hours to complete.

 

Your right, indexes are probably the key... I added one to the id column in table the underlying the view, and a reran (with no membership changes) and it took 5 minutes. So that looks good

 

I also wondered if there might be a significant lag in using gsh since it uses REST calls... eg might it be faster run as a cron job? I can't get the cron to start for some reason so I haven't tested that theory yet.

 

Who is Shilen?

On Wed, Oct 10, 2012 at 7:48 PM, Chris Hyzer <> wrote:

Shilen always says to analyze your tables…  try this in the grouper DB to all the grouper tables, and I guess in your source.  Is it indexed for ID and IDENTIFIER?  I don’t think it should be slower if a different DB than the Grouper one, unless it is on a different continent or something.

 

The loader will commit as it goes, so it is partially done.  See how many users are in the group:

 

select count(*) from grouper_memberships_lw_v where group_name = 'test:testGroup' and list_name = 'members'

 

Start up that job again, and as it runs, check the progress with the count query.  The first run is always slow, since it is adding each membership, and subsequent runs will be a lot faster since it only has the diffs to do, e.g. a few dozen or hundred memberships or something

 

Thanks,

Chris

 

 

From: [mailto:] On Behalf Of Jon Gorrono
Sent: Wednesday, October 10, 2012 5:47 PM
To:
Subject: [grouper-users] loader performance

 

 

I am still down in the shallow end here :)

 

I am a little surprised at the time it is taking to load a group with the loader

 

I've created a simple view with subject_id and subject_source_id and defined the group in the ui and created a loader job in the group attributes to select users from the view for the group

 

There are about 130k users and in this case they are all being shoved into one group

 

The source for the subjects api lookups is a different view on the same remote machine as the view used by the loader to populate the group and is using the C3p0 jdbc connection provider

 

The network is not likely the bottleneck, and the machines, both 'development-quality', are not top-notch but their mid-range performance is usually adequate for sane debug cycles etc.

 

The loader had been 40 minutes and had not yet finished when I had it was stopped (abruptly, heh) for scheduled patching.

 

I am guessing that the struggle it is having is with the remote source for the subjects...

 

So I guess my question is... is it really practical to have a remote subject source? Given the answer is 'yes' then there would be some better alternate questions, but I am not sure what good ones might be right now.

 

Any comments, questions, suggestions are welcome.


 

--
Jon Gorrono
PGP Key: 0x5434509D - http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
http{middleware.ucdavis.edu}



 

--
Jon Gorrono
PGP Key: 0x5434509D - http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
http{middleware.ucdavis.edu}



 

--
Jon Gorrono
PGP Key: 0x5434509D - http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
http{middleware.ucdavis.edu}



 

--
Jon Gorrono
PGP Key: 0x5434509D - http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
http{middleware.ucdavis.edu}



 

--
Jon Gorrono
PGP Key: 0x5434509D - http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
http{middleware.ucdavis.edu}



 

--
Jon Gorrono
PGP Key: 0x5434509D - http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
http{middleware.ucdavis.edu}




Archive powered by MHonArc 2.6.16.

Top of Page