Skip to Content.
Sympa Menu

grouper-dev - Re: [grouper-dev] Performance of Group Searches

Subject: Grouper Developers Forum

List archive

Re: [grouper-dev] Performance of Group Searches

Chronological Thread 
  • From: "GW Brown, Information Systems and Computing" <>
  • To: Shilen Patel <>, Grouper Dev <>
  • Subject: Re: [grouper-dev] Performance of Group Searches
  • Date: Fri, 19 Oct 2007 10:02:26 +0100

--On 18 October 2007 17:24 -0400 Shilen Patel


I've been working on GRP-48 which involves improving the group search
performance in the Grouper API. I've found a few ways to make some huge
performance improvements, but before I get too far into code changes and
testing, I thought I would describe what I'm doing. This is to primarily
make sure I'm not breaking any design decisions that I may not be aware

So first here are the performance results. I'll use a specific search
example using Duke's test Grouper installation. We have 3120 "ECON"
courses somewhere within the stem duke:siss:courses. Note that these
results do not use the Grouper UI.

A search for ECON at the duke stem using a non-GrouperSystem session
currently takes 134 seconds. With code changes - 22 seconds. A search
for ECON at the duke stem using a GrouperSystem session currently takes
109 seconds. With code changes - 6 seconds.
A search for ECON at the root stem using a non-GrouperSystem session
currently takes 63 seconds. With code changes - 22 seconds.
A search for ECON at the root stem using a GrouperSystem session
currently takes 39 seconds. With code changes - 6 seconds.

After the modifications, in the cases where a non-GrouperSystem session
is created, about 75 percent of the time is actually spent on privilege
checking. I haven't yet looked for performance improvements in this
area. I've also noticed that the Grouper UI also does some privilege
checking during group searches, but I don't understand why. Shouldn't
this already be taken care of in the API? Gary can you comment on this?
The API checks that the user has VIEW privilege. Depending on the browse mode you are in when you search I need to do further checks - ADMIN or UPDATE for Manage groups, OPTIN for Join groups. In principle we could extend the API interfaces to pass in the privileges so the API can do all the checks - this is effectively the approach we were trying for GRP-7.

I wonder whether the privilege checks should be done on the final resultset after all the search filters have been resolved. Each search filter may return a lot of results, but ANDed searches may cause many to be discarded.

So I've made 3 primary modifications to get the performance results
described above.

1. Using ehcache and adding a new cache type in grouper.ehcache.xml,
I've adding caching to Member objects in GrouperAccessAdapter. 2. The
next modification is related to scoping the results. To determine if a
group or a stem (X) is a child of another stem (Y), the API currently
does some recursive checks up the hierarchy of X to see if Y is found.
Instead I made a modification to just check the object names. If the
name of X starts with the name of Y, then X is a child of Y.
Could this be worked into the query itself rather than iterating through the results and doing comparisons?

In principle we could also filter results by privileges through the query - but then we would be breaking abstractions and making it difficult to create alternative privilege implementations.
3. To do the actual database search for the groups, the API currently
first gets a list of all group attribute ids by doing 1 query. For the
ECON example above, that would result in a list of 3120 group attribute
ids. Next, the API performs 3120 queries to retrieve all of the group
attribute data. Then there will be another 3120 queries to get the group
data. So that's 6241 queries. Furthermore, say sometime in the future
you want to call group.getName() on all of the 3120 groups, that will
result in 15,000 more queries. Anyways, so I reduced all that down to 1
query that takes about 5 seconds. I've set the group attributes as a
property of the group so that additional queries to get group attributes
are not needed. I did not use ehcache for this, although that might be
something to think about. Any thoughts on whether there will be problems
if group attributes are queried and saved ahead of time like this?
Some time will be spent reading data we don't always use, but the biggest impact may be on the number of objects created - however, I think that can be mitigated through appropriate tuning of garbage collection.

So hopefully all this makes sense. Please let me know if you have any
thoughts or think what I'm doing is completely insane.
It does make sense. You have proven beyond doubt that making multiple queries on each item from a large result set does not scale so we should be aiming to minimize that type of activity in the API (and UI).


-- Shilen

GW Brown, Information Systems and Computing

Archive powered by MHonArc 2.6.16.

Top of Page