Skip to Content.
Sympa Menu

grouper-dev - Re: [grouper-dev] Hello from Duke

Subject: Grouper Developers Forum

List archive

Re: [grouper-dev] Hello from Duke


Chronological Thread 
  • From: "GW Brown, Information Systems and Computing" <>
  • To: Tom Barton <>
  • Cc:
  • Subject: Re: [grouper-dev] Hello from Duke
  • Date: Sun, 08 Jul 2007 08:46:53 +0100



--On 07 July 2007 15:59 -0500 Tom Barton
<>
wrote:



GW Brown, Information Systems and Computing wrote:
7. The xml-import takes about 3-4 days for us and requires 4 GB of
memory allocated to the Java process.

It appears that the current DOM-based xml import/export approach does
not scale. We (ie, the grouper-dev community) should settle on an
alternative. Other JAXB-supported xml processing modes? A gsh-based
approach?
I think we need to understand better where the problem lies. i.e. how
much time is spent parsing the XML and figuring out which API calls to
make vs making the API calls. I`m sure another approach will be called
for but we should understand what we will gain by changing the current
one.

Good point - I was leaping prematurely to a conclusion. Can you recommend
a reasonable means to determine where the time is being spent?
XML parsing is done in XMLReader. It would be straightforward to add code to track time spent here and log it, or you could write a short bit of Java to just load the XML.

The other timings are harder and would need a profiler, but you should be able to get an idea of how much time is spent in the XMLImporter and associated classes, and how much time is spent in other Grouper API classes.

In principle the time spent doing the standard Grouper API calls should be similar whatever approach we use. XML loading could almost certainly be shortened, but would probably require two passes - one to `flatten` the XML so it can be handled in a linear way, and a second to actually make the API calls.

And the
amount of ram needed to complete the operation is also an issue. Can you
propose a way to observe how that is allocated?
Again a profiler would be needed. Something like YourKit allows you to connect to a running process and take a snapshot of objects in memory. It also tracks how much time is spent garbage collecting.

The import process is split into several steps:
this._processMetadata(this._getImmediateElement(root, "metadata"));
this._process( this._getImmediateElement(root, "data"), this.importRoot );
this._processPaths(e, stem);
this._processGroups(e, stem);
this._processMembershipLists();
this._processNamingPrivLists();
this._processAccessPrivLists();

You could try and snapshot at these stages. It may be possible to discard objects which are no longer needed which, in turn, might reduce the memory requirements.

I`m not sure how practical it would be to profile / snapshot Duke`s full load - 4g of ram can hold an awful lot of objects which would give very large snapshot files which the profiler UI might have difficulties processing, however, starting with smaller files might let us build a graph of filesize vs overall load time and memory requirements - which would give us something we could compare to any new / changed import code.

NB it would be worth knowing if Duke changed the default garbage collection options for the load. If using a multiprocessor machine there are continuous / parallel collection strategies which could give a lower memory footprint.

Tom



----------------------
GW Brown, Information Systems and Computing




Archive powered by MHonArc 2.6.16.

Top of Page