grouper-users - RE: [grouper-users] Grouper Loader - of another type

Subject: Grouper Users - Open Discussion List

List archive

RE: [grouper-users] Grouper Loader - of another type

From: "Black, Carey M." <>
To: "Gettes, Michael" <>
Cc: "" <>
Subject: RE: [grouper-users] Grouper Loader - of another type
Date: Thu, 14 Feb 2019 03:38:44 +0000

Michael,

Rest (or Soap) endpoints would be a nice feature for a loader job to be able to connect to. However, the inherent problem is that (unlike LDAP/SQL) the constructs for “selecting data” is not as simple as “a name” (attribute/column name).

At the exit of the REST/SOAP interface is a “TBD” document (json/xml formats are common, but not strictly correct all of the time either) that you need to be able to parse and “get the data you want” out. Simply put, it is not trivial to generically do that to *just any* REST target. Especially since “REST” is very much subject to : “And thirdly, the code is more what you'd call "guidelines" than actual rules.”[1] ( You at least need to invent an “extraction layer” and a “mapping layer” to get data from the external formats to an internal format. ) And establishing those loader jobs would require those extraction and mapping layers to be implemented (configured and/or written ) by the implementer.

I cannot say that it is impossible. However, it would be “a lot” of work inside the Grouper project.

And since at the end you likely would need to code up some Java anyways…. You might as well just “do it all yourself” and skip the fancy “generic” extraction/mapping layers/tools.

On the other hand…. ( If you are willing to load SQL staging tables….)

Personally, I am looking down a path to ingest REST based data too. I have started down a path based on work done by another University. REF: https://spaces.at.internet2.edu/display/Grouper/Newcastle+University+Introduction ( Thank you by the way! )

Basically using an ETL ( Extraction, Transformation, Load ) tool to get the job done.

NOTE: There are many ETL tools to pick from. Talend is free, and it has it’s quarks. ( REF: https://www.talend.com/products/data-integration/data-integration-open-studio/ FWIW: Talend talks to just about every data source you can think of. ( https://www.talendforge.org/components/index.php , Edition “Talend Open Studio for Data Integration” , click “Show All”, then do a find for what you are looking for. J ) So I think this general approach should work for lots of sources. )

I have been able to get as far as reading from a bespoke REST (web service) and loading the data into “shadow tables” in the Grouper DB. Then I was going to trigger a set of SQL loader jobs to pull the data into Grouper proper. J (With a rather hacky way to fork gsh scripts to kick off the loader jobs at the end of the ETL work. L )

Now, if someone was very enterprising, I could also see a possibility for grouper API to be called directly from inside Talend. ( It is Java based, and has the ability to call “custom code” as a built in feature. ) While I quickly considered that idea, I also dismissed it.

The work to maintain that code may, over time, not be worth the effort. And given that the ETL model ( at least in Talend ) appears to be “row centric” model, you would likely be making more Grouper API calls than the Loader would need to by ingesting an SQL table. ( And frankly using some SQL trickery before the loader job an using an incremental loader job would likely also greatly reduce the Grouper API calls even more dramatically.) However starting the Loader job(s) without forking a process (to spin up a gsh script) would be nice. I just have not decided how hard it would be to get that level of integration into Talend. Frankly having a Web Service call that could start a Loader process might be an ideal approach. However, that Grouper Web Service would need some ACL’s on it too. J )

Picture loading multiple tables with 100K(s) rows. ( via talend )

Use some SQL to “diff” those rows with Grouper tables ( or a previous copy of the staging table ) to find the “nothing to do here” rows and drop them.

Then only process the “new” or “removed” rows in an SQL incremental loader process. ( with add / remove defined for each row that is left.)

That approach pushes the ETL and diff processing outside of Grouper and reduces the Grouper API work to a minimum. (Just adds and removes) And uses “standard, built in features” as much as possible too.

True there may be issue with the SQL diff process against the main Grouper tables. (Thus the use of shadow table(s)) I also considered offloading all of that DB work to a “read only replicated copy” of the DB too. But that could lead to race conditions. ( Unless the groups are strictly controlled by the above process.) And likely is more complicated than most users would want to get too. Yet, it should work efficiently, effectively and scale well too.

My approach is still a work in progress. Others have already blazed this path too. ( YMMV, Y(Value)MV too. )

Likely not the answer you wanted, but I bet you can get this up and running in a few weeks with little to no code written. J If not, let me know.

Carey Matthew

[1] Quote from Captin Barbossa from the Pirate of the Caribbean movie series. REF: https://en.wikiquote.org/wiki/Pirates_of_the_Caribbean:_The_Curse_of_the_Black_Pearl

From: <> On Behalf Of Gettes, Michael
Sent: Wednesday, February 13, 2019 8:45 PM
To: Richard Frovarp <>
Cc:
Subject: Re: [grouper-users] Grouper Loader - of another type

I’m wanting to manage many 10s of thousands of groups. I believe that would mean many thousands of calls via web services whereby a loader job would handle this all in bulk… the memberships, the privs, the names/descriptions all in one loader job. The scale, I believe, is best handled by the loader job.

I hope this helps.

/mrg

On Feb 13, 2019, at 8:09 PM, Richard Frovarp <> wrote:

Grouper newbie here. I would likely need something similar. Why not write intermediary code to use Grouper web services? That was my plan, so I'm curious as to what I'm missing.

From: <> on behalf of Gettes, Michael <>
Sent: Wednesday, February 13, 2019 4:36:55 PM
To:
Subject: [grouper-users] Grouper Loader - of another type

Hi all,

Currently, grouper supports loader jobs of LDAP and SQL and an additional capability to inject messages to process changes related to an individual - a way of sparking loader jobs for one person instead of in bulk - at least this is my interpretation.

I have a need for loader jobs to be of an arbitrary nature - call a program, written in any language, which might do REST calls or whatever and return, in bulk, something similar to what the loader job now receives from SQL/LDAP. This way I can go against alternative sources without the need of staging the data into LDAP/SQL but get all the benefits and scale of a grouper loader job.

Does anyone else see a need for this? Grouper dev dudes… (and dudettes)… have you considered this? I can only assume you have since you guys have thought of a great many things for grouper.

Many thanks for your time and consideration especially if you choose to respond.

/mrg

[grouper-users] Grouper Loader - of another type, Gettes, Michael, 02/13/2019
- Re: [grouper-users] Grouper Loader - of another type, Richard Frovarp, 02/14/2019
  - Re: [grouper-users] Grouper Loader - of another type, Gettes, Michael, 02/14/2019
    - RE: [grouper-users] Grouper Loader - of another type, Black, Carey M., 02/14/2019
      - RE: [grouper-users] Grouper Loader - of another type, Jim Fox, 02/14/2019
  - Re: [grouper-users] Grouper Loader - of another type, Andrew Morgan, 02/14/2019
- Re: [grouper-users] Grouper Loader - of another type, Greg Haverkamp, 02/14/2019
  - Re: [grouper-users] Grouper Loader - of another type, Gettes, Michael, 02/14/2019
    - Re: [grouper-users] Grouper Loader - of another type, Greg Haverkamp, 02/14/2019
      - RE: [grouper-users] Grouper Loader - of another type, Hyzer, Chris, 02/14/2019
        
        Re: [grouper-users] Grouper Loader - of another type, Gettes, Michael, 02/14/2019

List archive

RE: [grouper-users] Grouper Loader - of another type