Skip to Content.
Sympa Menu

grouper-users - Re: [grouper-users] RE: Status Monitoring - Two Errors

Subject: Grouper Users - Open Discussion List

List archive

Re: [grouper-users] RE: Status Monitoring - Two Errors


Chronological Thread 
  • From: "Gettes, Michael" <>
  • To: "Black, Carey M." <>
  • Cc: Chris Hyzer <>, Ryan Rumbaugh <>, "" <>
  • Subject: Re: [grouper-users] RE: Status Monitoring - Two Errors
  • Date: Fri, 7 Sep 2018 19:25:06 +0000
  • Accept-language: en-US
  • Ironport-phdr: 9a23:3ynyKxKzNiONyioGN9mcpTZWNBhigK39O0sv0rFitYgfKfXxwZ3uMQTl6Ol3ixeRBMOHs60C07KempujcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS47xaFLIv3K98yMZFAnhOgppPOT1HZPZg9iq2+yo9JDffwdFiCChbb9uMR67sRjfus4KjIV4N60/0AHJonxGe+RXwWNnO1eelAvi68mz4ZBu7T1et+ou+MBcX6r6eb84TaFDAzQ9L281/szrugLdQgaJ+3ART38ZkhtMAwjC8RH6QpL8uTb0u+ZhxCWXO9D9QKsqUjq+8ahkVB7oiD8GNzEn9mHXltdwh79frB64uhBz35LYbISTOfFjfK3SYMkaSHJBUMhPSiJBHo2yYYgBD+UDOuhYrpXyqFQVoBSkGQWsAfniyj9UinL026AxzuQvERvB3AwlB98CvnXao8vpNKcOT++117HExijEYfNXxTj96JLHfQ4lof2CQLl9dsXRxlczFwPZkFqQspfoPy6b1uQJqWSU8fdvVf+2hmMhtgp/rD+vxsI2hYnIgIIY0l/E9T9+wIYoPd23VlR7bcS+H5tIryGWL4p2QsU+Q252oiY6zKMJuYKlcCQQ1pso2gPfZ+Sbc4eS+BLvTumRIDZgiHJ9YrK/gBGy8Ua4yu37V8m01kpKojBZndnLs3AA0QHY5MufSvZl40utxzWC2xrN5u1aI004j6jWJpE7zrIuiJYfrVzPEjLqlEnsgqKbd18o9vWm5unoeLnqu56RO5d6ig7gMakihsmyDOE7PwUKQ2SU5eGx26H48kLjQLhHi+M6n6vCv5zEIMkUu7W2DxFX34sl9h2xFS2p0M4CknkCNF9FeAyIj4zuO1zWO/D4COu/g0y2nDd2yfDGOqftDYvQIXjeiLvhfLB95FBAyAcr0NxT+ZFZBqsfLP7tWEL9rt7VAxAjPwCq3errFM1x1oYEVmKOBq+ZPrnSsViN5u83LOmDepMVuCrjJPg+/PPukHE5mVsHcaa3wJQXdWi0Hu56LEWBfXrsntABHH8FvgokS+zqlUWCXiBJZ3qrQqI8/S80CJi9DYrYQoCtgaeB3DugHpFIfGxGC1aMEWv2eIWeXfcDdj6SLtF7njMaSLehVtxp6Rb7/iX+wrFkaqL/8zcVpNrGkpI9s+fXnBoxs2UuVOyayHzLQm1pyDAmXTgziepQrFBhx0zHmY15iPxRGNgZr6dGXxsmOIX00uJ+Td3+R1SSLZ+yVF+6T4D+UnkKRdUrzopLOh4lFg==

Ok, I need some education then.  In the scenario you describe (and I agree with the value of what you describe) why wouldn’t it be fire up the new loaders and at some point you shoot the old ones.  If a loader was mid-job, and you shoot it, the next scheduled time the job would re-run, all is well.  Is not acceptable to skip a cycle in the less frequent scenario you describe?

/mrg

On Sep 7, 2018, at 3:07 PM, Black, Carey M. <> wrote:

Chris,
 
I think we are not yet seeing the same picture. I am thinking about a use case more in the “docker spin up / spin down” type cycles.
                Which means patches, upgrades, icon changes, days of the week, moving load between data centers, moving to or from the cloud, etc…
                Basically when the wheels are going “round and round”. J
 
A Grouper shop could spin up a “new loader”.
                It would happily start processing jobs etc… (that are not already running on other loaders.)
Then go to the “old loader(s)” and say “Hey.. you have been replaced. Finish your work and die.”
 
I see no “gap in things running” in that process.
                Start a “new home” for the jobs to move to as they can. ( by schedule and/or run time for the job)
                Wait for them to finish, then exit.
 
 
( Yes, I think it is generally a bad idea to have long running jobs. But sometimes that is what it takes to do the job. Larger data sets take more time.)
 
--
Carey Matthew 
 
From: Hyzer, Chris <> 
Sent: Friday, September 7, 2018 2:24 PM
To: Black, Carey M. <>
Cc: Ryan Rumbaugh <>; ; Gettes, Michael <>
Subject: RE: [grouper-users] RE: Status Monitoring - Two Errors
 
First off the loader process is also the grouper daemon, theres more there than just loader.  There are long running daemon jobs and there are short running daemon jobs.  I cant imagine someone would want a quiesce where it takes a couple hours to stop the loader and in the meantime, no jobs run.  Including jobs that do change log temp to change log, sending out messages, provisioning, etc.  Is this only for upgrades?  You want it to stop, do whatever you had to do, and  turn it back on quickly, and any jobs that didn’t finish, it should try them again (and it will continue where they left off not including the initial query/filter).  We have discussed this and have a jira on it
 
 
If you want a quiesce, and a timeout of a minute or 5 or whatever, then each daemon job type I would think would need to check if quiescing and return gracefully from where they are (since I assume its just a transaction level thing not the entire job).  I think the above jira would be higher priority…
 
Anyways, if Im off base please correct me
 
Thanks
Chris
 
From:  [] On Behalf Of Black, Carey M.
Sent: Friday, September 07, 2018 1:58 PM
To: Hyzer, Chris <>
Cc: Ryan Rumbaugh <>; ; Gettes, Michael <>
Subject: RE: [grouper-users] RE: Status Monitoring - Two Errors
 
Chris,
 
RE: “If we wait until work finishes, how do you define work, and will it ever really finish?”
 
The “loader” is a big topic: ( AKA: What does a Loader process do?)
                Background processes for grouper
                                Daily report
                                Rules engine
                                Attestation
                                PSP ( PSPNG?)
                                Find Bad Memberships
                                TIER Instrumentation
                Loader jobs ( pull data into grouper)
                                Ldap sources
                                RDBMS sources
                ChangeLogConsumers ( send data out of grouper )
                                Custom code and a host of “send data out of grouper” type of things
                Others?...?
 
                And then there are the conditions/interactions around running N loader processes too.
                                They internally make sure they are not running the same job on N loaders.
                                They “skip running processes” if they come due again.
                                So currently I don’t think it is possible to know where a job will “decide to run” on which one of the loaders.
 
 
My thoughts about the loader “quiesce” mode would be to:
1)      No longer start any new jobs on that instance.
                                Essentially nullify all schedules, and do not check for changed schedules until after restart.
                                This would include all of the “internal jobs” like Daily reports, Rules engine, etc…
2)      Let the running jobs run till completion or “failed to complete” state.
3)      Then exit.
 
                This would allow a host to be “quiesed” and “roll the work load off to other nodes” in a controlled way without requiring “rework” or disrupting the current work and causing undesired delays for those jobs.
               
I am not sure how processes would “not finish”. Can you explain that part of your response?
 
 
 
 
However, maybe it would be helpful to take a single specific example and walk through it in detail? ( Basically a “Long running/big process” condition. )
 
I have “LDAP Loader jobs” ( mostly “LDAP_GROUPS_FROM_ATTRIBUTES”, but there are other styles of loader jobs too including some SQL jobs.) Some of them can pull in “large numbers of groups and/or members”. 
                In fact, I have had to “break down a single ldap search condition” into many narrower searches  to reduce the size and number of the data ( groups ) returned so that the RAM/CPU load is manageable across time. Well, and so the job would actually finish.
                Just for the record, I have done things like dump millions of ldap objects with standard LDAP command line tools from this source and it normally took between 30 min and about 2 hours depending on the complexity of the search and how indexes support the search.) So the LDAP source can support the work. And the search that I am using is well indexed so we should be on the low end of that range. (Yet the loader job takes about 2 hours to complete, when it does not error out and fail. But that is a different topic….)
 
 
                So as a “simple example” ( that I think most universities could relate to ) let us talk about the largest cohort that a University has. Their Alumni. 
                We try to provide some services to our Alumni. So the University needs to know who is an Alumni for authorization data to applications. For our current numbers we are talking about a single group on the order of 500K members. I have isolated that group loader job to just load that one group. And it well does not behave very well. It takes a lot of RAM when it runs, and I think I have even observed CPU spikes while it is running.  So much so, I have disabled the job and I am looking for a “better way” to deal with the large group “issue” that I see. (  I did not break this “one group” down to “Load 26 sub groups” ( by first letter of their last name) and then have a group the has those sub groups as members. But I may need to go there…. I just don’t want too. L  )  However, in fairness, grouper 2.4 move to Ldaptive ( instead of vt-ldap) and that may change this in some helpful ways. However, I still think this is a good example for many reasons. ( And no, this set does not just change at the end of terms. It is a continuous flow, with very large spikes of change at the end of term. Believe it or not, we even try to know when our Alum change state to “deceased” as well. Which is most of the continuous membership changes for this group. )  This job can take 2 hours to complete to “success”.
 
 
So I will continue on this example.
Just talking about the run time of this one loader job:
                Obviously this loader job takes time to search the ( ldap ) source for 500k entries (members).  ( And the data can be changing while the “pull of data” is going on too.  But I leave that as a “source” issue to deal with.) From previous experience I expect that to be about 20-40 minutes from “search” to “results”.
                So if that job is running and the loader job is killed, then a lot of work(time/cpu cycles) may be “lost”.  And it will take time for the next loader to “start again” and get back to the “relative point” in the job that was killed.
 
                Questions about what happens when the loader job is abruptly stopped:
                                In the middle of the query(s)? How would you “pick up where you left off”? Maybe just start again?
                                While loading the results into the grouper staging table(?) How do you know it was done loading the data? Is there a “total count” recorded before the first record is loaded?
                                While converting the temp data into memberships? ( Maybe you could continue from here… maybe….)
                                Am I describing the internal process of the loader job poorly? ß If so, then it could be that I just don’t understand the phases of the job well enough to see the features.
                                                Maybe there are “gates” that are recoverable points where the next loader could “pick up and keep going”?
 
--
Carey Matthew 
 
From: Gettes, Michael <> 
Sent: Friday, September 7, 2018 10:02 AM
To: Chris Hyzer <>
Cc: Black, Carey M. <>; Ryan Rumbaugh <>; 
Subject: Re: [grouper-users] RE: Status Monitoring - Two Errors
 
Well, that’s cool if we can restart midway.  BUT, if grouper is down for an hour or twelve, I don’t think I would want to restart.  Maybe it is configurable?   The default being something like a restart within 20 minutes causes grouperus loaderus interruptus to be continued.  Longer than that and we continue with the normal schedule???
 
(It’s Friday.  I’m punchy).
 
/mrg

 

On Sep 7, 2018, at 9:38 AM, Hyzer, Chris <> wrote:
 
I don’t think it is bad to stop loader jobs abruptly, but I agree that when it starts again it should continue with in progress jobs.  Right?  If we wait until work finishes, how do you define work, and will it ever really finish?  If it picks back up where it left off, it should be fine since things are transactional and not marked as complete until complete…  thoughts?
 
Thanks
Chris
 
From:  [] On Behalf Of Gettes, Michael
Sent: Monday, August 27, 2018 12:00 PM
To: Black, Carey M. <
>
Cc: Ryan Rumbaugh <
>; 
Subject: Re: [grouper-users] RE: Status Monitoring - Two Errors
 
I’ve always wanted a quiesce capability.  Something that lets all the current work complete but the current loader instance won’t start any new jobs.  This would be needed for all loader daemons or just specific ones so we can safely take instances down.  I have no idea if this is possible with Quartz and haven’t had a chance to look into it.
 
/mrg

 

On Aug 27, 2018, at 11:20 AM, Black, Carey M. <> wrote:
 
Ryan,
 
RE: “I had been restarting the API daemon” …  ( due to docker use )
                I have often wondered how the “shutdown process” works for the daemon. Is it “graceful” ( and lets all running jobs complete before shutdown) or does it just “pull the plug”? 
                                I think it just pulls the plug.
                                Which “leaves” running jobs as “in progress”(in the DB status table) and they refuse to immediately start when the loader restarts. Well, until the “in progress” record(s) get old enough that they are assumed to be dead. Then the jobs will no longer refuse to start.
 
                I say that to say this:
                                If the loader is restarted repeatedly, quickly, and/or often, you may be interrupting the running jobs and leaving them as “in progress” (in the DB) and producing more delay on the jobs re-starting again. But it all depends on how fast/often those things are spinning up and down.
 
                                However, maybe If you always spinning up instances (and let the old ones run for a bit) you may be able to “wait till a good time” to turn them off.
                                Maybe if you cycle out the old instances gracefully by timing it with these settings?
                                “
                                ##################################
                                ## enabled / disabled cron
                                ##################################
                                
                                #quartz cron-like schedule for enabled/disabled daemon.  Note, this has nothing to do with the changelog
                                #leave blank to disable this, the default is 12:01am, 11:01am, 3:01pm every day: 0 1 0,11,15 * * ?
                                changeLog.enabledDisabled.quartz.cron = 0 1 0,11,15 * * ?
                                “
 
 
RE: how to schedule the “deprovisioningDaemon”
 
                Verify that your grouper-loader.base.properties has this block: ( or you can add it to your grouper-loader.properties )
                NOTE: it was added to the default base as of GRP-1623. ( which maps to grouper_v2_3_0_api_patch_107  ( and for the UI grouper_v2_3_0_ui_patch_44 ) ) You likely are past those patches… but just saying. J
                “
                #####################################
                ## Deprovisioning Job
                #####################################
                otherJob.deprovisioningDaemon.class = edu.internet2.middleware.grouper.app.deprovisioning.GrouperDeprovisioningJob
                otherJob.deprovisioningDaemon.quartzCron = 0 0 2 * * ?
                “
 
HTH.
 
-- 
Carey Matthew 
 
From:  <> On Behalf Of Ryan Rumbaugh
Sent: Monday, August 27, 2018 10:12 AM
To: 

Subject: [grouper-users] RE: Status Monitoring - Two Errors
 
An update to this issue that may be helpful to others…
 
Before I left the office on Friday I ran the gsh command “loaderRunOneJob(“CHANGE_LOG_changeLogTempToChangeLog”)” process and now the number of rows in the change_entry_temp table is zero! I tried running that before, but really didn’t see much of anything happening. Maybe I was just too impatient.
 
Now when accessing grouper/status?diagnosticType=all the only error is related to “OTHER_JOB_deprovisioningDaemon”. If anyone had any tips on how to get that kick started it would be greatly appreciated.
 
 
--
Ryan Rumbaugh
 
From:  <> On Behalf Of Ryan Rumbaugh
Sent: Friday, August 24, 2018 9:15 AM
To: 

Subject: [grouper-users] Status Monitoring - Two Errors
 
Good morning,
 
We would like to begin monitoring the status of grouper by using the diagnostic pages at grouper/status?diagnosticType=all, but before doing so I would like to take care of the two issues shown below.
 
Can anyone provide tips/suggestions on how to fix the two failures for CHANGE_LOG_changeLogTempToChangeLog and  OTHER_JOB_deprovisioningDaemon?
 
We had a Java heap issue late last week which I believe caused the “grouper_change_log_entry_temp” table to keep growing. It’s at 69,886 rows currently while earlier this week it was at 50k. Thanks for any insight.
 
 
 
2 errors in the diagnostic tasks:
 
DiagnosticLoaderJobTest, Loader job CHANGE_LOG_changeLogTempToChangeLog
 
DiagnosticLoaderJobTest, Loader job OTHER_JOB_deprovisioningDaemon
 
 
 
Error stack for: loader_CHANGE_LOG_changeLogTempToChangeLog
java.lang.RuntimeException: Cant find a success in job CHANGE_LOG_changeLogTempToChangeLog since: 2018/08/16 14:19:22.000, expecting one in the last 30 minutes
                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticLoaderJobTest.doTask(DiagnosticLoaderJobTest.java:175)
                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticTask.executeTask(DiagnosticTask.java:78)
                at edu.internet2.middleware.grouper.j2ee.status.GrouperStatusServlet.doGet(GrouperStatusServlet.java:180)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:635)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
                at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
                at org.owasp.csrfguard.CsrfGuardFilter.doFilter(CsrfGuardFilter.java:110)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
                at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
                at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
                at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:478)
                at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
                at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:80)
                at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:624)
                at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
                at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:341)
                at org.apache.coyote.ajp.AjpProcessor.service(AjpProcessor.java:478)
                at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
                at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:798)
                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1441)
                at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
                at java.lang.Thread.run(Thread.java:748)
 
 
Error stack for: loader_OTHER_JOB_deprovisioningDaemon
java.lang.RuntimeException: Cant find a success in job OTHER_JOB_deprovisioningDaemon, expecting one in the last 3120 minutes
                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticLoaderJobTest.doTask(DiagnosticLoaderJobTest.java:173)
                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticTask.executeTask(DiagnosticTask.java:78)
                at edu.internet2.middleware.grouper.j2ee.status.GrouperStatusServlet.doGet(GrouperStatusServlet.java:180)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:635)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
                at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
                at org.owasp.csrfguard.CsrfGuardFilter.doFilter(CsrfGuardFilter.java:110)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
                at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
                at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
                at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:478)
                at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
                at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:80)
                at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:624)
                at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
                at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:341)
                at org.apache.coyote.ajp.AjpProcessor.service(AjpProcessor.java:478)
                at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
                at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:798)
                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1441)
                at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
                at java.lang.Thread.run(Thread.java:748)
 
--
Ryan Rumbaugh




Archive powered by MHonArc 2.6.19.

Top of Page