grouper-users - Re: [grouper-users] RE: Status Monitoring - Two Errors

Subject: Grouper Users - Open Discussion List

List archive

Re: [grouper-users] RE: Status Monitoring - Two Errors

From: "Gettes, Michael" <>
To: Chris Hyzer <>
Cc: "Black, Carey M." <>, Ryan Rumbaugh <>, "" <>
Subject: Re: [grouper-users] RE: Status Monitoring - Two Errors
Date: Fri, 7 Sep 2018 18:37:36 +0000
Accept-language: en-US
Ironport-phdr: 9a23:7teMeBIyngkzN51uF9mcpTZWNBhigK39O0sv0rFitYgfKfvxwZ3uMQTl6Ol3ixeRBMOHs60C07KempujcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS47xaFLIv3K98yMZFAnhOgppPOT1HZPZg9iq2+yo9JDffwdFiCChbb9uMR67sRjfus4KjIV4N60/0AHJonxGe+RXwWNnO1eelAvi68mz4ZBu7T1et+ou+MBcX6r6eb84TaFDAzQ9L281/szrugLdQgaJ+3ART38ZkhtMAwjC8RH6QpL8uTb0u+ZhxCWXO9D9QKsqUjq+8ahkVB7oiD8GNzEn9mHXltdwh79frB64uhBz35LYbISTOfFjfK3SYMkaSHJBUMhPSiJBHo2yYYgBD+UDOuhYrpXyqFQVoBSkGQWsAfniyj9UinL026AxzuQvERvB3AwlB98CvnXao8vpNKcOT++117HExijEYfNXxTj96JLHfQ4lof2CQLl9dsXRxlczFwPZkFqQspfoPy6b1uQJqWSU8fdvVf+2hmMhtgp/rD+vxsI2hYnIgIIY0l/E9T9+wIYoPd23VlR7bcS+H5tIryGWL4p2QsU+Q252oiY6zKMJuYKlcCQQ1pso2gPfZ+Sbc4eS+BLvTumRIDZgiHJ9YrK/gBGy8Ua4yu37V8m01kpKojBZndnLs3AA0QHY5MufSvZl40utxzWC2xrN5u1aI004j6jWJpE7zrIuiJYfrVzPEjLqlEnsgqKbd18o9vWm5unoeLnqu5CRO5d6ig7gMakihsmyDOE7PwUKQ2SU5eGx26H48kLjQLhHi+M6n6vCv5zEIMkUu7W2DxFX34sl9h2xFS2p0M4CknkCNF9FeAyIj4zuO1zWO/D4COu/g0y2nDd2yfDGOqftDYvQIXjeiLvhfLB95FBAyAcr0NxT+ZFZBqsfLP7tWEL9rt7VAxAjPwCq3errFM1x1oYEVmKOBq+ZPrnSsViN5u83LOmDepMVuCrjJPg+/PPukHE5mVsHcaa3wJQXdWi0Hu56LEWBfXrsntABHH8FvgokS+zqlUWCXiBJZ3qrQqI8/S80CJi9DYrYQoCtgaeB3DugHpFIfGxGC1aMEWv2eIWeXfcDdj6SLtF7njMaSLehVtxp6Rb7nw/gyqEvDvvG9zdQ4Zfo1Mln6vf7lAo5syFsAsKblWyBUjcnsHkPQmp8861jvUFmjh+m0KN4ivFcX5QH6P5VTgohHYPSy6p3B82kCVGJRcuAVFvzGobuOjo2VN9khoZWO0s=

Given the capability to restart mid-job, which I didn’t know was possible and I believe is really slick, I think what Chris is proposing is a better way to go than quiesce - save the concern I echoed earlier today. This allows for quick shutdown and a quick return to service. That’s what it’s all about.

/mrg

On Sep 7, 2018, at 2:23 PM, Hyzer, Chris <> wrote:

First off the loader process is also the grouper daemon, theres more there than just loader. There are long running daemon jobs and there are short running daemon jobs. I cant imagine someone would want a quiesce where it takes a couple hours to stop the loader and in the meantime, no jobs run. Including jobs that do change log temp to change log, sending out messages, provisioning, etc. Is this only for upgrades? You want it to stop, do whatever you had to do, and turn it back on quickly, and any jobs that didn’t finish, it should try them again (and it will continue where they left off not including the initial query/filter). We have discussed this and have a jira on it

https://bugs.internet2.edu/jira/browse/GRP-1671

If you want a quiesce, and a timeout of a minute or 5 or whatever, then each daemon job type I would think would need to check if quiescing and return gracefully from where they are (since I assume its just a transaction level thing not the entire job). I think the above jira would be higher priority…

Anyways, if Im off base please correct me

Thanks

Chris

From:  [] On Behalf Of Black, Carey M.
Sent: Friday, September 07, 2018 1:58 PM
To: Hyzer, Chris <>
Cc: Ryan Rumbaugh <>; ; Gettes, Michael <>
Subject: RE: [grouper-users] RE: Status Monitoring - Two Errors

Chris,

RE: “If we wait until work finishes, how do you define work, and will it ever really finish?”

The “loader” is a big topic: ( AKA: What does a Loader process do?)

                Background processes for grouper

                                Daily report

                                Rules engine

                                Attestation

                                PSP ( PSPNG?)

                                Find Bad Memberships

                                TIER Instrumentation

                Loader jobs ( pull data into grouper)

                                Ldap sources

                                RDBMS sources

                ChangeLogConsumers ( send data out of grouper )

                                Custom code and a host of “send data out of grouper” type of things

                Others?...?

                And then there are the conditions/interactions around running N loader processes too.

                                They internally make sure they are not running the same job on N loaders.

                                They “skip running processes” if they come due again.

                                So currently I don’t think it is possible to know where a job will “decide to run” on which one of the loaders.

My thoughts about the loader “quiesce” mode would be to:

1)      No longer start any new jobs on that instance.

                                Essentially nullify all schedules, and do not check for changed schedules until after restart.

                                This would include all of the “internal jobs” like Daily reports, Rules engine, etc…

2)      Let the running jobs run till completion or “failed to complete” state.

3)      Then exit.

                This would allow a host to be “quiesed” and “roll the work load off to other nodes” in a controlled way without requiring “rework” or disrupting the current work and causing undesired delays for those jobs.



I am not sure how processes would “not finish”. Can you explain that part of your response?

However, maybe it would be helpful to take a single specific example and walk through it in detail? ( Basically a “Long running/big process” condition. )

I have “LDAP Loader jobs” ( mostly “LDAP_GROUPS_FROM_ATTRIBUTES”, but there are other styles of loader jobs too including some SQL jobs.) Some of them can pull in “large numbers of groups and/or members”.

                In fact, I have had to “break down a single ldap search condition” into many narrower searches to reduce the size and number of the data ( groups ) returned so that the RAM/CPU load is manageable across time. Well, and so the job would actually finish.

                Just for the record, I have done things like dump millions of ldap objects with standard LDAP command line tools from this source and it normally took between 30 min and about 2 hours depending on the complexity of the search and how indexes support the search.) So the LDAP source can support the work. And the search that I am using is well indexed so we should be on the low end of that range. (Yet the loader job takes about 2 hours to complete, when it does not error out and fail. But that is a different topic….)

                So as a “simple example” ( that I think most universities could relate to ) let us talk about the largest cohort that a University has. Their Alumni.

                We try to provide some services to our Alumni. So the University needs to know who is an Alumni for authorization data to applications. For our current numbers we are talking about a single group on the order of 500K members. I have isolated that group loader job to just load that one group. And it well does not behave very well. It takes a lot of RAM when it runs, and I think I have even observed CPU spikes while it is running. So much so, I have disabled the job and I am looking for a “better way” to deal with the large group “issue” that I see. ( I did not break this “one group” down to “Load 26 sub groups” ( by first letter of their last name) and then have a group the has those sub groups as members. But I may need to go there…. I just don’t want too. L ) However, in fairness, grouper 2.4 move to Ldaptive ( instead of vt-ldap) and that may change this in some helpful ways. However, I still think this is a good example for many reasons. ( And no, this set does not just change at the end of terms. It is a continuous flow, with very large spikes of change at the end of term. Believe it or not, we even try to know when our Alum change state to “deceased” as well. Which is most of the continuous membership changes for this group. ) This job can take 2 hours to complete to “success”.

So I will continue on this example.

Just talking about the run time of this one loader job:

                Obviously this loader job takes time to search the ( ldap ) source for 500k entries (members). ( And the data can be changing while the “pull of data” is going on too. But I leave that as a “source” issue to deal with.) From previous experience I expect that to be about 20-40 minutes from “search” to “results”.

                So if that job is running and the loader job is killed, then a lot of work(time/cpu cycles) may be “lost”. And it will take time for the next loader to “start again” and get back to the “relative point” in the job that was killed.

                Questions about what happens when the loader job is abruptly stopped:

                                In the middle of the query(s)? How would you “pick up where you left off”? Maybe just start again?

                                While loading the results into the grouper staging table(?) How do you know it was done loading the data? Is there a “total count” recorded before the first record is loaded?

                                While converting the temp data into memberships? ( Maybe you could continue from here… maybe….)

                                Am I describing the internal process of the loader job poorly? ß If so, then it could be that I just don’t understand the phases of the job well enough to see the features.

                                                Maybe there are “gates” that are recoverable points where the next loader could “pick up and keep going”?

--

Carey Matthew

From: Gettes, Michael <>
Sent: Friday, September 7, 2018 10:02 AM
To: Chris Hyzer <>
Cc: Black, Carey M. <>; Ryan Rumbaugh <>;
Subject: Re: [grouper-users] RE: Status Monitoring - Two Errors

Well, that’s cool if we can restart midway. BUT, if grouper is down for an hour or twelve, I don’t think I would want to restart. Maybe it is configurable? The default being something like a restart within 20 minutes causes grouperus loaderus interruptus to be continued. Longer than that and we continue with the normal schedule???

(It’s Friday. I’m punchy).

/mrg

On Sep 7, 2018, at 9:38 AM, Hyzer, Chris <> wrote:

I don’t think it is bad to stop loader jobs abruptly, but I agree that when it starts again it should continue with in progress jobs. Right? If we wait until work finishes, how do you define work, and will it ever really finish? If it picks back up where it left off, it should be fine since things are transactional and not marked as complete until complete… thoughts?

Thanks

Chris

From:  [] On Behalf Of Gettes, Michael
Sent: Monday, August 27, 2018 12:00 PM
To: Black, Carey M. <>
Cc: Ryan Rumbaugh <>;
Subject: Re: [grouper-users] RE: Status Monitoring - Two Errors

I’ve always wanted a quiesce capability. Something that lets all the current work complete but the current loader instance won’t start any new jobs. This would be needed for all loader daemons or just specific ones so we can safely take instances down. I have no idea if this is possible with Quartz and haven’t had a chance to look into it.

/mrg

On Aug 27, 2018, at 11:20 AM, Black, Carey M. <> wrote:

Ryan,

RE: “I had been restarting the API daemon” … ( due to docker use )

                I have often wondered how the “shutdown process” works for the daemon. Is it “graceful” ( and lets all running jobs complete before shutdown) or does it just “pull the plug”?

                                I think it just pulls the plug.

                                Which “leaves” running jobs as “in progress”(in the DB status table) and they refuse to immediately start when the loader restarts. Well, until the “in progress” record(s) get old enough that they are assumed to be dead. Then the jobs will no longer refuse to start.

                I say that to say this:

                                If the loader is restarted repeatedly, quickly, and/or often, you may be interrupting the running jobs and leaving them as “in progress” (in the DB) and producing more delay on the jobs re-starting again. But it all depends on how fast/often those things are spinning up and down.

                                However, maybe If you always spinning up instances (and let the old ones run for a bit) you may be able to “wait till a good time” to turn them off.

                                Maybe if you cycle out the old instances gracefully by timing it with these settings?

                                “

                                ##################################

                                ## enabled / disabled cron

                                ##################################



                                #quartz cron-like schedule for enabled/disabled daemon. Note, this has nothing to do with the changelog

                                #leave blank to disable this, the default is 12:01am, 11:01am, 3:01pm every day: 0 1 0,11,15 * * ?

                                changeLog.enabledDisabled.quartz.cron = 0 1 0,11,15 * * ?

                                “

RE: how to schedule the “deprovisioningDaemon”

                Verify that your grouper-loader.base.properties has this block: ( or you can add it to your grouper-loader.properties )

                NOTE: it was added to the default base as of GRP-1623. ( which maps to grouper_v2_3_0_api_patch_107  ( and for the UI grouper_v2_3_0_ui_patch_44 ) ) You likely are past those patches… but just saying. J

                “

                #####################################

                ## Deprovisioning Job

                #####################################

                otherJob.deprovisioningDaemon.class = edu.internet2.middleware.grouper.app.deprovisioning.GrouperDeprovisioningJob

                otherJob.deprovisioningDaemon.quartzCron = 0 0 2 * * ?

                “

HTH.

--

Carey Matthew

From:  <> On Behalf Of Ryan Rumbaugh
Sent: Monday, August 27, 2018 10:12 AM
To:
Subject: [grouper-users] RE: Status Monitoring - Two Errors

An update to this issue that may be helpful to others…

Before I left the office on Friday I ran the gsh command “loaderRunOneJob(“CHANGE_LOG_changeLogTempToChangeLog”)” process and now the number of rows in the change_entry_temp table is zero! I tried running that before, but really didn’t see much of anything happening. Maybe I was just too impatient.

Now when accessing grouper/status?diagnosticType=all the only error is related to “OTHER_JOB_deprovisioningDaemon”. If anyone had any tips on how to get that kick started it would be greatly appreciated.

--

Ryan Rumbaugh

From:  <> On Behalf Of Ryan Rumbaugh
Sent: Friday, August 24, 2018 9:15 AM
To:
Subject: [grouper-users] Status Monitoring - Two Errors

Good morning,

We would like to begin monitoring the status of grouper by using the diagnostic pages at grouper/status?diagnosticType=all, but before doing so I would like to take care of the two issues shown below.

Can anyone provide tips/suggestions on how to fix the two failures for CHANGE_LOG_changeLogTempToChangeLog and OTHER_JOB_deprovisioningDaemon?

We had a Java heap issue late last week which I believe caused the “grouper_change_log_entry_temp” table to keep growing. It’s at 69,886 rows currently while earlier this week it was at 50k. Thanks for any insight.

2 errors in the diagnostic tasks:

DiagnosticLoaderJobTest, Loader job CHANGE_LOG_changeLogTempToChangeLog

DiagnosticLoaderJobTest, Loader job OTHER_JOB_deprovisioningDaemon

Error stack for: loader_CHANGE_LOG_changeLogTempToChangeLog

java.lang.RuntimeException: Cant find a success in job CHANGE_LOG_changeLogTempToChangeLog since: 2018/08/16 14:19:22.000, expecting one in the last 30 minutes

                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticLoaderJobTest.doTask(DiagnosticLoaderJobTest.java:175)

                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticTask.executeTask(DiagnosticTask.java:78)

                at edu.internet2.middleware.grouper.j2ee.status.GrouperStatusServlet.doGet(GrouperStatusServlet.java:180)

                at javax.servlet.http.HttpServlet.service(HttpServlet.java:635)

                at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)

                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230)

                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)

                at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)

                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)

                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)

                at org.owasp.csrfguard.CsrfGuardFilter.doFilter(CsrfGuardFilter.java:110)

                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)

                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)

                at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)

                at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)

                at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:478)

                at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)

                at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:80)

                at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:624)

                at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)

                at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:341)

                at org.apache.coyote.ajp.AjpProcessor.service(AjpProcessor.java:478)

                at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)

                at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:798)

                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1441)

                at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

                at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)

                at java.lang.Thread.run(Thread.java:748)

Error stack for: loader_OTHER_JOB_deprovisioningDaemon

java.lang.RuntimeException: Cant find a success in job OTHER_JOB_deprovisioningDaemon, expecting one in the last 3120 minutes

                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticLoaderJobTest.doTask(DiagnosticLoaderJobTest.java:173)

                at edu.internet2.middleware.grouper.j2ee.status.DiagnosticTask.executeTask(DiagnosticTask.java:78)

                at edu.internet2.middleware.grouper.j2ee.status.GrouperStatusServlet.doGet(GrouperStatusServlet.java:180)

                at javax.servlet.http.HttpServlet.service(HttpServlet.java:635)

                at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)

                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230)

                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)

                at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)

                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)

                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)

                at org.owasp.csrfguard.CsrfGuardFilter.doFilter(CsrfGuardFilter.java:110)

                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)

                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)

                at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)

                at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)

                at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:478)

                at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)

                at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:80)

                at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:624)

                at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)

                at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:341)

                at org.apache.coyote.ajp.AjpProcessor.service(AjpProcessor.java:478)

                at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)

                at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:798)

                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1441)

                at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

                at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)

                at java.lang.Thread.run(Thread.java:748)

--

Ryan Rumbaugh

RE: [grouper-users] RE: Status Monitoring - Two Errors, Hyzer, Chris, 09/07/2018
- Re: [grouper-users] RE: Status Monitoring - Two Errors, Gettes, Michael, 09/07/2018
  - RE: [grouper-users] RE: Status Monitoring - Two Errors, Black, Carey M., 09/07/2018
    - RE: [grouper-users] RE: Status Monitoring - Two Errors, Hyzer, Chris, 09/07/2018
      - Re: [grouper-users] RE: Status Monitoring - Two Errors, Gettes, Michael, 09/07/2018
      - RE: [grouper-users] RE: Status Monitoring - Two Errors, Black, Carey M., 09/07/2018
        
        Re: [grouper-users] RE: Status Monitoring - Two Errors, Gettes, Michael, 09/07/2018
        
        RE: [grouper-users] RE: Status Monitoring - Two Errors, Black, Carey M., 09/07/2018
- <Possible follow-up(s)>
- [grouper-users] RE: Status Monitoring - Two Errors, Ryan Rumbaugh, 09/07/2018

List archive

Re: [grouper-users] RE: Status Monitoring - Two Errors