Skip to Content.
Sympa Menu

grouper-users - Re: [cifer-prov] Re: [grouper-users] monitoring a message bus ....

Subject: Grouper Users - Open Discussion List

List archive

Re: [cifer-prov] Re: [grouper-users] monitoring a message bus ....


Chronological Thread 
  • From: "Bee-lindgren, Bert A" <>
  • To: Patrick Radtke <>
  • Cc: Keith Hazelton <>, Grouper-Users <>, , Steven Carmody <>
  • Subject: Re: [cifer-prov] Re: [grouper-users] monitoring a message bus ....
  • Date: Thu, 18 Apr 2013 10:07:42 -0400 (EDT)
  • Authentication-results: sfpop-ironport01.merit.edu; dkim=neutral (message not signed) header.i=none

Georgia Tech's IAM group uses queues implemented in both Rabbit and a
database. The database one is more mature and has lots of monitoring and
stats.

Here's what we've seen:

BUS/QUEUES:
Focus on the Queues/Bus as a monitoring safety-net and specifically look for
stuck and/or erroring queues:
The most accurate monitoring -- zero false-positives -- we've implemented is
the following:
Look for queues with data for several minutes and zero completions. We do
this 24x7 and generally don't wake people up unnecessarily. (knocking on wood)

We've looked at queue backlogs as well, but there are too many
false-positives for alerting. Of course, if there's a special integration,
then you can alert on its backlogs.

We have a database table that describes the queues and their respective
thresholds and 24x7 vs 16x7 vs no-alert characteristics.


The downsides of Bus/Queue monitoring that we have seen are:
1) Delay... The thresholds tend to be 15-120 minutes. That might be too long
a delay before knowing a key integration is broken
2) Triage/Diagnostics... The bus tells you something is broken, but doesn't
describe the core problem. Who does the NOC call? What does the callee look
at to find the problem.


PROCESSES:
We also alert on missing queue-reading processes, stale log files, special
log messages, unresponsive hosts, etc... However, the significant surface
area of them means they're not as consistently monitored.

However, the good news is that you can implement fast notifications (easily
1-10 minutes) as well as provide diagnostic clues.


So, in summary, I'd suggest:
1) Implementing across-the-board monitoring focused on the queuing system.
2) Implement other monitoring for diagnostic/triage info and for where you
need fast alerts.




Bert Bee-Lindgren, Identity Management & Middleware
IT/EIS :: Georgia Tech ::
811 Marietta, Across from Richard Tanner (Cube 230 on Fridays)
W: 877-237-8251 :: SMS: 402-237-8251 :: AIM: BertBeeLindgren
https://mail.gatech.edu/home/bl17?fmt=freebusy (my availability)

----- Original Message -----
> From: "Patrick Radtke"
> <>
> To: "Steven Carmody"
> <>
> Cc: "Keith Hazelton"
> <>,
> "Grouper-Users"
> <>,
>
>
> Sent: Tuesday, April 9, 2013 7:40:52 PM
> Subject: Re: [cifer-prov] Re: [grouper-users] monitoring a message bus ....
>
> Hi Steven,
>
> We use an ESB for communicating person and account changes between
> interested departments. There are countless ways for things to fail.
>
> A large portion of these failures can be detected from monitoring the
> ESB. The group running our ESB monitor at the ESB the number of
> unprocessed messages in the durable queue/topic for each listener and
> the time since a given listener last connected. Alerts are sent if
> they
> are over certain thresholds. This is useful in detecting:
> - listeners that aren't running
> - listeners that are running but where the the JMS connection is
> hung.
> - listeners that are thrashing on a 'bad' message (e.g. listener
> has
> an exception processing the message before it acknowledges the
> messages,
> and continually re-receives the same message)
> - listeners that are so badly written they deadlock themselves
>
> The caveat for this setup is that you need enough message volume for
> the
> thresholds to be reached if a listener is broken.
>
> We haven't manage to set our thresholds to the right sweet spot, but
> I
> do believe monitoring at the ESB level can detect a wide range of
> listener issues.
>
> -Patrick
>
>
>
> On 4/9/13 10:16 AM, Keith Hazelton wrote:
> > I'm going to cross-post this to CIFER-Provisioning and Integration.
> >
> > Curious to know specifics: Are you using an ESB? Just a JMS queue?
> > Which products? What do the message recipients do since most
> > target systems don't come knowing how to do this.
> >
> > --Keith
> > ______________________
> > On 2013-04-09, at 12:03 , Steven Carmody wrote:
> >
> >> Hi,
> >>
> >> I know my question is only tangentially related to Grouper, but at
> >> least there's a link, even if its weak. Thanks for your patience
> >> with this question!
> >>
> >> Brown replicates group memberships from Grouper to several
> >> different target systems: ldap, Google, and our LMS. We expect to
> >> add other targets over time. When changes occur in Grouper, a msg
> >> is placed on a msg bus. A listener picks up that msg, and has a
> >> set of rules telling it the one or more targets that the msg
> >> should be forwarded to.
> >>
> >> As we become more and more reliant on this infrastructure, we're
> >> asking ourselves what we should monitor with respect to the bus.
> >> We're keenly interested in the experience of other sites with
> >> respect to what sorts of problems they've encountered with a bus,
> >> and what sort of monitoring we should implement.
> >>
> >> Is it enough to just make sure that the bus is delivering msgs?
> >> (ie have a separate Q used by the monitoring software). Or do we
> >> need to build monitoring into all the Listeners, to make sure
> >> that they are all still processing msgs ? Or other approaches ?
> >>
> >> Thanks in advance for sharing your experience and suggestions!
> >
>
>



Archive powered by MHonArc 2.6.16.

Top of Page