Skip to Content.
Sympa Menu

grouper-users - Re: [grouper-users] LDAP timeouts after Java upgrade

Subject: Grouper Users - Open Discussion List

List archive

Re: [grouper-users] LDAP timeouts after Java upgrade


Chronological Thread 
  • From: Baron Fujimoto <>
  • To: "Oulman,James F" <>
  • Cc: "" <>
  • Subject: Re: [grouper-users] LDAP timeouts after Java upgrade
  • Date: Wed, 20 May 2020 15:50:15 -1000

On Tue, May 19, 2020 at 03:52:30AM +0000, Oulman,James F wrote:
My understanding is the default idle timeout on the F5 tcp profile is 300
seconds[1]. On our Shibboleth IdPs, we run the ldap validation at 60s and it
resolved our dropped connections errors in the logs.

1. https://support.f5.com/csp/article/K13004262

´╗┐On 5/18/20, 9:55 PM, " on behalf of Baron
Fujimoto" < on behalf of >
wrote:

[External Email]

On Tue, May 05, 2020 at 05:22:17PM -1000, Baron Fujimoto wrote:
>On Tue, Apr 28, 2020 at 10:14:55AM +0100, Robert Bradley wrote:
>>
>>On 28/04/2020 02:27, Baron Fujimoto wrote:
>>>We're running Grouper 2.2.2 with LDAP (389DS) as a subject source.
>>>We were previously using Java 1.0.8_212 successfully. However, I
>>>recently upgraded the instance to use the current version of Java
>>>(251), and after doing so noticed that while it initially appears
>>>to work as expected, the LDAP connections eventually begin to time
>>>out with the following error:
>>>
>>>javax.naming.NamingException: LDAP response read timed out,
>>>timeout used:-1ms
>>>
>>>The timeouts start to occur after ~20 minutes. Netstat shows no
>>>open connections to our LDAP at that point.
>>>
>>>The grouper host is actually a node in a cluster behind a load
>>>balancer, but our lb admins can't find any relevant ~20 minute
>>>timeout value there.
>>>
>>>I've empirically determined that this appears to happen with a
>>>version of Java 8 higher than 221 (i.e. 231, 241, 251). I dodn't
>>>see anything in the JDK release notes for 231 that appear to be
>>>relevant.

>>><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.oracle.com_technetwork_java_javase_8u-2Drelnotes-2D2225394.ht&d=DwIBAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=_L7sACgIQaR0AZonCJxTrg&m=dVAVILzctndF7wXmU6MlIciU0r1tt3WEKz2dnrBpYAU&s=aIg6Cs7FoO3SdtUsO0xWjRA9jO2ieEZPJu8K3vLCGLg&e=
>>ml>
>>>
>>>
>>One thought is that it could be a similar JNDI bug to that described
>>in
https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.shibboleth.net_confluence_display_IDP30_LDAPonJava-253E8&d=DwIBAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=_L7sACgIQaR0AZonCJxTrg&m=dVAVILzctndF7wXmU6MlIciU0r1tt3WEKz2dnrBpYAU&s=rkJHTfS6eSLDQRmJPnCyBIa9W1uLd4EBaU0D8NlwfV8&e=
>>and
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.shibboleth.net_jira_browse_IDP-2D1441&d=DwIBAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=_L7sACgIQaR0AZonCJxTrg&m=dVAVILzctndF7wXmU6MlIciU0r1tt3WEKz2dnrBpYAU&s=h77iKk9Gc3ZMgIqkz8A5_jepIL3J42xnOwH7J6cM63o&e=
. The problem
>>with that is that the fix probably involves upgrading Grouper to get
>>the ldaptive library instead of vt-ldap, and then configuring it to
>>use the UnboundID library instead of JNDI. I doubt that's a practical
>>option in the short term unless vt-ldap has a similar setting you can tr
>>y.
>
>FWIW, I'm also seeing similar behavior when I attempted the same upgrade
in one of our CAS deployments, though the timeout errors happen rather quickly
there, so it doesn't appear to be Grouper specific. AFAIK, the version of CAS
involved is using an ldaptive library, so that suggests that the problem may not
lie with the vt-ldap library. Unfortunately my searches don't turn up any
evidence that this has been a problem for others, so I'm kind of at a loss now.
:/

We're still wrestling with this, but have uncovered a few more details in
case it provides any new insight into the problem.

1) Our LDAP is actually a cluster behind an F5 load balancer. If we point
CAS or Grouper at non-load balanced LDAP host, we do not see the timeout
problem. It appears that both JDK 8u231+ *and* LDAP behind the load balancer
are necessary conditions to trigger the timeour error.

So clearly it's some interaction between the two, possibly some half
closed or zombie connection from the load balancer that the upgraded Java is
not dealing with properly.

2) We've empirically determined that if we shorten the default value for
the LDAP pool validation from 600s to, say, 60s for CAS then this also
mitigates the timeout problem. The shortened pool validation period seems to
be sufficient to function as some sort of keepalive.

I tried something similar for Grouper, setting the validateTimerPeriod to
a very short value with the following in our ldap.properties:

edu.vt.middleware.ldap.pool.validatePeriodically = true
edu.vt.middleware.ldap.pool.validateTimerPeriod = 60

It seems to have the same mitigating effect for the Grouper UI, but not
for the Grouper WS as far as I can tell.

[bringing response back to bottom since that's been the general MO for this
thread]

Our F5 admins have provided the folling additional information regarding the
configuration of our load balancer fronting the LDAP cluster:

F5 is configured with nPath (aka DSR, Direct Server Return)

Packets paths (where client=CAS, Server=LDAP):
- Client -> F5 -> Server
- Return packet is Server -> Client
- bypasses F5, so F5 doesn't see full 3-way TCP handshake (SYN, SYN-ACK,
ACK)
- F5 sees SYN, ACK from client to server, but not SYN-ACK from server to
client (sent via DSR)

F5 relies on idle timeout to server's Virtual IP (VIP)
- Timeout value = 51 seconds

The Idle timeout was selected based on recommendations in the following
reference:
-
https://techdocs.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/ltm-implementations-12-1-0/4.html

I also now have a support case open with Oracle, but no real movement there
yet.

--
UH Information Technology Services : Identity & Access Mgmt, Middleware
minutas cantorum, minutas balorum, minutas carboratum desendus pantorum



Archive powered by MHonArc 2.6.19.

Top of Page