Skip to Content.
Sympa Menu

perfsonar-user - [perfsonar-user] Fwd: Re: Problems with perfSonar instance at UKI-SOUTHGRID-CAM-HEP

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Fwd: Re: Problems with perfSonar instance at UKI-SOUTHGRID-CAM-HEP


Chronological Thread 
  • From: John Hill <>
  • To:
  • Subject: [perfsonar-user] Fwd: Re: Problems with perfSonar instance at UKI-SOUTHGRID-CAM-HEP
  • Date: Fri, 11 Jan 2019 10:47:51 +0000
  • Ironport-phdr: 9a23:07YS5hZemFKAtpchZGzFWPX/LSx+4OfEezUN459isYplN5qZr825bnLW6fgltlLVR4KTs6sC17KG9fi4EUU7or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRpOOv1BpTSj8Oq3Oyu5pHfeQpFiCa+bL9oMBm6sRjau9ULj4dlNqs/0AbCrGFSe+RRy2NoJFaTkAj568yt4pNt8Dletuw4+cJYXqr0Y6o3TbpDDDQ7KG81/9HktQPCTQSU+HQRVHgdnwdSDAjE6BH6WYrxsjf/u+Fg1iSWIdH6QLYpUjmk8qxlSgLniD0fOjAk/mHZlMx+gqFVrh2vqBNwwZLbbo6OOfpifa7QZ88WSXZPU8tTUSFKH4Oyb5EID+oEJetWq479p1sIrRCjBwesBefvyjtVjXLx3a060uAhEQXd0QwgAd0OqG7YrM31NKYSS+y60LPHzTDZY/xMxTjx8pXIchM4rPyKQLl+f83RyUw1GAPEiFWdsYvlPyuL2eQLqGiU8+tgWvypi2E7tQ5xrSKvxsYxhYXTgYIV0F/E+CNky4g2Pd21UFN3bNC5HJdKqi2WKpZ6TtkhTm1ypSo3xKEKtYamcCUE1Zgr3QPTZ+Gaf4SS4x/uVfydLSpliH9jZbmxnQy98VK6xe35TsS00EhFri5CktTUrnANzwfT68aeRvZz4kutwyqA1xvS6u1ePU87j6/bJ4Q7zbEsjJYTrEfDEjf3mEXwkqCWal0p9vWm5uj6eLnqu4KQO5Juhgz9KKgih8KyDfggPggLRWeb+OC81LP5/U3+RbVHlv02kqjdsJDePskbprC2AxdP3oY76xa+Dy2q38gCknkCNl5KYg6Ig5L0O1HNOPz4F+uwg0ywkDd3wPDLJqXhDYvXLnjNi7fherB95FRGyAYq0NBf/IxbCqsaLfL3W0/xr8DYDgQnPwCuwubnDsl92Z0EWWKJHKCZLL3evUWW6e0yPunfLLMS7S7wMfY+4PjnlzokglIHVaivwZYNbn2kRLJrL1jKT2Drh4IjFmcM9io/SOii3F2DVTp7Z3uqGbo893cwAYfgBIyFW4P70+/J5zuyApADPjMOMVuLC3q9L4g=

Hello,
Marian Babik suggested that I forward this email exchange. I have rebooted the node recently, which does not seem to have improved things. This is an old server, and is running both latency and throughput tests: however this has not been a problem before.

Thanks,

John Hill

-------- Forwarded Message --------
Subject: Re: Problems with perfSonar instance at UKI-SOUTHGRID-CAM-HEP
Date: Fri, 11 Jan 2019 09:11:31 +0000
From: Marian Babik
<>
To: John Hill
<>
CC: wlcg-perfsonar-support (WLCG perfSONAR support mailing list) <>

Hi John,
I think it’s worth reporting this to the developers, could you please forward the mail to "" <> ? Mark Feit should be able to help debug if there is a real issue with performance after 4.1.5 (but other nodes on 4.1.5 that we monitor look good, so maybe it’s something specific).

I noticed that the toolkit web page shows only 35 entries, but the auto-URL for the host shows a lot more hosts to be tested as the node was added to the ATLAS and LHCb meshes - unsure when exactly this happened (http://psconfig.opensciencegrid.org/pub/auto/serv04.hep.phy.cam.ac.uk - note it’s json - you'll need to scroll to the end to see which meshes the node participates). Just ATLAS mesh has 66 hosts (though other meshes probably don’t add additional tests as there is quite some overlap), so it’s possible that the node is simply too busy trying to test all this and the upgrade somehow made it worse (if so then the load should be mainly on I/O as the tests are continuous so a lot of writes to DB).

Let me try to remove the host from the ATLAS and LHCb meshes and see if that improves the situation. Rebooting the node should help speed up the recovery.

Thanks,
Marian



On Jan 10, 2019, at 4:26 PM, John Hill
<>
wrote:

Hello,
A few days before Christmas, our perfSonar instance started to misbehave.
The symptoms are many more processes than usual, with a much higher load
average then normal, and a failure to publish to MadDash. I can't see
anything obviously wrong, but then I know very little about the inner
workings of perfSonar. The local campus network people claim that thay
haven't changed anything.
The problem showed up about 24 hours after the host was updated to 4.1.5 -
is this new version more resource hungry? I see quite a few errors in
/var/log/pscheduler/pscheduler.log of the type

Failed to post run for result: Database connection pool exhausted. Unable to
get connection after 60 attempts.

though there are also a lot of successful posting of results.
This host does both latency and throughput tests, but that has not been an
issue in the past. Any ideas on how to debug this would be welcome.

Thanks,

John Hill




Archive powered by MHonArc 2.6.19.

Top of Page