Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] Fwd: Re: Problems with perfSonar instance at UKI-SOUTHGRID-CAM-HEP

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Fwd: Re: Problems with perfSonar instance at UKI-SOUTHGRID-CAM-HEP


Chronological Thread 
  • From: Mark Feit <>
  • To: John Hill <>, "" <>
  • Cc: Marian Babik <>
  • Subject: Re: [perfsonar-user] Fwd: Re: Problems with perfSonar instance at UKI-SOUTHGRID-CAM-HEP
  • Date: Fri, 11 Jan 2019 15:30:54 +0000
  • Accept-language: en-US
  • Authentication-results: hep.phy.cam.ac.uk; dkim=none (message not signed) header.d=none;hep.phy.cam.ac.uk; dmarc=none action=none header.from=internet2.edu;
  • Ironport-phdr: 9a23:Z8QZORTklXMWEJjOpUYaaBpfW9psv+yvbD5Q0YIujvd0So/mwa6zYRaN2/xhgRfzUJnB7Loc0qyK6/CmATRIyK3CmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TW94jEIBxrwKxd+KPjrFY7OlcS30P2594HObwlSizexfbB/IA+qoQnNq8IbnZZsJqEtxxXTv3BGYf5WxWRmJVKSmxbz+MK994N9/ipTpvws6ddOXb31cKokQ7NYCi8mM30u683wqRbDVwqP6WACXWgQjxFFHhLK7BD+Xpf2ryv6qu9w0zSUMMHqUbw5Xymp4rx1QxH0ligIKz858HnWisNuiqJbvAmhrAF7z4LNfY2ZKOZycqbbcNgHR2ROQ9xRWjRBDI2icoUPE+QPM+VWr4b/plsBsRSwCga3CePz0TBIg2P60bEm3+kjFwzNwQwuH8gJsHTRtNj7M70dUfq2zKLVzTvMcfJW2Svg44XPdxAhr++DXbNsccfKyEkvER/FgUuKqYzjITyVyvoBv3KF4OV9SOKikmgqoBxyrDi33sogl5XFi40Pxl3L9yh12ok4KN6iREN7fNKoCIZcuiOUOodsQM4vR3tktDskxrEbo5K3YSYHxZU/yx7RdfOKcJSE7xfmWeueIjp3mnFodbexihu9/0WtxejxW8ap31tEoCpIl9vBu38P2hPJ7MWMV+Fz8V272TmV0gDe8uFELl4wlarcM5Mv2qI9mJ0PvUnDByP7hlz4gLKPekUj4een9f7rYrL7pp+ALIB0jRz+MqIzlcClGeQ4KA8OX3SF9uugyL3j/Er5QLNQgv0xj6nZrJTaJcMcpq66GQNazoEj6xOnAzen1tQXg2UHIUpYdB6bgIXlIV7DLfLiAfqwgFmgijdmy+3eMr3kGJrNL3zDkLn7fbZ67k5R0A8zzd5B6JJVFrEMO/PzWknttNPGFB85NRK7w/r5BNlnyIwRRH+PDreDMKzOqV+I+v4vI+6UaY8Npjn9MfYl5+XpjX8/g1AdZ7Cl3YYMaH+mBPRrOEGZYXv3gtcdCmcGoBAyTO3siF2eTzFTfXCyULwg5j0lEo6pE5rMRp3+yICGiQW8HpseRmFCDhjYEXbsc62EVu9KcCKDZMRol3oNXv68SNllnVu2uRX00L1hJ/CR5zYVr7ri0sR4/eveiUt0+DBpRYzJy2yXQXpzmGoSAiIt0bpXoEphx02F3LQixfFUCIoAyelOV1IfPITfh8J3Csy6DgfPc9aVYFegXti8BzwtFJQ8z8JYMBU1IMmrkh2Wh3niOLQSjbHeQcVsqviGjXHsO8ZwzWrH36A9jl4gB9FCLnCimrUmrFaBB4PR1V2ZhuChfKFa1SWL6WTQhWaNvUQNVgl2XO2FWH0EfULZoJz/4V+KVL6hD7krc24jgc6PI6dHcJvl2FNBQvqwONnGJXqxiiG7DBPOz7jKcYm5M2kY1T/WXU4DlQ1b9H2aNA84UyGmpW+7bnRuGFvjblmq//N5rSa6SFM51QeHcxcn2raoqRM=
  • Spamdiagnosticoutput: 1:0

Marian Babik and John Hill write:

I noticed that the toolkit web page shows only 35 entries, but the
auto-URL for the host shows a lot more hosts to be tested as the node
was added to the ATLAS and LHCb meshes - unsure when exactly this
happened...

This would be a good thing to run down since a change like that will put
significantly more load on a system. ATLAS and LHC are running the largest
meshes I know of and are often where we find the hairy edges of what
perfSONAR can do.

> On Jan 10, 2019, at 4:26 PM, John Hill
<>
wrote:
>
> The problem showed up about 24 hours after the host was updated to
4.1.5 - is this new version more resource hungry?

It shouldn't be. The last release to introduce anything new was 4.1, and
as pScheduler goes, that was almost a non-event. Everything since has been
bugfixes and very minor improvements.

> I see quite a few errors in /var/log/pscheduler/pscheduler.log of the
type
>
> Failed to post run for result: Database connection pool exhausted.
Unable to get connection after 60 attempts.

Some under-the-hood insight: One of the internal parts of pScheduler is
called the runner, which is responsible for overseeing the execution of
measurements and storing the results. It's multithreaded and maintains a
pool of connections to the PostgreSQL database that the threads can acquire,
use and return when needed. The size of that pool depends on the maximum
number of connections available on the database server. Currently, we have
that set at 500 and the runner takes half to leave some for other programs
that use it. The pool will spend up to a minute waiting for a connection to
be available, making it more resilient in situations where demand has spiked
enough to exhaust it.

A large-enough workload can make exhaustion of the pool a regular event.
Most measurements don't contribute too much because they're either relatively
infrequent (RTT, trace) or self-regulating (throughput, which doesn't run
more than one at a time). Because it produces a continuous stream of
results, each streaming latency task causes its corresponding thread in the
runner to be almost always in possession of a connection from the pool. As
Marian pointed out, the process of getting measurements stored is I/O-bound,
so it's entirely possible that the database isn't going fast enough to
prevent pool exhaustion. Raising the connection limit might help, but the
other possibility is that a system problem is reducing I/O throughput enough
that threads hold onto connections longer than they usually would, exhausting
the pool.

The lowest-hanging fruit would be to check the system's general health,
especially if it's an old machine. Make sure the kernel isn't complaining
about I/O retires or memory problems. Check that the swap space isn't in
heavy use. If the disk is RAIDed, make sure the controller isn't spending a
lot of time reconstructing data because of a failed disk that wasn't replaced.

Hope that helps.

--Mark





Archive powered by MHonArc 2.6.19.

Top of Page