Skip to Content.
Sympa Menu

perfsonar-user - [perfsonar-user] Problems abound

Subject: perfSONAR User Q&A and Other Discussion

List archive

[perfsonar-user] Problems abound


Chronological Thread 
  • From: Phil Reese <>
  • To: "" <>
  • Subject: [perfsonar-user] Problems abound
  • Date: Mon, 5 Apr 2021 10:57:15 -0700

Hi,

A colleague and I were adding RabbitMQ archiver to a stable, though large, JSON file.  In the past, the attempt to add worked, passing data to RabbitMQ, but over time, it slogged down the MaDDash grid, where full horizontal grid lines would turn orange.  RabbitMQ was stopped and the original JSON file as re-published and pushed out. After a few hours things returned to normal.

With a few more revs of PS under the belt and the main grid running very stable for months, we tried to add the RabbitMQ archiver again.  The same thing happened, data was going to RabbitMQ but over time, horizontal grid lines would turn orange.  Ok, not ready for use yet, so re-published the long working JSON file.  Confirmed the 23 edge nodes got the update and waited.

We decided to abort the experiment when 9 lines went orange. However, once the original JSON was put back and a night had past, even more horizontal lines have turned orange!

Tried to be patient and wait it out.  But now, 36hours after reverting to the known working JSON, I only have 8 of 23 working grid lines.

Do I just need to be more patient?

I have done some debugging research.

From an orange host, I've run 'pscheduler troubleshoot'.  From all orange hosts, this gets to the very end but fails doing the archive step.  All the green line hosts pass the troubleshoot test fully!

The first result seemed to suggest the MaDDash systems wasn't accepting job runs, but the second result says, the archiving process is working.  Other dashboards reporting to the same MaDDash host and esmond DB, all have continued to work.  Doesn't seem like its a MaDDash host archive issue.

Looked at logs on several orange nodes and don't see 'ERROR's.  Did see this in psconfig-pscheduler-agent.log:

2021/04/05 09:21:03 WARN pid=2125 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=231 guid=11377194-9629-11EB-BF65-39029479FCEA msg=Problem adding test throughput(sups-yyy.stanford.edu->sups-xxx-east.stanford.edu), continuing with rest of config: Inactivity timeout

But these messages are sprinkled with these logs:

2021/04/05 10:08:27 INFO pid=2125 prog=perfSONAR_PS::PSConfig::PScheduler::Agent::_run_end line=239 guid=7B77CA56-9631-11EB-BF65-39029479FCEA msg=Added 0 new tasks, and deleted 1 old tasks

2021/04/05 10:08:27 INFO pid=2125 prog=main:: line=178 guid=7B77CA56-9631-11EB-BF65-39029479FCEA msg=Agent completed running

'nethogs' on the MaDDash host shows about equal send/receive traffic to and from the edge nodes, no more traffic to the RabbitMQ pathway and only some 20KB/sec of traffic total.

Happy to find other places to look!

Is there a way to dump the current schedule to allow a new JSON file to populate without having to wait for the old scheduler to clear out?

Thanks,
Phil




Archive powered by MHonArc 2.6.24.

Top of Page