Skip to Content.
Sympa Menu

perfsonar-user - RE: [perfsonar-user] perfsonar server load high

Subject: perfSONAR User Q&A and Other Discussion

List archive

RE: [perfsonar-user] perfsonar server load high


Chronological Thread 
  • From: Zhi-Wei Lu <>
  • To: Andrew Lake <>, "" <>
  • Subject: RE: [perfsonar-user] perfsonar server load high
  • Date: Tue, 19 Dec 2017 17:29:03 +0000
  • Accept-language: en-US
  • Authentication-results: spf=none (sender IP is ) ;
  • Ironport-phdr: 9a23:w8T3mxG0tM0SE/ryB3rxyp1GYnF86YWxBRYc798ds5kLTJ7yrsqwAkXT6L1XgUPTWs2DsrQY07OQ6/iocFdDyK7JiGoFfp1IWk1NouQttCtkPvS4D1bmJuXhdS0wEZcKflZk+3amLRodQ56mNBXdrXKo8DEdBAj0OxZrKeTpAI7SiNm82/yv95HJbAhEmCexbaluIBmqsA7cqtQYjYx+J6gr1xDHuGFIe+NYxWNpIVKcgRPx7dqu8ZBg7ipdpesv+9ZPXqvmcas4S6dYDCk9PGAu+MLrrxjDQhCR6XYaT24bjwBHAwnB7BH9Q5fxri73vfdz1SWGIcH7S60/VDK/5KlpVRDokj8KOT4n/m/Klsx+gqFVoByjqBx+34Hbb5qYO+Bicq/BZ94WWXZNU8RXWidcAo28dYwPD+8ZMOlbr4n9pkICohugCgmtGejhzCJIjWLx0Kw73eUhFRzG0Rc9H90SrXvbtsv1NKYJUeyv0qbH0CjDYupQ1Dzg5obIdRUhruuNXbJ2acfRzkkvFwLCjlmJsozlIyma1vgTvGSB8eVvSP+vh3Y7qw5tvzSj39sshZfPhoIayV3I7Dt1zJwzJdKmVE53ecOkEJ1Qty2AKYR5X94iT3lpuCkg0b0GpYS0fDQUx5g92RHQduGHf5CT7R39TuaRIil3hHZ7d7O/nRq971WvyvD6W8Kp01hKtjJInsTQun0CzRDe5cqKRuFy80u/wzqC1w/e5vlZLU01kafXMYAtz7Aym5YJr0jPAyH7lF/ogKOKcEgv5/Km5P79Yrr8o5+RL490hR/6MqQpgsGxGfg1PA8SU2WV4Oixyb/s8VPgTLVNlfI5jLPVsJfHJcQHvaG5BBJV0oA+5BqlFzemytMYnWUZI11ZZBKHjo/pO1fULPD/EPe/n1CskDBsx/DFJLHuHpLNLn3bnLfge7Zy9VJcxRIuwdxD6J9YEL4MLfDpVkL+qtDUFB80PgOsz+biEtp914ceWWyVAq+eNaPfqV6J5+wrI+mRf4IVpSryJOU/6P7wjH85gkURcre00psKcHy4BOhpI12FYXrwhdcMCX8KsRQkTOzkk12CVjhTaGyoX64l+zE7E5ypDZ3YS4CpgbyBxzu7HoZIamxcC1CMF2voeJueW/cKdi2SPtFtniYaWre8Vo9ynS2p4Sb8x6BqMaLw8y4V/cbq0tRkz+DI0xc/6WowR46F3nuDSHtxl2UDSnow26xy5FFmx02Y+al+n/FCE9FPvbVEXhpwfcrEwvZ0DMr3UwTKc5KPT1qrB8i9DCsqZtM339IUZUthQZOvgg2VjASwBLpApbyKQaM58+r4wjClNdl51l7b36UokVg9BMZDKDv11eZE6wHPCtuRwA2inKGwePFE0Q==
  • Spamdiagnosticmetadata: NSPM
  • Spamdiagnosticoutput: 1:99

Hi Andrew,

 

After I posted the email,  the load of the server came down after 1 hour of reboot, right about the system reboot, load stayed high in the 30+ range.  The system load is in the 1-2 range currently.  As a consequence, the throughput tests are working properly (with numbers that I expect)

We have about 125 powstream process, I tried to trim old, obsolete entries from configuration, but I the web page for that does not respond.

Do you know what daemons I should restart?

 

Thank you.

 

Zhi-Wei Lu

IET-CR-Network Operations Center

University of California, Davis

(530) 752-0155

 

From: [mailto:] On Behalf Of Andrew Lake
Sent: Tuesday, December 19, 2017 7:14 AM
To: Zhi-Wei Lu <>;
Subject: Re: [perfsonar-user] perfsonar server load high

 

Hi,

 

Most of those errors look like a side effect of having a loaded host, not the cause. Are you running a large number of powstream tests on this host? If you do a “ps auxw | grep powstream | wc-l” what do you get? The number should be roughly 4x the number of powstream tests your host is running since each test leads to 4 processes. If its significantly higher than that, you may have some other issue. 

 

The clean_esmond_db.log error looks like some type of cassandra issue. You might try looking in the logs under /var/log/cassandra for more information and might also be worth looking to see if you have a cassandra process running with "ps auxw | grep java”. Sometimes cassandra can have problems which in turn can cause httpd to have some issues and spawn too many processes.

 

Thanks,

Andy

 

 

 

On December 18, 2017 at 6:48:24 PM, Zhi-Wei Lu () wrote:

While at today’s perfclub meeting, I noticed that our server to CENIC had terrible throughput issue.  I then noticed that our server had load as high as “30+”, since there were a few recent perfsonar related packages. I reboot the server, once the system came back, it had high load right away.  I wonder if anyone see similar problem.  In the owamp_bwctl log, I was log such as these:

 

 

Dec 18 15:25:04 melange owampd[20058]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:25:06 melange bwctld[21040]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:07 melange bwctld[21142]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:09 melange bwctld[21185]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:20 melange bwctld[21390]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:31 melange owampd[20963]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:25:31 melange owampd[20963]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:25:31 melange owampd[20950]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:25:31 melange owampd[20950]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:26:08 melange owampd[21585]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:26:08 melange owampd[21585]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:26:08 melange owampd[21583]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:26:08 melange owampd[21583]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:26:35 melange owampd[22207]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:26:35 melange owampd[22207]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

While in meshconfig-agent.log, there are many warnings as well.

 

2017/12/18 15:19:00 (10386) WARN> perfsonar_meshconfig_agent:145 main::__ANON__ - Warned: Use of uninitialized value $address in exists at /usr/lib/perfsonar/bin/../lib/perfSONAR_PS/RegularTesting/Tests/BwctlBase.pm line 384.

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem adding test trace(128.120.80.74->nautilus.sr.unh.edu), continuing with rest of config: 500 INTERNAL SERVER ERROR: Unable to determine participants: Process took too long to run.

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(128.120.80.74->tc1-teng8-2.net.ohio-state.edu): 400 BAD REQUEST: Can't find pScheduler or BWCTL on tc1-teng8-2.net.ohio-state.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(tc1-teng8-2.net.ohio-state.edu->128.120.80.74): 400 BAD REQUEST: Can't find pScheduler or BWCTL on tc1-teng8-2.net.ohio-state.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(128.120.80.74->b06sr1-vlan254.tele.iastate.edu): 400 BAD REQUEST: Can't find pScheduler or BWCTL on b06sr1-vlan254.tele.iastate.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(b06sr1-vlan254.tele.iastate.edu->128.120.80.74): 400 BAD REQUEST: Can't find pScheduler or BWCTL on b06sr1-vlan254.tele.iastate.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test rtt(2607:f810:330:1ffe::f->perfsonar-011.net.berkeley.edu): 400 BAD REQUEST: Neither the source nor destination is running pScheduler.

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(128.120.80.74->nautilus.sr.unh.edu): 400 BAD REQUEST: Can't find pScheduler or BWCTL on nautilus.sr.unh.edu

 

There were also errors in clean_esmond_db.log

query error for metadata_key=bdd71f21372749cf90d63c6544bda3df, event_type=time-error-estimates, summary_type=base, summary_window=0, beg

in_time=1476263098, end_time=1476349498, error=An attempt was made to connect to each of the servers twice, but none of the attempts suc

ceeded. The last failure was TTransportException: Could not connect to localhost:9160

Error connecting to remote JMX agent!

java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: Exception creating connection to: 127.0.0.1; nested exception is:

        java.net.SocketException: Network is unreachable (connect failed)]

        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:370)

        at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:268)

        at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:151)

        at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:121)

        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1276)

Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: Exception creating connection to: 127.0.0.1; nested exception is:

        java.net.SocketException: Network is unreachable (connect failed)]

        at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:142)

        at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:204)

        at javax.naming.InitialContext.lookup(InitialContext.java:415)

        at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1928)

        at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1895)

        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287)

        ... 4 more

Caused by: java.rmi.ConnectIOException: Exception creating connection to: 127.0.0.1; nested exception is:

        java.net.SocketException: Network is unreachable (connect failed)

        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:631)

        at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)

        at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)

        at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:338)

        at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:112)

 

Please let me know if you know solution to this problem.  Thank you.

 

Zhi-Wei Lu

IET-CR-Network Operations Center

University of California, Davis

(530) 752-0155

 




Archive powered by MHonArc 2.6.19.

Top of Page