Skip to Content.
Sympa Menu

perfsonar-user - Re: [perfsonar-user] perfsonar server load high

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] perfsonar server load high


Chronological Thread 
  • From: Andrew Lake <>
  • To: Zhi-Wei Lu <>, "" <>
  • Subject: Re: [perfsonar-user] perfsonar server load high
  • Date: Tue, 19 Dec 2017 07:14:14 -0800
  • Ironport-phdr: 9a23:zO1UjBC1GOGV/4b5ZH9ZUyQJP3N1i/DPJgcQr6AfoPdwSPX7pcbcNUDSrc9gkEXOFd2Cra4c0qyO6+jJYi8p2d65qncMcZhBBVcuqP49uEgeOvODElDxN/XwbiY3T4xoXV5h+GynYwAOQJ6tL1LdrWev4jEMBx7xKRR6JvjvGo7Vks+7y/2+94fcbglUmTaxe69+IAmrpgjNq8cahpdvJLwswRXTuHtIfOpWxWJsJV2Nmhv3+9m98p1+/SlOovwt78FPX7n0cKQ+VrxYES8pM3sp683xtBnMVhWA630BWWgLiBVIAgzF7BbnXpfttybxq+Rw1DWGMcDwULs5Xymp4aV2Rx/ykCoJNzw28G/QhMN/gqxVow+vqQJjzIPPeo6ZKOBzc7nHcN8GR2dMWNtaWSxbAoO7aosCF/YMPeBFoInnuVQPowa1Cw+2C+Ps1DBDm3j70rc80+s8EQDLxxIvH8kUvHTSstr1KL4fXOaox6fGyjXDaulZ2Tb76IXQdRAhofCMXbVpfcrK1UkgDR/FgUuKpYP7IjyVy/wBs3SH4OZ9TO6vl3Aoqwd+ojWs38sglJPFhoQLxVDY7Sl53YA1Jd2iREFlfNGkDYNctyCAOIttXsMtWX1otzggxrIYpJG7YTAGx485yB7FaPyIbYyI7QzjVeqLPzh3mW9ldbSijBix6Uit0vDwWte33VpQoCdJiNbBum0X2xHS6cWLUuVx8lul1DqV1A3e6vtILV4pmafUMZIswKA8m5wOukrZBCD2gl/5jKqOe0Uk5Oeo7+Pnb63oppCCOYJ4kAX+Pb8qmsClDuQ4NRYOU3Ca+eS6yrLj4VX0TKhUgvA1iKXUvorWKMsGqqKjAgJY0Z4v6xOlADen1NQYk2MHLFVAeB+flYfpPUzBIfDjAPihmFSgijFryO7aPrH5GJXCMmDDkKv9fbZ680Nc0BQ8zcpR55JPDbEBJuj8WlXouNzFFR82LRa0zv3jCNV8zYMeRXmPDrGDPKPTt1+I+vwgI/OKZIALpDbxNeIp6ODzgn8kyhchevyb3ZpSV325VtB7axGLfn7xqsoKHW4UvxF4QeD33hnKGyZefXi1Rac14Do2TYShAYqGW5ugmqep3SGnE4dQa3wcTF2ADD2gI52JQfkLciebJMRs1zoDWbXkVpQszwqGtQnmxqBhI/aOvCAUqMSw+sJy4rj6kx0o+CM8K82e3inZRmd4j0sFXHkw0bwp8h818UuKzaUt268QLtdU/f4cCgo=

Hi,

Most of those errors look like a side effect of having a loaded host, not the cause. Are you running a large number of powstream tests on this host? If you do a “ps auxw | grep powstream | wc-l” what do you get? The number should be roughly 4x the number of powstream tests your host is running since each test leads to 4 processes. If its significantly higher than that, you may have some other issue. 

The clean_esmond_db.log error looks like some type of cassandra issue. You might try looking in the logs under /var/log/cassandra for more information and might also be worth looking to see if you have a cassandra process running with "ps auxw | grep java”. Sometimes cassandra can have problems which in turn can cause httpd to have some issues and spawn too many processes.

Thanks,
Andy



On December 18, 2017 at 6:48:24 PM, Zhi-Wei Lu () wrote:

While at today’s perfclub meeting, I noticed that our server to CENIC had terrible throughput issue.  I then noticed that our server had load as high as “30+”, since there were a few recent perfsonar related packages. I reboot the server, once the system came back, it had high load right away.  I wonder if anyone see similar problem.  In the owamp_bwctl log, I was log such as these:

 

 

Dec 18 15:25:04 melange owampd[20058]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:25:06 melange bwctld[21040]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:07 melange bwctld[21142]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:09 melange bwctld[21185]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:20 melange bwctld[21390]: FILE=sapi.c, LINE=391, BWLControlAccept(): Unable to read ClientGreeting message

Dec 18 15:25:31 melange owampd[20963]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:25:31 melange owampd[20963]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:25:31 melange owampd[20950]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:25:31 melange owampd[20950]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:26:08 melange owampd[21585]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:26:08 melange owampd[21585]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:26:08 melange owampd[21583]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:26:08 melange owampd[21583]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

Dec 18 15:26:35 melange owampd[22207]: FILE=protocol.c, LINE=1900, _OWPWriteStopSessions: called in wrong state.

Dec 18 15:26:35 melange owampd[22207]: FILE=owampd.c, LINE=806, Control session terminated abnormally...

While in meshconfig-agent.log, there are many warnings as well.

 

2017/12/18 15:19:00 (10386) WARN> perfsonar_meshconfig_agent:145 main::__ANON__ - Warned: Use of uninitialized value $address in exists at /usr/lib/perfsonar/bin/../lib/perfSONAR_PS/RegularTesting/Tests/BwctlBase.pm line 384.

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem adding test trace(128.120.80.74->nautilus.sr.unh.edu), continuing with rest of config: 500 INTERNAL SERVER ERROR: Unable to determine participants: Process took too long to run.

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(128.120.80.74->tc1-teng8-2.net.ohio-state.edu): 400 BAD REQUEST: Can't find pScheduler or BWCTL on tc1-teng8-2.net.ohio-state.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(tc1-teng8-2.net.ohio-state.edu->128.120.80.74): 400 BAD REQUEST: Can't find pScheduler or BWCTL on tc1-teng8-2.net.ohio-state.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(128.120.80.74->b06sr1-vlan254.tele.iastate.edu): 400 BAD REQUEST: Can't find pScheduler or BWCTL on b06sr1-vlan254.tele.iastate.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(b06sr1-vlan254.tele.iastate.edu->128.120.80.74): 400 BAD REQUEST: Can't find pScheduler or BWCTL on b06sr1-vlan254.tele.iastate.edu

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test rtt(2607:f810:330:1ffe::f->perfsonar-011.net.berkeley.edu): 400 BAD REQUEST: Neither the source nor destination is running pScheduler.

 

2017/12/18 15:25:24 (10386) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for creation, skipping test throughput(128.120.80.74->nautilus.sr.unh.edu): 400 BAD REQUEST: Can't find pScheduler or BWCTL on nautilus.sr.unh.edu

 

There were also errors in clean_esmond_db.log

query error for metadata_key=bdd71f21372749cf90d63c6544bda3df, event_type=time-error-estimates, summary_type=base, summary_window=0, beg

in_time=1476263098, end_time=1476349498, error=An attempt was made to connect to each of the servers twice, but none of the attempts suc

ceeded. The last failure was TTransportException: Could not connect to localhost:9160

Error connecting to remote JMX agent!

java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: Exception creating connection to: 127.0.0.1; nested exception is:

        java.net.SocketException: Network is unreachable (connect failed)]

        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:370)

        at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:268)

        at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:151)

        at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:121)

        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1276)

Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: Exception creating connection to: 127.0.0.1; nested exception is:

        java.net.SocketException: Network is unreachable (connect failed)]

        at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:142)

        at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:204)

        at javax.naming.InitialContext.lookup(InitialContext.java:415)

        at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1928)

        at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1895)

        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287)

        ... 4 more

Caused by: java.rmi.ConnectIOException: Exception creating connection to: 127.0.0.1; nested exception is:

        java.net.SocketException: Network is unreachable (connect failed)

        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:631)

        at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)

        at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)

        at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:338)

        at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:112)

 

Please let me know if you know solution to this problem.  Thank you.

 

Zhi-Wei Lu

IET-CR-Network Operations Center

University of California, Davis

(530) 752-0155

 




Archive powered by MHonArc 2.6.19.

Top of Page