perfsonar-user - Re: [perfsonar-user] Some questions about a slide

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Some questions about a slide

From: Eli Dart <>
To: "Wu, Xiaoban" <>
Cc: "" <>
Subject: Re: [perfsonar-user] Some questions about a slide
Date: Fri, 28 Aug 2015 09:14:21 -0700

Hi Xiaoban,

See inline, below...

On Fri, Aug 28, 2015 at 8:40 AM, Wu, Xiaoban <> wrote:

Dear Eli,

Thanks so much for your reply. But I still have some questions.

First, for the graph in the link http://fasterdata.es.net/network-tuning/tcp-issues-explained/packet-loss/ , we can see that there is a steep drop from 0ms to 10ms except for the pink plot, but it says that "TCP can still perform OK over such distances", I want to know if this is a contradiction? If not, could you please explain why in more details? Thanks so much.

This all depends on your definition of OK. For a 10G DTN, single-stream performance of 5Gbps is typically fine, especially since modern data transfer tools (e.g. Globus) use multiple parallel streams. So, from a practical production DTN perspective, users might (depending on the environment) be happy with that performance level for short-distance transfers. However, there would still be a problem in the network that could cripple long-distance transfers. There is nothing hard and fast here. The take-home message is that, when there is loss, TCP tends to perform much better

over short distances than it does over long distances. This has implications for the methods we should be using for fault isolation.

Second, it says that " This makes sense because in ESnet we monitor our own network internally and have a high degree of confidence that it is functioning correctly." What if unluckily it is exactly the ESnet's problem instead of the other networks? (Because, If everybody is confident in their own network, then we maybe don't know where to start to find the problems)

If the problem is within ESnet, it will show up in the testing. If it does, then I would first investigate within ESnet.

Then it says "we immediately have something to investigate within ESnet", I want to know in detail how you could find the problem within such a small distance like 30ms as in the graph on page 20, since "TCP can still perform OK over such distances".

I would first look at the results of our automated testing, since we collect regular test data for tracking performance trends - see http://ps-dashboard.es.net/.

For localizing the issue, I would pick an anchor tester that was far from the place I expected the fault to be and use the same methodology we have been discussing. If I wasn't able to get anything conclusive from that, I'd pick a different anchor tester and start again.

Third, it says " try starting at the other end of the path", what if unfortunately the other end also has the problem? In this case, how could we solve this issue? Thanks so much.

If there are multiple problems it is much harder. The thing to do in that case is look outside the path that you're currently working on, and pick an anchor tester that is elsewhere.

If there are a ton of problems all stacked up on the same path, the technique we have been discussing is not so useful (this technique is good for rapid fault isolation when most networks in the path are functioning correctly). If there are a large number of problems, you need to start doing other things (e.g. burst testing using nuttcp as described here: http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/nuttcp/).

Eli

All the best,

Xiaoban

From: Eli Dart <>
Sent: Thursday, August 27, 2015 6:44 PM
To: Wu, Xiaoban
Cc:
Subject: Re: [perfsonar-user] Some questions about a slide

Hi Xiaoban,

Segment by segment testing has the potential of showing high performance even when there are problems. To see why, take a look at the TCP loss graph here:

http://fasterdata.es.net/network-tuning/tcp-issues-explained/packet-loss/

When the latency is low, performance can still be good even in the presence of packet loss.

Note that in saying "segment by segment testing" we mean something like the following. Label each perfSONAR host in the path, in order, 1, 2, 3, 4, 5, ... N. Then test from 1-2, from 2-3, from 3-4, and so on. This method is flawed, because the maximum latency of a test is governed by the length of any one segment. It is possible that some segments are long and others are short, but in most networks the segments will be short enough that the latency will be 10 milliseconds or less. Depending on the amount of loss, TCP can still perform OK over such distances, thus making it difficult to identify the problem segment using this testing methodology.

So, I find it's better to pick a test host as an anchor and gradually increase the length of the test.

Looking at the diagram on slide 20, and picking the ESnet perfSONAR host closest to the national laboratory (dark green) as an anchor point, we might test first to the ESnet perfSONAR host closest to the handoff to Internet2. This makes sense because in ESnet we monitor our own network internally and have a high degree of confidence that it is functioning correctly. If performance between the two ESnet testers is not what it should be, then we immediately have something to investigate within ESnet.

We would then test to the closest Internet2 perfSONAR host on the other side of the ESnet/Internet2 peering. If that's good, then test to the far Internet2 perfSONAR host (the one closest to the peering between Internet2 and the purple regional network).

Then, test to the perfSONAR host in the regional network just beyond the peering with Internet2. And so on.

If performance goes down significantly as a function of distance as you test progressively further, you may have a problem close to your anchor tester. If that's the case, try starting at the other end of the path.

The way you know the long path is clean is that a TCP BWCTL test performs well. If there is loss, it will perform poorly. In the real-world testing from which these diagrams were derived, the difference between the performance of the clean test and the dirty test was dramatic - it's not a subtle thing.

Does this help?

Thanks,

Eli

On Thu, Aug 27, 2015 at 2:16 PM, Wu, Xiaoban <> wrote:

Dear All,

I have been reading this slide http://www.perfsonar.net/media/cms_page_media/3271/20150701-perfSONAR-7-Debugging_Strategies-v1.pdf , but now I am struggling with some questions.

On page 10 and page 17, it says we need to find the longest clean path and segment-to-segment test is not helpful. I want to know if anybody has done some tests to verify that segment-to-segment test is not helpful. If so, could you please provide the details of how you implement this test? Thanks so much for your help.

On the other hand, on page 20, it shows a long clean path. I want to know how to make sure that it is clean, so we can trust it. Moreover, how to find such long clean path? Since segment-to-segment test is not helpful, gradually increasing the path probably may deceive us. If you have done any similar test, could you please also provide some details? Thanks so much for your help.

All the best,

Xiaoban

--

Eli Dart, Network Engineer NOC: (510) 486-7600

ESnet Office of the CTO (AS293) (800) 333-7638

Lawrence Berkeley National Laboratory

PGP Key fingerprint = C970 F8D3 CFDD 8FFF 5486 343A 2D31 4478 5F82 B2B3

Eli Dart, Network Engineer NOC: (510) 486-7600

ESnet Office of the CTO (AS293) (800) 333-7638

Lawrence Berkeley National Laboratory

PGP Key fingerprint = C970 F8D3 CFDD 8FFF 5486 343A 2D31 4478 5F82 B2B3

[perfsonar-user] Some questions about a slide, Wu, Xiaoban, 08/27/2015
- Re: [perfsonar-user] Some questions about a slide, Eli Dart, 08/27/2015
  - Re: [perfsonar-user] Some questions about a slide, Wu, Xiaoban, 08/28/2015
    - Re: [perfsonar-user] Some questions about a slide, Eli Dart, 08/28/2015

List archive

Re: [perfsonar-user] Some questions about a slide