perfsonar-user - Re: [perfsonar-user] Current status of PS in Amazon Cloud services

Subject: perfSONAR User Q&A and Other Discussion

List archive

Re: [perfsonar-user] Current status of PS in Amazon Cloud services

From: Dan Pritts <>
To: Casey Russell <>
Cc: Mark Feit <>, "" <>
Subject: Re: [perfsonar-user] Current status of PS in Amazon Cloud services
Date: Wed, 3 Apr 2019 17:06:33 -0400

Picking up this recent thread...

My group here at UMich moved production workloads to AWS in early 2018. A week and a half ago we started seeing performance issues between campus and AWS us-east-1. I did some low-bandwidth UDP iperfs and saw 0.3%-0.5% packet loss on the path. UM networking engaged the Internet2 NOC, who found and fixed a problem with an Internet2 link to AWS in Ashburn.

Mark mentioned below that the way around NAT for perfsonar is to get public IPs. Unfortunately, it's not that simple in AWS. Even when you have a public address, your linux instance is given an RFC1918 address and AWS does 1:1 NAT for you. It's weird, but for almost any application this works just fine. (With IPv6, they've done away with this, and you can get a plain old routed address.)

While gathering information for the NOC, I brought up the latest perfsonar on an aws instance and a system here in Ann Arbor. I could source tests from AWS, but sourcing from Ann Arbor didn't work - pscheduler never started up iperf3. Manual iperf3's worked fine, of course, and when I got both ends up with IPv6, it worked fine in both directions.

Having tools like perfsonar available and workable is key when moving to the cloud. The network has to be nearly perfect, and without being able to monitor it that seems like a long shot.

Regarding the uncertainty that comes with the cloud - you are absolutely right. That makes tools even more useful.

So let me put in a feature request - make perfsonar tools work with 1:1 NAT like AWS uses. Manual configuration would be OK - a quick little entry somewhere that says "my public IP is x.y.z.1". Sorry if this is there and I missed it; it wasn't for lack of searching.

Hi to anyone who remembers me, hope you are all well.

danno

Casey Russell wrote on 2/22/19 2:32 PM:
Actually, allow me to adjust my wording on that just a bit. For this smaller member, who uses Cloud Connect for day to day student and administrative software, that ability for KR to do this testing "would be nice".

I suspect however, that K-State and KU, as they begin their deployments will see it as a bit more critical than "it would be nice". They'll likely see it as a more critical capability since they'll be doing more research related tasks and disaster recovery type functions down the road.

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

need support?

On Fri, Feb 22, 2019 at 1:28 PM Casey Russell <> wrote:

Thanks Mark,

I have a member who was asking about latency to the other end of their Cloud Connect provisioning to Azure, and (while that's not Amazon) I just became curious about what the status of PS on the cloud provider networks was. You've answered that question nicely. I rather doubt they'll actually go down that road for now.

I've started them down the basic troubleshooting road with the standard tools (traceroute, etc), but as for the I2 perfSONAR node question, I happen to do some regular PS testing to a node in the Equnix datacenter and was able to tell him what our regular latency was to Ashburn, although I cautioned that his exact path through KR and I2 might be somewhat different. Speaking for just KanREN, It would be really helpful to have a published list, and the ability to do ad-hoc testing, particularly to nodes that are near (or at) the Cloud Connect or Cloud Access datacenters. I know large, managed, multi-domain meshes are a mess and I wouldn't advocate for that, but it would be nice as a REN or RON to be able to do a quick ad-hoc test for a member when a question like this comes up.

Sincerely,

Casey Russell

Network Engineer

785-856-9809

2029 Becker Drive, Suite 282
Lawrence, Kansas 66047

need support?

On Fri, Feb 22, 2019 at 12:45 PM Mark Feit <> wrote:

Casey Russell writes:

The last time I see the question addressed was two years ago in 2017 when it was acknowled that there were a host of problems trying to deploy a PS node in the amazon cloud. It appears the problems were mostly surrounding the NAT issues. Has there been any progress made in the intervening time? Is anyone deploying a set of PS tools in Amazon Cloud services to monitor their new Cloud Connect connectivity or something similar? If so, do you have a roadmap or template for doing so?

Things are much the same.

NAT isn’t difficult to deal with for tests like TCP throughput, where you can do reverse testing to dodge the translation. It’s a lot stickier for UDP throughput, latency and RTT, where you can do tests from the behind the NAT to the outside, but anything you can do the other way won’t traverse it and there isn’t necessarily a server available to play the “other end” role. We’ve discussed things like UDP hole punching, but the number of variables would make it difficult to support. The only other way around it is to have public IPs.

The harder problem to overcome is the uncertainty that comes from deploying infrastructure in what’s essentially a geographically- and topologically-diverse black box. An AWS instance in us-east-1 will materialize as one VM on one machine in one rack in one aisle in one of almost three dozen buildings in two counties in Northern Virginia. (I live near Ashburn and drive past Amazon’s data centers often enough to have a good feel for how large and spread out their presence here is. It’s quite something.) Making end-to-end measurements meaningful is difficult in an environment where there’s no control over the location of one end, but that lack of control is what makes commodity computing cheap enough to be attractive. I’ve run into researchers who install perfSONAR in a container on the same instances where their applications run and treat any measurements they get as valid for only that instance. If I remember correctly, their institution was getting a large-enough break on network traffic that the cost of outbound throughput testing wasn’t an issue.

This has been mentioned before but is worth repeating: Check the agreements you have with cloud providers before sharing the results of performance measurements involving their systems and networks. I don’t know if it’s still true, but at least one provider forbade that.

Putting on my Internet2 hat for a moment: We have internal-use-only perfSONAR nodes in the vicinity of some, but not all, of our cloud provider handoffs. There have been discussions about adding more at those points but no solid action. If you’re a member and consider this important to your operations, your feedback helps drive what we do.

HTH.

--Mark

--
To unsubscribe from this list: https://lists.internet2.edu/sympa/signoff/perfsonar-user

Dan Pritts
ICPSR Computing & Network Services
University of Michigan

Re: [perfsonar-user] Current status of PS in Amazon Cloud services, Dan Pritts, 04/03/2019
- Re: [perfsonar-user] Current status of PS in Amazon Cloud services, Mark Feit, 04/03/2019
  - Re: [perfsonar-user] Current status of PS in Amazon Cloud services, Dan Pritts, 04/04/2019
    - Re: [perfsonar-user] Current status of PS in Amazon Cloud services, Mark Feit, 04/04/2019

List archive

Re: [perfsonar-user] Current status of PS in Amazon Cloud services