Skip to Content.
Sympa Menu

wg-irr - [WG-IRR] Fwd: May 9th RADb Service Outage Report

Subject: Registry Working Group

List archive

[WG-IRR] Fwd: May 9th RADb Service Outage Report


Chronological Thread 
  • From: David Farmer <>
  • To: "" <>
  • Subject: [WG-IRR] Fwd: May 9th RADb Service Outage Report
  • Date: Wed, 19 May 2021 08:02:10 -0500
  • Dkim-filter: OpenDKIM Filter v2.11.0 mta-p7.oit.umn.edu 4FlY1W36Lkz9vlpd
  • Dmarc-filter: OpenDMARC Filter v1.3.2 mta-p7.oit.umn.edu 4FlY1W36Lkz9vlpd

This is related to the issues discussed on the slack channel a little over a week ago, on May 10th.

---------- Forwarded message ---------
From: Merit RADb <>
Date: Wed, May 19, 2021 at 07:27
Subject: May 9th RADb Service Outage Report
To: <>


 
RADb Service Outage Report
 
RADb Community:

We appreciate your patience as we’ve taken some time to understand and investigate the RADb service outage that started on May 9th, 2021 and concluded on May 10th, 2021. We’d like to share details of the incident with you to ensure your understanding.

Problem Description: On Sunday, May 9th, at 12:53PM ET, the RADb Support Team noticed that systems hosted from our primary data center were not working or responding as expected. At this point, the RADb website became unavailable. In the process of performing emergency restoration to these systems, our data center Core router lost its ability to forward IPv4 traffic for a little over an hour, affecting traffic during that time period.

Reason for Outage: The problem originally presented itself as a routing instance issue. When engineers attempted a failover of the routing engine to the secondary engine, the data center Core router lost the functions to forward IPv4 traffic. Merit engineers corrected this issue while working with the equipment vendor and route-forwarding was re-established at 21:05PM ET. After IPv4 functions were restored, the team continued work on restoring applications. An optical connection was found to be failing on one of our distribution node’s uplink to the Core router. Merit hosts four distribution nodes at this location, but there was a static route in place that was sending traffic through the failing interface. Engineers removed the static route and then balanced the traffic across the remaining distribution nodes, which restored full connectivity to systems, including RADb, by Monday, May 10th at 09:43AM ET.

Resolution Steps/Process: Merit engineers have an open case with the equipment vendor support team for a root cause analysis focused on the route engine failover and fallout from that suggested step. Case analysis is still ongoing and may take additional time for discovery and testing. Engineers isolated a distribution node from the pool and balanced traffic across the remaining nodes in order to restore functionality at the location. On May 13th, between 00:00 - 03:30AM ET, engineers worked at our primary data center to replace a failed optic and cable. With the primary link restored, between the distribution and Core router, additional testing allowed normalization of the link.

Preventative Measures Going Forward: In order to ensure RADb uptime, additional dynamic routing was added to the distribution nodes to prevent reoccurrence of the incident. Network automation will be leveraged to improve the configuration management of the systems involved. Evaluation of disaster recovery / business continuity options out of alternative data center locations is underway.

We welcome any feedback or questions that you may have - please contact . We also welcome your participation in our customer survey, should you wish to provide an evaluation of the ongoing value that you receive from RADb: https://www.radb.net/survey

We sincerely apologize for the impact that this outage may have caused for you and your organization, and appreciate your ongoing input as we continue to improve RADb.

Thank you,

The RADb Team

 
 
                                                           
This email was sent to by Merit Network
Merit Network, Inc.  |  880 Technology Drive Suite B, Ann Arbor, MI 48108-8963
Copyright © 2021    Merit Network, Inc. - All rights reserved.
--
===============================================
David Farmer              
Networking & Telecommunication Services
Office of Information Technology
University of Minnesota  
2218 University Ave SE        Phone: 612-626-0815
Minneapolis, MN 55414-3029   Cell: 612-812-9952
===============================================


  • [WG-IRR] Fwd: May 9th RADb Service Outage Report, David Farmer, 05/19/2021

Archive powered by MHonArc 2.6.24.

Top of Page