Deep Dive
26 July 2022 / /
Triple Whammy of Loss of Wavelength, Routing Control Plane Crash, and IGP Issues
Background
I have written many times about our network platform, with the peering and transit edge based around the open source VyOS project running on commodity hardware and our use of network automation to build repeatable and reliable configurations. I have also written a few times about our network architecture, with the most relevant piece to this RFO being the topology of our UK ring.
Between 08:49:30 and 08:49:45 UTC (09:49 UK local time) our network monitoring detected the loss of light received Williams House (Equinix MA1) in Manchester for the wavelength service via Bradford to AQL DC2 in Leeds. This set of a cascade of events which required significant manual intervention before stabilising at 09:39:15 UTC. Later in the day the affected wavelength service was restored at 16:27:30 UTC.
Incident Response
All times are presented in UTC (add one hour for UK local time), and are derived from the incident on our status page.
08:49 — Loss of link at Manchester end of Manchester-Leeds path, SNMP alert delivered via out-of-band and notified to network engineers.
08:51 — Faelix engineer verified no light received from Bradford regeneration site at Manchester end of the Manchester-Leeds path and contacted the fibre owner.
08:52 — By now customers have begun calling. The link loss has been detected by AS41495’s interior gateway protocol (OSPF), but the topology change has not been programmed correctly.
08:56 — The fibre owner senior engineer confirms they have raised a fibre fault with the field engineering team.
08:57 — Rather than the flurry of initial alerts subsiding, hundreds of notifications now deluge the network engineering team, while customers call directly to them asking for updates.
09:10 — The router
bly
in Williams House has attracted a significant amount of traffic towards it (due to OSPF costs), and according to its “RIB” it looks like it is going to send that traffic via the backup link towards AQL DC2. However, its “FIB” has not updated correctly and so the traffic is being dropped into the path that failed — also known as “blackholed”.09:18 — After numerous attempts to reconfigure
bly
it becomes clear that its control plane has locked up. Engineers issue an immediate reboot. Decision made to allowbly
to boot VyOS1.3.1
as this RIB/FIB/blackholing issue seems reminiscent of a bug in FRRouting that we experienced about two years earlier.09:25 — The alternative north-south path between Reynolds House and Telehouse West had been “downprefed” the night before, following issues other customers of the same provider had reported, and the planned maintenance the provider had planned affecting their 100G core links in Manchester scheduled for 21:00-23:59 UTC on 2022-07-26 (later the same day). Unfortunately it had not been noticed that the OSPF costs set for that link were much higher than those used by another router in Equinix MA1 connected to the backup path towards AQL DC2. That router was never intended to carry AS41495’s full production traffic between Manchester and London. This hampered a quick switchover to the MA2-THW link as intended.
09:33 —
bly
finishes booting up into VyOS1.3.1
and starts establishing OSPF adjacencies, and iBGP and eBGP sessions.09:39 — External routing from peering and transit is now stable, and customer traffic is flowing once again. However some customers with private cloud connectivity within our hosting network are still seeing issues. Additionally some customers whose routers had been migrated to AQL DC2 following a disk failure in Manchester a few days earlier expecting their default gateways “long-lined” from Manchester. Our automation tools had not programmed the new nexthop IPs into the hosting routers at AQL DC2.
10:15 — Almost all of the affected hosting customers (private networks or unmigrated default gateways) have been resolved. This involved moving the default gateways, and creating pseudowire tunnels between infrastructure, presenting a layer-2 private network to the affected customers, but routing it within our backbone at layer-3 for loop-avoidance. This involves bringing two new routers into service at AQL DC2,
coudreau
andkorsakov
. These routers were installed in May 2022, but we were waiting until our plans for exiting MA2 were decided before bringing them into service (which would have been carried out during planned maintenances related to that move). However, to give us more options during the resolution of this incident, when we restartedbly
we also began preparing to bringcoudreau
into service rather than having east-coast connectivity between Manchester and London merely cut-through Leeds at layer-2.12:21 — The usable MTU on the link from AQL DC2 (Leeds) to Telehouse North (London) is observed to be significantly smaller than what is meant to be provided; and slightly smaller than the configured value. We adjust this manually at both ends to ensure BGP’s stability.
Problem Management
10:03 — Fibre engineers had already run OTDR tests from the Bradford end and found no issues. They have now arrived at the Manchester end and are performing further tests.
10:32 — The fault is now believed to be with the DWDM amplification unit at the Manchester end of the fibre. The engineer on-site has observed the EDFA system there showing a red “fail” light, usually indicative of a hardware issue within the laser fibre.
11:32 — We are told that the failed DWDM amplifier has been removed and the engineers on-site are beginning to test the replacement before bringing it into service. Our plan is that once we’ve got the all-clear from them then we will begin our own tests.
From this point we are in contact with the provider’s engineer on site at Williams House, and are assisting remotely with our experience configuring these DWDM amplifiers. Here is a schematic diagram of the operation of one of these erbium-doped fibre amplifiers (EDFA).
The replacement unit had been configured identically to the old unit, but its status was showing that the amplifier was not activating. This was clear from the indication it gave that line 1 transmit was at a light level of -60dBm (unmeasurably low). We worked with the engineer on-site in checking, reconfiguring, adjusting parameters, but the amplifier steadfastly refused to activate its laser. Together we followed the vendor’s “turn up” documentation thoroughly, both checking every item as we went through. And still it reported that the laser status was off.
15:24 — On a whim we suggested that the engineer on site swap the attenuator (in place in the dispersion compensation loop) for a new one. At this point the EDFA burst into life: line 1 transmit was at a nominal level, line 2 activated, and the various DWDM wavelengths established link. We were relieved but frustrated that this was non-obvious. Had line 1 been transmitting and line 2 was not receiving light it would have indicated a fault with the dispersion compensation/attenuation loop. But by indicating line 1 was not transmitting we wasted time believing that the laser was not activating due to the wrong light level thresholds being configured.
15:45 — Our own tests were complete and we began bringing the link back into production.
19:40 — We saw the link drop out briefly a couple of times.
20:08 — The provider’s engineer has confirmed that there are no field works being carried out on this fibre segment, so these drops are unexpected.
21:00 — The provider’s engineer brings forward a plan to adjust thresholds on the EDFA in response to a few more short drops to the light level received in Manchester.
23:00 — The provider’s engineer has left site, having adjusted some thresholds. We continue to monitor the situation, with the Manchester-Leeds path in backup mode.
08:00 — We confirm the additional threshold changes with the provider’s network engineer, continue to observe network stability, and continue writing this post-incident review.
Conclusions
bly
did not reroute traffic correctly when the wavelength dropped.- In spite of being engineered to route around failures, an unrelated software bug hampered correct network convergence.
- We had seen this happen once before in December 2020/January 2021 but believed that the software
bly
was subsequently running had fixed this issue. - Our automation tools had not adjusted the network topology following server migrations.
- We need to fix those automation tools and ensure they work correctly during evacuations like the one we carried out a few days before this incident after a disk failure in Manchester.
- A confluence of issues, both shortly before this incident and during it, made this especially difficult to resolve.
- Having one core link fail is something to be prepared for. Having it fail when another core link is “downprefed” pending other provider maintenance is unlucky. But having the cost applied to the backup link be incorrect is our mistake.
Positives
- We have built an extensive out-of-band network using third party connections in each of our datacentres as part of our management plane. This made accessing our core infrastructure much easier, especially the part of it that was malfunctioning and blackholing traffic in its data plane.
Remediations
We have identified a number of issues during this incident which will be addressed in a mix of business-as-usual and planned maintenances:
- upgrade bly — carried out during incident
- reimplement some private cloud network layer-2 segments to be tunneled over layer-3 routing — carried out during incident
- reimplement all private cloud network segments onto underlay network — will be during upcoming works building out to MA5
- review of OSPF link costs and procedure for link “costing out” or “down-pref” to be updated in light of those findings — priority work
- ensure our automation tools reprovision default gateways at the time of hosted server/VPS migrations — priority work
- upgrade all routers which may still be running a version of FRRouting which we suspect has a bug that contributed to this incident — planned works
Closing Remarks
I would like to apologise to our customers for both how our network didn't react to the physical fault as it should have, but also how long it took us to remedy some of the dedicated private services which needed manual intervention. We'll be applying service level credits as appropriate in line with our service level agreement and/or customers' contracts. And, of course, we'll be implementing the learning and follow-on actions identified during our post-incident review.