Network at Faelix

Posted on 01 Oct 2015 by Marek 

Large Internet exchanges, such as LINX, have two separate networks using different switch vendor's equipment. Some would argue that this adds complexity, and complexity can be the enemy of reliability. But when you look at a network the size and importance of LINX, and understand some of the historic faults that have occurred, you can start to understand the rationale.

Ever since Faelix became a multi-homed network on the Internet, we have used open source software for our routing. Our consultancy work often involves working with different technologies — we support several other ISPs' networks — and so our previous network architecture used a mix of Linux boxes running Quagga and Bird. We wanted to keep this diversity in this year's network refresh.

Once Bitten, Twice Shy

Five years ago, when we started using a fairly early version of Bird, we stumbled across an early bug with its BGP implementation. Bird was segfaulting. Our debugging suggested that this was caused by some BGP community data that was part of a Cogent full transit feed. Some ISPs' reaction to this is to suggest that this is perfectly reasonable behaviour given a full routing table from AS174. But joking aside, this would have been fairly catastrophic for our customers at the time: our provider edge would have stopped announcing our prefixes into the DFZ, and traffic would no longer flow. Thankfully we were also running Quagga, a completely separate implementation of BGP, and the Quagga boxes continued to route traffic in and out of our network while we debugged Bird and upgraded it to a version without this behaviour.

Fast forward to 2014 and our network consultancy has helped a few other autonomous systems to change and upgrade their networks. One of those projects with another ISP in Manchester evaluated MikroTik's Cloud Core Routers and found them to be a very cost-effective and energy-efficient way of routing large quantities of IPv4 and IPv6 traffic. Under the hood, MikroTik's RouterOS seems to use Quagga as its BGP implementation. We swapped out four Quagga and Bird routers with two CCRs as our main packet pushers, and left two Bird boxes as backup paths. After months of lab work and testing, and then switch-over in May, we thought little more of our multi-vendor setup.

Until July.

There was a bug in RouterOS on 64-bit platforms, such as the Tile processor in the CCRs. That bug manifested itself at exactly 23:59:60 UTC on July 30th — the leap-second — for certain configurations of NTP servers. A minute after the leap-second we received automatic alerts of high packet loss. One of our CCRs had crashed! Traffic routed itself around the faulty router a short while later, via the other CCR and the Bird boxes. After a quick trip to Reynolds House to attempt diagnostics, the stuck CCR was power-cycled and running happily once more.

The following morning, we checked the MikroTik forums and found that our crash wasn't a coincidence. In some cases, ISPs had hundreds of routers crash simultaneously, and despite a "watchdog", all required a site visit or remote power management to reboot. We were one of the few ISPs using MikroTik CCRs at their network edge that hadn't suffered a major outage.

We appreciate the level of support that large vendors can give to their high-profile customers: engineers to debug faults, developers to pore over router crash dumps, and the large pool of knowledgable and experienced staff who work for the large vendors. No platform, hardware or software is ever perfect, and a vendor support agreement is good business sense and reassurance for customers.

Our decision right at the start of running our network was to establish and maintain our niche working with smaller companies and using Open Source Software. One of the important freedoms that the OSS movement gives is freedom of choice: pick whichever software works for you best, pick which to help improve... or, in our case, pick more than one. Picking two different platforms is not easy: there are interoperability tests to do internally, there isn't always 100% overlap of features, and there are twice as many bugs. But even though this kind of diversity requires more knowledge and careful configuration management, our decision to run more than one routing software has been vindicated — twice!