Deep Dive

Deep Dive

/ by Marek / , , , , , , ,  + .

RPKI and BGP Routing Security

Yesterday global content delivery network Cloudflare launched the website isbgpsafeyet.com and it has landed some very mixed responses. Some end users are confused or worried by the pronouncement that their ISP “does not implement BGP safely” and that this leaves the users “vulnerable to malicious route hijacks”.

Other eloquent voices explain what BGP route hijacks are, and how RPKI can mitigate some of them — so we’re not going to cover that here. In this article we want to share some of our operational experience.

Background

At FAELIX we deployed RPKI in the late summer of 2019 when we upgraded our BGP edge network) to new equipment and a different network operating system. Several things motivated that project including:

As part of our upgrade effort we evaluated hardware and software options and settled upon VyOS as it met many of our needs out of the box, and we could develop the things that we needed or wanted on top of its platform. One of the most exciting aspects of VyOS was its use of FRR (free range routing), a collaborative effort forked from the venerable Quagga routing project with contributions from giants of networking: ISC, Cumulus, VMware, and many others. We were already familiar with Quagga, having started our network using this for our BGP and OSPF implementations back in 2007, so the modernised FRR was a familiar networking friend for us.

Before we deployed VyOS we built the Halophile Router or hphr which automated and templated the configuration for each router using SaltStack. As part of building each router’s configuration it would would apply IRR-based filters and drop RPKI invalid prefixes when learning routes from peers, customers, and our upstream transit providers. This gave us the best routing security that we had ever had, and the ability to automatically rebuild the filters in a reproducible manner. And we were rightly proud of our efforts in getting this deployed in our network, as it increased our security posture by helping protect both our network and the networks of our customers from some types of BGP route hijack.

The Incident of 2020-04-17

On the morning of 17th April, our automated network monitoring alerted our network engineers to a problem. Some traffic was being blackholed within our network — that is to say, was not reaching the intended destination. Some of our customers were affected by this, depending on how their traffic is routed within our network (e.g. depending on which site they are in, which edge router their customer BGP sessions are to, etc), and by the time they got in touch to report the problem we were already fixing it.

What had happened is the bgpd process — which speaks BGP to other routers in our network and other providers’ networks — had crashed. The logs on the router suggested it had accessed a region of memory it was not entitled to, and so caused a “segmentation fault” which is usually treated as a reason for that process to be terminated with some prejudice. The result of this was all the BGP sessions from that router dropped, and our network’s routing gradually reconverged. However, while that recomputation of routing paths was taking place, some data was still being sent towards the router which had just crashed, and it no longer knew where to send that traffic on so would drop it as “destination unreachable”.

There is an unfortunate bug in VyOS, T1894, which means that when the bgpd process is automatically restarted following the crash it does not resume with its previous configuration. It resumes with no configuration at all. Frustratingly the only fix is to reboot the router (at least until we or another contributor to the project fixes this), and that is exactly what our engineers did.

The crashed router rebooted, performed its hardware tests, booted the operating system, started configuring its network interfaces, and then began establishing BGP sessions once more. And then two minutes later the bgpd process crashed again. This time we caught a copy of the detailed logs which might give us clues as to where the problem lay:

Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f43747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f4374798833]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x712e5) [0x7f43747b92e5]74798833]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f43735c2890]92e5]74798833]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x65) [0x5576bbe42415]92e5]74798833]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so(+0x5042) [0x7f436fa1d042]]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f43747c6b00]]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f43747965c8]b00]]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/frr/bgpd(main+0x2ff) [0x5576bbdecb4f]run+0xd8) [0x7f43747965c8]b00]]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f4373229b45]b00]]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: /usr/lib/frr/bgpd(+0x3cb6c) [0x5576bbdeeb6c]_main+0xf5) [0x7f4373229b45]b00]]3747983d7]
Apr 17 08:22:42 aebi bgpd[1238]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:509#01242415); aborting...

Unfortunately it also caused another “flap” within our network, which again did not go unnoticed. However, we now had some knowledge which might help us mitigate the problem in the short term, and reinstate routing via that router. We took the decision to upgrade the version of VyOS on that router from 1.2.4 to 1.2.5; and also to disable that router’s connections to the RTRR service powered by Routinator on our route-servers. This would stop that router from rejecting RPKI-invalid routes, but if it worked then it would also bring our Manchester data-centres back to normal operation rather than running “at risk” with one down.

The timing of this RPKI-related crash was, of course, the same day that Cloudflare launched a high visibility campaign about RPKI filtering. What an odd coincidence!

The Investigation

The following day, on 18th April we began looking in earnest what had caused the crash. We identified:

  • the lines of code in FRR responsible for the segfault which caused the previous day’s incident

  • that a subsequent commit to the FRR source had fixed it

  • that this commit was included in FRR 7.2.1 and later versions

  • and that a more recent version of FRR is used in VyOS 1.2.5 (which we had installed the day before)

We were reasonably confident that VyOS 1.2.5 would be stable to deploy into production as an RPKI-checking BGP router. But we needed to be sure, so we tested this in our network lab by running VyOS versions 1.2.4 and 1.2.5 with identical configurations. As expected, our lab 1.2.4 crashed in exactly the same way that we had observed in production, after a few minutes of uptime. But 1.2.5 remained solid for many minutes. We informed the VyOS development team of our findings, and they closed T1874 (“FRR crashing triggered by RPKI”).

Our Take on Deploying RPKI

Our RPKI deployment was fairly mature when this happened: we had automated configuration, multiple RTRR servers deployed by orchestration tooling, and we had been running in production for months. Somewhere along the way some circumstances changed and upset our router running VyOS 1.2.4 — we don’t have the exact details, but it is possible this is a mixture of which BGP routes we were learning, what data was being served by RTRR, the exact order routes were being learned, possible race-conditions and preconditions which we did not have time to investigate as part of this incident.

Operators may now feel under pressure to perform RPKI validation on their BGP edge. Many have already begun investigating the challenges, planning their projects, testing their tooling, or deploying lab or production equipment. All of us rely on the third party vendors to properly support and carefully develop the code to run these networking protocols. While the technology has existed for many years it is clear to us that everybody — the vendors, operators, carriers, customers, and providers’ own staff — is going to be learning together. As it currently stands many networks’ operational experience of RPKI is not yet mature, and that will make many a network engineer pause to consider rolling it out on their network.

In some ways our outage may vindicate the decision of those engineers who would choose to hesitate. They may consider that we were “early adopters” and our bad experience is a form of QA of the technology or its implementation. As network engineers we strive to build systems and structures which function to a high standard. New features bring new code paths which have not been walked down for as long and with as much data, and unless proof and correctness tools or automated testers can be used, bugs will only ever be shaken out over time. But as engineers we also use scientific priciples to guide our work, and that includes learning from when things go wrong, and sharing that knowledge and experience to better our field as a whole.

The creative application of scientific principles to design or develop structures, machines, […] or works […]; or to construct or operate the same with full cognizance of their design; or to forecast their behavior under specific operating conditions; all as respects an intended function, economics of operation and safety to life and property.