A Verizon error resulted in a “cascading catastrophic failure” that triggered widespread internet outages on Monday, affecting Cloudflare, Amazon, Facebook, and others.
Website accelerator firm Cloudflare described the event as a “small heart attack” that left websites that rely on it, such as voice and text chat app Discord, unreachable from many parts of the internet for about two hours. And, according to Cloudflare, it was completely avoidable.
The outage was due to a so-called route leak from a Verizon customer. In Cloudflare’s case, this meant traffic that normally goes through Verizon and Level 3 Communications to reach Cloudflare instead went through a metal maker called Allegheny Technologies, Pennsylvania-based ISP DQE, and Cogent Communications onwards to Cloudflare.
Allgheny and DQE’s networks weren’t up to the task of such a massive spike in traffic.
At the heart of yesterday’s outage was the Border Gateway Protocol, which networks use to share information about what routes to take. Cloudflare says DQE incorrectly announced routes from its network to its customer, Allegheny.
That routing information was passed on to Verizon, which “proceeded to tell the entire internet about these ‘better’ routes,” explained Cloudflare engineer Tom Strickx.
“The leak should have stopped at Verizon. However, against numerous best practices outlined below, Verizon’s lack of filtering turned this into a major incident that affected many internet services such as Amazon, Linode, and Cloudflare.”
The incident caused Cloudflare to lose about 15 percent of its traffic, according to Strickx.
Andree Toonk, founder of Cisco-owned BGPmon, estimated that around 2,400 networks and 20,000 IP addresses were affected by the incident.
Networking firm ThousandEyes said users in the US, Canada and UK experienced “severe packet loss” when trying to reach apps that depended on Cloudflare. It notes the route leak introduced more specific routes to the legitimate, less-specific ones that Cloudflare normally uses.
The firm believes DQE was the original source of this outage due to its of BGP route-optimization software, which some engineers think should never be used due to the potential for them to cause yesterday’s incident.
“When a more-specific prefix is advertised to the internet, its route is preferred to the less specific prefix. Advertising a more-specific route for a third party’s network is generally a no-no. In fact, intentional advertising of a more-specific prefix is a BGP hijacking method – how criminals attempt to siphon traffic away from legitimate service hosts for cyber-security exploits,” explained Alex Henthorn-Iwane of ThousandEyes.
Job Snijders, an engineer at NTT Communications, is a harsh critic of BGP optimizers, arguing the products need to be destroyed.
“It is extremely irresponsible behavior to use software that generates fake BGP more-specifics for the purpose of traffic engineering. You simply cannot expect that those more-specifics will never escape into the global DFZ,” he wrote in 2017.
Snijders said yesterday’s incident was a “cascading catastrophic failure both in process and technologies” and again called on networks to “turn off your ‘BGP optimizers”.