Speculation Swirls Around Facebook Outage

Datacenter2

By: Mary Jander


Facebook (Nasdaq: FB) is still reeling from an outage that brought its main social platform, plus Instagram, Oculus, and WhatsApp, to a halt for over six hours on Monday, October 4. And the tech world is buzzing with speculation.

Most observers agree that the cause of Facebook’s outage was a snafu involving the Border Gateway Protocol (BGP), which works with domain name servers (DNS) to route traffic to specific IP addresses on the Internet. A misconfigured router caused rippling errors throughout Facebook’s net.

This scenario was verified by Santosh Janardhan, Facebook VP of Infrastructure, in an update this afternoon:

“The data traffic between … computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And … our engineers often need to take part of the backbone offline for maintenance….
“This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.
“This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.”
F40 2022 Website Webinar Ad 2

There was no evidence of malicious activity and there was no exposure of user information, Janardhan said.

But Really, What Happened?

Despite Facebook’s explanation of the outage, plenty of rumor and innuendo have swirled about its actual cause. Puzzlement centers on exactly why the router was misconfigured. There is also talk about how Facebook could have avoided the problem if it was an inadvertent mistake.

“In simpler terms, sometime this morning Facebook took away the map telling the world’s computers how to find its various online properties,” wrote Brian Krebs in a blog citing several experts.

Krebs also refers to sources who said the outage may have lasted longer due to IT workers who were unable to get into the physical buildings on Facebook’s campus, since the BGP outage made their access credentials inaccessible.

This scenario is backed up by Santosh Janardhan’s update. “[T]hese facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them.”

BGP to Blame?

Several experts blame BGP for yesterday’s outage. After all, it was It was BGP, not the DNS servers, that initiated the failure. “This wasn't a DNS issue itself, but failing DNS was the first symptom we'd seen of a larger Facebook outage,” wrote Cloudflare (NYSE: NET) experts Celso Martinho (engineering director) and Tom Strickx (edge network technical lead) in a post on Cloudflare’s site.

Others say BGP is an outdated protocol whose time has passed. “We were using BGPv4 in 1996, and we’re still using the same version now today, twenty-five years later, with literally no significant improvement or development to the protocol,” wrote Bill Woodhouse, executive director of Internet Packet Clearing House, a non-governmental, nonprofit organization that establishes Internet exchange points and manages security for the Internet’s Domain Name System. He also cited attrition among BGP engineering experts.

BGP was also blamed for the CenturyLink outage a year ago that crashed a range of big sites worldwide that rely on its services. In that snafu, a BGP function called flowspec was misconfigured. Flowspec reroutes traffic among peer routers on the Internet to mitigate distributed denial of service (DDoS) attacks and other security woes.

Of course, no one has missed the odd coincidence that Facebook’s systems crashed hours after whistleblower Frances Haugen, a former employee, took Facebook to task on national television for putting profits ahead of user safety. The fallout from that interview on CBS’s “60 Minutes” continues to goad politicians worldwide into a chorus of Facebook criticism, even as Haugen testifies to a U.S. Senate subcommittee today.

Whether there was any foul play behind the outage may never be clear. “We don’t know how or why the outages persist at Facebook and its other properties,” wrote Brian Krebs, “but the changes had to have come from inside the company, as Facebook manages those records internally. Whether the changes were made maliciously or by accident is anyone’s guess at this point.”

Could This Outage Have Been Avoided?

All of this prompts questions about how this outage could have been avoided. Why, for instance, were all of Facebook’s systems running in house? Why were they so inaccessible in a crisis? Why was the audit tool buggy? Why was one engineer fiddling with global backbone capacity?

At the very least, the Facebook crash should serve as a wake-up call – and not just for Facebook. Other outages like CenturyLink’s have demonstrated that other systems operate along similar lines, even in this age of zero trust and cyberwarfare.

As Santosh Janardhan wrote today: “Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one.”