Facebook Disconnected Itself from the Internet and the World Didn't Stop - a Rundown of What Happened
We're still here, aren't we?
Yesterday the few social networks that remain outside of Facebook’s grasp were buzzing with a sudden increase in temporary users and a lot of messages back and forth about the fact that Facebook, Instagram and WhatsApp, 3 major social networks in the world were out of service and nowhere to be found.
What happened? Let’s find out.
The TLDR; explanation
Long story short, they accidentally released a new version of some internet routing rules that determine how networks communicate with each other.
This update essentially disconnected them from the internet, so every time someone typed “facebook.com” or “instagram.com” on their browser, or had their mobile app try to reach the servers, the internet returned a nice and sound “I have no idea what that is, make sure you wrote the address correctly and try again”.
The slightly longer explanation
There is no “official” version as to why the outage originated. The truth might only be known to a few within the organization. What we, the rest of the world knows about it, is provided by the evidence other network-related companies are publishing. So let’s talk about that.
As it turns out, someone — or something — messed with a core communication protocol that makes up the way internet works and it left most of Facebook’s relevant servers unreachable.
Let me put it this way: internet is made up of thousands (if not millions) of smaller networks, controlled usually by ISPs and big companies, such as — you guessed it — Facebook. These smaller networks (known as AS or Autonomous Systems) bounce information from each other, they route the data inside the mega-network that is internet.
— But Fernando, how can they do that? — You ask. Fantastic question!
AS know about each other through a protocol called BGP (or Border Gateway Protocol, but let’s be honest, knowing that BGP stands for doesn’t really clear things up). Each AS controls a subset of public IPs, and through BGP they know how to contact other subsets.
As you probably know, when you issue a request for a website, that request doesn’t go directly from your computer to the server where the website is located. There are many nodes between the source of that request and the destination and each one needs to understand where to redirect the request to make sure it reaches the target destination.
Look at the above diagram, imagine each cluster of circles there is an AS, and you’re sitting on the green circle , sending a request that is targetted for the red one (facebook.com). Through the use of BGP each AS knows where to send the data (the red path). Each hop the data takes is calculated based on a lot of variables, such as timing, costs (that’s right normally AS’s charge each other for traffic) number of hops remaining, and others.
How does each AS know about each other? They are constantly updating each other about changes. The protocol allows for them to update others about new prefixes (i.e IP groups) that they may or may no longer accept. And here is where the problem begins.
What did Facebook do?
Reading other more technical sources, it looks like for some reason Facebook decided to start broadcasting to other AS’s updates that essentially rendered them no longer responsive.
They were withdrawing the list of IPs they were responsible for off the table. “We no longer want them” they said. Of course, this was a mistake, but nevertheless, computers don’t know about mistakes, they just know about protocols and commands.
Look at the above chart. That’s the moment Facebook decided to start broadcasting to the world the withdrawal of all their IPs. This is information taken from Cloudfare’s very technical explanation of what happened. If you’re into networking I’d recommend hoping over there for more details.
If we were to look again at our fake internet graph from before, it would now look like this:
The rest of internet was working fine, but now a lot of requests didn’t know where to go. A route was broken if you will, but the world didn’t end.
Why did this happen?
Alright, even if Facebook is not sharing the details of WHAT happened (their official statement is a bit bland, to say the least), thanks to the nature of how internet works, we’re able to understand it. The outside view of other companies, such as Cloudfare in this instance, gives us enough details to get that part.
However, the WHY remains a mystery.
There are theories, however.
For instance a few days ago a whistleblower (former Facebook manager) told CBS that the company had contributed to the U.S Capitol invasion of Jan. 6 by removing the safeguards against misinformation after Joe Biden defeated Donal Trump in the elections. And a few days later, this happens? Coincidence?
I mean, I’m not saying this was a targeted attack, but if you think about it, why did Facebook took so long (about 4 hours) to resolve the issue? It looks like they were locked out of their own systems. Digitally by the problem and physically as well by… what?
This tech reporter claims to have received tips about Facebook employees locked out of their buildings due to their security badges no longer working:
Other news outlets such as The Verge claim to have received information about Facebook employees having been physically sent to a location in California to solve the issue.
Was this an attack? Did they lose private data to hackers? Was there an intern involved in merging a typo in the BGP rules? No one knows and I don’t think we’ll ever do.
Services are up and running though, so we can go back to enjoying life as we know it!
Granted, this problem also ended up affecting DNS as well. Given how the nameservers were no longer available, the DNS resolvers were propagating the error across internet, making sure that no one could possibly find them.
Because again, that’s how internet works, it’s a group of smaller groups, all talking with each other, sharing information and most importantly, intrinsically trusting that information.
We can see how a “simple mistake” like what happened yesterday to Mark’s playground, ended up with a problem that affected the world.
Mind you, this time the world was left out of a few social networks, but what if this intrinsic trust could leave us without other, more vital services?