Why Twitter Didn’t Go Down: From a Real Twitter SRE
Twitter supposedly lost around 80% of its work force. What ever the real number is, there are whole teams with out engineers on it now. Yet, the website goes on and the tweets keep coming. This left a lot wondering what exactly was going on with all those engineers and made it seem like it was all just bloat. I’d like to explain my little corner of Twitter (though it wasn’t so little) and some of the work that went on that kept this thing running.
Background and History
For five years I was a Site Reliability Engineer(SRE) at Twitter. For four of those years I was the sole SRE for the Cache team. There was a few before me, and the whole team I worked with, where a bunch came and went. But for four years I was the one responsible for automation, reliability and operations in the team. I designed and implemented most of the tools that are keeping it running so I think I’m qualified to talk about it. (There might be only one or two other people)
A cache can be used to make things faster or to alleviate requests from something that is more expensive to run. If you have a server that takes 1 second to respond, but it’s the same response every time you can store that response in a cache server where the response can be served in milliseconds. Or, if you have a cluster of servers where serving 1000 requests a second might cost $1000, you can instead use the cache to store the responses and serve it from that cache server instead. Then you would have a small cluster for $100 and a cheap and large cache cluster of servers maybe for another $100. The numbers are just examples to illustrate the point.
The caches took on most of the traffic the site saw. Tweets, all of the timelines, direct messages, advertisements, authentication, all were served out the Cache team’s servers. If something went wrong with Cache, you as a user would know, the problems would be visible.
When I joined the team the first project I had was to swap old machines that were being retired for new machines. There were no tools or automation to do this, I was given a spreadsheet with server names. I am happy to say operations on that team is not like that anymore!
How the cache’s keep running
The first big point that is keeping the caches running is that they are ran as Aurora jobs on Mesos. Aurora finds servers for applications to run on, Mesos aggregates all the servers together so Aurora knows about them. Aurora will also keep applications running after they are started. If we say a cache cluster needs 100 servers, it will do its best to keep 100 running. If a server completely breaks for some reason, Mesos will detect this, remove the server from its aggregated pool, Aurora will now be informed that there are only 99 caches running and then know it needs to find a new server from Aurora to run on. It will automatically find one and bring the total back to 100. No person needs to get involved.
In a data center servers are put into things called racks. Servers on racks are connected to other servers on racks through a device called a switch. From here there is a whole complex system of connections of switches to more switches and routers and eventually out to the internet. A rack can hold somewhere between 20 to 30 servers on it. A rack can fail, the switch can break or maybe a power supply dies, this then takes down all 20 servers. One more nice thing Aurora and Mesos does for us is ensure that not too many applications will be put on a single rack. So the whole rack can go down safely and suddenly, Aurora and Mesos will find new servers to be homes for the applications that were running there.
That spreadsheet mentioned before, it was also tracking how many servers were on racks and the spreadsheet writer tried to make sure there weren’t too many. Now with the current tools, when we provision new servers to make them live, we have tools that will track all of this. Those tools make sure the team doesn’t have too many physical servers on a rack and that everything is distributed in a way that won’t cause problems if there are failures.
Mesos doesn’t detect every server failure unfortunately, so we have extra monitoring for hardware issue. We look for things like bad disks and faulty memory. Some of these won’t take down a whole server but might just run slow. We have a dashboard of alerts that gets scanned for broken servers. If one is detected to be broken, we automatically create a repair task for someone in the data center to go look at it.
One more piece of important software that the team has is a service that tracks up time of cache clusters. If too many servers have been seen as down in a short period of time, new tasks that require taking down a cache will be rejected until it is safe. This is how we avoid accidentally taking down entire cache clusters and overwhelming the services that are protected by them. We have stops in place for thing like too many to quickly being down, too many out for repair at one time or Aurora not being able to find new servers to place old jobs. To create a repair task for a server detected as broken, first we check if it is even safe to remove jobs from it by checking that service, then once it is empty it is marked safe for a data center technician to work on it. When the technician in the data center marks the server as fixed, we again had tools that looked for this and automatically activated the server so it could run jobs. The only human needed was the person in the data center actually fixing it. (Are they still there though?)
Repeated application issues were also fixed. We had bugs where new cache servers wouldn’t be added back(race condition on start up) or sometimes it took up to 10 minutes to add a server back(O(n^n) logic). Since we weren’t bogged down by manual tasks thanks to all of this automation work we could develop a culture in the team where we could go and fix these while keeping projects on track. We have other automatic fixes, for example, where if some application metrics like latency were an outlier we automatically restarted the task, so an engineer wouldn’t get paged. The team would maybe get one page a week, almost never critical. We frequently had on call rotations where no one got paged.
Capacity planning was also one of the more important reasons why the site hasn’t gone down. Twitter has two data centers running that can handle the entire site being failed into it. Every important service that runs can be run out of one data center. The total capacity available at anytime is actually 200%. This is only for disaster scenarios, most of the time both data centers are serving traffic. Data centers are at most 50% utilized. Even this would be busy in practice. When people calculate their capacity needs, they figure out what is needed for one data center serving all traffic, then normally add headroom on top of that! There is a ton of server headroom available for extra traffic as long nothing needs to be failed over. An entire data center failing is pretty rare, it only happened once in my five years there.
We also kept cache clusters separate. We didn’t have a multi tenant clusters that served everything and had application level isolation. This helps so if one cluster is having an issue it kept the blast radius to only that cluster and maybe some co-located servers on the machine. Again, Aurora helps here by keeping caches distributed so there won’t be a lot effected and eventually monitoring will catch up and fix them.
What would you say ya do here?
Well, I did everything above! I did talk to the customers (teams that used cache as a service). After things were automated, I automated more. I also worked on interesting performance issues, experimented with technology that might make things better and drove some large cost savings projects. I did capacity planning and determined how many servers to order. I was pretty busy. Despite what some think, I wasn’t just collecting paychecks while playing video games and smoking weed all day.
That’s how the caches that are serving Twitter requests are staying up and running. This is only a part of what day to day operations are like. It was a lot of work over the years to get to this point. This is a moment to step back and appreciate that this thing is still actually working!
Well for now at least, I’m sure there’s some bugs lurking somewhere...