This is a lightning talk I gave at the fantastic Mailchimp Chats meetup on May 21st, 2019. The theme of the evening was “failure and possible redemption in software companies.” I chose a topic I’d already been thinking about: what we expect from design documents, and how much they can tell us about weaknesses in our systems. I don’t mean massive requirements specs, just the kind of RFCs that many of us write to explain what we want to do. What can we learn about failure from reading these documents?
I don’t really talk about it in here, but I think design documents catch more organisational problems than technical ones. They let us understand what everyone else is doing and why. They set our expectations for supported use cases. The technical weaknesses we catch are often along the lines of “you’re depending on a system whose team provides only best-effort support” or “that service is about to be deprecated” or “are you sure we have enough capacity?”, all of which are as much to do with humans as technology.
But design documents can also tell us how the system is intended to behave, and that means we can try to guess how it will fail. Somewhat. A bit :-) That’s what this talk is about.
I find distributed systems fascinating. I've sometimes said that one of my hobbies is predicting how systems will fail. But can you really?
I was really excited when Sarah and Richard told me this was a meetup about failure, because I love software failure. I’ve been a systems administrator and a site reliability engineer, and I've worked on systems engineering projects for my whole career. I have seen many things fail and I have caused many things to fail.
If you were at SRECon this year, you probably saw Laura Nolan's fantastic keynote, called What Breaks Our Systems: A Taxonomy of Black Swans, where she talked about the kinds of events that trigger problems you don't know you have, in ways that you can't predict, until they blow up on you.
Her taxonomy was: hitting limits, hidden or otherwise; slowness that spreads; thundering herds; automation getting out of control; cyberattacks; and dependency problems. And she's got great advice for how to think about each of them. It's a wonderful talk.
These black swan events that she mentions are, by definition, unpredictable. When something big like this happens though, we can find ourselves asking "shouldn’t we have seen this coming?"
Could we have seen this coming?
I mean, we do a bunch of systems design review, architecture review, code review, unit tests, integration tests, regression tests, smoke tests, seventeen other types of tests. We invest in observability at every layer of the stack. We do chaos engineering. We put metrics on our metrics. There is so much data! Couldn't we have caught this before it happened, if we'd just known where to look?
If we go back to the original design documents, for example, shouldn't we be able to see the narrative foreshadowing of this event, some set of facts that the camera might have paused meaningfully at as it panned over the system?
It's a good question. And I have strong opinions on it which are...
That you know, yeah, we can find some potential failures by staring really hard at design documents.
But also that no, that's only for the easiest cases and mostly you can't.
But also that there's a way in which the the answer is actually maybe, yes you totally can. Sort of.
As you can tell, I feel strongly about this. It's all happening here this evening.
Design documents (also known as RFCs) are documents we put together when building a new system or making a major change. They say, approximately, "here is what I'm going to do and what the end result will look like and why".
I really enjoy reading design documents. It's one of the few occasions where I get dedicated time to sit down and stare at a system and think about the whole thing, even though it's not currently broken. Even better than not broken, it doesn’t even exist! It’s the platonic ideal of a system!
I have a whole process for design documents. I put dedicated time in my calendar and I print the thing out and take a pen, I make some tea, and go hide in a phone room to read it.
I usually book an hour. If you send me a document to review and it's clear and I can understand it in a lot less than an hour, I'll feel generally more favourable towards you. (The, uh, inverse of that is also true.)
Then I try to build a model in my brain of the system in the design doc. I really try to understand exactly what problem the system is solving, what context there is, why it needs to exist at all.
And then once I understand what the system is trying to achieve, I try to load the whole thing up into my brain.
I want to build a really solid mental model of the system in the design.
What new components are being added? What existing components are being used? What's talking to what? Where does the data sit and in what direction does it flow. What are the hidden dependencies? How much will it die if DNS goes away? What's scaling horizontally vs vertically. What's leader elected? How many milliseconds can anything be from anything else?
At this point I'm not even reviewing, I'm just building a model in my brain. Sometimes this is easy and sometimes it really isn't. Design documents with pictures are superior to design documents without pictures. They make it much easier to turn the document into brain data structures.
If it's a topic I don't know much about, maybe I'm reading wikipedia at the same time to try to understand the parts I don’t understand.
I don't know if this is a weird way to do it. I have had so few conversations with other people about how they review design documents. Isn't that weird? We should talk about this!
And then once I feel like I have a good model of the system in my brain, I try to break it. It's like a kind of weird half-assed static analysis for designs!
I mentally try to kill every component, one at a time: what happens if this thing dies. Does that cause chaos?
I add slowness in various places and try to imagine what will happen. Will this thing blow out its number of connections or its ram and fall over? Will things retry and stampede and take our dependencies down? (Do our dependency owners know we're coming?)
I look at the safety systems -- the load balancing, the fallback plan, any automated recovery -- and I mentally toggle a switch so that it goes berserk, and I check whether it can cause problems that are worse than the things we're setting out to prevent. Resiliency robots sometimes go wild and eat your face. It is known.
I imagine being a person with access to this system and play a puzzle game: how can they accidentally or intentionally take this down? What commands can they type that will break the system?
I think all of this is part of a thorough review; I take this pretty seriously in the same way that I'll mentally try to break code when I'm reviewing it.
If I find weaknesses or other reviewers do, the author can either improve the design, or note in the document that the thing is a risk but an acceptable risk. Design document templates should have "Risks" and "Alternatives considered" sections to spell out the things we've already thought about and decided not to care about.
Calling out risks is often enough to change how we think about the design. For example, if we accept that there's a possibility that our robots will eat our faces, then even if we decide that's an acceptable risk, we might monitor them differently, or make them a little less powerful.
But lots of times I can't see anything that will obviously break. Very likely the system designer has run through this same sort of checklist already. I mean, I suspect they're not thinking of it as a weird puzzle simulation game with robots who want to destroy us, but they're probably thinking about how to build a resilient system.
Lots of times I can't find a problem. But that doesn't mean I don't think it will break.
It means I do think it will break but I don't personally know how it will break.
It turns out that complex systems are complex.
As Ellen Ullman pointed out, we build our new computer systems on top of our old computer systems. We don't clear the ground and start again. I love this quote: "over time, without a plan, on top of ruins". It's so evocative.
Every new computer system is a layer on top of a ton of other computer systems: whether it's built on the lowest level hardware or decades of a company's cruft or the incomprehensibly massive distributed systems that make up the public clouds.
We try to get a handle on the complexity by building software abstractions, and that works a bit, but …
as Joel Spolsky told us in 2002, all of our abstractions are leaky and so they don't really simplify our lives as much as they were meant to.
In theory everything’s a nice clean abstraction, but in practice weird network bugs and kernel settings can still eat your face in 2019 and if you haven't accounted for the fact that you've toggled kernel setting to power-saving mode or that every machine on your network polls for operating system patches at 3:48pm or that someone hardcoded a GCP region name as a health check in a rarely used ansible playbook three years ago and Google's about to turn that region down, then all bets are off. There are too many variables.
When we build mental models on top of the software abstractions, we can't ignore the underlying details. They can affect our lives.
Even if we perfectly interpreted the words that the author of the design document wrote, which we probably didn't, our mental model of the system is still wrong because we don't have a good model of the systems under it.
Even if you do think you have a model of every single thing, you can be surprised.
My favourite post-mortem of all time is the one Fran Garcia wrote about hosted graphite being taken down by an AWS outage. The deal of course was that they didn't use AWS, so they were quite surprised at being affected by the AWS outage. But a lot of other people do use AWS and when it had connectivity issues, a lot of hosted graphite's users became unusually slow all at once, and usually-short-lived connections stayed open until they hit a connection limit in the load balancer and prevented anyone else from connecting. Amazing.
Good luck predicting that one!
If we're using mental models to catch problems, we will probably find and catch some things, but we won't catch all the things and we definitely won't catch the subtle things. And that's fine. It would be a waste of time and energy to try to have our models include everything in the universe.
But what we can do is assume the subtle things are there somewhere! Maybe you don't know what your hidden disasters are and you can't find them, but you can assume that they're there. Like rats in a subway station.
Some failure is ok. To quote the great Liz Fong-Jones, "We don't need every blade of grass on a lawn to be green". Once you make peace with constant failure, what actually matters is how quickly we detect and recover from failures. (And also how small and isolated we can make each failure be, but that's a whole other talk)
If we agree that, even if we don't know WHAT is going to fail, we know something is going to fail, then we can prepare for unexpected things. Because they're expected unexpected things. Uh. Grammar gets difficult here.
A huge part of any outage is the "debugging" stage, where a human who just got paged is staring at whatever information they have available, trying to understand what's happening. And also trying to understand what should be happening. Every minute of that stage extends the outage. So we can give ourselves a head start by making it as easy as possible for people to understand what's going on. We can make sure those future-people can quickly build good mental models.
Unfortunately the classic way we build mental models is to be paged by things.
It's actually illegal to get this far into a talk about system failures without mentioning John Allspaw, so I'll mention that he says incidents direct attention to where our mental models need recalibration. Every time there's an outage, it's an opportunity to improve everyone's understanding of the system.
You never learn as much as when something's broken. Being paged while you're on call is a gift. It's a weird gift from someone who doesn't know you well and who you maybe regret inviting to the party, but a gift all the same.
But ideally you want everyone to already have the knowledge before the big one comes. You can update your mental models in peace time by doing wheel of misfortune exercises, which are incident simulations where you roleplay an outage. You can get some by planning chaos engineering exercises.
But you also get some when you're reviewing a system design document. It can be a good opportunity to update your understanding of the state of the world. Design docs hopefully include information about the existing systems that are being used.
As you build the mental model of the system, you can watch out for how difficult it was to understand. I've seen people review docs and say they're ok with them and then later, when you ask a question, say "oh I didn't understand that bit but it seemed fine."
We're not great at saying "I don't understand" and I think we need to get great at it.
Because when we say "this is hard to follow", we're not just saying "I don't understand this", we're saying "this will be a liability during an outage".
If the design is very complicated, people aren't going to be able to understand what's going on, and your outages are going to be longer. Maybe this is an acceptable risk, maybe you don't need to fix it, but it's a risk worth writing down. If your complicated design is going to eat your face, maybe you monitor it differently. Maybe you write a very short document with not many words and one big picture and link to it from every alert.
As Cindy Sridharan says in her article, "Effective Mental Models for Code and Systems", understandability should be our highest priority.
And if you don't understand the system while sitting in a quiet room with a cup of tea reviewing a document that is intended to describe the system...
. ...think how it's going to be for someone who's woken up at 2am by a pager.
Simulate the late night on call experience. Get design documents reviewed by people who don't know the system and people who will be willing to admit what they don’t understand.
The system will fail for one reason or another, so we should predict that it will happen and prepare for it. And that's all I have, thank you.
What Breaks Our Systems: A Taxonomy of Black Swans, by Laura Nolan. Also there’s an accompanying article.
(Warning: autoplaying video) Liz Fong-Jones’s talk “Why production is complex, and how to detangle and understand it “
Effective Mental Models for Code and Systems, by Cindy Sridharan
The Law of Leaky Abstractions, Joel Spolsky
Recalibrating Mental Models Through Design of Chaos Experiments, by John Allspaw in InfoQ eMag: Chaos Engineering
(Warning: terrifyingly ad-heavy page) The dumbing-down of programming, by Ellen Ullman.