While thinking a lot about dependencies and disaster plans, I started noticing the fire escapes that are a staple of New York City buildings. The result was a talk called The History of Fire Escapes.
Reference material: http://noidea.dog/fires
So, I'm a New Yorker. I wasn’t born here — I’m an immigrant — but one of the many things I love about New York City is that you move here, and it’s immediately your city. The number one criterion for being a New Yorker is wanting to be a New Yorker. It's a welcoming place. So good morning to my fellow New Yorkers, wherever you're originally from, and, if you're travelled to be here, welcome to New York. We're glad to have you.
I work in Site Reliability and I'm especially interested in what happens when things fail, the contingency plans we use to recover when something breaks. And last year I was thinking about that a lot and walking around the city and I started really noticing that New York is *covered* in fire escapes. They’re a contingency plan too. They’re for incident response. You don’t use them until all of your regular methods of getting out of the building have failed.
So I started reading about fire escapes.
Before I say more about that, let’s talk content. This talk is about at disaster prevention and disaster recovery in software, by looking at parallels in building fires. This will include stories of some of the worst fires in the history of New York City.
We'll be looking at the reasons fires started, the stuff that helped them spread and how people died. There's also some pictures of buildings on fire. Nothing lurid, but there are pictures.
If you have raw feelings related to recent fires, this could be rough.
If you'd be more comfortable skipping this one, you should do that with my blessing. While you're packing up, I'll even tell you what I'm going to say, so you don't miss anything:
Here's my thesis
Fire escapes are a hacky bit of afterthought tacked on to the outside of a building after the building is finished. If you're using fire escapes, it's worth making them as good as possible, but you’ll prevent more fires if you build better buildings.
Similarly, incident response is often a hacky bit of afterthought tacked on long after software is released. Again, great incident response can help you recover faster than if you don’t have it but… you’ll prevent more outages if you build better software.
Finally, buildings have an extremely detailed fire code, but we don't really have an extremely detailed systems engineering code for software, and I think we should have.
Now I'm going to say the same thing but take 35 minutes.
Fire escapes were really only built in New York City for a hundred years. They weren't common until the 1860s, and in the 1960s they stopped being allowed on new construction.
There's some debate now about whether we should start removing them in places where the building has been upgraded, or whether they should be preserved as part of the city's history.
I think at least some of them should be preserved. Look how beautiful that is!
And here's another lovely one. They made an effort to have it match the style of the building, not feel like a separate thing tacked on at the end. And I think that's key.
But most of the time, the people adding the fire escape didn't think of it as part of the building .As this quote says, fire escapes were haphazardly attached to the most elaborately designed facades. The facade of the building was architecture but the fire escape was law.
It was an external contingency plan, not part of the main structure. And I think that's part of why fire escapes ended up not being successful.
But I'm jumping to the end. Let's look at the evolution of New York City's fire code.
By the way, my great fear now is that there’s a building historian in the room who will listen to this and be like “Nope, that is really not what happened." Please forgive any errors, building historian! If i made mistakes, I would love if you would come tell me at the end!
On to the history. We’re skipping the great fire of 1776, and jumping straight to 1835 and the Financial district.
This was a commercial, not residential area, and as a result the number of fatalities was comparatively low -- two people -- I mean, still, two too many, but this is mostly remembered as a fire that cost a LOT of money. Almost 700 buildings were destroyed. The city had 26 fire insurance companies. This fire put 23 of them out of business.
The fire was caused by a burst gas pipe in a maze of wooden warehouses. Wood burns easily so there were no failure domains: the fire spread very quickly. Inside two hours it covered 17 city blocks, most of the financial district.
The city's water supplies were low and the typical contingency plan was to pull water from the rivers, but it was a freezing night in December and first the firefighters had to cut through ice.
At the time it was also common to use gunpowder to level buildings and stop the fire spreading. But they had used up all their gunpowder on a fire two days earlier. That fire involved the entire fire department of 1500 people, and they were still exhausted. Still, they fought the fire for 15 hours until marines from the Brooklyn Navy Yard arrived with more gunpowder and blew up some buildings along Wall Street to make a barrier.
As a result of the fire, the city stopped using volunteer firefighters and moved to a professional force with better equipment.
And they built the Croton Dam and Aqueduct. It was built because of the fire, but a reliable water source is good for lots of reasons!
But more importantly, as well as better incident response, they took the opportunity to make a more resilient city. The fire spread fast because the buildings were made of wood. They rebuilt with stone and brick.
And this paid off, ten years later, when there was another enormous fire. The great fire of 1845 was very bad -- thirty people died -- but it didn’t spread as far or as fast, because it slowed down when it hit those new brick buildings.
Let’s jump forward 25 years and talk about tenements. Tenements were extremely dense, extremely terrible housing. I'd read about tenements but hasn't realised the scale of them. In the 1860s, nearly 500 thousand people -- more than half the city -- lived in tenements.
The population of New York City doubled every decade between 1800 and 1880. Maybe you've seen this with teams and software systems: when you grow rapidly, you can build some culture problems and some technical debt. This was certainly the case here. Landlords made more accommodation by splitting big rooms into many smaller ones, mostly with no light or ventilation. These were really awful places to live. They were crime riddled, filthy and filled with disease. Every report about them mentioned that they were fire traps.
In 1860, two tenement fires happened back to back.
The first one, on Elm Street, started in a bakery on the ground floor of a large residential building. Terrible place for a bakery, but that's where it was. The baker was storing a lot of hay and wood shavings, and when they burned they made dense smoke, killing some of the people who lived in the higher floors before the fire even got up there.
The wooden stairway quickly burned away, trapping people on the top floors. Firefighters arrived with ladders, but the ladders only went to the fourth floor and this was a six storey building. At least 10 people died.
A month later four houses burned on west 45th street. These houses had roof hatches called scuttles, which should have let people escape across the roofs, but they all were missing their ladders so nobody could get up there. Another ten people died.
These escape plans -- the ladders and scuttles and the roof -- had worked fine for a previous iteration of shorter NYC buildings, but they hadn't been updated for the new shape of the city.
Just like with the water and the gunpowder, there was a plan in place for a fire disaster. And just like them, the plan only worked in the most optimistic circumstance.
We see that all the time. Backups that will work if we lose the database in a very specific way. Failover plans that only work if we have two weeks notice of the failover and the old data center doesn't lose power.
The city immediately passed a law to make the tenements more robust against fire. They even put an injunction on new tenement construction until the law was passed. Now houses for more than eight families (kind of specific) had to have fire-proof stairs either inside or outside the building.
What’s frustrating about this is that four years earlier a commission had reported that, if there was a fire, tenants on the 6th and 7th floors of tenements had basically zero chance of survival. They recommended fire proof stairs. But nothing happened until a bunch of people died.
Seven years later, the Draft Riots (which are a whole separate awful thing in which a whole bunch of people died) led to another law: the Tenement House act. This act had good goals but it was extremely unsuccessful.
Buildings had to have a fire escape, but they didn't have to make anyone safer! So landlords put up fire escapes that couldn’t hold the number of people in the house, or that weren’t well attached to the walls or that were just a rusty ladder. And what even was a fire escape? Well, it wasn't well defined.
Let's take a diversion and look at some fire escape patents.
As we look at them, you might want to think of disaster recovery plans you have known and loved.
This is a ladder with a counterweight. Imagine climbing down from the 7th floor of your building on one of these. With your six children. In the rain. In a dress that went to your ankles.
This is a parachute that rolls up very small. The idea was that you'd carry it with you everywhere in case you were in any tall building fire situations.
According to this patent, and I quote: "A person desiring to escape seizes one member of the cord, rope, or chain, as shown in Fig. 1, and forthwith jumps out of the window. [...]"
Like, I am looking at this thing and do not feel like I could forthwith jump out of anything.
Anna Gonnelly's fire escape was a bridge that you could sling from your roof to another building. It had side rails, so it was only moderately terrifying.
This one is just fantastically ludicrous. But good if you want to fight supervillain crime?
All of these patents were granted, btw.
You might think that this is just a parachute helmet. It is not. It is a parachute helmet and a pair of very bouncy shoes.
Finally, I've read this patent three times and I'm fairly convinced that the guy invented a rope. It's the most silicon valley invention of 1882.
Though, let's be clear, rope was a popular kind of fire escape. In fact, it was the state of the art for hotels.
I don't mean a ladder made of rope, I mean literally a rope. Every hotel room had to have a rope and that was the only fire escape. Even at the time, people found that pretty terrible.
This is part of a snarky cartoon from a magazine called Puck, published in 1887, of a whole lot of people trying to use the ropes.
Like most of those other parents, it's designed for the easiest case: someone with upper body strength and agility who isn't wearing a skirt or carrying a child. If your disaster plan only works for the easiest case, it's not a good plan.
I want to emphasise here that a rope is better than nothing. In fact, probably every one of these fire escapes, even mister parachute hat, is better than nothing. But these escape plans are not where I would put my efforts if I wanted to have fewer people die in fires. But this is what the law focused on.
Anyway! The Tenement House Act.
Even with fire escapes, tenements were still terrible. They were badly constructed, overcrowded, and -- I find this amazing -- it was perfectly legal to store lots of combustible materials in them.
One other thing the tenement act said, was that every room now had to have a window. And just like “what even is a fire escape” it didn’t define “what even is a window”. So the landlords cut holes in interior walls between rooms and called them "interior windows".
A decade later, the law said sigh, ok, exterior windows. So landlords started constructing buildings with air shafts, little narrow gaps between buildings. Now, picture it, you have no indoor plumbing and the bathroom is down six flights of stairs and now you have an air shaft. You can imagine how that goes. One article I read described the air shaft as “festering tubes of disease”. Very poetic!
And many of the fire escapes just led down to these air shafts and there was no way out from there.
By 1871, iron fire escapes were becoming common and of course people were using them as extra space. You still see that now -- they're used for bikes and gardening and barbecues and cat runs. All of that has been illegal since 1871. Because it makes the fire escape very hard to use in a fire!
A later law said that every fire escapes had to have a cast-iron sign saying that you could be fined for obstructing your fire escape. And it was fair, because usable fire escapes are better than unusable ones.
But, again, it was still perfectly legal to run your explosive business out of a tenement basement and tons of residential fires started because of deep frying crullers. And anyway, the regulations were mostly not enforced, so people didn't pay much attention.
Moving on. It’s 1876 at the Brooklyn Theater on Cadman Plaza.
The final act of the play was about to start and the stage manager noticed a very tiny fire on the left of the stage.
It was typical to keep buckets of water next to the stage, but there weren't any. There was a fire hose, but too much scenery was piled beside the stage and he couldn't get to it. There's those encumbrances again.
The stage manager asked a couple of carpenters to put the fire out by beating it with poles. This didn't work and actually spread some sparks, setting fire to the loft.
The actors -- laudably -- wanted to avoid a panic, so they announced that the fire was part of the show, and that people shouldn't freak out, but once the audience realised, they stampeded. And they had trouble getting out. We have a real stampeding herd problem here: there was only one stairway down from the cheap seats at the top, and everyone trying to use it at once. It filled with smoke. There were no fire escapes and some exits were locked to prevent against gate crashers so people couldn't get out that way.
278 people died. At the time, it was the worst theater fire in US history. It's now the third worst because we really don't learn.
The jury blamed the theater owners for not obeying a bunch of existing fire laws, and new laws were written, including widening exits and not storing stuff on the stage. In 1882, the building code said that theatres had to have automatic sprinklers: it's the first type of building in the city to require sprinklers. The first automated response.
What I find remarkable is that this fire happened nine years after regulation said that tenements had to have safe exits, but those laws didn't carry over to theatres, or to other types of buildings like: hotels, schools, factories, ships, offices. I'm going to spare you most of the horror stories, but we'll look at factories in a minute, after….
...we get proper no-kidding tenement regulation at last! And we even do it without a bunch of people dying!. Thank you Jacob Riis!
In 1890, this guy called Jacob Riis published a book about tenement life called How the Other Half Lives and did a lecture tour on it. And up until now the upper and middle class people of New York City had sort of known the tenements were awful, but for the first time ever, there were photographs. It was harder to ignore. Well, it was probably part empathy, part fear of smallpox coming out of there but, whatever, over the next decade, people started to care.
I was really reassured when I read this, because until then it had been all “there was a horrific fire and we added a very specific law and then there was a different horrific fire and we added a different very specific law”. And it was mostly like that! But this Tenement House Act came from someone saying “wow, look how much this sucks” in a compelling way. And that gives me hope!
Anyway, the next couple of Tenement House Acts included having to have actual windows, not air shafts, and fire escapes couldn't be ladders any more: they had to have open balconies and stairs and be properly attached to the wall. Even better: your neighbours can no longer boil oil in the basement! Hurray! And all new construction has to have interior fire partitions. Failure domains!
We're finally looking at stopping fires from starting and spreading, not just escaping from them. And, best of all, it’s all actually going to be enforced. Welcome to the 20th century!
But, oh yeah, it still sucks in factories.
The triangle shirtwaist is the famous one, but the Newark factory fire a few months earlier is a textbook disaster waiting to happen so I wanted to talk about it.
This building had two fire escapes -- look at the size of this building! One of them was a really heavy ladder that needed to be lifted into place. Another emergency plan that only worked for people with good upper body strength. In the fire, the young women who worked in this factory weren't able to lift down the ladder. So.. only one fire escape.
The building was shared by a couple of paper box companies, a nightgown factory and a lamp manufacturer. It had previously been used by machine companies and the floors were soaked in oil.
A fire started in the lamp factory. There was no fire alarm, and the bottom three floors had evacuated before they realised that 116 people up on the 4th didn't know there was a fire.
This building had had ten fires in ten years and the buildings department had condemned this factory three times, but the factory owners basically ignored them and kept running. All of that was expensive for insurance and they didn't want another fire on their record, so they delayed calling in the firefighters, even though the firehouse was just across the street.
The firehouse had a policy of reprimanding their firefighters for false alarms -- no blameless post-mortems here! -- so before raising a general alarm, they sent a couple of guys over with a fire extinguisher, delaying the real response even more.
The only door up to the 4th floor was kept locked, which was against the law. The windows wouldn't open and the victims had to break glass with their hands. The window sills were four feet off the ground and the platform up to them broke under the weight of people trying to get out.
And the victims had never been in a fire drill and they had no idea what to do. They, quite reasonably, freaked out.
25 people died, 32 more were very badly injured.
I feel like I could spend an hour just talking about this fire. There's so much to learn from it.
When officials investigated, they said the root cause was not the walls soaked in grease, or delaying calling fire fighters, or the locked door, or the lack of smoke alarms or the unusable fire escapes. It was that "the victims merely succumbed to panic"
The way humans react to a disaster can definitely make the situation worse -- remember those carpenters with sticks in the theater -- but that is in no way their fault. Humans will act in human ways. If your systems can't handle that, and you haven't invested a lot of time in training the humans to act in some other way, your systems are crap.
So what happened? Nothing. The jury didn't convict, though at least one juror later said he regretted it. New Yorkers did look a bit at their factories and say "huh, I wonder if we should care about that"..., but nothing changed. Is it because it happened ten miles away instead of on the island of Manhattan? No idea. The New York Fire Chief said "This city may have a fire as deadly as the one in Newark at any time".
Four months later…
This building was considered fireproof. They had done it right. They built a good building. But it was packed with garments hanging so tightly together that the building might as well have been made out of cloth.
The building should have had three fire escapes; it had one and that collapsed under the weight of people escaping. Fire fighters came but the fire ladders and the water could only get to the 6th floor and the city had gotten taller again: the factory was on the 7th to 9th.
One exit was locked; the guy with the key escaped without unlocking it.
And the employers already knew about the problems. Employees had organised a strike the previous year to protest the working conditions, and they'd been fired. The building had had a recent warning notice from the department of sanitary control, but they hadn't fixed their violations.
The fire department developed a stronger water pump and a longer ladder, so they could reach taller buildings.
But more importantly, building conditions took a big step forwards. There were 60 new laws over the next three years. Again, everyone knew factories were bad. But, again, the law didn't change until a bunch of people died ON THE ISLAND OF MANHATTAN.
Sprinklers started to be required in factories. (But only factories over seven stories tall. Very specific again.)
A professional organisation, the American Society of Safety Engineers (which still exists), was founded.
And at last, people started to look at fire escapes differently. After the disaster, a report called them "a pitiful delusion." and "a type of exit condemned by the experience of many fires".
The report called out a lot of reasons fire escapes are terrible:
The platforms are too small
People put stuff on them
They don't get a lot of maintenance
Snow and ice makes them slippy and dangerous
But most importantly:
They never, ever get tested.
Fire escapes were known to collapse during times of intense use. But they pretty much have one time of intense use. If they're going to collapse, it's going to be during a fire.
So what do we do?
We have a couple of options here. We can add more regulations around fire escapes: you have to maintain them, you have to try them out every year! There actually was a law about regularly painting your fire escape. To prevent against slipping you have to build a textured floor into the fire escape and leave a pair of shoes with good grips on the top of each one… Or we could step back and ask whether we're optimising for the wrong thing.
In 1923, the New York Times had an article praising fireproof interior walls: "For six years there has been no loss of life by fire in the 200 buildings so treated."
It blows my mind that a group of 206 buildings having no fire deaths in six years was considered newsworthy.
In 1929 those fireproof walls became code: all new buildings over 75 feet in height had to have them, and also had to have two fully enclosed staircases! Failure domains are part of the code at last!
The idea of building better buildings gained traction and in 1968 fire escapes stopped being allowed at all. The code still says "Fire escapes shall not be permitted on new construction".
The 1968 code also required sprinklers for hotels and high-rise office buildings, but not nightclubs or residential buildings.
In 1975, seven people died in a nightclub, so, sprinklers for required for nightclubs.
In 1998 there were two bad residential fires, and now you have to have sprinklers for residences with four or more units.
And I'm sure this story is not over and the code will be expanded many more times in response to very specific things in which a bunch of people die. [Edit, in 2019 there’s a new law that you need to have a lock on the dials on your stove if a child under six lives in the house. Very specific response to a horrific recent fire.]
Btw, there's no retrofitting of existing buildings. Most of the laws only apply to new buildings and existing buildings get better as they're renovated. So buildings in NYC comply to the safety standard of whenever they were renovated last. Think about that, wherever you sleep tonight.
So that was 150 years of fire codes. For decades we considered it inevitable that fires would start and spread, and we optimised for escaping from them. And we definitely got good at responding to massive fire disasters. But slowly we made progress on other, more important parts of the fire life cycle. Which I'm going to describe in four stages:
We prevented sparks. A certain amount of sparks are ok! We need to cook food and have birthday candles. But by becoming more deliberate about when we make sparks, we made it harder for the fire to start at all. We moved bakeries out of residential buildings, began doing wiring inspections, did public safety campaigns about cooking and smoking.
We worked on detection and immediate amateur response: smoke alarms, fire blankets, fire extinguishers, and more public safety campaigns. And we introduced sprinklers.
3. We introduced failure domains, to keep the fire to one small part of the building or city. We started using materials that were hard to ignite so the fire would spread slowly. And we did fire drills, to move humans quickly and safely away from the danger area and to prevent the kind of panic that makes things worse.
And only then, 4, emergency response. We also got better at responding to massive fires. The New York Fire Department is *very good*.
But step 4, this is our last resort and we should try not to rely on our last resort. We gained more from stopping the fire from getting to this point.
And, if you missed my extremely subtle metaphor here, it's the same for software.
The most important reliability work is making problems stop before they get to that fourth stage.
This means that reliability is everyone's problem. Everyone who's writing code or designing systems should have reliability in mind.
Yeah, some people have a site reliability team. Just as we have people who specialise in UI or security, both of which we should all care about, we can have people who specialise in reliability and advocate for it. But, while SREs may occasionally act as firefighters, the more important part of their job is to be the fire safety engineers, handing out smoke alarms, legislating fire partitions, pointing out buildings that are made of wood, advocating for the removal of clutter, educating everyone.
The part of their job which is being last resort firefighters? That skillset should be used rarely. You don't want the NYFD running into your kitchen every time you burn toast. If you're calling them in, it's a sign that something's gone horribly wrong. But it's still very common to have firefighters reacting to every software problem.
There's a really nice tradition in the ops and SRE communities, where if a site is down, people send #hugops on twitter to the people working on it. I want to particularly call out Baron Schwartz sending hugops in advance to people running mail servers on GDPR day :-D
I love #hugops. I send #hugops. But one thing you'll notice if you follow the hashtag is that… a lot of things break and nobody is really surprised.
We're at the stage of software evolution where we expect software to fail. We need to build better buildings in software too.
And that means we think about those same four stages.
Just like with buildings, a certain amount of sparks are fine for us too! We need to make changes. Maybe something gets overloaded or a user does something we didn't plan for. Many of us use the concept of error budgets: depending on how close we are to missing our SLAs, we make more or fewer changes.
We can reduce our sparks:
We can think about how users use our tools and provide clean, safe, validated interfaces that are hard to get wrong. We can restrict their access to functionality or data they don't need. A stove igniter is a better tool than a box of matches.
The fire department recommends that you don't operate a stove while drunk or sleepy, and the same goes for a root prompts or code merges. Many outages are caused by changes, so we can make them deliberately and carefully, with design review, code review and change management.
We can make it a standard to inspect our systems, looking for regressions, looking for what has bitrotted or become overloaded. A thorough test suite is like a wiring inspection that runs on every deploy.
And we can do chaos engineering: continually testing the system's resilience against chaotic events.
But, ok, sometimes, inevitably, things go wrong. We have an opportunity to put this fire out while it's tiny.
Humans can react quickest if the right fire extinguishers are available. Provide a one-click rollback for all your changes. Use canaries: push the change to one instance before we push all the instances. And launch with feature flags to push out new features in a way that makes it very fast to turn them off if you need to.
Alerts need a fine balance, as everyone knows who’s ever had an over-enthusiastic smoke alarm in their kitchen. An occasional false alarm is ok, but having humans continuously react to small problems can burn them out. It's using up your gunpowder on small fires and not having enough left for the big ones! So aim to keep your false alarms low.
But even better, don't get humans involved at all for small things. Add automatic recovery. If a machine dies, it should automatically be replaced. If a backend goes missing, we should be able to coast for a while. Health checking and load balancing should move traffic from an unhealthy region to a healthy one.
Maybe you want to let humans know, but the message they should get is "everything is under control but you might want to look at this when you get a chance". Not "WELCOME TO 3AM! A MACHINE DID A THING".
Stage 3: Ok, there's a fire, it's happening. Now we want to not let it get on anything it's not already on.
Failure domains split our systems up so that only one part of it should be affected by any given outage. And if the problem's going to move as components get overloaded, we want that to be slow enough that we can control it, not an immediate cascade. And we have our own version of moving bakeries out of residential buildings: we can isolate risky customers on their own replicas or shards.
Just like we make it incredibly common to hear a smoke alarm and find our way outside, make it so that a disaster is never a surprise. Humans will panic the first time they hit a situation that's outside their comfort zone. At intervals, tell people you're doing a controlled outage, and take a system offline.
You know the phenomenon where you're fixing something and you hit a bunch of unintuitive commands, or out of date documentation, and it ends up taking you much longer to do something simple? Or you even end up breaking something else? These traps are a basement full of straw, or a fire hose with cluttered scenery on top of it. It's making it very, very hard for you to move around safely as you try to fix the real problem. Push back on technical debt and clutter.
Fatigue is an encumbrance too. You're way more likely to make a mistake if you're exhausted. Set rules about how long a person should deal with an incident before their on call shift is over and someone else needs to swap in. Enforce those rules.
And sometimes we will still get to stage 4, fighting a massive outage. But we should aim to not get here often. Firefighting is not good for your SLAs and it's also not great for the health of the humans involved.
Ideally we'll get to a point where our firefighters mostly train using controlled outages, like many real fire departments do. But we're not there yet.
Many of us are still fixing unreliable software by focusing on this fourth stage, with human response and escape routes...
..., that means they're building tenements. Foul air is coming in through the air shafts, and it's not somewhere humans should live. Reliability can't be added after the building is finished. It needs to be built in. Failure needs to be built in.
Building better buildings makes a huge difference.
In 2016, 48 people died by fires in New York City. This is still a lot of people! But 2016 was the lowest number since they started recording a hundred years ago, even though the population of the city continues to grow.
That Bronx fire in December that killed 12 people was the deadliest in 25 years. How did we get from the fire traps of the 1800s to here?
Well, this helped. This is the New York City fire code. It has 444 pages and costs $140 dollars, which I know because I really wanted to bring one in here today and dramatically wave it at everyone. The guy at the library was really confused about why I'd want a physical copy. He was like "Look, do you have access to the internet?"
And fire safety is also mentioned plenty in the city building code, the city construction code, the state building code, the National Fire Prevention Agency electrical code and I’m sure plenty of other dense legislation. Don’t ask me what's in each of these. There’s a lot of code, that’s all I’m saying.
But we don't have a fire code for software. We have a bunch of O’Reilly books and they're great. But nothing makes us adhere to our best practices, or prioritises one set of rules over the others. Why don't we have a fire code yet?
It has been proposed from time to time!
I found this report from 1986 called "Software: a vital key to UK competitiveness", which had a whole appendix on safety critical software. It starts with “No computer software failure has killed or injured a large number of people. It is just conceivable that such a tragedy could occur.”
The Advisory Council predicted a time when it wouldn’t be possible to recover from software failure by just switching off the computer and doing the thing manually -- this was written in 1986, remember. We're there now. They wanted certification: you would only be able to operate a life-critical computer system if you had a license and a Certified Software Engineer to sign off on it -- and they would be personally liable! -- and a bunch of other stuff, and you'd have to get re-certified every five years.
They also proposed what’s basically on call shifts, disaster recovery practice drills, and post-mortems, including post-mortems for near misses. A lot of this feels prescient and we ended up doing it, but we never required certification.
Jon pointed out that, while we might think of computing as a new field, it's the same age as a bunch of others. Software, aviation, power, emergency medicine all took a big jump forward after world war 2. But our industry is significantly less mature than any of the others.
Is that because the stakes are lower? It's at least part of the reason. Mostly, the stakes have have been lower. Software mostly hasn't had the ability to cause massive disasters.
Researching this talk, I read a ton about deaths from software -- it really was a cheerful time creating this talk -- and found surprisingly few. Most of the new about software and deaths were about how software is IMPROVING things. By making processes repeatable and precise, we're saving lives.
But we have had some famously dangerous software bugs.
The Therac-25 radiation therapy machine had a concurrent programming bug that made it occasionally give its patients radiation doses that were hundreds of times greater than they should have been. Three people died.
In college I remember studying the London Ambulance dispatch failure. A new software system was deployed that hadn't been load tested, and it had a memory leak. It couldn't keep track of where the ambulances were, which led to them arriving hours late. 46 people died who might have been ok if the ambulance had arrived on time.
And some near misses. Like, I haven't heard of any actual negative outcomes from the OCR bug that went around in 2013, but you can see how it might print end up with numbers in prescriptions or structural engineering documents being catastrophically wrong.
And the news is full of software concerns in vehicles, self-driving or otherwise.
But none of those has been our Triangle fire. So far software has been able to kill people one or a few at a time. We haven’t had the wide-scale disasters that have shocked other industries into growing up.
Aviation regulations came from a bunch of people dying. Mining regulations came from a bunch of people dying. Professional engineering organisations came from a bunch of people dying. To quote my new favourite 1910s journalist, Inis Weed, "It took a Titanic disaster to improve the safety of vessels. It took a Newark Fire and a Triangle fire to bring New York State's fire legislation to its present inefficiency".
The use of software for life-critical systems grows every year. And every day we send #hugops on Twitter to the people working on the latest massive software outage. At some point these will overlap. Hope is not a strategy.
Are we ready for this kind of responsibility?
We, all of us here, are people who are responsible for software. The world will need a lot of software over the next few decades. Some people in this room will run life critical systems. We are 1890s landlords looking at a whole lot of new opportunity. We know, there's money to be made from cutting all of the corners, but we have a choice. I don't want us to wait for a disaster...
...to decide not to build tenements.
Remember, some regulations didn't come from fires! Some came from a lot of people deciding to care about the same thing at the same time.
We can decide now what good systems look like. We can create professional standards and industry safety codes, and create and opt in to a professional organisation to keep ourselves honest. And then, like the fire code, we can keep revising and improving it until huge software outages are rare and shocking.
The entire industry should learn from every major outage. No secrets.
Before I finish: if you're in New York, the NYFD and the Red Cross have a shared campaign to give people free smoke alarms and free batteries. They'll even come install it for you. If you don't have a smoke alarm, please search for #GetAlarmedNYC and fill in their form. http://fw.to/Kzv1G4f
(Two SREs live in my apartment, so we already have two redundant meshes of networked alarms from different manufacturers and also a few standalone alarms.)
This slide lists a few references that I found especially useful or interesting while writing this talk. That first one contains a list of all the others, so hit up http://noidea.dog/fires if you want a lot of links to read more about fires and fire escapes.
If you have comments on the talk, or questions or you're a building historian who is willing to tell me what I got wrong, you can find me at @whereistanya on Twitter or firstname.lastname@example.org.