SRECon was in Brooklyn this year for the first time, and it sold out fast. Watch and learn, industry. Brooklyn is where it’s at!
This was my first time on the program committee, and I enjoyed that a lot. I’ve been doing a personal retrospective on what we ended up picking, re-reading my reviews of abstracts and comparing them to the talks I saw. Being on a program committee gives you a fascinating look at the inside of a conference and a chance to very slightly influence what the industry is talking about. It was a lot of work, but worth the time and I’d do it again.
I’ve noted in other years how ideas have suddenly appeared across a lot of talks at a conference at once. I remember the year at LISA where everyone suddenly believed in centralized config management (newer people in the industry: you would be horrified at how we used to do it), the year we got to the acceptance phase of the IPv6 stages of grief, the year containerization happened for real, and so on.
This year’s emergent topic was Safety-II, the idea (from medical research) that we should study the everyday performance of complex systems, not just the failures. Safety-II emphasises human flexibility and innovation as part of a resilient system, discarding the idea that the squishy meat component of the machine is the one most likely to break.
I read the “From Safety-I to Safety-II” paper this weekend. It’s both interesting and approachable and there’s a ton we can learn from it. I would note though that the paper says that Safety-II comes in addition to Safety-I; the old approach is still fine for a lot of situations.
By contrast, the several talks that mentioned Safety-II really concentrated on discarding existing practices, in some cases taking extreme positions: don’t do post-mortem action items, don’t use error budgets, don’t dig into failures, don’t measure availability. Without concrete alternatives, it felt a little unsatisfying, and I think it landed badly with an SRE crowd. The backchannel conversations were skeptical to say the least.
My read is that this has been covered in more depth in chaos engineering conferences, so the ideas were assumed to stand alone, the nuance assumed to be implied. And, as Tiarnan pointed out afterwards, “‘Actually it's more complicated than that’ doesn't make for an inspirational slide.”
Now that I understand the topic better, I’m wishing that we’d had a a deep and nuanced treatment of it at SRECon. I’d suggest that near-future SRECons and LISAs and so on take a step back and tell this story from the start (perhaps inviting an outside expert), as well as emphasising practical ways to apply the philosophy to running software in production. We shouldn’t just talk about which parts of established SRE doctrine we need to discard (though we should talk about that!), but we should bring suggestions for what to do instead. The paper, for example, proposes scheduling time to study systems that are working normally. I’m not sure what this means for us — what form should the study take? — and I’m looking forward to reading more about that.
Other observations: SRECon has always emphasised the human side of reliability and I appreciated the focus on learning and teaching, psychological safety and avoiding burnout. Other highlights included Denise Yu’s gorgeous sketchnotes (see some of them inline (with permission!)), having lunch with the lovely Etsy delegation, and many good hallway track conversations. Yelp’s on call game got some well-deserved love on Twitter; I think it was referenced in Chie Shu, Dorothy Jung, and Wenting Wang’s workshop “What I Wish I Knew before Going On-call”, which I didn’t see but heard good things about.
As always my greatest difficulty with SRECon is that a hundred people I know and like are in one place for three days, so I don’t get time to catch up with everyone I want to. A pretty good problem to have, all things considered :-)
I had other stuff I needed to do this week and missed some sessions (including all day Tuesday) but here’s what I saw and liked. If I misrepresented anything, or if you know extra livetweets or slides I could link, please let me know (mail email@example.com or leave a comment here) and I'll fix it.
What Breaks Our Systems: A Taxonomy of Black Swans, Laura Nolan, Slack.
Black swans are outlier events that are hard to predict and very severe. A black swan event can blow your error budget for a decade. Laura introduced six types of black swan outages, illustrating each with true stories of industry outages.
Unexpected physical or logical limits, like the filesystem limit that took down Instapaper or the day Sentry ran out of Postgres transaction IDs. We can defend against them with load and capacity testing, and by adding monitoring for limits we know about.
Spreading slowness, e.g., the amazing Hosted Graphite outage where AWS connectivity issues took them down even though they don't use AWS. Failing fast and limiting retries can add resilience.
Thundering herds and coordinated demand, like CircleCI’s surge after a GitHub outage ended. We should plan for this kind of traffic and test it.
The kind of automation interactions which make complex systems even more complex. Google sending its CDN to be disk-erased is a famous example. There should be constraints to limit what the automation can do, and it should be easy to disable it completely. [ed: Christina Schulman and Etienne Perot had a great talk about this at SRECon Americas 2018!]
Cyberattacks, like the malware that disabled the Maersk shipping company and cost billions in its effect on global shipping. Restricting how much systems trust each other, like Google’s Beyond Corp, can reduce the blast radius here.
Finally, dependencies can cause outages, as Trello found out when their servers refused to start without a component they didn’t actually need. Circular dependencies are particularly bad as they can lock you out of your systems. Laura recommended dependency management and layering (and I will hubristically link to my own work on this topic :-)) as well as making sure that communication during incidents won’t depend on services the incident might take down.
Other ways we can defend against black swan events include using formal incident management, practicing how we communicate during outages, and being alert to human psychology: we recover faster if oncallers can easily get help and escalate, and if they’re getting enough rest.
I’d read the slides for this talk before, and seen the rave reviews for it, so it was excellent to see it firsthand and discover that it’s as good as promised. I love how it respectfully examines other people’s outages and pulls out broad lessons we can all learn from.
Livetweets: https://twitter.com/lizthegrey/status/1110163951441059840 , https://twitter.com/msuriar/status/1110163767936061440
Laura recommends: Release It! by Michael T. Nygard.
Complexity: The Crucial Ingredient in Your Kitchen. Casey Rosenthal, Verica.io.
Why is software all about disruption? Why are we drawn to burning things to the ground to change them? Because we know at some gut level that the current system is wrong.
A complex system is one where no single human can understand how all the parts fit together, or how a change made in one place will affect another. Even if every human involved in building a system makes sensible local decisions, you can still end up with an unreliable system.
A common approach to reliability is "add redundancy". But the Challenger disaster showed us that the redundancy (plus some normalisation of deviance) just made people feel more comfortable taking risks. The redundancy contributed to the disaster. A study on squirrels showed that groups of squirrels who crossed more streets were actually less likely to be hit by cars. Avoiding risk backfires. Exposure to risk is necessary for learning to remediate risk.
Another approach is to aim for simplicity. But complexity is inevitable. Accidental complexity keeps accumulating. While you can stop what you're doing and clean it up, there's no sustainable mechanism to keep it under control. And essential complexity grows with the requirements of the problem that needs to be solved. For example, high availability means extra components. We have to learn to navigate complexity.
Safety researcher Jens Rasmussen’s wrote that systems are subject to competing priorities: operations must be profitable, workers must be safe, and workers must have manageable workloads. We have an intuitive sense of economics and workloads, but our outages show that we don't have an intuitive sense of safety limits. Chaos engineering experiments can help us find those limits and teach us about the weaknesses in our systems.
We have four configurable "Economic Pillars of Complexity: states, relationships, environment and reversibility. We should limit the number of possible states and configurations, and prefer reversible decisions.
With this many factors and this much change, we need to be able to improvise. It’s the “kitchen model of organisation”: in a well run kitchen, everyone operates independently, but they’re working towards a shared goal. Everyone should know where they’re going and how much they can improvise.
Takeaways: embrace and navigate complexity, provide opportunities for teams to practice working together, optimise for reversibility, communicate the safety margin. I also took away that more talks should be illustrated with pictures of squirrels driving cars. Get on that, everyone.
Shoutout to the person in the audience who asked why Casey had an infinity gauntlet on his arm. “Why don’t you have one?” was the profound response. (We LOLed.)
Livetweets: https://twitter.com/msuriar/status/1110173241006530562, https://twitter.com/lizthegrey/status/1110171987152945152
Case Study: Implementing SLOs for a New Service. Arnaud Lawson, Squarespace.
Service Level Objectives are the performance and reliability targets for a service over time. Service Level Indicators are the metrics we use to decide whether we’re meeting our SLO. Arnaud presented a case study of choosing SLIs and SLOs for his service, Ceph Object Store.
Determine which SLI types would capture user experience. Ceph is a request-driven HTTP Server with a storage backend. The HTTP server needs availability and latency SLIs. The storage backend needs durability SLIs.
Define the SLIs. Be clear about what you’re measuring. For example, the availability SLI is “the percentage of HTTP requests that don’t fail.”
Choose how to measure them. They collected SLIs from load balancer logs, added instrumentation to client programs, and deployed probers to measure user actions.
Collect SLIs for a few weeks to get a performance baseline and use this to set SLOs. They collected success metrics and latency metrics for four weeks and based their SLO on them. A sample availability SLO: “99.9% of requests will complete successfully over 4 weeks”. A sample latency SLO: “90% of requests will complete successfully in < 300ms over 4 weeks.”
Infer error budgets. This is the amount of headroom there is above an SLO. For example, 99.9% availability means that 0.1% of requests can fail.
Publish the SLOs. Arnaud wrote documentation about the types of SLIs being measured, and what the SLOs were, including a rationale for why they were chosen.
SLIs inform decisions for prioritisation, capacity planning, etc, and can also identify service issues. SLOs also help users make decisions about whether the service fits their use case. Collect SLIs that matter to users and never strive for 100% availability.
Arnaud is my colleague so I’m biased, but I thought this was a great treatment of the subject, with actionable steps to take. I also appreciated the message that the ultimate goal of SLOs is to make your users happy. It’s easy to forget that when we’re deep in the implementation mines.
Fixing On-Call When Nobody Thinks It's (Too) Broken. Tony Lykke, Hudson River Trading
When Tony started at Hudson River Trading, they received hundreds of pages every week. He tried to convince coworkers that they didn’t have to live like this, but received everything from “it’s not realistic to reduce noise unless you’re Google” to “look, it’s better than it used to be”. Here’s the nine step plan that worked for him:
1. understand your audience. Respect that people here consider alerts to mean “things are working”. Accept that nobody else believes there’s a problem.
2. understand the problem. Create visualizations to show the number of pages. He discovered there was always a spike in alerts caused by data consolidation just after the markets closed.
3. understand the system. They’re running Nagios 3, last released in 2012, with 200k lines of (often stale) configs. It’s deployed manually (and tediously) via rsync.
4. devise a game plan. It doesn’t have to be comprehensive, but you need some project management and some good communication. Focus on low risk/high impact changes first.
5. get permission. Or forgiveness, if you have that kind of position in your organisation. Use the data you collected to convince people. Know how much time you can spend on this before getting in trouble. You will inevitably break something, so communicate a lot and in particular make sure oncallers are aware.
6. lay the groundwork. Make it trivial to add/change configs. This is not an immediate impact on the thing you want to change, but it’s necessary. He replaced 15k duplicate lines with templates. He added automated build and deploy.
7. choose the lowest hanging fruit. Visualizations help choose high impact changes, but also look for low risk ones. For them, the first step was silencing a set of alerts which were always ignored. This should have felt like a huge win but people were anxious so…
8. communicate more! Silence made people uncomfortable: they worried that the monitoring stack was broken. He redirected alerts to a slack channel to mitigate silence anxiety.
9. goto 7. Find another low hanging fruit and go again!
This was a funny and relatable story full of good advice. Kudos to Tony for pushing through everyone’s complete lack of belief in his project and making it work.
Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value. Aaron Wieczorek, United States Digital Service
The US Digital Service started in 2014 when President Obama called in engineers and designers to fix healthcare.gov. They’re a small agency (less than 200 people), and agile enough to step in while a crisis is still happening. A recent example is restoring service to https://airnow.gov during the California wildfires. But they often find out about these outages from Twitter or the Wall Street Journal.
They wanted to deploy monitoring to get ahead of the problem. But there are 26,049 .gov and .mil services, so monitoring’s not trivial.
They created a minimal product using Python scripts, then expanded it using existing software: Prometheus, HAProxy, Alertmanager, InfluxDB, Grafana, Prometheus Blackbox Exporter. Alerts went to a Slack channel.
Tuning Prometheus is harder when you have more than 20k endpoints. Expensive queries took down Prometheus. AWS contacted them to say they were scraping too aggressively from EC2. They moved to an AWS Lambda solution.
Now charts show service uptime and allow them to react quickly.
Proactive monitoring allows immediate incident response and allows you to train teams.
Dashboards with this many endpoints are hard: you can crash your browser. Instead alerting should guide you to more specific dashboards.
Tuning Prometheus with this many services involves a lot of guessing.
Be nice to the services you're trying to help; don’t accidentally be an abusive user.
USDS stories are always a mix of “wow” and “haha oh god” and this was no exception :-) Thank you for the work you do, USDS folks!
Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way. Michael Kehoe and Todd Palino, LinkedIn.
A code yellow is a way of formally reacting to major problems. Michael and Todd shared two case studies of teams unable to make progress because they were buried under reactive operational work. In both cases, the right solution was to step back, access the situation and ask for help. They declared a code yellow and assigned extra people to achieve clear goals on the project.
To run a successful code yellow, you first need to admit there’s a problem and measure it. Write a problem statement. Set clear exit criteria and concrete success criteria.
Pull in whoever is needed: at least a project manager and some dedicated engineers. For one of the case studies this meant taking five staff engineers off other projects. Plan for what work will be done, prioritising things that will reduce toil and burnout. Most importantly, communicate well. You need people to understand why this project has such high priority, and to feel like a partner in it.
I’ve seen code yellows used effectively at el Goog, and I’m glad other folks use this model too. This was a solid overview of a useful tool for saving teams who are in ops overload.
Creating a Code Review Culture. Johnathan Turner, Squarespace.
Code review ensures code quality for your organisation and also provides an opportunity to teach. John proposed some practices that contribute to a strong code review culture.
For organisations, communicate the culture and establish a community of experts, but be conscious of making space for new experts to develop. Train code reviewers to do a good job. A large part of setting the culture is for code authors and code reviewers to communicate respect for each other.
As authors, this means respecting the reviewer’s time and giving them a lot of context, the why as well as the what. Point out any tradeoffs you’re making, and think about how manageable the PR is, both in size and in content.
As a code reviewer, justify your critiques and engage with the author as an equal: assume they have reasons for the choices they made, and ask questions rather than just instructing them to change things.
John suggests reviewing in passes, where each pass is a theme.
You might begin by sizing up the “shape” of the PR, understanding what it’s trying to do, whether the change is necessary and whether this PR achieves it. Then you might review for readability — though hopefully style “nits” will have been eradicated by a linter or formatter. You might look at language gotchas, unnecessary use of esoteric language features, poor naming and bad spelling propagated by helpful IDEs. You should think about whether this code will be safe in production.
Watch for code that introduces new patterns, as they’ll inevitably be copied and pasted when other developers want to do the same thing.
I haven’t seen code review discussed at SRECon before, and I think this was a valuable topic. A bunch of people referred back to this presentation in other talks throughout the conference so it seems like it struck a chord. [Caveat: John’s also my coworker — Squarespace represented at SRECon this year! — so I have bias here too but seriously this was a great talk.]
SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager. Jen Wohlner, Fastly
Product managers focus on what and why, as opposed to program managers, who focus on when and how. (This is a neat definition.)
SREs already do some product management, by making proofs of concept, writing plans, and so on. But we can do more. Here are five steps towards thinking like a product manager:
Know your users and talk to them. Surveys are ok, but a higher touch approach is better. Try 30 minute meetings where you interview up to two people at a time. Interview TLs and managers separately from ICs to get different perspectives. Share the results broadly.
Ask non-leading questions. Don't assume people use your tool like you intended or that they like your product. Don’t ask "do you have problems with X" but instead "Have you done X? If so, walk me through how."
Do prototyping sprints. Limited time forces you to scope your project tightly, so focus on one thing for a week and choose core users to focus on. It’s good for team bonding too.
Take the time to plan project work even if you’re under water. Make the goals user-centric not tech-centric: not "roll out X", but "give users a pilot and get feedback". The roadmap needs to be owned by the team, not a manager or PM. Update it weekly or biweekly, and be willing to change priorities. Do a quarterly retrospective on the roadmapping process itself.
The roadmap spreadsheet is just an internal tool. For people outside the team, you have to package the information nicely [ed: yes, this!], e.g., by sending a bi-weekly email.
User needs and pain point change, so follow up with your users regularly.
One of my current hobbyhorses is that Infrastructure organisations should take more of a product approach — be clear about what features are available and what’s on the roadmap; understand user needs; solve the whole problem not just the technology part — and I was very happy to see this talk.
Optimizing for Learning. Logan McDonald, BuzzFeed
Expert intuition is not magic, it’s achievable. Logan was going on call for the first time but didn’t have a lot of practical experience as an SRE. She needed to decide what to learn from the vast fields of available knowledge.
Problem solving is easier with constraints. If you need to design a room, you’ll first want to discard most of the infinite possibilities by asking questions. “What do I need to fit in this room"?”. She used Mikey Dickerson’s hierarchy of service reliability and concluded that monitoring comes first; she asked to help build out the new monitoring system.
Robert Gagne, an educational psychologist, identified a hierarchy of learning, increasing in complexity from signal learning to rule learning and problem solving. Logan discovered that rule learning was key in incidents. She asked coworkers to vocalise their thoughts as they debugged. Beginners can only succeed if information is transparent.
Reading a textbook doesn’t embed long-term knowledge. We need “low stakes testing” to exercise retrieval (for example, trying to think of a solution before looking it up). [ed: wheel of misfortune exercises are great for this!] For maximum effect, we should interleave the information recall using Leitner boxes and use memory palace visualizations to make information memorable by tying a bizarre idea to something familiar.
Good mental models are the basis of expert intuition and mental models come faster from observable systems. We need to evaluate our systems for how long it takes new people to come up to speed. After incidents, we should center the experience of the junior people: let them ask the questions.
Institutional knowledge is important so we need humans to tell each other things and to feel safe asking. When our mental health suffers, so does our ability to make strong ties between information. We need to emphasise psychological safety.
This was a great talk packed with both interesting theory and immediately applicable practical ideas for building a learning organisation.
Livetweets: https://twitter.com/msuriar/status/1110889187799191552 , https://twitter.com/tammybutow/status/1110887715032260609
Follow: @_loganmcdonald (and @emilywithcurls for the great images!)
Zero to SRE. Kim Schlesinger, ReactiveOps.
Your company can convert a junior engineer to a mid-level engineer in one year, insists Kim, getting our attention immediately. As Steve Kinney says, being able to take junior engineers and turn them into badasses is a competitive advantage.
You need to do work up front: don’t hire junior engineers until your culture is ready for them. Make it normal to spend time on learning — At ReactiveOps, they track 20% learning time in their project management tools — and create a culture where it’s easy to ask questions. Put teams on projects, not just individuals. [ed: yes! This changes everything!]
Give your juniors clear, measurable expectations. An engineering levelling doc helps. Make it clear how long people should try to solve problems themselves before asking for help. Create personalised technical learning plans, which include completing real projects with real deadlines. Develop systems of accountability, but provide lots of help and regular check ins. Focus on developing muscle memory first; abstract concepts come later when the knowledge has something to attach to.
Give junior engineers a team on day one, and be explicit about when you’ll start expecting them to be “productive” and sending PRs. Hire more than one and let each junior person help the next junior person. Teaching is a good form or learning and a great confidence booster.
A year in, compare against your levelling doc and promote them if they’ve gone up a level.
This was an insightful and practical talk about how to make junior engineers successful. I loved the emphasis on making people feel safe and included, and in particular the idea of putting learning time into project plans. That’s a good way to live.
Livetweets: https://twitter.com/tammybutow/status/1110898227090542592 https://twitter.com/Ana_M_Medina/status/1110897107911753731
One on One SRE. Amy Tobey, GitHub.
On Amy’s second day in her first job, someone asked her to log in to a server and kill a process. She used killall not realising how aggressive it is on Solaris, and took down the server. Twenty years later, she’s still thinking about this. [ed: I love these moments that change how you work forever. Even if they don’t feel so good at the time.]
After incidents, many of us run post-mortem meetings, but Amy’s team found that they were stressful for the team who owned the service. Rather than questions, they were often hit with comments and further trauma.
GitHub moved to 1:1 incident debriefs, with documentation that’s only shared with incident investigators and the people being interviewed. They try to set an environment where people feel comfortable being vulnerable. They ask:
broad general questions: “What was your role in the incident?" “What surprised you?” “What did you learn from this incident?"
questions that uncover burnout, like "How long did you work on the incident", "Were you able to get the support you needed?", “Did you practice self-care during this process?". This last reinforces that you everyone should take rests, drink water, order food at the company expense, etc. It helps set the culture.
questions like “Do you feel that the incident was preventable?”, that let people get upset and shouty if they need to.
And they ask “Can you think of anyone else we should talk to?". For GitHub’s big outage last year, they ended up interviewing 34 people.
The way to build influence in an org is to start scheduling 1:1s with interesting people. Establish relationships and learn what people are working on. An individual contributor can influence availability at company-wide scale by building a personal network and having a lot of empathy.
I’m torn between wanting to try this 1:1 model and not wanting to spend this long on a single outage. But I can definitely see that it would take some of the pressure off the team afterwards. Maybe this is one to be used just for the very big outages. It’s an interesting idea anyway and one I’ll be thinking about a lot.
Amy Recommends: The Body Keeps The Score by Bessel Vsn Der Kolk, M.D.
Fault Tree Analysis Applied to Apache Kafka. Andrey Falko, Lyft.
Defining SLOs can feel like weather forecasting: if you get it wrong, people won’t be pleased with you. Andrey walked us through some worked examples of Fault Tree Analysis, a mechanism for evaluating failure probabilities invented at Boeing in the sixties.
FTA draws a system as a series of symbols, using events and logic gates. You could model a RAID0 array (striped, not mirrored, if you forget your RAID levels) with an OR gate, or RAID1 (mirrored) with an AND gate.
If disks fail with 4% probability every year, and you have two striped disks, the probability of losing one of them (and causing a complete outage) can be modelled as P(A or B) = P(A) + P(B) - (P(A) * P(B)). By comparison, P(A and B) = P(A) * P(B) — much smaller. For a three disk array, RAID0 gives you a 12% change of failure, RAID1 is less than 0.001%.
If you’re running a Kafka message queue, you want replication and redundancy to minimize data loss, but there’s no point in over-spending on more replica than you need. We can draw a FTA for Kafka’s Broker and Zookeeper components, measuring the probability of disk failure, network partitions, other hardware failures and OS failures. Having two brokers gets us 99% availability. Going to three gets us 99.95%. We can quantify the cost per nine, and see whether it’s worth it, or whether it would be better to change other inputs: e.g., switching to SSDs means a lower chance of hard disk failure, and can get us to four nines for the same number of replicas.
I’d seen some talks about FTA before, but this was the first time I’ve seen a practical talk about how it can be useful. I enjoyed this a lot.
Strategies to Edit Production Data. Julie Qiu, Google.
Having unfettered access to a database prompt sets you up to make mistakes. Julie laid out five stages, building on each other, to take you from running raw update commands to using a safe script runner service.
I’ve seen and blogged this talk before so I’m not going into detail here, but this is a great talk and I recommend it.
Livetweets: https://twitter.com/Ana_M_Medina/status/1110931540551315458 https://twitter.com/wiredferret/status/1110931536663273474
Extending the Error Budget Model to Security and Feature Freshness. Jim Thomson and David Laing, Pivotal.
Error budgets to measure availability are pretty common in the SRE world now. But availability isn’t everything: security is also important, and users always want new features. Can we use a mechanism like error budgets for security and feature freshness?
The vulnerability budget: we use an SLI of the number of days since a dependency we’re running was patched. If we have a 30 day SLO, we know patches get applied at least that often. 30 days might sound high, but it would have saved Equifax: their breach was on May 13th and the vulnerability in Apache Struts had been disclosed on March 7th. Pivotal use a 30d SLO, but actually patch every week.
The legacy budget: libraries need to be new enough that they get support, but not so bleeding edge that nobody is comfortable using it. Unlike other SLOs, they have both an upper and lower bound. Their SLO says that they upgrade every 90 days, but always to a stable and supported version.
These SLOs allow teams to demonstrate their value: it’s easier to show a service inside SLO than enumerate the security breaches you didn’t have, for example.
I hadn’t seen this way of quantifying the freshness of libraries before, and I would guess a model like this is good at encouraging people who are reticent about updates
You Don't Have to Love Your Job. Leslie Carr, Quip.
"Choose a job you'll like love and you'll never work a day in your life". Leslie says this can only be said by someone who's never had a job. She took us on a historical tour from 1500s Europe to Elon Musk, showing how the Calvinist idea of predestination led to people’s morality being measured by how successful they were at work. The puritans and pilgrims brought that to the US and that’s where our “work ethic” comes from.
Should you love your job?
Elon Musk brags about working 120 hours a week and tweets that nobody has changed the world in 40 hours a week. That doesn't give you much time to sleep. But if Tesla stock goes up $1, Elon Musk makes $33 million dollars. That’s quite an incentive to convince his employees to work more. For comparison, Darwin worked three hours a day and changed the world pretty well.
Occasional spikes are ok, but continuous sustained 80 hour weeks shouldn't be. Working 60 hours or more actually makes you less productive overall than people who work 40 hours a week. It also leads to increased health problems and divorce rates.
Love inspires heroics, and the SRE community has worked hard to rid ourselves of heroics and personal sacrifice. Site reliability needs to be sustainable and businesses need to play a longer game. Don’t burn people out.
Since Leslie has been talking to people about this talk, a bunch of people told her "I was thinking of leaving tech because I don't love it" But they like it well enough and are good at it. Tech pays well, has good perks. “What percentage of LeCroix's revenue is from tech?”
You don't have to love your job. Just be friends with it.
Mindfulness in SRE: Monitoring and Alerting for One's Self. Tommy Lutz, Google.
Tommy found himself in a single engine plane that had lost its engine, dropping fast with six minutes to impact. In life we have physical threats like this and we have threats to our ego, and in both cases our bodies react the same way: adrenaline spike; blood redirected to muscles; ready to fight. Talking on a stage makes our bodies react as if we’re in danger.
We make a lot of micro-decisions, automatic, fast, but low accuracy conclusions. The inputs are physical state, sensory input and habits, run through threat assessment. Mindfulness lets us slow down and inspect these decisions to evaluate if there’s a real threat. By paying attention to the mind, body and surroundings with an attitude of kindness and curiosity, we can realise what our reactions are. We can go from reacting defensively to realising what happened after the fact to intercepting and modifying our behaviour in real time.
We can use it during incidents and during disagreements. Breathe, observe, decide and act, to build safety instead of defensiveness. Mindfulness can be cultivated with deliberate practice. Tommy got us to stop for a moment and focus on our breathing and, honestly, it was pretty great.
Resilience Engineering Mythbusting. Will Gallego, Fastly
I was fried by the end of day three, and my notes became very rough. My summaries of the last two keynotes are heavily supplemented by other people’s livetweets. Livetweeters, you are heroes. 😍
Will presented some of the basics of resilience engineering, a discipline that comes from nuclear power safety, transportation, and medicine. An organisation is resilient if it can adapt. We can understand our system, monitor it, and therefore see what's coming and modify it before something goes wrong.
There are some ways that people might misunderstand resilience engineering:
use the word “resilience” where they mean “robustness” or “reliability”. Software can be robust but resilience requires humans to be part of the system for evaluation and decision-making.
believe that complexity is inherently bad, whereas it’s sometimes needed for functionality. We need to learn to cope with necessary complexity.
believe that chaos engineering finds bugs, whereas really it develops intuition.
always follow best practices. Because sometimes they’re wrong and a human should make that call.
insist that retrospectives have follow up items. They’re for learning and building expertise.
look for root causes and ask why, instead of looking at all the contributing factors.
measure reliability by looking at past performance. Sometimes there are latent problems like Meltdown and Spectre that just haven’t happened yet. Ask experts “what scares you about the system” instead of measuring its current reliability.
believe error budgets control risk.
believe resilience engineering is representative of human factors when it’s kind of a monoculture.
This was a provocative talk and not all of it resonated with me, but we go to conferences to have our ideas challenged, right? It was one of the several talks that prompted me to go read more about resilience engineering (see the “Safety-II” section at the start of this blog post) and I consider that a win.
Livetweets: https://twitter.com/wiredferret/status/1111002519373791232 https://twitter.com/msuriar/status/1111002565234311170
Will recommends: The ETTO Principle - Erik Hollnagel, How Complex Systems Fail - Richard Cook, Field Guide for Understanding Human Error - Sidney Dekker
Why Are Distributed Systems So Hard? Denise Yu, Pivotal.
A long time ago, all business apps talked to one database. But storage and retrieval needs evolved. For a while people scaled vertically — bigger machines! — but Cloud services and VMs allow us to scale horizontally, shard and replicate, as well as putting data closer to our users.
Distributed computing is hard to reason about. In the nineties, Sun Microsystems wrote the Fallacies of Distributed Computing. One of these is that we shouldn’t assume the network is reliable. So how can we know that communication happened? The Byzantine Generals problem demonstrates the difficulty of making a decision when nobody can trust each other.
In 2000, Eric Brewer brought us the CAP theorem. This is often framed as “consistency, availability, partition tolerance: choose two”, but you can’t not choose partition tolerance because you can’t assume the network is reliable. Partitions will happen. So you get to choose C or A.
C is actually for… Linearizability (hahaha), a narrow version of consistency. It means that changes have to be applied consistently to all replicas: if any replica has a value, every replica should have the value. A is for availability, though it’s hard to distinguish between “unavailable” and “just slow”. During a partition, you can choose to have availability (allowing reads and writes on both sides), but since the two sides can’t talk to each other, you lose linearizability. Or you can choose consistency and guarantee that both sides are the same by not writing any more, at the loss of availability.
Distributed systems are hard for a bunch of other reasons, including imperfect hardware, imperfect software and multi-tenant systems without user isolation.
We have some mitigation strategies: automatic leader election for keeping things writable, consensus algorithms like Paxos and Raft.
Distributed systems of humans are hard too. We all have different mental models and it’s hard to share state and compare them. We need an environment that’s optimised for learning. During incident analysis, we need to look for unintuitive design, alert fatigue, unspoken assumptions, normalization of deviance, and other aspects affecting the humans in the system. [ed: and all of the ways in which humans just misunderstand each other while debugging!] We need to understand and design for the whole system, including the human parts.
This was freaking great and Denise’s slides are the most beautiful slides.
Book: http://leanpub.com/achildrensatozofcontinuousdelivery, Denise Yu and Steve Smith
Livetweets: https://twitter.com/lizthegrey/status/1111010808354533376, https://twitter.com/msuriar/status/1111010760635940864
Denise recommends: Designing Data Intensive Applications, and this Nikolas Means talk, [warning: autoplaying video!] "Who destroyed Three Mile Island?"