02 Sep 2017 : Conference Report: SRECon EMEA 2017

Between travel, meetings and other commitments, I didn’t get to see as many SRECon EMEA talks as I would have liked to this year. But I liked what I saw! These summaries are assembled from handwritten notes and may contain lies or nonsense. If I’m misrepresenting anyone, please let me know and I’ll fix it.

I’m including links to the abstract for each talk; most of those will include links to videos once they’re available in a few weeks.

Here’s the talks I saw:

Deploying Changes to Production in the Age of the Microservice, Samantha Schaevitz, Google

(program, livetweets by msuriar)

Sam identified four kinds of changes:

  • client change, pulled by the user. If the change goes badly, it might take O(days) to roll back.
  • new server binary version. Can be rolled back in O(hours).
  • static config change such as a flag flip. O(hours).
  • dynamic config change, such as watching a file for changes. O(minutes).

What if you never changed anything? It’s very stable, but your product managers won’t be happy, and neither will developers who don’t get to deploy their code. To quote John A Shedd, “A ship in harbor is safe, but that’s not what ships are built for”. And not all changes are under your control: infrastructure, OS, etc, can change underneath you.

Release to increasingly risk-averse bands of users, e.g., dev team testers, then internal testers, trusted external testers, free tier customers, and finally paying customers.

A breakage has costs, e.g., immediately lost revenue, user trust, contract violations, increased infrastructure costs, the opportunity cost that your SREs and devs have to spend time recovering from the breakage. And breakages come in many forms, from automatic retries that users don’t notice, to latency increases, to serving 500s. A new UI that users hate is a breakage, as is an unexpected increase in resource usage.

The optimal rollout is staged, progressive, revertable, automatic and well understood. Make sure everyone on the team knows how to roll back, not just the expert. And make it happen without an operator having to make decisions. “This one feels low risk; I’ll skip the 50% step…” shouldn’t be an option.

To detect badness early:

  • have good test coverage, including unit and integration tests
  • continuously build and test
  • A/B experiment for everything
  • automate to limit toil and error
  • deploy often to reduce the change surface
  • make rollbacks easy, and versions backwards compatible

Consistent naming is more important than what the names are, e.g., there should be no confusion about what’s happening when you’re deploying to qa. If different services use different names, your rollout tools end up with awful regex matching for stages. (Oh my god it’s so true.)

Application Automation with Habitat. Mandi Walls, Chef Software

(program, slides, github, livetweets by msuriar)

Running applications in diverse environments gets complicated quickly. We have different hardware, OSes, libraries, configs and start/stop mechanisms. Reducing complexity often really means moving it around – into the config management system, for example, with big mucky switch statements, like “if it’s on debian, set this flag”. (I’ve seen this sort of thing and I agree that it’s amazingly gross.) We want platform agnosticism.

Habitat provides a consistent way to build and run applications by allowing repeatable, hermetic builds with signed packages.

Apps are built in Habitat Studio, a busybox cleanroom. You roll up a set of binaries and libraries that you know will play well together. Most are built from source, but you might include some off the shelf commercial binaries. Config travels with the app, but some decisions, e.g., config parameters, can parameterised with handlebarsjs syntax and left until runtime.

Plan files, written in bash, live with the application and define the build. The output is a runnable tarball called a hart.

There’s a public depot with team namespacing. You can’t run this on premises yet, but that’s coming.

Mandi did a live demo, which is always risky, but it worked. (Congratulations :-) ) She showed the habitat supervisors having a leader election, and how their HTTP endpoints expose what’s running and who’s the current leader.

I’m always appreciative of people who are working on complexity reduction, and this was relevant to my interest in disaster recovery: a well-tested hermetic environment is much easier to restore than dependency chaos. Habitat seems interesting and I’ll go read more about it.

Case Study: Lessons Learned from Our First Worldwide Outage. Yoav Cohen, Imperva Incapsula

(program) Incapsula run thousands of servers in 35 PoPs, including http/s proxies, Behemoth servers that mitigate DDoS attacks and management consoles for customers to use for configuration.

Their customer config database changes constantly and needs to be distributed to all servers. Their SLA says that changes will be live globally in 60s. They achieve this by maintaining a local repo on every server, updated from the leader repo. Then the running applications on each server read from the local repo.

This worked great until it caused their first global outage. Customers write their configs in freeform text, and one customer accidentally put a double quote in a bad place. It passed the validation checker, whoops, and got to the local repos, where it caused every one of the proxies to go into a crash loop. (Parsers are hard :-( ). Millions of websites became unreachable.

This should have been self-healing: each server was supposed to maintain a list of sites that were being processed immediately before a crash, so they’d know which configs to avoid after a config of death. Why didn’t that safety mechanism work? Because it was commented out! Why didn’t testing catch that? Because there weren’t any tests for it. Humans had to analyse core dumps and remove the bad config manually; the service was down globally for 30-40 minutes.

In the postmortem, they identified two kinds of possible crash, with mitigations for both:

  • crashing while loading rules, immediately after the new config is pushed. They added a sandbox repo to canary the new change before the real repos. If it crashed, the new config wouldn’t get pushed out.

  • crashing while evaluating rules. This might not happen for hours or days until a request comes in that matches that part of the config. When that happens, they will go back to the last known good config: they have 5m, 1h and 24h snapshots. But (I noticed with my “fallback plans are scary” hat on), they’ve only had to use the 5m snapshot once, and the others not at all.

I enjoyed this talk. It was a good story, well told, and it’s the best when people are willing to share the things that went wrong and let us all learn from them. I can imagine a lot of people who were in the room making better global config decisions from hearing this talk.

When Trouble Comes to Town. Michael Gorven, Facebook

(program) When a critical service is down, it’s easy to become paralysed with fear. Having a set of steps to follow helps us react in a more useful way.

  1. Don’t panic. You shouldn’t be alone: resolving incidents should be a team effort. You shouldn’t be at risk of being fired either for causing or failing to debug the alert. (If either of those things is not true, consider working somewhere else.) And the outage is – almost certainly! – not the end of the world.
  2. Access the impact. What’s broken and how broken is it? Dashboards should make this easy.
  3. Communicate. Inform the company. Coordinate the response. They use Facebook groups for this. Advertise where the realtime discussion is happening. Nominate an incident manager to coordinate the response and make sure the right people are working on it. If escalation is required, escalate early.
  4. Find out what changed. Was it a code push, a config change, an instance started or stopped? It’s better to have a central change database where all of the systems publish events, rather than looking at ten different places. Work out the exact start time: user impact can lag the start of the actual problem.
  5. Mitigate. Can you restore some functionality? Drain traffic? Limit the blast radius? If there’s a way you want to mitigate but can’t, file a bug to add that control after the incident.
  6. Identify the root cause. But this is much less important than mitigating: revert the change or reduce the damage before you start on the root cause. If you have many people involved, you can do this in parallel.
  7. Resolution. Confirm that the user impact is actually gone. Maybe you fixed the wrong thing! Maybe the problem moved. Communicate internally that you believe it’s fixed, because people may still have problems that they think you’re working on.
  8. Cleanup. While it’s still fresh in your mind, record the details, write the incident report, document what happened. Include a timeline, root cause and followups. A followup template could look like
    • detection: did we have the right alerts?
    • escalation: how well did it go?
    • recovery: could we have mitigated faster?
    • prevention: how do we stop it happening again?

This is a great set of steps, and I think it’d make an excellent infographic to help reassure oncallers.

Incident Management and Chatops at Shopify. Daniella Niyonkuru, Shopify

(program) Being on call causes stress and anxiety:

  • I might forget I’m on call
  • my phone might be on silent
  • I might forget to update the status of the outage
  • I might not know who to escalate to
  • too much context switching might make me unable to focus on the problem

Chatops is conversation-driven communication, where you customise a chatbot to make your life easier. Shopify use a slack chatbox called Spy, based on Lita. They’ve integrated it with Pagerduty and Github. Some sample commands:

  • spy page imoc “there is some problem” : page someone
  • spy incident me “incident name” : open an incident with me as incident commander
  • spy incident tldr
  • spy incident tell :team “some message”

These commands happen right in the slack channel that everyone’s looking at, so people have scrollback and can follow the narrative of the outage. This helps new people onboard too. You get to spend more time in a single chat window, meaning fewer context switches and less likelihood of distraction. It reduces the cognitive overhead of too many open browser tabs.

Spy sends reminders and makes sure you don’t drop important status updates, and gives you several warnings in advance to make sure you know you’re going on call. Best of all, it will let your team know if you’ve been handling an outage for more than an hour, and ask for someone else to step in and prevent fatigue. (This should be standard procedure for on call rotations everywhere, but I’ve heard of almost no other groups that do it.)

Every talk I’ve ever seen from Shopify has been fresh and interesting and this one continued that trend. A stressed oncaller makes poor decisions (and it’s just not a fun way to live), so I love that they’re consciously thinking about how to make oncall better for the humans involved. And Chatops seems nifty.

Lightning Talks

  • 6 Ways a Culture of Communication Strengthens Your Team’s Resiliency. Jaime Woo, Shopify

    (livetweets by msuriar). If you’ve ever listened to a bad wedding speech, Jaime told us, you know how important good communication is. Hah! Here are six reasons communication is good for your organisation:

    1. Better cross-pollination of ideas. Shopify brings everyone together 4-5 times per year for events and internal hackathons.
    2. Make every mistake unique. They write blog posts and magazine-style internal articles about mistakes, and have a slack channel where it’s safe to ask for help. They want to get as much value out of every mistake as possible. (I love this.)
    3. Have a growth mindset. Make sure everyone can feel safe admitting mistakes. Being quiet about them doesn’t help anyone. So, when you make a mistake, tell your team, then the company, and maybe even conferences.
    4. Aid a greater sense of direction: where the company is going and why. New people are joining all the time, so it’s not enough to give that message once; it has to be continuous.
    5. Underline their strengths. Telling the story of the company makes it easier for everyone to see who they are and what they’re good at.
    6. Increased connection to what they’re working on. It mkes people feel like they’re part of a greater thing that matters.
  • Dynamic Documentation in 5 minutes. Daniel Lawrence, LinkedIn

    (github, livetweets by msuriar) Humans are lazy and forget things, we can’t type fast and we can only do one thing at a time. Daniel had complex systems that weren’t all the same, and wanted to make it easy for people to fix problems. There are tons of parameter options, but he went with moustache style, partly because it’s fun to say moustache style. Hah :-) They can add parameters to their wiki that get automatically filled in based on the query string. They can show or hide sections of the doc, e.g., if the query string says this is java, add some GC information.

  • Resource management and isolation, the non-shiny way. Luiz Viana, Demonware

    (livetweets by msuriar) Containers solve many problems, but they’re not always the right answer. They’re good when you have immutable images and you’re constantly scaling up and down, but an ssl 0day can mean recreating thousands of containers just to update a library. People are using them for resource isolation, when the linux kernel has tons of resource isolation features already available. Use cgroups, under /sys/fs/cgroup. It’s currently not that user-friendly, but cgroups v2 will make it easier to use. You can isolate services and workloads and protect against bad neighbours. And there are no extra dependencies; it’s already in the kernel. Containers are good for what they’re good for, but take time and research and see if they’re the right tool for you.

  • Collecting metrics with Snap - the open telemetry framework. Guy Fighel, SignifAI

    (slides, github, livetweets by msuriar) Telemetry involves a lot of tools, a lot of formats, a lot of metrics with different collection needs. What should we collect, and how can we be smarter about it? Snap offers easy scheduling, scaling and dynamic control. It uses an open plugin model: over 100 plugins are available and it’s easy to add more. And plugins can swap in without restarting, which is pretty cool. It has a flexible three part workflow:

    1. collection. From the OS, applications, etc.
    2. processing. Adding context, anomaly detection, statistics, filtering, etc.
    3. publishing. To dashboards, logging, alerts, etc.
  • Decentralized Data. Jason Koppe, Indeed

    (livetweets by msuriar) Indeed’s sysadmin team created a data transfer system that is used to propagate data around the world. But it had problems. Replicated corruption caused site-wide outages that needed to be manually resolved. They needed to run some jobs with custom heap sizes to avoid cascading OOMs. And the system was a bottleneck: anyone who wanted to add a new artifact to the system needed time from the sysadmin team. They created a new system called Reliable Artifact Distribution (RAD!). Five improvements:

    1. Resuming from last known good data after a crash
    2. Atomic filesystem ops to avoid corruption
    3. A canary for data updates
    4. Using bittorrent to route around failures.
    5. A self-service interface so that developers could declare producers ad consumers in code without needing to come to the sysadmin team.
  • Live Failover, Emil Stolarsky, Shopify

    (livetweets by msuriar) Shopify runs active/passive sites and failovers have to be intentionally triggered. This used to involve a bunch of engineers at the low-traffic time of day working through a checklist, but they’ve recently integrated it into their chatops. Demo time! Emil failed over as we watched (it was really cool) and showed a photograph of a coworker’s three year old doing the same thing. “So simple a three year old can do it” is a pretty great failover model.

Distributed Systems, Like It or Not. Theo Schlossnagle, Circonus

(program) The “distributed systems facts of life talk”. Nice :-) We have progressed from simple to complicated systems:

  • 1990: single user, single system
  • 1995: distributed users, single system
  • 2000: single user, distributed system
  • 2005: distributed users, distributed system

A lot of the tooling we need is no longer technical tooling, it’s brain tooling.

Can we avoid distributed systems? No, they’re everywhere. For example, GPUs mean the the iPhone has eight different processors with different clock speeds. So, since there’s no escaping distributing systems, what do we need to think about?

  1. Time. With one clock, you know what time it is. With multiple clocks, you never do. It takes 1ns for the light to go from a clock to your eyes, so by the time you know the time, your CPU can have done 3 instructions. Threaded race conditions are bad enough when you’re in the same box, but clocks are separate devices. Some day we’ll look back at this time and say “Do you remember when distributed systems didn’t have a single clock? That must have been hard!” If we had better time, everyone with a computer could use Spanner. Clocks aren’t for timing things, they’re for knowing what happened in what order. A recent ACM Turing award was for Lamport timestamps, which gives guaranteed ordering of events.

  2. Causal thinking. Debugging distributed systems means that every single system state is potentially relevant to the outage.

  3. Byzantine failures. Pathological timing failures.

  4. Consensus. Paxos is hard; Raft is much easier. Virtual synchrony is an alternative approach. (It’s also called “ordered, reliable multicast”, says wikipedia.).

Theo showed a picture of a complex system. “This is probably the recommended architecture for a blog these days”. Hahaha, amazing. This whole talk was funny and entertaining.

Distributed systems give us modular development, language domains, security domains and higher availability. But in return, many distributed systems situations appear to make no sense: every individual component claims to be fine, but there’s still an outage. Pathology is the diagnosis of issues after the event.

Avoiding and Breaking Out of Capacity Prison. Jake Welch, Microsoft

(program) Why manage capacity? To improve the customer experience (e.g., full disks cause latency) and reduce operational toil caused by needing to be reactive. We need to identify which physical resources we care about and what their limits are. Monitoring is important, with alerts and forecasting. Start with simple models. Linear forecasting is better than nothing.

They use a 5y forecast for datacenters, 2y for servers. You need to be prepared for spikes in depend (e.g., during a feature launch) and for supply chain disruption. For example, flooding in Thailand greatly increased costs and affected the supply chain for two quarters. Make sure you can survive six months where it’s impossible to get new hardware. Don’t wait until it happens to find out.

They have various “levers”, such as shifting load to other regions, which are implemented as playbooks.

They use vector bin packing: efficient predictable placement of many objects. Jake walked us through the model, but it went so far over my head that I can’t say anything about it. Sorry, forecasting is definitely not my area :-/

In summary, identify the phyical and logical limits that affect you and make sure you’re monitoring them.

Run Less Software; Use Less Bits. Rich Archbold, Intercom

(program) Intercom is a successful Irish startup, but they face several threats:

  • other companies who might copy their business. Low interest rates and the ease of spinning up software using frameworks and everything-as-a-service means that any competitor with junior developers could quickly catch up; the first to market advantage is gone.
  • if one of the “big four” (Amazon, Apple, Google, Microsoft; I hadn’t heard this expression before) gets into the same market, Intercom could quickly find themselves irrelevant.

Is it paranoid to think like that? Rich gave several examples of market leaders who suddenly weren’t: Slack beat Hipchat; Amazon registered a trademark similar to Blue Apron’s and Blue Apron’s stock took an immediate hit.

The solution is to be more agile than your competitors, so you can move and react faster. That means:

  • using standard technology. Use a small, opinionated set of technology that your company can become expert in. Rich showed us the ten technologies they use at Intercom.
  • outsourcing. 70% of most companys’ energy is spent on “undifferentiated heavy lifting”. Intercom has gotten that down to 40% and are trying to push it lower. Rich quoted Peter Drucker: “There is surely nothing quite so useless as doing with great efficiency what should not be done at all.”
  • creating enduring competitive advantage. And this includes hiring a diverse set of people. Look for problem solvers, not technologists.

Rich used violent, warlike metaphors throughout and it was clear that he sees this as a real existential fight for survival. This was an interesting talk for me, and made me realise that I’d like to see more talks on how startups operate, or at least how they make their technology decisions.

Service with an Angry Smile: Passive-Aggressive Behavior in SRE. Lauri Apple, Zalando

(program, livetweets by lizthegrey)

I only came in for this end of this, but the last few minutes gave me a major takeaway from the conference: your team should have a “definition of ‘done’”. I don’t know if my interpretation of those words is even what Lauri intended, but it felt extremely profound and useful, while at the same time immediately seeming obvious, which is basically the definition of genius.

A lot of people have said afterwards that this was an excellent talk, and I regret missing it. It’s one of the ones I hope to catch at another conference.

The Cult(Ure) of Strength. Emily Gorcenski, Simple

(program, livetweets by lizthegrey)

Strength culture is the emphasise on sacrifice as a virtue and over-valuing exceptionalism: we look to heroes to set our everyday standards. As a transgender woman, Emily has often been told “You’re so brave” and “You’re so strong”, simply for existing. But this is empty praise. We ask marginalised people to be badasses… and unfairly require it of them. “Adi is so brave to be the only black person on campus” “Kelly has to walk up two flights of stairs with her prosthetes. What an inspiration.” But Kelly and Adi don’t want to be brave or inspirations: they want you to install an elevator and hire more black people. And they want you to put your voice to fixing that, instead of praising their badassery. We shouldn’t just accept that everyday acts are hard.

Strength is a tax that we pay with emotional labour. When we require bravery, we demand free work.

Farida Bedwei is a founder and software engineer from Ghana making waves with cutting-edge mobile payments technology. But articles about her focus on the fact that she has cerebral palsy. The articles even say things like “She hasn’t let her disability affect her success”: the sacrifice is always expected to be part of the story.

In the tech industry’s push to disrupt society, we have managed to instead completely replicate its antipatterns. We assign strength to things that aren’t actually relevant – all of those rockstar job reqs – and over-emphasise and normalise extraordinary sacrifice. If your process requires sacrifice, your process sucks.

On call culture has traditionally been full of this sort of martyrdom: we even call our anecdotes ‘war stories’. But that’s toxic and dishonest. We’re not warriors.

Emily contrasted with pictures from Charlottesville and showed some real life-or-death scenarios where ordinary people were forced to step up and be heroic. There are definitely times when it’s necessary to be a hero and it’s ok to ask for extraordinary acts of sacrifice then. But it shouldn’t be required for a regular tech job.

When we elevate people to rockstars, we make it impossible for the people who can’t put in the free labour, so organise to be paid for on call and for learning in your free time. Even if you don’t need that, other people do.

My initial reaction to this last part was “but I like learning in my free time… wait, I get it now!”. Because Emily is right: not everyone has free time and resources to do that. And if learning outside working hours is a required part of the job, we’re limiting the number and types of people who get to do the job. Mind. Blown. So we should think about the assumptions we make about what extra unpaid hours tech people need to do.

This talk made me think a lot, and I recommend it. (As did a lot of people: check out the number of retweets on @nocoot’s tweet about it!)

Have You Tried Turning It off and Turning It on Again? Tanya Reilly, Google

(program, livetweets by msuriar, livetweets by lizthegrey)

Ok, this is my own talk so I can’t review it, but it was such an honour to be livetweeted by Murali and Liz and I’m hubristically linking those here. This also explains why I’ve got no notes on the plenary that immediately followed: I was talking with folks outside the room and decompressing, and I didn’t come back into the room until it was already over :-( Some day I’ll be all “finished, nbd” after speaking and just sit down and watch what’s on afterwards, but haha not yet :-)

Persistent SRE Antipatterns: Pitfalls On the Road to Creating a Successful SRE Program Like Netflix and Google. Jonah Horowitz, Stripe, and Blake Bisset.

(program, livetweets by msuriar, livetweets by lizthegrey)

Last talk! Jonah and Blake set up a fancy tea service on stage, echoing the call from Niall at the start of the conference to de-emphasise alcohol culture in SRE. This talk felt a lot like a podcast: it apparently started as a conversation in a bar and they continued it on stage. This made it fun to listen to, but the stream of consciousness made it hard to recap. Some things I noted though:

  • you don’t need a NOC any more
  • alerts should be based on user experience, and should be actionable
  • don’t rely on humans to absorb the vast corpus of your data and decide what’s important or not
  • if you get paged at night, just do remediation at night. Root cause analysis should happen in working hours.
  • almost nobody needs a <5m response SLA
  • don’t have any servers that need manual processes, or use config management as a facade over heterogenous services
  • if a machine can do it, don’t use humans. Automated response is good. File bugs to fix the things.
  • don’t burn out your team
  • reduce the need for SREs; we should scale sub-linearly
  • base SLOs on business needs. You don’t need all the nines.
  • don’t try to exceed your SLO.

In conclusion

I only got to two days of the conference, and I had some other commitments for part of those days, so I missed out on some really good stuff. Some other talks I heard people get particularly excited about:

Some repeating themes

  • hero culture is bad. Stop doing that already. (Yes!)
  • self-healing architecture; don’t spend humans on what a machine can do. (Yes!)
  • don’t make design decisions just because it’s in the Google book; do what’s right for your org. (I’m surprised that this needed to be said, but I heard it said a bunch of times, so apparently it does.)

lizthegrey, msuriar and quinnypig livetweeted many of the sessions. This was something I hadn’t realised I needed in my life but now I want it for every conference.