Conference Report: SRECon Americas 2017 Day 2
Here’s what I saw on day two of SRECon! I wrote about day one over here. These summaries come from a mix of handwritten notes and things I remembered; if I got something wrong, please let me know and I’ll fix it.
Observability in the Cambrian Stack Era. Charity Majors, Honeycomb
Monitoring is a solved problem, Charity told us, but debugging is not. In the age of distributed systems, we shouldn’t think about individual nodes: we need to observe our systems across barriers and boundaries. We need tools like strace for systems, like changing TCP headers in flight, like running gdb throughout the network.
But instead we keep making more dashboards! Dashboards are great at understanding the last outage, or showing us our KPIs, but they’re not good for debugging. The future must be explorable.
Logging suffers from some of the same problems as dashboards: you need to know in advance what information you want. At least structure your data: write out a json blob instead of a string. The old model used to be “be as chaotic as you want on the client side and we’ll make sense of it on the server”. No more.
The first wave of devops was “hey, ops, learn to code”. The second wave needs to be “hey, developers, learn to ops”. In the future, we spend less time writing code, and more time understanding it.
The Road to Chaos. Nora Jones, Jet.com
Chaos engineering is building confidence in the systems ability to withstand turbulence. Like getting a flu shot, Nora said. She was hired to bring chaos engineering to Jet.com.
Their system included over 900 microservices (!), using the single responsibility principle model. Most of these were only about 300 lines of code. This meant scalable and independent releases, but it made it difficult to troubleshoot.
Enter the armies of chaos. It wasn’t immediately popular, heh. She introduced chaos in stages.
Level 1: Graceful restarts and socialisation. Tell people what’s coming. Although the word “chaos” sounds cool to SREs (it really does), it might be scary to everyone else. She let teams send pull requests to opt their service out, but opt-outs only lasted one sprint. They wanted everyone to get used to and resilient to chaos.
Chaos ran during working hours in the QA environment. Cute superhero-themed slack bots announced when each type of chaos was starting.
Level 2: Can we cause a cascading failure? Yes, but not the one they were aiming for. Whoops! They brought down their QA environment for a week. Nora walked through the code for us (in F#, which was new and fascinating to me).
Level 3: Targeted chaos. A lot of their outages were related to Kafka and some other infrastructure. They gathered ideas for infrastructure-specific chaos to inject: deleting topics, partially deleting topics, dropping packets etc.
Level 4: Gamification. Not for all cultures, Nora emphasised. Decide whether your teams enjoy competition. Hers did, so they introduced games where teams would get points for surviving chaos, points for adding new types of chaos, etc. Dashboards around the open-plan office showed the leaderboard. People got really into it.
Level 5: Other uses for chaos tools, like automatically shutting down machines over the weekend if they weren’t in use, and scaling down services.
The mark of success for a tool is whether it affects the development lifecycle. This did. Developers started running chaos testing locally before their code went to QA. They had better alerting and monitoring and were more prepared for outages. A new team was formed to log telemetry events and latency.
One theme of this talk was how the culture had to move from “Your chaos broke our system” to “The brokenness was already there, and we uncovered it at a convenient time.” It sounded like that took a while but was ultimately successful.
Postmortem Action Items: Plan the Work and Work the Plan. Sue Lueder and Betsy Beyer, Google
Sue ran a survey at Google and found that the #1 problem by far that people had with postmortems was action item followup. She and Betsy laid out some antipatterns:
Unbalanced Action Items. Not striking a healthy balance between incremental “bandaid” fixes and complete rearchitectures.
Fixing symptoms instead of root causes. Instead of “why didn’t this system fail gracefully” ask “why isn’t failing gracefully a part of our design process?”
Humans as the root cause. Humans are bad at repetitive tasks, so we need to remove their ability to introduce errors. Don’t expect them to have read their mail, for example: find another way of letting them know when they’re using a system that’s in planned maintenance.
Not thinking beyond prevention. Consider the whole timeline of the incident: how long it took to detect, the scale of the rollout to fix it, etc. Improve detection diagnosis and triage, not just root causes.
For successful postmortems…
Prioritise and assign the work. Understand who is doing the action item and when.
Have senior leaders pay attention to the open bugs. That makes them more likely to get closed. Every postmortem should have a P0 or P1 bug. As Ben Treynor, founder of Google’s SRE team, says, a port-mortem without action items is indistinguishable from no postmortem.
Review postmortems like we review code or designs. Are the action items realistic? Will they prevent the same incident from happening again? Are they added to the project plan? What can be automated?
Report on open action item counts by priority. If you see a deviation growing between the number opened and the number closed, you’re building technical debt.
I particularly liked that last one. I hadn’t mentally framed open bugs as “technical debt” before, but of course that’s exactly what they are: they’re problems we know about that haven’t been fixed.
Sue, Betsy and John Lunney have an article on this topic in the Spring 2017 volume of ;login magazine
Building Real Time Infrastructure at Facebook. Jeff Barber and Shie Erlich, Facebook
When you ‘like’ a post, other people immediately see that you did. When someone’s replying to you, you see the “someone is typing” message in real time. Facebook does that with an ephemeral pubsub store. Each device – every phone, for example – subscribes to pubsub topics. When an event happens, the notifications are immediately sent.
The team had a goal to understand and improve the reliability of their service at the same time as scaling it to “a billion or bust”. Although their service ostensibly had four nines, it didn’t “feel” reliable. They needed better measurement.
So what is reliability? It’s a function:
(What we accomplished) / (What we should have accomplished)
In their case, this was how many topics updates were delivered out of the number that were sent. Looking at it this way uncovered some flaws with their architecture. The metrics for different parts of that equation all came from different components. It was too easy for the thundering herd to take out components. They rearchitected, adding state to make it easier to route specific topics. The new architecture is easier to reason about, and now they can comfortably handle a billion deliveries. Previously a new hot video would cause enough load that someone’s pager would go off. Now they don’t notice.
The lesson is to take a step back and evaluate the whole system, rather than optimising individual parts.
Killing Our Darlings: How to Deprecate Systems. Daniele Sluijters, Spotify
This was one of my favourite talks. Daniele advised us to deprecate systems we’d like to get rid of – create a kill list – but do it with respect for the people who are still holding on to the service by their fingernails.
Write a schedule: people need to know when the service will stop working, and when you’ll start doing weird things to push them out.
Talk to people a lot. Use personal networks, talk at all-hands meetings. Assume that nobody will read emails. No matter how much communication you do, people will complain that you should have communicated more, but try anyway.
Make migration as easy as possible. Provide alternatives. This often means making code changes for your users. You want this deprecation; they don’t. The long tail is where the heavy lifting happens. You can’t just tell people the deadline and then pull the rug out from under them; you’ll have to work to move them.
There will be unknown complications, unexpected use cases, transient dependencies you haven’t considered. You can use planned outages to shake out the unknown unknowns, but be nice and turn the service back on if you’re breaking someone.
Soft-deprecate first: keep the service around for a while but lock off access, e.g., with a firewall. Watch access logs and alert on unexpected attempts to use the system.
Finally, really deprecate. Back up the data and confirm you can restore it. Deprecate. Celebrate. Do a retrospective, but also do cake and champagne.
When building a service that will some day be deprecated (which I guess is all of them), keep track of your customers and use cases. Make sure you can limit their access, e.g., by requiring an API token. This means new customers have to have some exchange with you to start using the service, and you’ll know who they are. Log everything.
And don’t be too reliable. Don’t exceed your SLA. Make sure your users can handle your errors by making sure they occasionally get errors.
Engineering Reliable Mobile Applications. Kristine Chen, Google
Mobile is important. A lot of the world primarily accesses the web on their phones and mobile search traffic has surpassed desktop traffic.
Most SRE work is on the server side: we have years of experience in how to monitor, canary, deploy and roll back on servers. But if you support mobile apps, you’re primarily working on clients. This raises some challenges:
- you don’t control your users’ devices. Mobile users can reject the update, or not have enough data or storage to update.
- users will have a variety of OSes and hardware. Some is going to be very old.
- client monitoring is far from real time. If the app is talking to your server, request and response logs can give you some indications, but if your update crashes the app, you’ll never know. You can’t burn your users’ battery or data. And privacy is very important: the user can opt out of sending monitoring data back, and you have to follow privacy policies.
- when a bad release goes to a server, you can roll back. With a client, you push a new version and hope people update.
How do they deal with these challenges? One way is to move as much of the app as possible to the server side, and control new features with flags.
Client-side monitoring can be triggered by events, but logs are usually batched to save resources. And again, you are constrained by privacy. It may be minutes before you get monitoring information. Emulators help, but you can’t get all possible devices.
Releases have other difficulties. Latency might look really good at the start of a rollout, because early adopters often have better devices and networks. Staged rollouts are affected by selection bias: the people who opt-in are not representative users. After beta-testing, you can canary to some small percentage of users, before pushing to all users, but you still only get whoever is willing to upgrade. Currently Kristine’s team is pushing new releases every 3-8 weeks, but they want to make it faster.
This was one of the most fascinating talks for me. I hadn’t realised how different it would be to be an SRE for phone apps, and the talk was a whole lot of “oh wow, I guess that’s true!” moments. It was fun to hear someone talking about something so relatable – it’s still SREing – but and at the same time so different to my experience.
No Haunted Graveyards. John Truscott Reese, Computer Scientist. Google.
Haunted Graveyards are parts of the system that we’re afraid to talk about. The race condition you usually win, so you’re ok with sometimes running the tool twice. Sharp edges on APIs because you don’t know who’s using it. jtr gave an example of files called ZOMG_DO_NOT_CREATE_[zonename], intended to block creation of new zones that reused the names of old zones that had been turned down “using black magic” and might come back to life.
In SRE, we avoid risk, but having parts of the system you’re afraid to touch is more of a risk. Use your error budget and fix the things. But the flip side of that is that we shouldn’t delete things just because we don’t understand them. That’s Chesterton’s Fence: don’t remove anything until you understand its purpose.
I’d heard jtr’s ‘haunted graveyards’ concept before (and have been using it in talks), but it was still interesting to hear it spelled out like this. I hadn’t heard about Chesterton’s Fence before. It reminds me of an old adage that any recommendation to ‘just’ do something demonstrates that the speaker doesn’t fully understand the problem.
Measuring Reliability through VALET Metrics. Raja Selvaraj, Home Depot.
Measure reliability to quantify the health of your service, to express quality, and to give you data to improve on. VALET is Home Depot’s framework for reliability. The five metrics they use are Volume, Availability, Latency, Errors and Ticket load.
Lessons Learned from Transforming SEs into SRE at Microsoft Azure. Cezar Alevatto Guimaraes, Microsoft
Cezar wanted to change his SE team to an SRE team: he liked the model of SREs having a mix of “software engineering” and “service engineering” skills. But you can’t just change a team name and expect people to work differently. They made three distinct roles, SRE Manager, TL and Engineer.
SRE Manager: helps the team develop skills. Pair your team up so they can learn from each other. It’s ok if they’re at different levels – we’re all at different levels on different skills. It’s also ok if they decide that SRE isn’t for them. We should celebrate when someone finds something they like doing more.
Tech lead: Also helps the team develop skills. Make space for teaching. Manage expectations. Create the culture, e.g., test driven development, continuous integration, how on call works. “Define the North Star for the team”
Engineer: Different people learn in different ways: online, books, etc. Learn however you like learning. It’s important that the team knows they need to spend effort to learn, and that they are ok with their job changing, for example, introducing pair programming.
No Engineers Necessary. Lei Lopez, Shopify.
Shopify use a deploy tool called shipit. Deploying used to have two manual steps. The developer would merge code, which would kick off build and continuous integration. Then they’d click again to deploy. This led to “deploy logjam” at popular times of day. If there was an issue, it was hard to say which commit caused it. Also, sometimes changes didn’t get shipped because the owner forgot to click the button.
So they made a robot to deploy. This has had an impact on culture. The developers are more careful to ship production-ready code: once you hit ‘merge’ you’ve effectively deployed.
They built a bot to tell people when their commit will deploy. If something goes wrong, they lock deploys. They’ve also added one-click rollbacks. But being blocked by locked deploys is frustrating, so they added a merge queue. They’re working on speed, removing flaky tests, and automated rollback.
SR(securit)E—Tom Schmidt, IBM
IBM wanted to integrate SRE concepts into security requirements. A security engineer needs security training and experience, and needs to be able to solve common problems using code. Sysadmins are less expensive, but it’s worth the extra cost to hire security people.
Security shouldn’t be a checkbox; compliance requirements are ongoing. A lot of this work can be toil. Mandate that components send relevant well-formatted logs. Aim for maintainable velocity. Use automation and common solutions as your default approach. You want consistency, efficiency and (tempered) acceleration.
How Three Changes Led to Big Increases in Oncall Health. Dale Neufeld, Shopify
Incident response and on call is a path to burnout and poor health. Shopify used to have 1000 servers, 5-10 deploys a day, 175 engineers, a single developer on call rotation, and 12 people in an ops team supporting it all. That led to heavy toil. There was no time for improvements. The breadth of knowledge required meant that the experts for any given system would be paged whether they were on call or not. New infrastructure was dumped on the ops team.
They made three changes:
- they fixed the production to product ratio by creating a parallel structure to the product organisation, with four production engineering teams.
- they moved the developers to service-specific oncall schedules.
- they recognised the burden of on call and mandated that an on call schedule had to have at least six people. The oncaller would always get the following Friday off. They added “empathetic bots”: if someone has been working on an incident for more than an hour, the bot will ask for someone to help them hand it off.
The results were good. The team reported feeling healthier.
Heroism is a problem in our culture and people don’t speak up, so explicitly ask people if on call is working for them.
Reliability When Everything Is a Platform: Why You Need to SRE Your Customers. Dave Rensin, Google.
A platform is a system with an API; an application is a system with a UI. But as soon as you start refactoring your code into shared services, your apps are talking to other apps. They’re users, and they’re also platforms.
Every application has an API. If you think your application doesn’t, you’re mistaken. It might be an unofficial one, a bot or a scraper, but third party apps will find a way.
- The most important feature is reliability
- Users, not monitoring, determine reliability.
- Well-engineered software can run at three 9s. Well engineered ops can run at four 9s. For five 9s, you need a well-engineered business.
Dave asserted that a platform’s customers can only get four 9s by luck if they don’t have shared operations with the platform. 99.99% uptime means you can be down for 4.32 minutes per month. You can’t file and resolve a support in four minutes. So, you have to SRE your customers.
This happens in multiple stages.
Stage1: Do Application Readiness Reviews on the components that reply on your platform. Ask “what reliability are you getting now?” and then ask for proof. Most people are measuring the wrong things and have SLOs that aren’t tied to business objectives.
Stage2: Build shared monitoring and alerting. It needs to be a common source of truth: you don’t get secret information that your customers don’t have. This helps eliminate blame and also gives you a fantastic black box probing network: your customer’s experience is the best monitoring you can have. Ideally the shared monitoring system will do the paging, ticketing and even rollbacks.
Stage 3: Practice operational rigor between the teams using joint postmortems. You, the platform provider, should be the incident commander. Outages should usually end up with actions for both parties: how does each side become more resilient to what the other did.
Stage 4: Joint on call. You’re not deploying your customer’s binaries, but you’re joining their war room and debugging. Run disaster tests and Wheel of Misfortune. Do joint projects, create joint open-source tooling. Train together so that you can work together in a crisis.
Google started doing ‘Customer Reliability Engineering’ last July. They select SREs “who have the genetic mutation of enjoying talking to humans”. At least 50% of the role is software development. They conduct Application Readiness Reviews, do design reviews, and build the shared monitoring.
Dave emphasised that although “Professional Services” is a good thing, this is not that. The main difference is that Customer Reliability Engineering is free. You have to not charge for any of it, so you’re able to come in with strong opinions.
This was a fun, engaging talk and a great way to end the conference.
The best thing from the hallway track was the story of one company that uses “human load balancers”. It’s literally one dude who watches the load and decides when to shunt it from one server to another. I promised not to name names.