Conference report: SRECon Americas Day 2

Day two of SRECon Americas 2018! (My notes on Day one are over here.)

Today started with opening comments from our fantastic co-chairs, Betsy Beyer and Kurt Andersen. Thanks for a great conference, Kurt and Betsy! This year SRECon has over 700 attendees from 23 countries. SRECon Asia/Australia is in Singapore in June and SRECon Europe is in Dusseldorf in August. The SRECon Europe CFP closes next Tuesday, so get on that!

On to the talks! Here's some notes I took. As always, I don't promise that I'm capturing the most important points from any of the speakers, but here's some things I found most interesting. If I misrepresented anything or made mistakes, please let me know (mail or leave a comment here) and I'll fix it.

If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There, Nicole Forsgren and Jez Humble, DevOps Research and Assessment (DORA)

Our industry wants to go fast and do things better, but we often don't have an good underlying measurement for where we're going. Maturity models aren't useful. It's like levels in World of Warcraft: for the longest time, you strive to get to the top, level 60, but then the world changes and WoW goes to 110 levels now. "Getting to level 'dope' is no longer good enough". Hah. We reach the highest level on the maturity model but then customers expect new things that we're not doing. Maturity models point us to a destination. We need a direction.

We can measure the performance of engineering teams in four ways, two about velocity, two about stability:

  • how long it takes to push changes
  • release frequency
  • how long to restore service after an outage
  • how often you need to roll back

High performing companies can release on demand, can push a change in under an hour, recover in under an hour and have a change failure rate of less than 15%. 

Firms and non-profits with high performing IT are more likely to exceed objectives and be successful. We can improve our metrics by investing in them, not just money, but time, attention, focus. We can improve our software delivery process through lean management and continuous delivery. More frequent releases and more streamlined processes increase job satisfaction.  You can change your culture by changing your practices. 

Gerald Weinberg said "Quality is value to some person". It's how well you did something the first time. Measure the proportion of time people are doing new work, vs repeated/unplanned work.

There are some common mistakes in measuring quality:

  • measuring output vs outcomes. Lines of code is a terrible metric. Some companies measure lines of code as an asset on the balance sheet so deleting code is destroying company assets. (A simultaneous *gasp* and laughter from the audience).  Look at group success, not individual success. Don't incentivise people to game the system. 
  • using velocity, e.g., story points for measuring productivity. You can't compare teams on points or again people will game the system, and you're discouraging code health and collaboration.
  • aiming for full utilisation of people's time. Queue theory tells us that as we approach full utilisation, our lead times approach infinity. If you're committed to more than one project at the same time, you're more utilised, but less productive.

The Westrum model for culture is predictive of outcomes in high risk areas like aviation and nuclear power. It defines pathological, bureaucratic and generative cultures. You can't just say "how is your culture?", you have to ask agree/disagree questions like "New ideas are welcomed", "Information is actively sought". In Project Aristotle, Google found that high psychological safety -- the ability to trust and be vulnerable with your team -- was the highest predictor of team performance.

In a complex, adaptive system, failure is inevitable. There's never any one root cause, always a bunch of contributing factors.  Human error is only the starting point of a blameless post-mortem. Ask "how could that person have had better information and better tools". Shoutout to Ryn Daniels' post, "On failure and resilience", about getting Etsy's "three-armed sweater award" for causing a huge outage, and to Kripa Krishnan on why Google does DiRT.

Nicole and Jez have a new book called Accelerate: the Science of DevOps.

Book: Accelerate: the Science of DevOps.
Follow: @nicolefv, @jezhumble

Security and SRE: Natural Force Multipliers, Cory Scott, LinkedIn

At LinkedIn, the head of security reports up to SRE, but SRE and Security are aligned more than organisationally. LinkedIn have a 'hierarchy of needs' for engineering, starting with the site having to be up and secure. 

Cory shared an Archilochus quote: "the fox knows many things but the hedgehog knows one big thing". The hedgehog specialises in one area of technology. That's good: we need our hedgehogs! But SREs and security folks are more like the fox, taking in information from a lot of sources, agile and able to adapt to change. (Cue a fox theme for the rest of the slides and it is *adorable*.)

SRE's processes have evolved faster than security's. Compliance initiatives and security alerts from vendor "magic boxes" are taking all of the energy out of the security team. Can we apply modern SRE principles to security?

SRE has a different hierarchy of needs than general engineering. Everything's built on monitoring and incident response. (ed: I believe this hierarchy was originally created by Mikey Dickerson.)

How can security keep up in a world where new products can be invented in the morning and deployed in the afternoon, and testing in production is becoming normal? We need to monitor telemetry coming out of applications, use error budgets, and invest in self-healing and automatic remediation, like automatically rolling back changes that introduce security problems. We need to remove human processes and inject engineering discipline and architecture reviews. We need safe and reliable test environments and good test datasets.

Start with a known good state, a good baseline. Introduce asset management and change control and constantly validate changes that have been made. Encourage a strong partnership between security and SRE. 

Lessons for security people: Your data pipeline is your security lifeblood. Have a human in the loop only as a last resort. Security solutions must be scalable and default-on.

Lessons for SRE: Remove single points of security failure, just like you do for availability. Where will things fail open? Where could a single user error expose an entire system? Assume that an attacker can be anywhere in your system or dataflow or management plane. Don't use the 'candy' security model (crunchy exterior, delicious gooey middle). Capture meaningful security telemetry.

We have an amazing force multiplier when security and SRE work together.

Follow: @cory_scott


What It Really Means to Be an Effective Engineer. Edmond Lau, Co Leadership.

Edmond left Google a few years ago, and joined a startup that was using ruby, a language he didn't know. He (quite reasonably!) assumed he'd have time to learn, but on the first day, the CEO said "I need you to create this feature." "Can I have more time to prepare?" "No. Sink or swim." He worked 70-80 hour weeks and lived in constant fear of pushing something bad. It was stressful. But he told himself that pushing through the pain was good for him.

In 2013, the Quora job ad said "You should be ready to make this startup the primary focus of your life". Silicon Valley had a pervasive message of 100 hour weeks as the only path to success. Edmond worked so much his wife never saw him, but he thought of himself as effective. But then, on vacation, he got a text message asking to help with a system that only he knew, and he realised that no matter how hard he worked, there was no escape, even on vacation. And things he worked really hard on often didn't matter, or got cancelled, or ended up not being used.

He realised that effort doesn't equal impact and it was better to focus on the impact per hour spent.  Effective engineers focus on high-leverage activities.

He spent two years interviewing people in Silicon Valley to understand impact. Mike Krieger, who was running Instagram with a tiny number of engineers, noted that it was easier to run fewer things. "Effective engineers reduce their operational burden. Do the simple thing first."

Other observations: Effective engineers invest in iteration speed and prioritise effectively. Even thinking every day about the three things you can do each day will improve your effectiveness. Effective engineers validate their assumptions and don't waste time going down the wrong path.

He self-published "The Effective Engineer" and felt confident that he understood how to be effective. But then the head of marketing exploded at him one day. "I'm exhausted. Everything with you feels like a negotiation. We're supposed to be on the same team". "I have this framework. It's called leverage. I do the simplest thing first and then I punt on things that don't seem important" He said "Edmond, you're really great at punting things".

He'd spent so much time focused on high-leverage activities, but he'd missed something vital: you have to be effective at working with other people. Effective engineers build good infrastructure for relationships. 

So he went on a new quest, to discover what builds good relationship infrastructure. He concluded that we're limited by the belief that trust can only be built slowly. Trust can be built in moments, and in 30 minute conversations. He does workshops where people say they build more trust in a 5m conversation than they have in working together for a year. He uses tools to frame conversations about how people can "design an alliance" together, and be explicit what (maybe wrong) assumptions they hold about each other. These are scary conversations but they build trust fast. He was able to use this model to repair the relationship with the head of marketing, and collaborate with him on a major project.

In conclusion, effective engineers work hard and get things done, but also focus on high-leverage activities and build infrastructure for their relationships. And they always grow and learn.

Website: http://coleadership.con/srecon
Book: The Effective Engineer
Follow: @edmondlau


SparkPost: The Day the DNS Died, Jeremy Blosser, SparkPost

Sparkpost sends 30% of the world's non-spam email, 15B messages/month. This means a lot of DNS. "But DNS is easy, right?" Their DNS infrastructure went through various iterations as it hit various scaling limits, including dabbling with Amazon's own VPC resolver, running bigger instances, and a bunch of stuff I missed because it went past faster than I could take notes. AWS's automated DDoS protection caused a minor outage; Amazon support didn't believe 40 Mb/s could genuinely be DNS.

Then one day in May 2017, the oncaller got paged. Some performance tests had been running,  they thought it was a side-effect of the test, but it was real. DNS was down.

Sending mail was impacted. ("Sorry if you didn't get your bank statement that day". Haha.). Monitoring kept working, which was good, but that was the only good news. Lots of control plane services -- admin logins, VPN, LDAP and debugging tools -- stopped functioning because they had unexpected DNS dependencies. Debugging was difficult. They sent some of the traffic to amazon VPC resolvers, which immediately were overrun. They added capacity, but the new instances immediately had the same symptoms. 

The workaround involved editing /etc/resolv.conf on every single node. That had a bunch of limitations: resolv.conf can only have 3 entries for nameservers, it's always read top to bottom and, worst of all, nginx and many other apps only read it at startup. If you need to change it, you need to restart the apps. LDAP being down meant it took hours to get onto some of the boxes to manually edit and restart. There was a seven hour outage.   

The cause turned out to be undocumented connection limits in AWS. (ed: but the cause is always the least interesting part. The circular dependencies are the real story here, imo.). They designed a new DNS with better isolation.

Lessons learned include that cloud providers may have hidden limits, especially if your use case is not typical, and that support apps should be isolated and protected from each other.

Follow: @sparkpost
No livetweeters in the room, unfortunately! (Please let me know if I missed one for this or other talks.)


Stable and Accurate Health-Checking of Horizontally-Scaled Services. Lorenzo Saino, Fastly

Fastly use loadbalancers in their PoPs to send requests to healthy http proxy instances.  In a PoP, space and power are a premium. They don't use hardware load balancers and they have limited scalability: it's much easier to shed load to neighbouring PoPs than to rack extra servers.

We don't want to remove healthy services, especially when we're under high load. Most health checking looks at one host in isolation. A daemon on the host can check if the job is running with reasonable resource usage, but if all of the machines are under heavy load, we don't want to start removing machines and cause a cascading failure. Better to look across the whole cluster and only remove a host if its behaviour is very different from the other hosts in the cluster. 

They stream metrics from the instances and aggregate the signal into a logically single component which decides on the health state of each machine. They collect metrics of error rate and response time and send then through a 3-stage filter: 

1) denoising. We're interested in persistent changes, not transient spikes. One technique is to use a moving average: define a window of samples and compute the mean. Larger windows mean a more stable signal, but it's slower to react, so it's a tradeoff. 

2) anomaly detection. Compare the various instances and identify misbehaving instances. Each host is represented by two variables, error rate and response time, which we can plot on a graph. Here, simple thresholding is one of the techniques that tell us if one dot deviates from other dots. We can compute mean and standard deviation for points and choose thresholds for when it's a problem.

3) hysteresis filter. If a host keeps fluctuating between healthy and unhealthy, we don't want to take it in and out of production. Sharp hysteresis gives us thresholds for when things can move in and out, and therefore stabilises the output.

The filter is still a SPoF. We could replicate it and use distributed consensus, but better to create several instances of it, one per host and feed them all the same input. If the filter runs on the hosts being checked, each filter can make the decision for its own host.

Nice slides and animations for this talk and I liked it a lot.

Paper: "Balancing on the edge: transport affinity without network state", J. Taveira Araújo, L. Saino, L. Buytenhek, R .Landa to appear in USENIX NSDI 2018.
Follow: @lorenzosaino

Don’t Ever Change! Are Immutable Deployments Really Simpler, Faster, and Safer? Rob Hirschfeld, RackN

Configuration is fragile because we're talking about mutating a system. Infrastructure as code, means building everything in place. Every one of our systems have to be configured and managed and that creates a dependency graph. We can lock things down, but we inevitably have to patch our systems.

Immutable infrastructure is another way of saying "pre-configured systems". Traditional deployment models do configuration after deployment, but it's better if we can do it beforehand. Immutability is a DevOps pattern. Shift configuration to the left of our pipeline; move it from the production to build stage.

Patching means we have to maintain root access, assume system state, manage dependency graphs, etc. We want to do a deployment with no hidden operations. Where possible, use the "create/delete" pattern: destroy the deployment or instance and build a new one, rather than patching something in place.

There are a bunch of ways to do immutable deployments, e.g., baseline with added configuration, or live boot with configuration. Rob prefers to deploy an image, with config based in. Every machine needs its own identity, so you can't get away with *no* machine-specific config, but you can greatly decrease it. And you still have to do the configuration on a live system, but now you just need to do it once to create the image, not repeat the process for every node in production.

Image-based deployment is faster, safer and more scalable, and rolling back using images means you can be very sure about the state. 

Live demo time! We saw a complete system reboot, net boot, installation in three minutes. Cool.

Questions for the room: How many people can do no-touch installs like this? Almost nobody. How many people have a reliable repeatable way to install a machine? Less than half the room. Yikes.

Follow: @zehicle

Working with Third Parties Shouldn't Suck. Jonathan Mercereau, traffiq corp.

Can you be an SRE and rely on third party apps like CDNs, telemetry or messaging. Yes. But you should monitor your third party apps. During a recent major DNS outage, 89% of companies had a service interruption. We depend on third party companies. We need to treat them as an extension of our stack, not as some ancillary tool.

Build vs Buy vs Adopt (Open Source) is a common conversation right now. We have to ask ourselves what problem we're trying to solve, whether it's a core competency, and what costs are involved. 


Weigh the risks and benefits. Try out the vendor. Ask them for information and look at how long they take to respond and how thorough and direct they are. Do a trial of only the features you'll need and see how much support you get and how much hand-holding you need. 

If a vendor is in your critical path and affects your users' experience (e.g., DNS or a CDN), treat it like your own service in your own stack. Measure their performance. Use both Synthetic Monitoring and RUM (ed: synthetic is representative test traffic, RUM is looking at what experience your real users had.) If you're logging, prune the data and think about compliance issues. Plan for failure. There will be another DNS outage or a CDN outage, so have a disaster plan. Hold your vendors accountable for their SLAs. Ask them for post-mortems. Insist on good communication.



Leveraging Multiple Regions to Improve Site Reliability: Lessons Learned from Andrew Duch, Labs had a major and embarrassing outage and decided to invest in a multi-region cloud setup. They wanted the ability to survive the loss of a region, faster MTTR and the ability to do planned maintenance without downtime. 

They asked every team to do a multi-datacenter plan, but everyone wanted to replicate *everything*: caches, datastore, queues. First lesson: don't do that. Only replicate the sources of truth and have all other datastore (e.g., caches) be projections of that data.

Not every system needs to be active/active. It takes time to implement, adds complexity and costs more. Active/passive is easier, but has much more failover complexity. We can do a failover exercise and be surprised to find that some service is not there. Do a cost/benefit analysis for each service and let the SLO drive the decisions.

If you're active/active, a third datacenter reduces your costs. With two, N+1 means you need one extra datacenter: 100% extra capacity. With three, that one extra datacenter is only 50% extra capacity.

Practice failovers. Jet do them every week and also have storage layer disaster exercises every month. It's not enough to run a failover exercise; also break the network links so you expose hidden or forgotten dependencies. (ed: Amen!)

And failover automation needs to scale. They made a custom tool that uses state machines to fully automate failover.


Lessons Learned from Five Years of Multi-Cloud at PagerDuty. Arup Chakrabarti, PagerDuty

Multi-cloud means having active or passive infrastructure in multiple cloud providers at the same time. It means running the *same* services in both clouds, not some in each. Procurement managers think it's going to be great because in theory you can pit the cloud providers against each other for pricing, but in practice, most places are too small for the providers to care.

In 2012, there was a general perception that the cloud was unreliable. PagerDuty was using AWS, with failover between regions for high availability. When AWS had a major outage in July 2012, they (and lots of people) were affected.

The state of the art for reliability was having distinct regions and running active/active. They started breaking the major product into separate services, using quorum based services like Zookeeper and favouring durability over performance: they were willing to write slower, but not to drop anything. They found that the code was too complicated to write if the datacenters were more than 50ms from each other. That meant that multiple AWS regions wouldn't work for them: us-west-1 and us-east-1 were 75ms apart. So instead they used AWS west-1 and west2, and Azure Fresno, all within 20ms of each other.

This setup had some good and some bad points. As hoped, it protected them from problems at either vendor: they didn't notice a bunch of Cloud outages. Teams quickly began building for reliability, and chaos engineering was easy to add. But, with different pricing and products on each cloud, sizing instances was hard.

Since each cloud had different features, they had to stop using their cloud hosted services and run everything themselves on Ubuntu images. The portability was good, but they run a DevOps shop where each team owns their own stack, and this made for a big stack. They have 50 services over ten teams and each team had to know how to run load balancers, DBs, apps, everything. They needed a lot of internal training to build deep technical knowledge. 

Avoiding the cloud services didn't make multi-cloud as easy as they hoped. They still ended up with over a hundred Chef code paths that specifically checked whether they were on an AWS or Azure VM. This was exacerbated by the lack of control over the network. One provider had a two minute NAT timeout and the other didn't. AWS had a leaky BGP route and routed everything via Japan for a while. The public internet is prone to outages, e.g., someone in California was cutting fiberoptic cables in 2015. The more distributed your infrastructure, the more you're affected by the network.

So, should you build multi-cloud? It depends. Don't do it without executive support and engineering staff who can go deep in their tech stack and make it work. It's not going to be easy. And don't do it unless you need it. Your customers won't care. Make sure you have some good reason to do it.

This was fascinating and one of my favourite talks.



Help Protect Your Data Centers with Safety Constraints. Christina Schulman and Etienne Perot, Google

Christina and Etienne have seen datacenter automation systems go spectacularly wrong. Their lessons about adding safety constraints to make those systems better can apply to automation at any scale.

Google uses automation to manage machine operations like repairs, installs and decommissions. If you ask it to do something stupid, it will do so very efficiently. This has caused educational experiences, like when the CDN (which serves non-video static content from the edge of the network) accidentally had all of its machines sent to be disk-erased at once. It was supposed to be just one rack, but an engineer's query accidentally matched all of the machines. This caused slow user requests, internal network congestion and two days of manual cleanup.

Everyone learned. "It will never happen again". Until a few months later when something similar happened and all of the load balancers in a datacenter were sent to be decommissioned at once. Luckily, the LBs weren't smart enough to realise they were decommissioned and just kept on serving. (Whew).

Every team has different definitions of safe, but we all agree that production should keep running. These outages had different causes, but some common patterns: inadequate limiting, code rot and changing data, complex interdependent systems, and unsafe rollouts. 

Enter SRSly. This was a mechanism to mitigate risk and bake it into automation. It added complexity -- it's an extra node in the graph! -- but it's a node that will prevent outages.

They needed to enumerate all of the production workflows, e.g., machine upgrades, storage drains, migrating VMs, pushing datacenter-wide configs, shutting down racks. SLOs provided inspiration for how to constrain the workflows. For example, with an SLO that says 99% of machines must be available, you might stop planned maintenance when 0.8% are unavailable so you'll prevent going over 1%.

They added an API that returned whether the rollout was safe or not and why. It allowed five types of constraints :

  • Rate limits: allow N things per period per bucket, e.g., only 1% of load balancers may be decommissioned per hour.
  • Concurrency limits: allow at most N concurrent things, like no more than 5% of servers may reboot at once.
  • Sanity/policy checks. Only allow X if Y is true. Don't reboot the machine if the service is still running.
  • service-specific health checks. e.g., don't touch Search instances if the Search oncaller got paged recently. 
  • automatic braking. Notice if an upgrade is breaking things and stop it.

But now SRSly is critical! To make sure it's safe, they use regression tests, internal sanity checks and big red buttons everywhere. Best of all, they sharded SRSly between entities to reduce the blast radius.

To make sure clients will run the checks and respect the constraints, SRSly gives out short term certificates that a client needs to present to make the change. 

It's possible to override the constraints for special occasions like faster kernel upgrades to fix a vulnerability, or stopping all changes during public demos. 

An audience question gave us the best line of the day. Q: "How do you convince stakeholders that this is important?" Christina: "Having a massive user-visible outage is very inspiring". Hahahaha.

Follow: @schulman

How Not to Go Boom: Lessons for SREs from Oil Refineries. Emil Stolarsky, Shopify

Waffle House has developed practices to ride out storms. It's good enough at staying open during them that FEMA uses a metric called the Waffle House index. The tech industry does not have a monopoly on reliability.

Oil refineries and chemical plants are complex systems with many components. If one component explodes, we don't want it to propagate to other components and have them also explode. Refineries have isolation and load shedding systems, just like we do.

Trevor Kletz, a pioneering chemical safety engineer, gave us a good way of thinking about risk: "If you think safety is expensive, try having an accident". Chemical engineers need to do a lot of quantative risk assessment to understand the cost of an explosion and the risk to human life.

They use a model called Fault Tree Analysis. A graph shows how components are connected, with each node connected to others using boolean operators. AND operator: if all the components leading to this node in fail, this one will too. OR operator: any one of them failing can cause a failure. We can add the percentage likelihood of failure to each node, then we have real numbers for the chance of failure, and we can understand the impact of changes.

In tech, we learn from failure, but we don't effectively share what we learned. The Bhopal disaster (ed: the world's worst industrial disaster; the numbers are staggering) led to the creation of the Center for Chemical Process Safety and the Chemical Safety and Hazard Investigation Board. Now every incident must be investigated and learned from. They don't just circulate dry texts, they create videos to make it as easy as possible to understand and learn.

Another example: steam boilers used to explode a lot, as frequently as one every 8 hours. Now they don't, because we have the 1915 ASME Boiler Code. What can we do for software that has the same impact?

I'm a big fan of having our immature industry start operating like real engineers (shameless self promotion: I'm talking about this tomorrow), and I liked this talk a lot. 

Follow: @EmilStolarsky


Aaand that was the day! I was wiped by 6pm and skipped the lightning talks, but judging by the twitter and slack conversations, it sounds like they were very fun.

A few other stray observations from the day:

  • I didn't make it to James Meickle's talk, "Beyond Burnout: Mental Health and Neurodiversity in Engineering", but people were talking about it a ton afterwards and it sounds like it was gold. I'll make time to watch the video. Livetweets:
  • People are talking about dependencies! I spent a lot of the last few years working on a project to prevent unexpected dependencies (and particularly dependency cycles in control planes) and it's very exciting for me to see that this is suddenly a thing lots of people care about.
  • Nobody believes in rollbacks. You can't get that old state and environment back: something is different. Very hermetic configurations are possible, but you're still a different person in a different time staring at a different deployment. Make peace with it.
  • Cloud services have lots of configured limits and you might not know you've hit them, or that they even exist, until you're trying to debug something that's acting very strangely.

And I managed to get in another walk outside in the sun today (10,538 steps for the day says my fitbit) so Project: Do Not Become A Conference-Zombie is going well!

One day left. See you at 8:30am :-)

(Day 1 notes, day 3 notes)