Conference Report: DevOps Days New York 2018
My first DevOps Days! Unfortunately, I could only make it to the morning sessions on both days, but I still saw some great talks. Here's some notes I took.
Devops - Almost 10 years - What A Strange Long Trip It's Been (Keynote), John Willis, SJ Technologies
This was a rapid journey through the history of DevOps, most of which I didn't know. John used a beautiful timeline graphic, created by @MindsEyeCCF at a previous conference. (She made drawings for every talk at DevOpsDays New York and they are *spectacular*). John shared a series of names, quotes, moments, ideas, showing how DevOps has evolved and what a huge number of people were involved along the way. There's no way I caught even half of the names, but here's some stuff that jumped out:
- Going to OSCon for the first time, seeing a presentation about Puppet that changed everything. "Ten minutes later my life is changed. This was the right way to do this."
- John Allspaw's Velocity 2009 presentation about DevOps. "We do ten deploys a day at Flickr".
- The first DevOpsDays, in Ghent. 40 energetic people, excited about a new way to do things. He did a DevOps Cafe podcast about it and talked about what amazed him: people were sharing information. Sharing used to less common. Banks wouldn't tell you how they did things; Google certainly didn't. But DevOps meant sharing.
- Andrew Shafer's metaphor, the Wall of Confusion, to help explain the gap between
developers and operations.
- John Allspaw again, reading How Complex Systems Fail and applying it to what we do.
- 2013: Docker appears. Containers will change everything!
- Dave Zwieback: Devops is really Inclusivity, Complexity [management], Empathy.
- Jennifer Davis and Bridget Kromhaut becoming organisers of DevOpsDays.
- First codes of conduct. Now all DevOps Days conferences have this.
- "Empathy is the essence of devops", Jeff Susna.
- Effective Devops by Jennifer Davis and Ryn Daniels. He thought that it would compete with his own book, the DevOps Handbook but it had a different perspective and is a great book.
- He wrote a course, Introduction to DevOps: Transforming and Improving Operations
- The Phoenix Project put a stake in the ground.
- Recognising burnout.
- The Westrum model to measure culture, moving to a generative culture (blameless and high-performing).
- Ben Rockwood, "The Devops Transformation" at LISA 2011.
- Deming and Lean.
- Diane Vaughan: the Normalisation of Deviance. The rules can say don't allow tailgaters, but it won't work if people think it's rude not to hold the door. You need a culture where people say thank you for checking their badge.
This was fun, and gave me some starting points for further reading.
Livetweets: https://twitter.com/lizthegrey/status/953998958740819968, https://twitter.com/crayzeigh/status/953995800220233735
DevOps Is More About Customer Feedback And Quick Learning Than Culture, Process And Tools, Kishore Jalleda, Yahoo
Devops is effective because it helps develop empathy and mindfulness and makes you a well-rounded engineer. Mindfulness is being present, asking why something matters and how it fits into the big picture.
Many descriptions of DevOps fail to draw a complete picture. The real
question is how it's benefiting your customers.
1) Customers don't care about 5 9s reliability, they care about 5 9s customer service. The device they're using to access your service doesn't offer five nines; they don't expect perfection. Gather data about how much uptime you actually need. Use error budgets. Nines get expensive fast. Don't over-engineer your app for reliability.
Uptime can be critical, e..g, live sports events and turbotax in April. There are no second chances with big events and massive launches. Understand what you need.
2) Customer feedback is more important than velocity. Do this exercise every time you ship: why am I doing this? How does this fit into the big picture? What does success look like? Afterwards: were my success criteria met? What can I do to make this better? Customers don't care how often you ship.
Engineers need to be invested in the code and the launch. Live like a customer to understand your customer.
3) It's easy to make a complex system but hard to run it. Scaling is no longer hard. The cloud makes it easy to scale to hundreds of millions of users with only a handful of engineers.
Business logic is no longer a strategic differentiator; infrastructure and operations engineering add the value. Dev and Ops interdependence builds a stronger sense of team and makes people willing to be held accountable.
4) Responding to customers fast gives you an advantage. Read "how customer service can turn angry customers into loyal ones"
Distancing software engineers from customers is an anti-pattern. If you're saying "I want my SWEs on features, not alerts, incidents, tests, user feedback", you're doing it wrong. Remove unnecessary layers between the alerts and the developers writing the code.
Kishore has an article on the same topic.
Livetweets: https://twitter.com/lizthegrey/status/954005997038776320, https://twitter.com/crayzeigh/status/954008192534700034
Cloud, Containers, Kubernetes, Bridget Kromhout, Microsoft
Bridget joined Microsoft to work on Linux and they gave her a Mac. Hah. Times have changed. In the 90s, open source vs closed source was a battle line. Open source won and now Microsoft is one of the biggest contributors to open source.
Conway's Law says that we design systems that mirror our organisation structure. But our organisations are different now. You don't usually send information all the way up the org chart and let it come back down; you ask the person you know in the other team. Our complex systems are built out of humans.
When building systems, choose the tech that's right for you; don't just use it because it's shiny. Yes, you can talk your org into letting you do a thing that will look good on your resume and let you get a new job, but when you leave, it'll be hard to run.
Containers are super exciting, but not actually new. There's lots of prior art: chroot, freebsd jails, solaris zones. But Docker made them accessible and user-friendly.
And now we have Kubernetes. Who's played with Kubernetes? A third of the room (of est 150-200 people?). Who's using it in production? 7 people. Orchestrating containers has typically involved a lot of janky bash. Kubernetes lets you avoid "super bespoke artisanal hand-whittled orchestration". (I LOLed).
Adding Kubernetes doesn't make services easy to use. It makes them more scalable and more robust, but there are tradeoffs. There are lots of necessary pieces. You end up with a very complex ecosystem.
We sometimes say "computers are easy, people are hard". It's catchy but not quite true. Massive distributed systems are very difficult, it's just that people are even more difficult. You'll need to invest in retraining and there will be resistance and discomfort.
And microservices are harder to debug.
Day 1 of using a new thing may be amazing, but at some point you will have to redeploy, scale out, patch, and handle your busiest time of year: it needs to stay usable through all of that. Think about Day 2 operations. Getting the thing running is just the start.
Some Kubernetes resources: the new stack article, Kelsey Hightower's "kubernetes the hard way" git repo, ivanfioravanti's fork of that to "Kubernetes The Hard Way on Azure".
All three major public cloud providers operate kubernetes as a service. Try it out, play with it, explore, experiment.
Livetweets: https://twitter.com/lizthegrey/status/954012689482899456, https://twitter.com/wiredferret/status/954012858064490497
Resolving Outages Faster With Better Debugging Strategies, Liz Fong-Jones and Adam Mckaig
Liz has seen a lot of monitoring in her ten years at Google. Adam has been at Google for 18 months and has found techniques for debugging that he wishes he'd had in previous jobs. They are here to share!
Something breaks. We get paged; now what? Prioritise mitigating and root causing as fast as possible, in that order. It's like the OODA loop: observe, orient, decide and act. Here's the debugging feedback loop:
- go out of SLO
- alerts fire
- reduce the blast radius and do quick mitigation
- formulate a hypotheses
- test it
- develop a solution
- test the solution
- meet SLO again
But sometimes your hypothesis was wrong! Here are three techniques to reduce the time taken for hypothesis formulation and testing. They'll demo in Panopticon, an internal Google tool, but it should be possible to add these techniques to any tooling.
Work through an example: Adam gets paged for too many slow queries. What can we discover about these queries? Are they all on the same subset of replicas, or distributed evenly? Is it a problem config or user?
Technique 1) Layer peeling (layers like onions!). Filter out as much data as possible.
First, drill down to the zone. Establish the blast radius. We got paged about us-west-1, but how are the other regions? They seem ok.
Some data will be precomputed. A metric doesn't exist in isolation; it's evaluated behind the scenes. Precomputations are good because 1) you get the data faster 2) it's an abstraction: you can stop thinking at this boundary if you want to.
Drill down and see that just one zone is the problem, us-west-1b. Consider draining the zone. What else can we find out about us-west-1b. Look at the percentiles by heatmap. Now
we see that only the higher percentiles are affected. Zoom in again! Now that the cardinality is low, we can plot all of the remaining tasks. We can see that it's one task. Kill that replica!
But unfortunately not all problems can be solved that way, so...
Technique 2) Dynamic Data Joins. Match up one metric with another.
Say we suspect a new kernel version is responsible for latency problems. We collect metrics about latency and metrics about the kernel version. We group those to count by version. We can augment the latency metric with the version data, group it, and see that one kernel is causing more latency.
Technique 3) Exemplars. Look at specific traces.
Examplars let us go straight to traces from some sample queries that had problems. It gives immediate information and links to a Dapper trace (Google's distributed tracing system; Zipkin is based on its white paper), which gives us a much faster hypothesis.
Testing hypotheses means we need dynamic query building and evaluation. Every dashboard has a cost in compute/storage/cognitive overload and it only tells you about the last outage. The ability to quickly drill down into data is key to fast debugging. Honeycomb are the closest to doing this but they hope it will become a common paradigm that everyone uses.
Faster debugging means we get to spend our error budgets elsewhere.
Panopticon and exemplars are extremely cool and have the best developers and I'm delighted to see them being demoed outside Google :-D
Livetweets: https://twitter.com/crayzeigh/status/954027688469114880, https://twitter.com/wiredferret/status/954027598316736512
Accelerate DevOps Adoption With A Dojo, Manish Patel, Verizon
Verizon calls their immersive learning programs "dojos". They have three in the US and two more internationally. The idea is to accelerate learning by combining agile principles and designing as a team. Engineers get better at understanding the customer and the products they're building.
Before beginning, they have a four hour session with the team to understand what they're doing and why, how it benefits the customer and what their learning goals are. One team, a network group, wanted to reduce install time for equipment from 60 days to 7-14 days. They start with "chartering", a paper exercise. They put the results up on the wall so everyone is continually reminded of what they're working towards.
Each effort in the dojo is called a challenge and lasts 6 weeks. This team brought in business partners, network engineers, field ops, IT, QA. They use physical story boards and agile with 2.5 day sprints. Having this short cycle helped them realise they had down-stream dependencies on another application and had to fly in one extra person.
The first sprints are spent on testing. They do retrospectives on what's going well and badly. The teams choose names and they celebrate with cakes.
Having teams collaborate has an acceleration effect. Email is slow. Having people from across all of the functions in one room means communication happens quickly. It changes the perspective of what they're doing, and whether it's even the right direction. One team, just from the paper exercise, discovered that the thing they were making was not what the user wanted. That was enough to cancel the project before a ton of time was wasted on it.
Livetweets: https://twitter.com/lizthegrey/status/954032686917738497, https://twitter.com/crayzeigh/status/954033045547438080
The History Of Fire Escapes: How To Fail, by me
I need to move to a Mac already. Drama presenting every time. But it went well and people appeared to not hate my central thesis, which is always a relief :-) If you want to read more about how New York City's fire code evolved, I've listed my references at http://noidea.dog/fires. And I came home with this incredible @MindsEyeCCF visual <3
Livetweets: https://twitter.com/lizthegrey/status/954340904478957568, https://twitter.com/wiredferret/status/954356033312493568, https://twitter.com/crayzeigh/status/954356248182484992
Choose Your Own Deployment: Interactive Feature Flag Adventure. Heidi Waterhouse, LaunchDarkly
This was immediately after my talk and I still haven't learned how to come back down quickly to earth after speaking so I missed a lot of this and am filling in gaps from livetweets. Thanks, livetweeters!
Heidi did this talk as a choose your own adventure game, which was a lot of fun. We followed Toggle, a space explorer, on a case study.
Something has gone wrong. How fast can you turn something off in production? "2 hours!" says someone. "How about 200 milliseconds? You can do that with feature flags.".
Dogfooding is for developers, but internal testing can be a larger group. Deploying the feature is a separate step from activating it. At the start you want to only turn it on for people who won't fire you. You can use feature flags to give different experiences to different users, and leave on old compatibility for customers who have old infrastructure.
Nobody wants to use software. (I think this is a great insight). They want to catch a Pokemon. They want to live indoors and eat food. They want to get stuff done. Make sure they can get stuff done with your software.
Make it accessible. We are only temporarily able-bodied. List all of the accessibility options and let users choose what they need.
Be careful with security and compliance: user ids might be PII you should handle carefully. Not everyone should be allowed change the feature flags, and you should know who did. You need role-based access control and auditing. Auditing helps with troubleshooting too.
Feature flags let you offer feature tiers to some customers. You don't need to push separate binaries for some situations; everything is in the main binary but enabled or disabled per-customer. You can make gradual changes, develop more quickly, and fix things before most users have seen them.
Livetweets: https://twitter.com/lizthegrey/status/954366357411426304, https://twitter.com/bridgetkromhout/status/954371227195494400, https://twitter.com/crayzeigh/status/954366848002379778
Moneyball And The Science Of Building Great DevOps Teams. Peter Varol
Moneyball is about baseball, but it's also about building great teams.
The Oakland As were the lowest budget team but found advantage through better analytics. They discovered that baseball scouts were hiring for things that didn't translate into better performance: fielding wasn't important; pitching was, but getting on base was most correlated with winning.
Data wins out over expert judgement: we're biased by our preconceived notions about what constitutes success. We maintain beliefs about devops that might be wrong. We like to work with people who are like ourselves. We want to work on products we like to use. We focus on the wrong things and do too many activities by rote. We work too hard and become fatigued
In "Thinking Fast and Slow", Daniel Kahneman lays out system1/system2 thinking and proves that the theory of the rational investor is wrong. System 1 keeps us functioning, but it's gullible and biased. System 2 is about deliberate, thoughtful decisions, but it's often lazy and difficult to engage.
Cognitive biases include answering questions based on being primed by previous questions, the halo effect of wanting our first impressions to be correct, basing decisions on easy heuristics rather than accurate data, and failing to account for regression to the mean.
We can learn from this. Generalists make the best DevOps. Automation standardises our processes, but we need to step back and ask whether it's working for us.
Assume that we will estimate badly. We know that most projects fail or are late but we still believe ours won't. We want to assign credit or blame and believe some teams are good or bad but know that there will always be variation. Keep people sharp by not having a stressful environment.
Minimise errors in judgement. Recognise and reduce bias and groupthink.
Livetweets: https://twitter.com/lizthegrey/status/954374320280817664, https://twitter.com/wiredferret/status/954374419106942977
Strategies To Edit Production Data. Julie Qiu, Spring
Editing data directly with SQL is scary. Even if you normally get your commands reviewed, you can have a situation where it's urgent and nobody is around; it's easy to make mistakes, like leaving out the where clause in a sql update.
We all have disaster stories where we've made mistakes, not because we're bad engineers, but because mistakes happen. Having access to make any edit you want sets you up for failure.
Here are 5 strategies for safer editing.
1) Raw SQL, with reviews. The reviewer needs to approve the query.
This is easy to implement and has an audit trail of reviews. It tells us the types of queries that are being run, but also why we ran them, so we know what kinds of tools we should prioritise. The process encourages people to be more careful and teaches them how to do it right.
Raw sql edits probably will never go away, but at least have a process. That's also a little more painful than just doing the thing, so people are encouraged to make tools.
But it's still possible to make mistakes, e.g., when copying and pasting. And the audit trail is being maintained manually and at will, and it's difficult to run long and complex logic.
2) Local scripts.
Start by writing the script and convert the sql logic to code. Use dry-run flags to preview the results before you commit them to production. You can reuse logic and common code, and write more complex queries, with access to your coding language's features.
But it's still easy to make mistakes, logs are only available locally, and you can have problems like your machine crashing while the script is running, or long scripts timing out.
3) Run on an existing server.
Write the script, copy it up to the server (maybe compiled in advance), Then ssh and run the script under screen or tmux.
Now you can run long scripts on existing infrastructure, and shared logs. But scripts can affect resources on your server, and it's not that user friendly: you need to ssh, run screen. There's a lot of room for error. There's also no persistent audit trail. Logs will get lost once the session ends.
4) Task runner.
Write the script and have it reviewed, but now it's triggered from a user interface. This will set up the virtual environment. Now we have persistent audit logs, enforced code review, and running scripts from a user interface.
But managing credentials is hard and there's not a clear separation of dev/production environments. And nothing makes sure we pass in the right arguments. Inputs aren't verified.
5) Script runner service
Allow fewer options; choose from a drop-down rather than typing in command line arguments. We can now parallelize and scale, preview the results and customise it as we like.
Which strategy to use depends on what your team needs. Over time, investing more effort is worth the cost.
Livetweets: https://twitter.com/lizthegrey/status/954385439682650112, https://twitter.com/wiredferret/status/954387967669399553
Moving fast at scale, Randy Shoup, Stitchfix
We know faster is better, but we sometimes ask whether we should do something "fast or right?" We shouldn't need to choose between speed and stability. The DevOps handbook says that high performing orgs deploy faster, recover from failure faster, lower failure rate. The high performing orgs have speed AND stability; the things that let us go fast also make us stable.
Four aspects of moving fast.
* Organising for speed.
Conway's law says you ship your org structure. So if we want a modular system, we need a modular organisation. Small independent teams lead to flexible, composable infrastructure. Larger interdependent teams lead to larger systems.
This is not just a descriptive law, it's a normative law: we engineer the software system we need by engineering the organisation. So, small service teams: the "two pizza" teams. Typically 4-6 people, and all disciplines need to be represented for the team to function. Product teams, not project teams: they need a deep understanding of business problems.
Teams should grow through "cellular mitosis". As the team grows, split it in two. Ideally 80% of project work should fall within a team boundary.
* What to build and what not to build
The 2003 book Lean Software Development by Mary Poppendieck and Tom Poppendieck says that "Building the wrong thing is the biggest waste in software development". What problem are you trying to solve? Maybe it can be solved without technology!
Redefine the problem. change the process! Do the new process manually for a while and see if it works and understand it very well before you begin to automate it.
In 2018 we don't need to build physical infrastructure, we use Cloud. Open source is usually better than commercial alternatives. This means that 2018 is a rough time to sell software, but a good time to sell services.
Iterate to a solution through experimentation. State your hypothesis: what metrics do you expect to move and why? Run a real A/B test, and log and measure user and system behaviour. This will let you understand why your experiment succeeded or failed.
* Prioritization. when to build things.
Scarce resources means we need to prioritise and have tradeoffs. We'd like to have this driven by return on investment.
Do fewer things, but get more done. Build one great thing instead of two half-finished things. Put people in pairs on highest priority thing, instead of one person on each thing. You deliver the full value of the first things earlier. (This is something I'd never thought of before, but it immediately seems obvious, which I think means it's genius.)
* How to build
Quality and reliability are priority 0 features. If the site isn't up it doesn't matter how pretty it is. Developers should be responsible for features, quality, performance and reliability.
Use test-driven development. It gives you better code, and the courage to refactor. You have more confidence in breaking things. Tests are executable documentation for how the code is supposed to work.
You might hear "We don't have time to do it right". Ask "Do you have time to do it twice?".
Livetweets: https://twitter.com/lizthegrey/status/954397209168052225, https://twitter.com/wiredferret/status/954395763634855941
And that was my first DevOps Days! It felt like a friendly, inclusive, human-centered conference, and I loved the real-time transcripts of what the speakers were saying. I had a good time and was sorry I couldn't stay for the afternoon sessions and Open Spaces. I hope to be back next year.