Nine takeaways from the DevOps report

I like the Accelerate State of DevOps Report because it brings us actual statistics on what’s making software engineering organisations successful. Here’s nine things I’m thinking about after reading this year’s report.


We’re measuring things!  Image: unsplash

We’re measuring things!

Image: unsplash

As I've talked about before, software scares the crap out of me. Although we have a ton of best practices, we don't have a "fire code" that tells us what's the current "right" way to do anything. A lot of it depends on who you ask, which speakers you listen to, which books or blogs you read, whether you subscribe more to the subculture of Site Reliability or Chaos Engineering or Resilience Engineering or etc. But most of the time we're just talking about what has worked for us. Knowledge-sharing is definitely useful but it's also anecdotal.

That's why I love the Accelerate State of DevOps Report, an annual publication that does statistical analysis on the practices that are making engineering organisations successful. The DevOps Research and Assessment group (better known as DORA) have been doing this research for six years now, with data from 31000 tech professionals, skewing senior: 48% have more than 16 years of experience.

This year's report is 82 pages long. Honestly it's super interesting and I recommend you read the whole thing, but I'm going to write up a few things I’m still thinking about after having read it a few weeks ago.

Disclaimer: If anything here sounds wrong or weird, assume it's my error, not the report’s. I've tried to put my own opinionating in brackets [like this] anywhere it might be ambiguous. If I misrepresented anything from the report, drop me a line at heytanyafixyourblog@noidea.dog and I'll fix it. On the other hand, if you want to argue about the statistics, I'm absolutely not the right person to contact. You can read about the methodology on page 77 and 78 of the report. (Consider not bugging the authors either though. Maybe go do something else that you would find fun instead? It's a great time of year for riding bikes.)

1. We're getting better at delivering software.

DORA cluster companies into "elite", "high", "medium" and "low" performers, where the "elite" organisations in the report are twice as likely as "low" performers to achieve their goals. They want to let teams benchmark themselves against the industry and understand how they can get better.

The key metric they analyse is Software Delivery and Operational Performance (SDO), a mash up [pretty sure that's the right statistical term :grin:] of the lead time for code changes, deployment frequency, percentage of changes that fail, time to restore after a failure, and a general measure of site availability. They showed that this metric is good at predicting whether an organisation is a high performer: it's a better predictor than software delivery performance or availability alone.

They also measured productivity, which is the ability to get complex, time-consuming tasks finished without being distracted. Juggling different projects or different types of work was [surprisingly to me!] not really different between low and high performers. If you get better at SDO and have higher productivity you'll probably be a more successful org, so it's worth digging in to anything that's stopping you from doing those things.

The report finds that there are many more elite orgs this year than last year. The industry is getting measurably better. And it mostly doesn't matter what industry you're in -- web service companies aren't better than finance, or etc -- with one exception: retail is better at delivering software than everyone else. With slim margins, they need to be nimble or they die.

2. Reacting quickly is a superpower.

People used to think there were tradeoffs between velocity and availability. You might think moving more slowly leads to higher quality software, which means higher availability, but the research shows the opposite is true. Elite teams deploy more often (multiple times a day), get changes out more quickly, and (I think this is key!) can restore service most quickly. They move really fast! But they still have a change failure rate of less than 15%.

[It makes sense: if you find a bug inside a few seconds, fix it immediately, and then spend a week deploying the fix, that's a lot of minutes of really pointless downtime.]

3. More effective organisations are not just using Cloud, but using it well.

A lot of people felt frustrated because they were using Cloud but not getting all of the magical benefits that cloud computing promised. It turns out that using the Cloud does predict faster software delivery performance and higher availability, but only if you use it right.

50% of respondents said they were using public cloud and 27% said "hybrid cloud. But people don't really agree on what hybrid cloud is, or even really what "using the Cloud" means. [I saw a LISA talk in 2009 where Rich Wolski from Eucalyptus Software said that nobody was sure what Cloud was, but everyone was sure that Amazon was it, so they were copying Amazon's APIs and that meant they were doing Cloud too. (I'm not mocking at all; I think this is legit.). But now it's 2019 and… has that changed?]

Everyone uses pictures of clouds to illustrate the cloud and I am no exception.  Image: unsplash

Everyone uses pictures of clouds to illustrate the cloud and I am no exception.
Image: unsplash

Since there's so much confusion, DORA worked with the five essential characteristics of cloud computing as defined by the National Institute of Standards and Technology's (NIST), and asked people whether they're specifically doing those. They are:

  • on-demand self-service: you can provision computing resources as needed without human interaction with a service provider.

  • broad network access: capabilities are available over the network and accessed through standard mechanisms. (I didn't understand this but https://www.techopedia.com/definition/28785/broad-network-access says it means you can access resources from outside the company's network.)

  • resource pooling: resources are shared in a multi-tenant model, and dynamically assigned when they're needed. Consumers don't know exactly where their resources are.

  • rapid elasticity: you can rapidly scale up and down on demand.

  • measured service: systems control, optimize and report resource usage per service.

Only 29% of the respondents who said they were "using cloud" agreed that they did all five of these things. The "elite" teams were 24x more likely than the low performers to say that they did. It matters how you do cloud.

Cloud usage doesn't always save people money, because it's easy to accidentally spend too much and get it wrong. [I can vouch for this: I accidentally spent $50 on Kubernetes the Hard Way and I consider that a fair price for the lesson in turning things off when I'm done with them. Also I did this.]

But respondents who said yes to all 5 cloud characteristics were 2.6 times more likely to be able to accurately estimate their costs of operating software, twice as likely to be able to identify their most expensive applications and 1.65x as likely to stay under budget.

4. Psychological safety improves productivity. Productivity reduces burnout.

DORA’s research shows that if you optimise for information flow, trust, innovation, and risk-sharing, you allow team members to take calculated and moderate risks, speak up, and be more creative. Blameless post-mortems support growth and learning from failure. The analysis found that a culture of psychological safety predicts software delivery performance, organisational performance and productivity.

Productivity is often thought of as the company achieving its goals, but the report finds that it's also good for the individual. It increases "work recovery", the ability to deal with work stress and disconnect from work outside work hours. Employees at low performing organisations were twice as likely to report feeling burned out. Burnout is a combination of exhaustion, cynicism and inefficacy at work. [It's also now recognised by the WHO as an "occupational phenomenon". I really recommend this talk on it by Dr. Aneika Simmons and Anjuan Simmons.]

5. Tooling is a good investment. Let people choose their own tools.

Automation is a sound investment: it frees up engineers from manual work to do higher value activities. Elite performers automate and integrate tools more frequently into their toolchains. But they're thoughtful about which software is strategic and which is just utility. They use commercial off-the-shelf software for the "utility" problems and save their resources for strategic software development efforts. [Intercom have written about this at length.]

People are more productive when they can choose the right tool for the job.  Image: unsplash

People are more productive when they can choose the right tool for the job.

Image: unsplash

When people can choose their own tools, they choose the ones that make them most productive and have better software delivery performance. [This was contrary to my gut feeling that it's better to standardise on a set of tools. I still suspect that it's worth the effort to supply a bunch of well-integrated standard tools, but let people use their own if they want.]

The report says that it's worth investing in automated test suites, deployment automation, monitoring, wide scale refactoring tools, dependency management and good internal and external search. Use version control, including for config and scripting.

Tooling should be easy to use even if it's designed for power users / advanced technologists. Usability matters.

6. Code maintainability and paying down tech debt is worth it.

Engineers need to be able to build good mental models and understand changes. We should architect for flexible, extensible and visible systems, and emphasise code maintainability, loosely coupled architecture and monitoring. Developers need to be able to change code maintained by other teams, find examples in the code base and reuse code. It needs to be easy to add, upgrade and migrate to new versions of dependencies without breaking code. Technical debt slows down productivity.

Loosely coupled architecture means teams can independently test, deploy and change their systems on demand without depending on other teams, and this leads to higher performance. You can decouple large domains with bounded contexts and APIs, or service oriented architecture, though avoid premature decomposition of new systems, or overly fine-grained services.

We should refactor as part of daily work. Respondents with high technical debt (e.g., known bugs, insufficient test coverage, dead code/artifacts not cleaned up, incomplete migrations, obsolete technology, and outdated docs) were 1.6 times less productive. High performers were 1.4 times more likely to have low technical debt.

[My coworker Jon has a great post about when technical debt is the smart choice.]

7. It needs to be easy to make changes.

Some people claim formal change approval leads to more stability. Others argue that streamlined change approval gives faster feedback and better outcomes. Who's right? DORA found that requiring the approval of an external body e.g., change advisory board or senior manager had a negative impact on delivery performance. Orgs who did that were 2.6 times more likely to be low performers.

Heavyweight approvals lead to more failures because the slower process makes people release larger batches when they do have an opportunity to get changes out. This means higher levels of risk and higher change fail rates.

They suggested a lighter weight approach: require every change to be approved by someone else on the team as part of code review. You can use automated thresholds to bound changes, e.g., not allowing compute costs to rise over a certain threshold even if the change is peer reviewed.

Instead of change approval boards, do this kind of peer review based approval and automation to detect and prevent bad changes. Then continuous testing, continuous integration, comprehensive monitoring and observability let us correct any errors quickly.

Change management was consistently one of the biggest constraints DORA found in their work with big organisations. and respondents with a clear change process were 1.8x more likely to be elite performers.

Having a clearly understood process for making changes reduces burnout too.

8. Community structures are better than groups of experts

Community structures make change happen.  Image: unsplash

Community structures make change happen.
Image: unsplash

The report looked at how organisations spread cultural change, like Agile or DevOps methods. They found that high performing orgs create formal and informal community structures: communities of practice, grassroots groups pulling together resources, small groups doing proofs of concepts, etc.

Low performers preferred training centers (sometimes called DOJOs) where you take people out of their normal routines to learn new tools or tech or culture and then put them back. They also do Centers of Excellence, having all the expertise in a consulting group. These approaches can cause siloing, bottlenecks and isolated experience. Also they remove the experts from doing their regular work, which slows everything down.

9. Transforming to a DevOps model needs a mix of org-level and team-level work.

Despite this report being about adopting DevOps, I still struggled with defining it here. I've mostly been in companies that considered the DevOps way to be the normal way so, when someone describes it, it's hard for me to pick out the parts that are the prescriptive philosophy and the parts that are just descriptive of deploying software anywhere. But when the report’s authors talk about transformation towards DevOps, I think they mean:

  • being able to deploy on demand

  • making lots of low risk changes instead of few big-bang changes

  • assuming things will break and planning for that

  • using config as code (i.e., repeatable!), not clicking on things or running arbitrary commands

  • automating everything

  • measuring everything

and

  • if you have "dev people" and "ops people", having them talk to each other, share information and be nice to each other. [And maybe be on the same team? And likely be the same people? I don't think the DevOps philosophy has an opinion on that, but I'm never sure.]

When companies buy into all of this and decide that they want to follow DevOps models, deploy quickly, and all the rest of it, it isn't easy. Organisations can suffer from "death by initiative": they try to do too much and don’t put enough resources on any one thing. The report says they’re more successful if they choose a few things that are holding the org back, put time, money, and executive sponsorship behind them, then iterate.

Organisations should identify high-level short term and long term outcomes, but they shouldn’t be super prescriptive about how to get there: teams need freedom to decide how they’re getting there and adapt to whatever changes happen along the way. There are org-level changes that can help though, e.g., creating central "force multiplier" solutions like providing a central platform for CI or removing architectural roadblocks.

Conclusion

DORA reckon that DevOps is going to become the standard way of doing software development. That's already true in the parts of the industry I work in, and I think it's cool that the philosophy is spreading.

I'd love to see this kind of analysis for other aspects of the industry, or for this report to roll in a bunch more things that are becoming best practice and feel like they're working. I’m already looking forward to next year’s report.

Again, read the whole thing here. And if you know any of Dr Nicole Forsgren, Dr Dustin Smith, Jez Humble or Jessie Frazelle, thank them for the work :-)