I enjoyed Infrastructure Now 2018, O’Reilly’s report on current and upcoming hot topics in infrastructure.
As you would expect, there is a lot of emphasis on cloud computing. The report points out that there’s a gap between the dream of operating at a higher level of abstraction and the reality: the various off-the-shelf components still need a ton of glue code to interoperate. True Infrastructure as a Service is some distance away, but we are getting closer. Complexity will become a focus, either to reduce it or to build new mechanisms for surviving it.
Serverless is a theme that runs through the report. I haven’t spent a lot of time on serverless architectures or Function as a Service, but I’ve certainly been dazzled by easy access to magical-seeming functionality supplied by other people. Machine learning, data processing and other FaaS capabilities will change the landscape of how applications are built. One of the contributors, Julia Grace from Slack points out that this will be a game-changer for a lot of companies without deep cloud expertise.
Many of the difficulties discussed feel very familiar to me: balancing an opinionated set of standards against supporting existing use cases; being drawn to powerful technologies but not wanting to get too tightly bound to any one vendor; the vast management complexity of Kubernetes and containerization. It’s interesting to read that we’re going through the same struggles.
Some other highlights: Microsoft’s Brendan Burns predicts that distributed systems will stop requiring expert skills, but also that we’re going to be stuck with “a spaghetti ball of interconnected microservices”. Mike Roberts from Symphonia cautions that we’ll need improved cluster management, discovery and deployment. (I think dependency management will also be key.) Nick Rockwell points out that once we’ve solved our stability problems, our focus shifts to cost optimization, which we’re not yet well equipped to monitor.
The line between the Operations and Development functions has been blurry for a while, and several of the contributors believe that’s going to continue, with much decreased demand for specialist “operations” roles. Terren Peterson notes that we should be comfortable with lower velocity from generalists — they’re spending 20-30% of their time doing work that used to belong a dedicated role — and Camille Fournier warns that on call load can cause burnout in software engineers. That’s certainly true, though I’d note that it also affects dedicated ops people, even those who appear to be drawn to the fires. We need to be careful about our alert load, and even more so as we embrace complex distributed systems which, as Julia says, have a more complicated answer to “Is the system up”.
There’s lots more in there. The report solidified a bunch of things I’ve been thinking about and made me think about some ideas that I hadn’t considered. I’ve summarised some of the ideas below, but it’s well worth the time to grab a copy and read the whole thing.
(With apologies for any errors; drop me a line if I misrepresented anything and I’ll fix it.)
tools must be accessible to non-experts, compatible with existing systems and cost effective
the next ten years will be all about reducing complexity
abstraction will change what SRE and operations means; specialised roles will decrease
legacy systems will continue to require time and effort
infrastructure as a utility is gaining traction
What is the biggest infra challenge you're facing now?
(Brendan Burns, Microsoft) APIs that are general, simple and empowering. Balancing forcing people to do the right thing vs meeting them where they are.
(Julia Grace, Slack) Hyper growth: balancing immediate problems against being ready for future scaling issues. Building/Buying to enter complex markets. Hiring.
(Nirmal Mehta, Booz Allen Hamilton) Meeting regulatory standards. Hiring. Moving legacy systems to modern paradigms like DevOps or containerization.
(Alan Ning, USDS/Department of Veterans Affairs) Security and governance while trying to onboard huge numbers of engineers.
(Terren Peterson, Capital One) Staffing and training to take advantage of innovations in tech.
(Mike Roberts, Symphonia) Dealing with new platforms and paradigms that are embraced by teams but not yet mature enough to be able to run easily.
(Nick Rockwell, New York Times) The many decisions involved in using cloud providers. Keeping cloud costs under control.
(Casey Rosenthal, Backplane.io) The balance between investing in centralized solutions vs keeping decisions reversible.
(Christopher Wright, Glossier) Unnecessary operational complexity. Failure to fully turn down things that are deprecated.
What will change next in underlying infrastructure?
(Brendan Burns, Microsoft) Tooling and environments that make it much easier to build distributed systems without being an expert.
(Camille Fournier, Two Sigma) Serverless tools and processes (e.g., monitoring and debugging serverless apps) will start to mature.
(Julia Grace, Slack) Corporations are becoming more open to the cloud: typically file sharing first, then internal workloads. Containers shift us up the stack. Kubernetes is winning, but it's complicated and migrations are costly. Cloud lets us stop thinking about infrastructure but we need to architect very differently. Adding API layers on old infrastructure may cause scaling issues as new systems add demands on old systems.
(Nirmal Mehta, Booz Allen Hamilton) More containerization and prepackaged DevOps and CI/CD pipelines. More abstraction: first serverless, then assembling business applications from drag-and-drop components.
(Alan Ning, USDS/Department of Veterans Affairs) Agile, iterative paradigms, though government's having trouble embracing them.
(Terren Peterson, Capital One) Standardization and reduction in complexity. It will take years to move off legacy technologies; supporting them alongside new systems will slow organizations down.
(Mike Roberts, Symphonia) Infrastructure as fully managed services, infrastructure as a commodity. More serverless offerings, e.g., for data, machine learning, analysis.
(Nick Rockwell, New York Times) Serverless will change how we build apps. Managing infrastructure will get simpler and operations as a function will blur even more with engineering.
(Casey Rosenthal, Backplane.io) More cloud. Serverless for data science and AI. Easier data storage. Network speeds are increasing faster than storage speeds, so data locality becomes easier. Developer user experience will improve; developers will have fewer operational concerns.
(Christopher Wright, Glossier) Distributed architecture is still difficult, e.g., distributed tracing isn’t solved yet. Google’s work here will be interesting.
How will DevOps and SRE roles change?
(Brendan Burns, Microsoft) The DevOps/SRE mantra was always that you should automate yourself out of a job. CD and self-healing infrastructure is finally freeing SREs from maintenance and letting them focus on making applications reliable.
(Camille Fournier, Two Sigma) Developers doing ops work is here to stay, though developers who are more interested in infrastructure often end up with the "DevOps" responsibility. On call can cause burnout so you need to invest in making the support load manageable. It's hard to know where the job of infrastructure engineers ends and reliability engineers begins; skeptical that the SRE model applies outside certain companies.
(Julia Grace, Slack) Operations and infrastructure have converged. Systems are too complicated for a single engineer to keep a model in their head: we need cooperation between teams of experts. Every job is now software engineering: you need to know the fundamentals of data structures, design patterns and networking. We’re moving from horizontal low-level teams to vertical teams with clear ownership over services.
(Nirmal Mehta, Booz Allen Hamilton) We focus on technology change but cultural change is harder. Need to change from delivering an application to delivering a capability, doing whatever needs to happen to deliver the full package. Our mental models need to change for cloud.
(Alan Ning, USDS/Department of Veterans Affairs) Ideally the line between developer and operations will blur.
(Terren Peterson, Capital One) We’re rebalancing between dedicated job functions and generalist engineers. Developers do self-service building and deploying. Specialist roles will diminish; we'll need to change our expectations of the velocity of generalists since they now spend 20-30% of their time doing work that used to be done by specialists.
(Mike Roberts, Symphonia) Infrastructure is now as dynamic as code and will get more so; we need to rethink operations for a constantly changing environment. Developers need to understand production. Ops folks need to help others run software instead of running it themselves.
(Nick Rockwell, New York Times) It’s good that lower-value activities are being commodified and automated. Operations becomes a specialised form of software engineering, focused not on reliability and scalability but on developer productivity.
(Casey Rosenthal, Backplane.io) DevOps, SRE and Chaos Engineering are pragmatic solutions to the unavoidable complexity of distributed systems. We will adjust to dealing with more complexity and learn to navigate it rather than trying to eliminate it.
What legacy infrastructure do you think you'll be stuck with?
(Brendan Burns, Microsoft) A spaghetti ball of interconnected microservices whose connections nobody understands. Especially true in serverless/FaaS because the code's so separated.
(Camille Fournier, Two Sigma) We'll have physical datacenters for a long time. Major legacy updates only happen when there's a pressing business need and they always take longer and cost more than people anticipate.
(Julia Grace, Slack) Tens of thousands of external developers rely on Slack APIs so we have to keep them working indefinitely. We know how to run our monolith at scale and will likely have it for a long time.
(Nirmal Mehta, Booz Allen Hamilton) Legacy technology never really dies. There's still plenty of COBOL out there. The IT industry has a massive sunk-cost fallacy.
(Alan Ning, USDS/Department of Veterans Affairs) We only migrate projects to the cloud if they have enough users to make it worthwhile. Everything else stays on-premises until its end of life.
(Terren Peterson, Capital One) It's not cost effective to replace mature technology stacks that don't require maintenance. We'll replace the ones with heavy operational maintenance and expensive licensing models. Surprisingly, it could mean that COBOL systems stay around while newer systems get replaced. Everything is legacy as soon as it hits production.
(Mike Roberts, Symphonia) Large java and rails apps with shared databases will be around for a while. We have to decide whether to try to evolve existing codebases or rewrite them.
(Nick Rockwell, New York Times) We (the NYT) won't be stuck with any legacy infrastructure. Migrations are worth it.
(Casey Rosenthal, Backplane.io) The industry will be stuck maintaining container platforms that aren't worth the effort. They're interesting technologies that make the operators feel powerful, but almost none of them optimises job scheduling and the modest improvements in developer experience are cancelled out by the cost of running the platform.
(Christopher Wright, Glossier) Glossier is a startup without much tech debt yet, but we're already making decisions based on needing to deal with a monolithic rails app we can't split up.
What is the impact of infrastructure abstraction moving up the stack?
(Brendan Burns, Microsoft) Hopefully building abstractions closer to the abstractions developers think about: a container is more like an application than a VM is.
(Camille Fournier, Two Sigma) We haven't moved as far up the stack as we thin: yes, we need fewer engineers on hardware and power but we still need to know storage performance, cloud offerings, etc. We still write a ton of glue code to make the pieces fit together; we can't stop understanding the lower levels yet.
(Julia Grace, Slack) In the short term, things are more complicated: when there are many services, it's harder to judge whether the system is up. We have access to fantastic abstractions but to debug them we still need to know how everything works.
(Nirmal Mehta, Booz Allen Hamilton) At an Azure Logic Apps training I used drag-and-drop functionality to build a language translation and sentiment analysis application in 20 minutes; eight years ago it would have taken a year. As the variety of features and services increase, we'll build more apps this way. Full data pipelines as a service mean we don't even have to manage databases any more. It will change how IT budgets are oriented; we'll buy these services with some consulting and IT will assemble them.
(Alan Ning, USDS/Department of Veterans Affairs) Moving from containers on your own infrastructure to SaaS has huge policy and governance implications. As we move higher up the stack, the security boundary expands.
(Terren Peterson, Capital One) Modern architectures will focus on orchestration layers and microservices; applications will need well-defined components and coordination of data flow. The level at which we want to change components of our systems will accelerate as service providers give us innovations we want to use.
(Mike Roberts, Symphonia) We get a lot more freedom to quickly deploy new components. But the flip side of that is that we end up managing a huge number of components and interconnections between them.
(Nick Rockwell, New York Times) Much less ops work. The work shifts to developer productivity, especially safe deployment. As we solve stability problems, we have new cost optimization problems which change how we monitor.
(Casey Rosenthal, Backplane.io) Serverless will mean engineers can focus only on business logic and not care about their stack.
(Christopher Wright, Glossier) We'll rely more on AWS and GCP and be conservative about technologies they don't support. The model of services with clean boundaries gives us a new problem: service ownership.
Where will the most advances in infrastructure occur over the next 12 months?
(Brendan Burns, Microsoft) Hopefully, developer productivity and empowering novice developers to build distributed systems.
(Camille Fournier, Two Sigma) Machine learning and large scale data processing. Investment and migration to Kubernetes, despite its sharp edges. Spark (for analytics and data) recently released Kubernetes support.
(Julia Grace, Slack) Serverless computing means access to the cloud with less specialized knowledge. More tooling around managing large numbers of services. Public cloud will mature and be more competitive in price and features, especially in AI/machine learning and other niches.
(Nirmal Mehta, Booz Allen Hamilton) More maturity in the container ecosystem. More language support in serverless. Containers for data pipelining/batch jobs. More GPU support. More adoption of cloud-centric databases and maybe improvements for high availability Postgres and MySQL on cloud.
(Alan Ning, USDS/Department of Veterans Affairs) More FaaS adoption, e.g., classic cron jobs moving to AWS Lambda or Azure Functions.
(Terren Peterson, Capital One) More adoption of serverless. With an increase in generalists, simpler and more usable infrastructure will win. Nicholas Carr's 2008 book, The Big Switch predicted cloud acting like utilities, similar to the electrification of industry.
(Mike Roberts, Symphonia) Improved container management. Kubernetes complexity should become as opaque to the average engineer as linux kernel architecture is; most people won't need to know. We'll need to see improved offerings for automated cluster management, discovery and deployment.
(Nick Rockwell, New York Times) Stripped down containers will become the abstraction for deploying to serverless infrastructure. Managed databases become mainstream. Development toolchains, deployment and monitoring will need major work but will pay off.
(Casey Rosenthal, Backplane.io) The big tech companies will advance the infrastructure and a handful of startups will bring the lessons learned to the rest of the industry.
(Christopher Wright, Glossier) We'll see more advances in Kubernetes and maybe changes in CI/CD. There are new and exciting applications for FaaS.