Conference Report: SRECon Americas 2017 Day 1
Although it's existed for three years, SRECon (Americas) 2017 was my first SRECon. I’ve been to LISA a bunch of times and wondered how SRECon would compare. Mostly, I liked it about the same (i.e., very much). The majority of the talks felt like they would have fit at either conference, though there were a couple of deep architectural discussions that I might not have seen at LISA.
I enjoyed the emphasis on chaos engineering and intentionally breaking things. And (like I said about LISA a few months ago) I like how much our industry is waking up to ‘humaning’ being a difficult and important skill.
I took a lot of notes during the sessions. Here's my summaries. Let me know if I misrepresented anything and I'll fix it. Day two is here.
So You Want To Be A Wizard. Julia Evans, Stripe.
It's b0rk! Julia writes with contagious enthusiasm about the joy of learning stuff. She chooses a new topic -- operating systems, or networks, or SSL -- and dives in, then creates blog posts and comics to make it easy for other people to understand it too. I was really looking forward to this talk and I was not disappointed!
The first obstacle to being a wizard is that computers aren't magic, and there's no incantations: we have to learn hard things. And learning is fun so that's good news, but how do we learn? Here are some ways:
Understand your systems. You can get still things done without understanding, but if you know what's happening, you can innovate. Chip away a bit of knowledge at a time. She learns by taking on projects, e.g., building a TCP stack from scratch (in python!), writing a keyboard driver. The experiments don't have to be good, or even work, but you have to learn something. She also recommended reading things that are a little too hard. When you don't understand something, dig in to it.
Ask questions. Asking a really good question is like telepathically moving information from someone else's brain to yours. To ask a good question:
do some research first, so you aren't asking about the basics
ask someone with just a bit more knowledge than you, not the expert. It's less load on the expert and your colleague will learn more from teaching you.
state what you already know. It helps organise your thoughts and reveals misunderstandings. It also makes sure you get an answer that's not too easy or too hard.
ask yes/no questions; they're easier to answer briefly.
if someone just fixed something, ask them how they did it while it's fresh in their mind.
ask questions in public, especially if you're senior.
That last one is key, I think. You can do a lot for your team culture by admitting the things you don't know.
Read the code. Even difficult codebases, even in languages you don't know. It can be easier than you expect to go look at the code and see what's really happening.
Debug like a wizard. To be better at debugging:
remember that the bug is happening for a logical reason. they're just computers. (This often doesn't feel true with distributed systems, but I guess it is :-) )
persevere, but if it's going to take weeks, make sure it's a thing worth spending weeks on.
train your intuitions. It helps if you have a feeling for what should be happening. Try http://computers-are-fast.github.io to help hone your instincts.
know your toolkit: tcpdump, perf, etc. Check out her debugging tools zine!
Eventually, you'll learn to like debugging, and move from "oh no, a bug!" to "I'm about to learn something."
Write down a design. Use design docs, project briefs, etc. They help you catch misconceptions early and get good ideas. (Like asking good questions, I guess!) Consider writing an announcement email before you start making something. This makes sure you've thought out why the thing's important, how it impacts other teams, and how you can be sure it's working.
Know why you're doing it. Either be sure that you know why the thing's important, or go do something else. When you wonder "Should I do X or Y", the answer often depends on "Why are you doing it?".
I really enjoyed this talk. Julia's an enthusiastic, dynamic speaker and she makes it feel safe and exciting to not already know things. She left us with two takeaways: ask questions in public, and read something that's too hard for you. Ok, I will!
Keep Calm and Carry On. Scaling Your Org with Microservices. Charity Majors, Honeycomb, and Bridget Kromhout, Pivotal
"How many of you are 'micro'ing some 'services', Bridget asked. A lot of people weren't sure if they are, and that seemed to be what the speakers expected. "What even is a microservice?", was their opener. Our industry is still figuring that out.
Usually we mean independently deployable, small, modular systems. Maybe there's a distinction between mono-repo vs multiple repos, maybe not. Whatever it means, the hard part of microservices, like the hard part of tech in general, is humans.
So, should we use microservices? It depends. They should be a consequence of your needs, not a goal in themselves. You should use the easiest architecture possible: if you can run on the LAMP stack, do that. 'Choose boring technology' still applies. If it's not core to your business, don't innovate. Avoid "resume-driven development".
Every service must be owned by a human, but don't have human SPOFs. People need to be able to go on vacation or to conferences without being called. Do chaos monkey for people. Nobody should be so busy that they don't have time to write state to shared storage. One form of state sharing: code descriptions should say why, not what.
If you split teams, aim for a 'startup' feeling, and a sense of ownership, and don't babysit people. Nobody should feel like 'we maintain a slice of a slice of someone else's thing'. You can't ask people to care about something unless they have power and responsibility.
What's management's role in this? To keep everyone on the same mission. They should repeat the mission until everyone is bored of hearing about it. Their secondary role is information routing, load balancing and health checking. And management should never be a promotion: it's a different job. You need to provide an equally rich and fulfilling career path for ICs.
Understand your communications channels, including the implicit ones like who gets promoted, and the unofficial ones like gossip and happy hours.
Does the SWE team have to be on call? Maybe! Don't suddenly force your dev team to own the on call for a noisy service -- they'll quit! But having them part of the rotation creates a virtuous cycle. The amount of time spent running the system is much higher than the time writing it, and nobody should be named a senior engineer if they don't take the operability of their system seriously. Ops is everyone's responsibility. "Designated Ops not Dedicated Ops", as @beerops tells us.
Observability is the rock on which everything is constructed. Democratize observability: trust each other with information; don't silo it because 'security'. Expose debugging information, monitoring, metrics, instrumentation.
And observe people too. Know whether people are ok. You have a responsibility to your team's wellbeing whether you're a manager or not. Yes, you can and should talk to people. Majoring in CS, you can end up thinking that you won't spend time on human problems, but that's a bait and switch, It's humans all the way down.
How Do Your Packets Flow. Leslie Carr.
Leslie opened by joking that we're all in the cloud now, so don't need to know how our networks work, right? Laughter from the audience. Nope, we still care. She gave a whirlwind tour of how our packets get where they're going.
Physical. Servers aggregate to rack switches, then to an aggregate layer that used to be expensive switches, but are now often a CLOS (which we can think of as a Redundant Array of Inexpensive Switches.) These days, servers can have 10Gb, 25Gb or even 100Gb network cards.
Lasers can let us put a lot of bandwidth into a cable. Fiber optic cables are suuuuper fragile and tiny, but are enclosed in protective layers to make them robust. A coating called cladding also reflects the light back into the cable.
The cable would still not carry as much data as we need, but we have a tech called "wavelength division multiplexing", which is a fancy term for "a prism". It lets us split the signal into up to 128 different colours, which gives us 26.6Tb on each fiber strand. 100 of those strands can be bundled into a cable.
Cables are be aerial, underground or trans-oceanic, and each has natural predators. Hunters shoot at birds sitting on aerial cables. Backhoes dig through underground cables. Anchors and sharks break through underwater cables.
Logical. A packet is about 1500 bytes. Most people only need to care about its destination address. A routing table moves packets from place to place. Historically, there were lots of differences between routers and switches, but less so now: switches are cheap and have a small routing table, routers, expensive, with a large routing table. There are over 650k entries in the IPv4 routing table online.
Connections. Transit is like ISPs for ISPs. It costs a lot and can mean suboptimal routes, so people do peering. Private peering is where two networks link to each other. But switch ports are expensive and it would mean a lot of messy cabling to maintain a full mesh of everyone linked to everyone, so it's typical to use an internet exchange. These connections typically happen in a Meet-Me room, a designated place in a colo facility for interconnections to happen.
This was an engaging and well-explained broad overview that didn't assume any prior knowledge. I really enjoy this kind of talk; it lets everyone in the room fill in the gaps in their knowledge, while solidifying what they already think they know. Leslie said that BGP would have been another 25 minute talk on its own, so she didn't cover it. That's a definite gap for me, so I hope she does that talk another time.
From Combat to Code: How Programming Is Helping Veterans Find Purpose Jerome Hardaway, Vets Who Code
200K veterans return home every year. That number is huge enough that I assumed I'd transcribed badly, but no, that's the number. 9% of adults in the US are veterans. They face obstacles in finding jobs: lack of support, prejudice, and the stereotype that they're a "broken hero" or a risky hire.
At the same time, the tech industry complains about its skills gap, and has a reputation for lack of inclusion and empathy.
Jerome suggested that we can solve these two problems at once. Vets are a good fit for coding jobs: they have discipline, pay attention to details, learn quickly and are great critical thinkers. Tech often involves gathering context and making decisions without perfect information, which is familiar territory to military folks.
People in his program use a mix of traditional college courses, bootcamps and online resources. Some of them are doing all of those at once, which must be exhausting.
Jerome asked us to help by:
- asking our HR teams to integrate with military communities.
- asking recruiting teams to make contact with local bases and become part of TAPS programming: that's "Transition Assistance Programming", and it often doesn't know which are the right skills to emphasise. The Office of Public Affairs is a good way to make contact.
- offering mentoring to teach vets skills that are in demand. For example, Site Reliability is a great match for people who are experts in situational awareness and responding to incidents, but until recently Jerome hadn't known that the job existed.
Tracking Service Infrastructure at Scale. John Arthorne, Shopify
Shopify think of software like a race car: you could claim that the car and the driver win the race, but in reality it depends on a lot of other teams. In software, that includes your development environment, continuous integration, telemetry, bug tracking, logs, load testing, etc. Every service needs all of these, and the more services you have, the harder it is to stay on top of all of them. Especially if, like Shopify, you're pushing a new service every week.
They used to use "spreadsheet defined infrastructure" for tracking service health. They needed a better solution.
Their project had three stages:
Ownership for all services and apps. The collective ownership model was too prone to the bystander effect when something broke. It's also better to have a defined owner with a stake in the long-term decisions about the service.
They added a metadata file to each github repo, which included ownership. Changing the owner needs a pull request, so ownership becomes more deliberate.
Measurement. They made dashboards for each service, displaying and linking to to all its relevant information: logs, monitoring, runtime, etc. They defined a simple model of service tiers, with specific needs at each tier:
| :----------------| :------------------------------------------------------------------------------------------------------| | Experiments | Owner, security, resolve outages | | Useful | Single owner, deploy automation, CI, standard dev setup, uptime monitoring, log retention, backups, SSL| | Important | On call, alerts, metrics instrumentation, dedicated DB, load tested, rolling deploy without downtime | | Critical | Playbooks, SLO, resiliancy patterns, DC failover, scheduled load tests, security review |
Automation. Finally, they automated filing bugs for missing services, and adding one-click automation to fill some gaps. for example "click here to enable uptime monitoring."
John concluded that infrastructure investment is a tradeoff. More is not necessarily better, but always measure your progress, to make sure you're working on the right thing.
Tune Your Way to Savings! Sandy Strong and Brian Martin, Twitter
Sandy and Brian talked us through how Twitter selects and serves an advertisment: a selection of candidate ads is sent through two filters, first a cheap one, to winnow the options as much as possible, then a CPU-intensive one. The ads that survive the filters go through a couple of auctions and one is chosen to serve.
The ad server was a significant part of their infrastructure cost, so they chartered a project to reduce its resource footprint. Even a small improvement could mean large savings. They built a new team with a mix of skillsets: SRE, software engineers and hardware engineers.
They identified key metrics: the 'quality factor' of the advertisement, the serving latency, and the revenue generated per thousand queries. They also defined a success rate: 99.9% of ads had to be served at sufficient quality. They built a controller to maintain that success rate, while adjusting other factors.
The ad server had run on Twitter's shared Mesos infrastructure, with CPU throttling enabled to make sure it was a good neighbour. The first step was to build an isolated, dedicated instance of Mesos. This allowed them to run without throttling, to see what the ad server could do with more resources.
They ran over 60 experiments including:
- different container shape. They tried shrinking the containers and sharding wider, and also running fewer, taller instances.
- system changes. Different schedulers. Using Hugepages, i.e., pagesizes of up to 1GB instead of the default 4K. NUMA/CPU pinning.
- Hardware changes. Enabling turboboost.
They were able to save 6% of CPU. This was down from 17% in the lab, but was still significant. They also decreased the load on their downstream services.
Their advice on running experiments:
- change only one thing at a time
- be realistic about how long it will take; expect to re-run some experiments
- identiy the metrics you're trying to move before you start running the experiments. Stay focused and don't get sidetracked by microoptimisations.
- resist confirmation bias.
Twitter has a blog post about their work: Resilient ad serving at Twitter-scale by Sridhar Iyer.
Every Day Is Monday in Operations. Benjamin Purgason, LinkedIn
Benjamin's team has a bunch of axioms that they keep returning to.
Every day is Monday in Operations. Services never sleep; they're constantly changing. We have to keep finding the most impactful, painful problems and fixing them. Bugs that would be unlikely in single-node systems are always going to appear at scale.
He told a story of his team fixing a bug that had existed for years. It took 750 person-hours to fix, between six people over 2 weeks. (That averages at 12h per person per day, which is not a good way to live, but I think part of the point was that this was how LinkedIn used to be, and they're not any more.)
Their outage was caused by a change they hadn't known about. Don't forget that you can be affected by changes that are outside your immediate environment. Peer review changes just like you review code.
Site up. Uptime is the most important thing. In the middle of a firefight, it's difficult to retain perspective and see the bigger picture, so ask for help. And plan ahead. Benjamin told a story of someone who was asked how he was going to scale a system he'd designed. "I plan to retire before it becomes a problem". Which is amusing, but the guy actually did retire. The team were left with a system that couldn't scale and had to be replaced.
What gets measured gets fixed. Measure MTTR and MTTD.
You are only as good as your lieutenants. Let your leads lead. Don't get in their way, even if you'd do it better. Create a culture of empowerment.
Don't assume anything. If you don't understand, ask, even if it seems like a simple, painfully obvious question. Even obvious things get missed.
LinkedIn has a series of blog posts on this topic at https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason
And that was day one!
I've written about day two over here.