Hi.

This blog contains whatever random tech stuff I'm thinking about recently. And a lot of conference reports.

Conference report: SRECon Americas Day 3

Conference report: SRECon Americas Day 3

Day three of SRECon! (Day two is over here. Day one is over here.) Did anyone else discover that little path beside the creek a few minutes walk from the hotel? It was lovely! Operation: Not A Conference Zombie was definitely helped by the flowers and geese and birdsong. Good work, Santa Clara! Of course, then I made the rookie mistake of taking a red-eye flight home, so I'm still a zombie after all. ¯\_(ツ)_/¯

On to the talks! I had meetings in the middle of the day that I couldn't reschedule, so I missed a few sessions :-( Same caveats as yesterday: if I made any mistakes or misrepresented anyone, or if I missed livetweets for some of these, mail me at heytanyafixyourblog@noidea.dog. Or comment here.

Containerization War Stories. Ruth Grace Wong and Rodrigo Menezes, Pinterest

Config management is hard and people are bad at it. Pinterest found that masterful puppet was a bottleneck and didn't have enough safeguards. They were three years (and counting!) into a migration to masterless puppet, fighting canaries and autoscaling, when they decided that configuration management was just the wrong tool for the job. They started using Docker instead.

Docker offered a consistent interface. Containers run the same way on dev as prod, and it provided a good impetus to standardise how they set up services and sidecars

Pinterest isn't doing container orchestration yet; they have just one service per VM. They tried Mesos and then decided Kubernetes works better for them, and unfortunately ended up having to maintain both, at least for a while. 

Containers are good because you get packaged immutable dependencies. Failures now happen at build time rather than at deploy or runtime. But it came with a trade-off: they used Docker bridge networking and found a 20% increase in latency. 

There were some other problems along the way. Zookeeper was locking up every few hours with nothing useful in the logs. The JSON log writer was blocking and when it couldn't write, it stopped the service. Now they use a ring buffer, but that overwrites old messages when the buffer is full, so they have to be careful.

Cron has a security feature that doesn't allow it to run crontabs with more than one hardlink, but OverlayFS uses hard links to save space, so cron jobs were quietly failing.

(ed: I went down a rabbithole this evening trying to understand the cron security issue here and my best guess is this: you can create a hard link to a file even if you don't have write access to it. So, if there's a program that runs as root and writes a file called /path/to/my/file, and you have access to edit that file, you can delete it and hard link it to /var/spoon/cron/tabs/someuser instead. That means you can trick the program into overwriting someuser's crontab even though you don't have access to edit it. If someone has seen an example of this, or if I have it wrong, leave a comment here or drop me a line at thathardlinkthing@noidea.dog and I'll update this.
PS:  say *why* and not just *what* in your changes, everyone! Future people will be wondering what you were thinking.).

Crashes were leaving zombies so automatic recovery wasn't working. They set --init on the docker daemon to have zombies reaped. And the OOM killer occasionally killed the docker daemon.

Other things that got flagged as "Docker issues" weren't really about Docker. Sometimes people were making multiple changes at once: moving to Docker and changing the OS and upgrading the JVM, and it wasn't clear which of those had caused the problem. They encouraged making one change at a time!

Converting Puppet configs to Docker configs was slow and tedious, so they built automation to do it. Provisioning time went from 30m to 5m. They migrated 25% of their infrastructure in a year.

Containers are (relatively) easy, but migrations are hard. Focus on immutable deployments and auto-recovery, migrate one thing at a time, and automate everything.

I enjoyed this retrospective on the migration. It illustrated how many jagged edges our tooling still has, even for something as widely used as Docker. 

Abstract: https://www.usenix.org/conference/srecon18americas/presentation/wong
Livetweets: https://twitter.com/lizthegrey/status/979380457069363200
Follow: @ruthgracewong

 

Resolving Outages Faster with Better Debugging Strategies. Liz Fong-Jones and Adam Mckaig, Google

IMG_20180329_093028.jpg

Liz has been at Google for ten years and worked on tons of teams. Adam came to Google recently, was blown away by the debugging tooling, and is making the talk he wished he had years ago. They're introducing three techniques that Google's observability tooling uses for shortening the OODA loop and making debugging faster.

In the interests of getting this post out today, I'm going to be lazy and link to my notes on it from when I saw the same talk at DevOps days. This talk is so good I watched it twice!

Liz and Adam observed that adding a new dashboard is a form of technical debt. I hadn't looked at it like that before! They're working on getting these techniques integrated with Grafana and Honeycomb.

(ed: Panopticon is one of the things I miss most from Google. Combined with ubiquitous horizontal monitoring of tasks, it means you can dip in and slice and dice and correlate multiple streams. It's incredibly powerful and I hope Liz and Adam are successful at getting these kinds of features in other observability tools.)

Links:
Abstract: https://www.usenix.org/conference/srecon18americas/presentation/fong-jones
Livetweets: https://twitter.com/xleem/status/979391287341142016
Follow: @lizthegrey@adam7mck

 

Breaking in a New Job as an SRE. Amy Tobey, Tenable

Amy noted that most job descriptions are not accurate, so it's worth being clear about what job you're signing up for before you join a new team. And know what you want: salary, stock, relocation costs, what kind of commute you'll have, how much you {have to, get to} travel, what kind of computer you'll use and whether you can expense your phone bill.  

Own your own onboarding as much as possible. Changing jobs is a good opportunity to learn new skills and tools: if your new team mostly uses tmux, try it out, even if you're a lifelong screen user. 

In your first week, start exploring code and documentation, and take a ton of notes. Make friends with the people in IT. Generate a new ssh key and new passwords; never reuse them between jobs. 

Some people like to deploy on the first day, but certainly by week two you should be pushing yourself to run playbooks, deploy code, make whatever real changes will help you learn the system. Find a project that's in your wheelhouse and do some small thing. Send a few PRs or update some documents. Shadow oncall and interrupts. And start building relationships.

A month in, you'll know whether you want the job. (ed: +5 insightful). Write a document with your perspective on how it's going, good and bad. Right now, you have 'beginners eyes'. You can see things that people who've been there a long time won't, and that later you won't be able to see. The document can be just for you, or you can share with your management.

Use active listening to help you bootstrap faster. This is also your time to set your own pace. Understanding how you work is how you'll take care of your own health.

Links:
Abstract: https://www.usenix.org/conference/srecon18americas/presentation/tobey
Livetweets: https://twitter.com/msuriar/status/979401072052600832
Follow: @MissAmyTobey

 

Architecting a Technical Post Mortem. Will Gallego, Etsy

Will opens every post-mortem meeting with two questions: "Who has never taken part in a post-mortem meeting before?" "Why do we do post-mortems?"

Blame is a barrier to getting deep insights from an incident but, ironically, talking about 'blameless' post-mortems can itself be a barrier, because we're all terrified of saying something that might point at someone. (ed: yes! I have seen this!) Better to be 'blame-aware'. We can say what happened, even if that means naming names! There's just no guilt attached to it.

Will's definition of a post-mortem meeting is "the application of a learning culture through shared discussion of our beliefs on what transpired over an agreed-upon limited number of events". Not a catchy definition, but there's a lot of truths in there:

learning culture: We're here to learn more about our software, and learning doesn't come automatically. The post-mortem meeting leader has to consciously take the meeting there.

 "As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly." -- http://stella.report

"As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly." -- http://stella.report

shared discussion of our beliefs: no one person's knowledge is going to be enough. Questions for the room: "Who feels like they understand every line of code in their company?" (Laughter) "Who feels like they understand every microservice?" Still nobody. "What about every aspect of the stack you own?" Two people were willing to claim that level of knowledge. 

an agreed-upon limited number of events: we have finite time for digging into every incident. We need to choose where to spend our time.

A post-mortem facilitator shouldn't work solo. There should be a note-taker and maybe a co-facilitator and ideally everyone who was an actor in the event. It's not possible to force sharing, and this has to be about sharing; you can't force anyone to be there. The meetings should be open invite.

The facilitator doesn't need to be an expert in the topic under discussion. In fact, it's good if they don't know anything about it, because they'll feel comfortable asking questions that the people who are "supposed" to know are afraid to ask.

Before the meeting, gather a timeline by reviewing chats and logs and talking to the people involved. Try to schedule the interview process within two days of the incident or memories will be fuzzy, and schedule the meeting within two weeks, ideally one. That's how long you have before people stop caring and move on to the next thing.

Timebox the meeting. The first five minutes are a good time to clarify why we're here and introduce the actors, and it gives a little time for stragglers to arrive. Then 30-40m of telling the timeline. The lead shouldn't do this; the actors need to tell their own story. Highlight and dig into the inflection points, where the incident changed in some way. Leave 10m for followup Q&A, and remediation, if needed.

The facilitator's job is to get people to open up: what were their assumptions and how did they change? Was acting the right decision? How did they know when to act? Were the documents useful? The facilitator's job is not to answer questions, but to get knowledgable people to say things out loud that they think are common knowledge. (ed: yes!)

Avoid counterfactuals --  the actions that should have happened --  and be aware of strong emotions. Notice who is (or feels) under attack. People who are scared don't open up, so talk to them before the meeting and make them feel safe.  If there's anger, note it and give people a chance to de-escalate. Remember that everyone is trying to do a good job. Empathy is a powerful tool in engineering.

Anyone who says "failure is not an option" doesn't understand systems. We adapt to failures and remind ourselves that it could always have been worse. 

This was a great session! In the hallway-track conversation afterwards, Tonmoy Ghosh mentioned a technique that they use at eBay: they schedule a director-level followup meeting a month after the post-mortem meeting to revisit the action items. In that meeting, some bugs just get closed -- in hindsight, they turn out not to be useful -- but the promise of a future discussion makes sure that the most important remediation gets attention. I hadn't heard of people doing this before, and it sounds like a good technique to make sure actions don't get dropped, and bugs don't stick around forever as technical debt.


Links: 
Abstract: https://www.usenix.org/conference/srecon18americas/presentation/gallego
Livetweets: https://twitter.com/qkate/status/979464087162601472
Follow: @wcgallego

 

Your System Has Recovered from an Incident, but Have Your Developers? Jaime Woo

How many people in the room have gone through a process to make sure their services have recovered? Most people. How many have made sure their operators have recovered? Very few. Maybe that's because people are harder than computers and we focus on the systems we understand. Let's look at the people.

Jaime did a straw poll of 40 engineers. "After an incident, how stressed are you?". 42.5% said they were 'stressed' or 'very stressed', after an average incident. They reported worse ability to sleep, concentrate and be social, and also worse mood. A third of people said they couldn't enjoy things as much.

 Btw, this graph shows raw numbers (out of 40), not percentages. So > 50% for the first bar, not 20%.

Btw, this graph shows raw numbers (out of 40), not percentages. So > 50% for the first bar, not 20%.

Jaime compared to other groups of people with stressful jobs: doctors, stand up comedians and olympians.

Doctors face life and death situations and success often means being perfect. In coping with medical error, medical practitioners are considered secondary victims. In studies, physicians report sleeping difficulties (42%) and anxiety about future errors (61%). 82% reported that peer support helped them feel better. Another question in the straw poll: how often do coworkers reach out to see how you're doing? Most people said never/rarely. Nobody said 'always' :-(

Comedians voluntarily stand up on stage under bright lights and tell jokes to crowds that might be drunk or hostile. Jaime asked comedian friends how they do it, and what happens when they're failing on stage. During the incident, they acknowledge to themselves that it's going badly, and sometimes even acknowledge it to the crowd. They treat it as a learning opportunity and they don't aim for perfect success: a "batting average" of one bad show in ten isn't bad. They strive to understand what happened and try to mentally get back to a better place. That doesn't come for free; it's a conscious attempt to recover.

Olympians train their whole lives for an opportunity that may be over in minutes, and they perform on a global stage. Researchers have worked with them on "self-compassion": they're asked to journal understanding, kindness and concern for themselves like they would for a friend in the same situation. They record their 'state rumination', using questions like "Did you find it hard to stop thinking about the problem afterwards?" or "When thinking about the problem, did your thoughts dwell on negative issues?" The group who did the exercises had a statistically significant drop in state rumination.

All of these groups need intentional recovery. It's not innate. You have to put in the work.

Recovering from incident response is not about being tough. If it affects doctors, comedians and olympians, it affects us.

This talk was amazing and important and I hope our industry starts to talk about this more.

Links:
Abstract: https://www.usenix.org/conference/srecon18americas/presentation/woo
Livetweets: https://twitter.com/msuriar/status/979475872146010112
Follow: @jaimewoo
 

The History of Fire Escapes. Tanya Reilly, Squarespace

It went ok. Kind of a depressing topic though, and I'm determined to write a talk about baby sea otters or something next year. In the meantime, if you want to read more about how New York City's fire code evolved, or about major software disasters, I've listed my references at http://noidea.dog/fires

Links:
Abstract: https://www.usenix.org/conference/srecon18americas/presentation/reilly
Livetweets: https://twitter.com/msuriar/status/979490623127404546 , https://twitter.com/lizthegrey/status/979491714510589952 
Slides: https://www.slideshare.net/TanyaReilly/the-history-of-fire-escapes
Follow: @whereistanya

 

Leaping from Mainframes to AWS: Technology Time Travel in the Government. Andy Brody and James Punteney, U.S. Digital Service

The USDS came out of the healthcare.gov fiasco. They're a mix of tech and non-tech folks serving tours of duty, working with various groups like veterans and medicaid recipients. (ed: as someone in the path to citizenship, I can't over-state how much they improved the process over what it was a couple of years ago. It's night and day. I love the USDS.)

Everyone asks them what it's like working for the Government in the current administration. Their day to day hasn't changed very much, and they have the same mission as before.

Serving the American people is important but it comes with a downside: lots of bureaucracy. "It's illegal to use cloud services", someone will claim, and they'll have to spend time finding the law that person is referring to and demonstrating that it doesn't actually say that. 

Government tech can feel like a timewarp. "1999 was a great year in technology!" (Hahaha). There's lots of technical debt, a spaghetti of code, a SPoF enterprise server bus. It took six months to spin up a single VM on the government private cloud. Can we move from there to microservices on serverless or whatever the buzzwords of the day are? Not easily!

There are three hard things in government: hiring, firing and buying. 

Hiring:  Salaries are capped and there are no stock options. High performers can't get rewarded, so low performers are extra demoralising.

Firing: it's hard to measure performance. Users are locked in; they don't have alternatives to the software they use. The top 4000 positions of the government are political appointees but we don't want the others -- the career civil servants -- to be fired for their political beliefs when the administration changes. So firing someone is justifiably difficult.

Buying: we need to avoid nepotism and corruption, so buying involves complicated contracts. The same group of people buy furniture, battleships and software. (!) 

Government launches are usually planned years in advance, with hundreds of people and mountains of paperwork. Everything is waterfall method: the agencies assume you can spec the whole thing in advance and get the software in a few years.

Not at the USDS! An example launch was trusted traveler and login.gov. The Government has thousand of separate login systems, including one for healthcare.gov that cost 100M dollars and caused 70% of the first year's downtime. login.gov was a collaboration between USDS and 18F to provide a single source of identity. It's on public cloud and it's open source! You can send a PR to the Government login system. (The whole room had a simultaneous reaction of "...WHAT?!". It was great.)

The US Customs and Border Protection screens 390 million travellers in a year. Their old global entry site was so bad that people would pay other people to sign up for it for them. The USDS worked with CBP for nine months to build a new site, using modern approaches like CI/CD and an ELK stack.

They wanted a phased launch, but the Governments likes big bang launches. They showed us a graph of the new site hitting 30k qps in the first hour or so... until they discovered a bad vulnerability, and needed to turn it off. "We got the site down inside 10m, which is a government record." 

They fixed it and turned the site back on later that day. All was good until "I was woken by PagerDuty, my favourite alarm clock". A traffic spike had caused autoscaling problems and they hadn't set a minimum count of instances. As soon as the systems started failing, the health checks decided everything was unhealthy and terminated them all. 

Next they had some problems with being throttled by third party APIs, and then some "Heisenbugs", the kind of bugs that seem to disappear as soon as you try to isolate them. But they figured it out.

They were unhappy with how the launch had gone, but the agencies were delighted: it was the most successful launch the Government had ever done. The outages even helped, because they were able to demonstrate why they needed the monitoring and other systems they'd been asking for.

The USDS has a playbook of 13 best practices, which are online at https://playbook.cio.gov. Most of them are about process, e.g., having one leader, not a committee. The Government tends to fire people who preside over failures, so it's a big change to build a culture where failure is celebrated as a learning experience. 

Trusted traveler had 50 pushes in its first month. There have been half a million applications completed using it, and the team is now addicted to real time metrics and dashboards. And USAJOBS (how you apply for a Government job) just started using login.gov.

This is just the beginning of what Government can do with tech that works. Sign up for a tour of duty at https://www.usds.gov/join.html. (And thanks again for the better citizenship process, USDS <3).

Links:
Abstract: https://www.usenix.org/conference/srecon18americas/presentation/brody
Livetweets: https://twitter.com/lizthegrey/status/979502552218664960
Follow: @alberge

Operational Excellence in April Fools’ Pranks: Being Funny Is Serious Work! Thomas Limoncelli, Stack Overflow, Inc.

Stack Overflow's 2016 April Fool's joke was a tamagotchi-like egg. They'd load tested it but hadn't considered the effect on the network: they accidentally DoSed themselves. But they calmly took the joke offline, fixed the bug, redeployed the code and relaunched. Were they lucky? No, it was part of their plan. They had a design for operational excellence in April Fools jokes.

It is traditional for a talk  author to include a slide  with their credentials, so you know why you should trust them on the talk topics Tom's is that he co-published a book of all the joke RFCs there have been, with added commentary. It makes him about $1 royalties a year.

To be funny, an April Fools prank needs to be both topical and absurdist. And it should be a surprise. Like when Google had just bought a facial recognition company, so facial recognition was topical, they put a moustache and glasses on all of the photographs in the internal corporate directory.

Stack Overflow had a good one about 2 factor called Dance Dance Authentication (Youtube video here; it autoplays so maybe don't click it if you're reading this while pretending to pay attention in a meeting).

What makes an April Fools joke not funny? Hurting people, punching down, or inside jokes.

Best practices for jokes:

Use feature flags. Flip a flag to enable the prank. 

Load test. "We all load test, right? Lower your hands. Load testing is the flossing of IT". (Hahaha.)

Dark launches. Deploy the feature in the browser but don't display anything, like Facebook did for six months before they launched Messenger or Google did with ipv6.

Involve all the silos. Tell marketing, PR, sales, eng, launch control, executives. (ed: and legal!)

Do a retrospective afterwards. Was it a success? What did you learn? 

You can be a bit lazy and use other people's resources, e.g., if your joke is a video on YouTube, you don't need to capacity plan for it unless you're Youtube SRE.

A great joke would be for the golang team, who are used to fielding complaints about lack of generics, etc, to announce "Go 3.0 is here and it includes all this stuff everyone always asks for" and then the link takes you to the Java download page. (ed: He's not wrong. That is comedy gold.)

Also, psyche, this was really a talk about deploying anything, just in disguise as being about April Fools things. (The talk was on March 29th, so pretty close.)

That was a funny talk. Nice work, Tom :-)

Links:
Abstract: https://www.usenix.org/conference/srecon18americas/presentation/limoncelli
Livetweets: https://twitter.com/msuriar/status/979510555823161344https://twitter.com/lizthegrey/status/979510940327591937
Follow: @YesThatTom

And that was SRECon Americas 2018! The survey is at http://bit.ly/2FRYTAs so if you were there, tell the program committee what you liked and what you didn't!

 

Here are some other stray observations, in the form of opinions I didn't have before the conference but do have now. These are  suuuuper incredibly subjective and I wonder how much other people agree with them:

  • Multi-cloud sounds like it's still more trouble than it's worth, and we're going to need a better story for it. 
  • Docker has become a thing people expect SRE types to know like, say, the Linux command line, or the difference between TCP and UDP, or what a PR is. Like any of those, you can certainly get away with not knowing them, but people will not check whether you do; they'll just start talking. 
  •  It feels increasingly weird that Amazon isn't at SRECon or LISA. The GCE and Azure people are all chatting about what they're up to and there's this weird kind of uncomfortable space where AWS should be. 
  • The gap between what we (as in the whole industry) say our best practices are and what we're actually doing is huge. At least half the room said they couldn't do a repeatable build. Very many of us are muddling our way towards a microservices and/or cloud-based architecture, all following different paths. There's no standard/representative story; we all have to make our own journey. (I'm imagining us all as emperor penguins making an epic waddling trek across the ice and avoiding seals. And now you are too.)
  • Kubernetes won. All of the other orchestration people can go home. I have no idea how they managed to make us all excited about container orchestration, but it's been a long time since I've seen people so genuinely excited to play with a piece of infrastructure. I mean, I am too. I'm not judging.
  • We've been slowly divesting ourselves of heroism for the last few years, but we're really starting to frown on nighttime pages and long hours now. Good.
  • Red eye flights are stupid and what was I even thinking.

I'm not going to make it to SRECon in Dusseldorf or Singapore, but SRECon Americas 2019..... 

 

...is in Brooklyn! YES! Hope to see a lot of you there :-)

Design documents: maybe the only record of what the hell you were thinking

Design documents: maybe the only record of what the hell you were thinking

Conference report: SRECon Americas Day 2

Conference report: SRECon Americas Day 2