Conference Report: LISA 2017
I went to LISA 2017! It was excellent! I vowed to write a shorter trip report this year, but there was so much good stuff so whatchagonnado /o\ Here’s the notes I took about what I saw and what I liked. They might not necessarily represent the most important parts of the talks; they’re just what made an impact with me. If I have any errors in here, please let me know and I’ll fix it. Here’s what I saw:
Never Events, Matt Provost, Yelp
“Never Events” are how the NHS refers to the the most serious medical errors. These are failures that should have been wholly preventable: systematic barriers were in place to stop them, but nonetheless they happened. There’s a formal framework for managing and reporting them, and we can learn a lot from how they’re handled.
The NHS publishes a list of all Never Events every year; last year they had around 400 of them. The most common is “Wrong Site Surgery” – the correct surgery but on the wrong tooth or limb or eye – and “retained foreign object” – leaving something inside the body.
The number is remarkably low, given the huge number of procedures the NHS undertakes. Part of this is because of checklists: the surgeon, anesthesist and nurse all have to speak out loud the patient’s name and the surgical site, and the patient has to verbally identify themselves and consent. After the procedure, they count the sharps and swabs to make sure nothing is left inside.
The retained object events are an example of making a change during an incident and forgetting to undo it. Checklists don’t help if it’s over a long period of time: think of the outages caused by not remembering to renew a domain name or ssl certificate, or blocking an IP during an outage and forgetting to put it back! Create tickets during the incident to clean up temporary fixes. Keep an eye out for it when writing the post-mortem… “we did this… did we remember to undo it?”
The majority of the Retained Foreign Object events were during childbirth, when there’s only one surgeon in the room: no peer-review! Similarly, when you’re on call, you can be on your own. If you’re doing something dangerous, get a second person online to review the command you’re typing.
Every Never Event is recorded, whether it actually causes injury or not. Matt compared this to the DROPS initiative, which focuses on preventing dropped objects in the oil and gas industry. They require reporting all drops, even if they didn’t cause an injury. Contrast to the airline industry which doesn’t need to report near-misses. (Yikes!) There was one of these at SFO recently came within 59 feet of the ground, which would have been the worst aviation disaster in history if the pilots hadn’t course-corrected in time. But there’s no legal requirement to report that, and nobody did for 24 hours, by which time the cockpit voice recorder’s data had been erased. (More yikes!) The NHS has a deadline for each stage of reporting, root cause analysis, etc.
Nearly all of us have, at some point, committed to the wrong git branch or removed the wrong directory. We should report these, even if they didn’t cause an outage.
This was one of my favourite talks of the conference. A few people mentioned the DROPS thing later and each time I pulled out the scrap of paper where I’d written “RECORD ALL DROPS!!!”.
ChatOps at Shopify: Inviting Bots in Our Day-to-Day Operations. Daniella Niyonkuru, Shopify
I had this same conversation like four times:
someone: I don’t know what I think about this ChatOps phenomenon.
me: watch Daniella talking about it and you’ll start thinking it’s cool
Chatops is “conversation-driven development”, it’s integrating chat channels (mostly Slack) with your infrastructure. Shopify use one called Spy, based on the Lita framework. It’s connected to Github, PagerDuty, Jira, etc, and it’s easy to expand it to do different things with just a few lines of code. Here’s some stuff it can do: show CDN traffic, return the Nginx status, profile a binary and return a flame graph, return a list of the versions of things that they’re running, etc.
And that’s just the reporting commands. It can also manage infrastructure and do failovers. It’s got a workflow for calling out to authentication in another tab.
Before Shopify wrote this automation, they used to have engineers come in during the stores’ quietest time and work through a long checklist. Now they’ve got 90 second failovers, and it’s all triggered from the slack channel.
During incidents, Shopify uses a model called the Incident Funnel, and Spy rides along with them the whole way. It can be used to page someone and start the incident. It will suggest who else might need to be paged, e.g., telling the social media team that the incident should be announced externally. It reminds you to update the status page, and it will reach out to the incident commander and tell them if one person has been handling an incident for too long, to prevent on call fatigue.
Other folks in the channel can learn the commands by watching people type them, and spy will suggest the correct syntax, if the user’s getting it wrong.
Spy can also add people to git repos, deploy, lock deploys, communicate, and do a million other things. This seems like enough unrelated things that I asked about whether Spy has a dedicated product manager. No, but it has a fully staffed team.
Daniella concluded that Spy has improved their knowledge sharing and on caller focus, eliminated manual toil, improved onboarding and overall made incident handling smoother.
Distributed tracing: from theory to practice. Stella Cotton, Heroku
Microservices and polyglot architectures make it hard for us to trace requests by looking at logs. We might not have enough data for statistical analysis and we might not want to wait for the logs to be aggregated. Worse, asynchronous calls with delays may mean that we’ll see the effect before the cause. “You can’t tell a coherent macro story about your application by monitoring individual processes”. Humans are bad at guessing why the web requests just got slow. So we have distributed tracing!
Google published the Dapper paper in 2010. Zipkin was a Twitter hack week project to implement the paper.
Tracing requires a tracer inside every application, annotating individual requests. A trace is the end to end record of the request. Each ‘hop’ along the way is called a span. At each node, the span will have the same trace id, but a different parent id. The traces are then all aggregated in one place. Although the trace ids are propagated in the request headers, the reporting happens out of band and doesn’t block the thread.
All of the data forms a directed acyclic graph, which can be modelled as a data structure, or displayed in Gantt charts, which are also called swimlane format (I hadn’t heard that before and I like it). It’s easy to see what blocks what and what’s happening in parallel. A wider gap between two spans may show that requests are queueing. Tracing solutions will usually sample, rather than every request.
There are a bunch of off the shelf solutions like LightStep and TraceView. Zipkin is the most popular open source offering. Heroku use that: they got it up and running in a few months of part time work, but instrumenting every app in the company takes longer!
Open Tracing standardises instrumentation APIs, which makes it easier to switch between vendors if you decide to do that. Interoperability is still terrible. Language support is mixed, and a vendor that claims to support a language may still want you to write a ton of code yourself to make it work. Some of the solutions need a separate monitoring agent running on every host.
Jonathan Mace at Brown University has a survey of how 26 companies do end to end tracing.
Other stuff to consider: custom instrumentation is an easy place to accidentally leak sensitive data, so be careful. And bear in mind that there’s a lot of cognitive load in setting something like this up, and it’s only valuable if you get broad coverage. You may need to market it to a lot of teams to get everything instrumented.
Stories from the Trenches of Government Technology Raquel Romano, Engineering Lead, Digital Service at Veterans Affairs, and Matt Cutts, Acting Administrator, US Digital Service
Veterans’ compensation can take months or years to come through, and it’s tremendously complicated. The USDS is launching a new platform, vets.gov, to make it much easier. It’s co-designed with veterans to make sure it’s easy to use, and it’s built on modern tech: Rails, React, Github, Jenkins, PagerDuty, devops. Compare that to the old-school government tech they need to integrate with, where they often don’t get access to the source code. (But the code’s not lost, so that’s better than it could be).
Matt reminisced about his time in Google working on ads, where a wise mentor showed him that you validate the data that arrives at your function, even if you wrote the code that calls the function. That’s Postel’s law: be conservative in what you send, be liberal in what you accept.
Government systems often don’t obey Postel’s law. In the system that transfers data from the DoD to Veteran’s Affairs, doctors had the option of transferring pdf, tiff or jpg files. If they selected anything other than pdf, the record would be silently dropped. (Oh my god). So the veteran would drive, maybe many hours, to the VA and find that there was no record that they were entitled to medical treatment. 5% of veterans were denied benefits for this and other bugs. One USDS person worked on it for 44 days and fixed the problems: 0% denied benefits and a faster data transfer too.
These are not efficient systems. Veterans have over 1000 toll free numbers they can call, and they need to remember 7 different usernames and passwords over many sites if they want to get benefits. One form, the 10-10EZ to apply for healthcare, could only be opened with an old version of Adobe Acrobat. Modern browsers would say “You need a later version of the PDF viewer”, with no way to know that it actually wanted the opposite. The USDS replaced it with a web form. Error rates dropped from 7% of applicants to <1%. Veterans started using the site: from less than 10% of all healthcare applications to more than 50%.
When you go into Government, you’re in the past. Things everyone else has been doing for doing for ages are new and risky. The Government didn’t want to do bug bounties, but the USDS pushed for an event called Hack the Pentagon! Before Hack the Pentagon, they’d found 30 bugs in 1095 days, costing around $13k each. Afterwards, they found 138 in 24 days, costing $1.1k each. Success! They followed up with Hack the US Army – and found a vulnerability so severe that they considered shutting down the online US Army recruiting system and going back to pen and paper. Then Hack the Air Force and one 17 year old made $40k from finding bugs.
This was really cool. I think most of us have seen terrible Government or administrative sites at some point, and it made the brain feel good to see them transformed to be efficient and human-centered. The USDS is knocking down many, many low-hanging fruit and making people’s lives better; spending 6 to 12 months with them seems to be a great way to do something meanful. (You need to be a US citizen and be willing to live in DC.)
Resiliency Testing with Toxiproxy. Jake Pittis, Shopify
There’s a legend at Shopify of the day the CEO turned a node off on purpose as a kind of manual chaos monkey, and how hard that was to debug. Reasoning about failure is complicated, and human intuitions are often wrong. Learning from incidents gives a feedback loop that makes systems more resilient. But waiting around for real outages isn’t so good, so Shopify introduced ‘gamedays’, artificially executing a known failure scenario under controlled circumstances.
As a multi-tenant platform, user isolation is a big deal, and a single customer having a flash sale (a short-lived promotion) needs to not break things for other stores. So, they wrote a simulator to pretend to be a flash sale mob. They ran it again and again, shipping fixes to slowly become more resilient to flash sales. Nobody notices the real ones now.
These kinds of failures need to be authentic, but this means they can have high production impact. And they wanted to make something their product folks could use – game days are really only friendly to infrastructure developers.
So they built Toxiproxy. It’s a binary daemon that exposes a HTTP API, and a thin client library to use the API.
They have a few hundred ToxiProxy tests, automating failures such as injecting latency, blackholing data and rejecting connections. They inject these failures from their automated test suite into the proxy and see how the application reacts. When they fix the root cause of an incident, they ship a ToxiProxy test to make sure the root cause stays fixed; the test gets run on every deployment to production. They also write proactive tests, looking for future fires: they create a resiliency matrix to test all of the intersections between components and reason about potential failures on that matrix.
Next steps are integrating ToxiProxy into every binary to make it very easy for people to use, and automating more gamedays, such as datacenter evaculations.
Now You See Me Too: Visual Tooling for Advanced System Analysis. Suchakrapani Sharma, ShiftLeft Inc.
From 40k year old cave paintings to CERN pictures of electrons, we’ve come a long way in visualisation. Suchakrapani showed us several gorgeous pictures of early bar charts, maps, timeseries and line charts.
But this is not a talk about data visualisation. (There was an audible “awwww” from the row I was in and only some of it was me. Seriously, these were lovely visualisations. I could happily do 45 minutes of that.)
We use visualisation to gain insights on problems. For example, a program is making system calls and we want to know why some are slow. We can instrument all of the calls with timestamps and observe them as they occur. We gather the data, characterise it, then visualise it.
Gathering Systems Data
Systems are full of places where we have observability. Kernel functions, perf counters, hypervisors, custom APIs, branches and calls, CPU pins, even EM waves. Tons of data available.
The Common Trace Format, CTF, is a flexible binary trace format which allows tracing kernel and userspace events. And there are lots of storage options, like OpenTSDB, Graphite, graphs, etc, depending on the format of the data.
Characterisation and Visualisation
Again, many options. Heatmaps are good at showing outliers. Uftrace can give you the function graph of a userspace application. Flame charts (flame charts are so beautiful!) with time on the horizontal axis. Flame graphs, visualising call stacks during execution. Timeline views, where individual sections show individual states. Critical flow views, following a process’s execution and showing how it relates to other processes.
Valgrind gives Callgraphs and Treemaps. Sunburst graphs (also beautiful!) can represent hierarchies and are good for showing depth, but can be misleading: it’s hard to tell how big each sector is.
Trace Compass can do tons of different types of traces and visualisations. And then Suchakrapani showed us a live demo of trace compass, some of which I didn’t understand but all of which I enjoyed. It was very cool.
Finally, think about the colours you use. Diverging, for heatmaps. Sequential, related but changing. Qualitative for different entities. (Sequential are super pleasant to look at, imo.) I didn’t get a photograph of this, so here’s a random other article that illustrates it well.
I enjoyed this talk a lot, though I honestly could have gone for about 20 minutes more of good data visualisations. Judging by the slack channel, I wasn’t the only one :-)
Vax to K8s: Ticketmaster’s Transformation to Cloud Native Devops. Heather Osborn, Ticketmaster
Heather’s a senior director of systems engineering at TicketMaster. She’s been there 20 years! In 2011, they began a project to transform TicketMaster’s tech and move 200 projects into the cloud. Transforming a 40 year old company is like turning an aircraft carrier.
TicketMaster has 484 million ticket transactions, more than 1B unique visits to the site. 60% of the traffic is mobile. Every Friday new tickets drop and it’s a self-inflicted DDoS: 70k fans want the same 30k Beyonce tickets! This traffic lasts an hour or even just a few minutes and then it quietens down again.
A VAX is a discontinued system from the seventies. They’ve virtualised it on Linux rather than spend their weekends shopping for VAX parts on the internet.
They had siloed teams. Ops did deployments, monitoring, alerts and escalations, and customer service. They watched graphs, waited for systems to fall over, manually throttled traffic that looked like robots. The dev team got features to market on time and threw systems over the wall to ops. Ops wanted stability and didn’t care about the product. It was a familiar story of slow innovation, animosity and burnout. Tickets going over the fence and back again ten times over the course of a deployment. This sucks: let’s do DevOps.
So, they gave the devs unfettered access to all of prod. “Ops died a little inside”. But in return the dev team would be responsible for their own releases and monitoring. And for the first time, it would be possible to page them. The ops team had never had any contact information for them before!
The developers got training to handle operational tasks. Initially there was some finger-pointing between teams and some reluctance to take on unfamiliar work. Until a catastrophe happened: there was a major DHCP failure and the dev and ops teams had to work together to recover from it. Without the DevOps training, it could have taken weeks to get back to normal, but they were back to selling tickets within a couple of hours.
Another attempt at DevOps and now everyone’s friends. The developer team appointed incident owners. Ops folks embedded with the dev teams to learn how to fix the real problems. They worked together on service catalogues and tooling, and the developers got involved in the Friday morning ticket releases. The time to repair was much faster and the teams worked together.
Going to DevOps got release cycles down from two weeks to one, but they wanted to get it down to a day. Their hardware costs were also too high: they needed to provision for the Friday mornings, but the hardware was idle the rest of the week. So they modernised the tech stack and moved to hybrid cloud.
They used a new tech maturity model, moved to Docker, AWS, Terraform, Kubernetes and Tectonic and got their deploys down to 60s.
AWS wasn’t suited for everything. Public Cloud is expensive for always-on systems. Other legacy systems would have needed substantial rearchitecting with better caching, failover and persistent storage, to make them able to run in the cloud. And they’re not moving the VAX emulator to AWS. (Aww). So some things are still on bare metal. There’s still tons of work left to do, but it’s fun. And they can’t stop selling tickets while they do it.
“Don’t You Know Who I Am?!” The Danger of Celebrity in Tech. Corey Quinn, Last Week in AWS
Corey displayed a slide with a bunch of logos from major tech companies, explaining that these logos will make people believe he speaks with authority. Hah. In our industry, merely being associated with them gives automatic credibility! But should it? At a previous conference, he saw a talk by Netflix saying that they give their developers root in production. Great, said the guy sitting beside him, I’ll start doing that too. But that guy worked for a bank.
That’s a cargo cult! You’re confusing cause and effect. Copying what Netflix does will not give you the reliability of Netflix, because you’re missing the larger context. Netflix hires very experienced developers and pays them unusually well; they can screen for good judgement. And they stream movies. The consequences of an outage are much lower than for a bank.
Another story from a previous conference: a new speaker gave his first talk at a major conference. He described some new system he’d worked on, and gave an engaging talk. But the first question (not actually a question) was an engineer who said “That’s not how we do it at Google.”. (Ugh). The list of things that are not questions include: a) calling bullshit on the entire premise of the talk, b) telling a pointless story, c) boasting about your resume. This condescending engineer hit all three.
These celebrity companies are not even doing mission critical stuff. If you’re the department of energy, sure, you get to talk about how important your reliability is. Maybe less so if you’re an internet company. (This whole section and actually this whole talk was really, really funny, but I’m not going to do it justice so you should go watch this video.)
If you don’t work for these companies, their solutions are probably not right for you. Don’t cargo-cult what they do. For example, Netflix’s Simian Army randomly kills AWS instances and has been around and publicised for years, but when a region of S3 went down earlier this year, a lot of the sites on the internet broke. Clearly, we’re not removing our single points of failure. If you’re running something people’s lives depend on, sure you should probably rearchitect your service to avoid black swan events. But most of us, no.
And if you do work for big companies, recognise the weight that your words carry. Don’t punch down. Getting up to talk is scary, and if the first response is very negative, it’s incredibly hard to get up and do it again. Ask yourself: is the speaker wrong, or just wrong for your context. And either way, put yourself in the other person’s shoes. If you see the speaker is stressed, ask a softball question that isn’t condescending. Be kind.
This was amazing and you should watch the video.
Plenary Panel: Scaling Talent: Attracting and Retaining a Diverse Workforce
Moderator: Tameika Reed, Founder of WomenInLinux
Panelists: Derek Arnold; Amy Nguyen, Stripe; Qianna Patterson, QP Advisors; Wayne Sutton, Co-Founder, CTO, Change Catalyst; Derek Watford, Founder of High Point Gamer
It’s very hard to blog a panel! I took a lot of notes, attributed to each speaker, but I’m going to munge it all together here rather than accidentally misquote anyone. Tameika opened by telling us all to be comfortable with being uncomfortable. This actually ended up not being very uncomfortable. I feel like we could have handled being forced to introspect a bit more! But this was an engaging panel (and a packed room, despite being the last session of the day. It gave me hope that this many people wanted to attend.), so I think it was still very valuable.
Q: Are internships the only way into the tech world?
Schools, and particularly schools in low-income communities, aren’t providing the skills we need for the tech world. Education doesn’t work for everyone. Teachers don’t hit their stride until year five, but the average teacher doesn’t spend more than three years in the classroom.
A lot of us are self-taught. Even in a traditional CS program, your social group may determine how much attention you get, what you learn, who mentors and supports you. A book called Whistling Vivaldi says that Asian and White students work in groups to study. Black students were working alone, without anyone to help catch errors and get them through frustration. Working alone, you don’t see such good results, although you spend just as many hours studying. It’s a bad cycle and it’s lonely.
Tech companies should fund meetups, build networks.
Q: Why haven’t major corps or business leaders worked more proactively to engage with educational systems.
Internship programs can work, if you’re going to state and community colleges and recruiting there. Get kids applying in high school and stay with them. Companies should do this more. But they engage with the same five schools over and over. At Grace Hopper, if you didn’t go to Stanford, MIT, CMU, nobody’s looking at your resume or inviting you to the nighttime events. We celebrate Silicon Valley, but they value pedigree over actual skills and ability.
Q: How can we train someone to be an SRE coming out of high school?
Everyone’s talking about kids needing to learn to code, but who can teach them? Can a math teacher get an extra cert? Do we think we can convince people with CS degrees to get a teaching credential and earn $60k per year? We’ve been grappling with this for a while. And we need to think about people with transferrable skills: it doesn’t all need to be CS degrees.
If someone’s bad at math, we don’t let them do STEM. But there are a ton of other roles in tech. It’s not just coding. It’s computational thinking and problem solving.
We also need to not gate on guidance counselors. They don’t know what a DevOps engineer does. We need to give kids videos to show them what jobs exist and what skills they’ll need. We need to see promise in students and encourage that promise. There are a lot of bad guidance counsellors and they have biases. If you’re Asian, people will assume you’re good at math. You can fail over and over again and still get tons of support to study STEM. Guidance counsellors encourage the students who they think look like engineers.
We should teach logic, computational thinking, personal finance.
Q: How do we stop bad behaviour? What do we do when the system we’re in allows harassment. How do we build better systems and safer communities?
If we had the answer, we’d be rich. Black women are pushed out of the industry; nobody’s trying to retain them. They’re pushed out through sexism, harassment, or hitting the glass ceiling. One story: an investor talking to two business partners, a black woman and a white man, who had been working together for a decade. The investor asked “where do you see yourself in three years?” The man assumed that he’d be growing the business and the woman would “hopefully be married with children”. If he’s willing to say it, how many times has he thought it? And how many times have other people dismissed her business and tech skills because she’s a black woman but they haven’t said it?
The fact that this happens to very senior tech women tells us something about how much it’s happening to people lower on the career ladder. It’s cumulative and it causes stress that pushes people out of tech,
We’re in a troubling time as humans. What we’re seeing in the movie industry right now is good: it’s becoming celebrated to call out bad players. We’re making progress: all conferences now have a code of conduct. Speaking out is risky; it’s easier for people with more privilege.
Q: Should you build a network before you speak out so you have a landing pad?
The tech industry is all about who you know. Sending your resume in doesn’t work; you need to know someone in the company. So make sure you leave this conference having gotten to know some people who are different from you. Offer to help people where you can. Be willing to connect with people who don’t look like you.
Q: How do you find your voice inside tech?
Speaking at conferences and doing podcasts. Knowing other people. Having friends on the internet who can sanity check whether you’re over-sensitive and whether that person was actually trying to be a jerk.
Start with self-awareness. You need to know who you are, how you think, what you’re good at and how to tell your story. Build a community based on your interests. Empower other people. Be a shield for them. Doing that is also helping yourself.
Part of finding your voice is telling the people who don’t want you to use your voice to shut up. Even if that’s you. Even if that’s your boss who always ignores your ideas.
Be aware of impostor syndrome, anxiety and depression. Keep a narrative in your mind about why you’re doing what you’re doing. If you’re under-represented, you have to do that with an extra weight that other people don’t have to. You have to have more confidence in every aspect of your life.
If you’re being pushed out, skill up before you leave. Learn for free. Take a skill, study people, get what you can.
Q: How do you move into a role of leadership?
Men just take leadership, they don’t wait for someone to give it. Women ask permission. But there’s no direct translation, like “Study this subject for this long and now you’re a leader”. We don’t have clear pathways. As leaders, we should always be training other leaders. Your responsibility is to clear pathways for other people to take your job. Continuity is good for business. Good leaders have a succession plan.
The No Asshole Rule is useful. It says you don’t need to give up your soul to be a leader.
Communicate what you need. Find allies. If there’s not a professional development track in your company, find colleagues and together push for one. Talk about leadership development. Ask for executive coaching as part of your perks. Just ask for what you need. It doesn’t hurt you for people to know you’re serious about being the best you can be.
Q: How can men be proactive in tech, and are they already proactive? Men have daughters, mothers…
Women are humans! Men should not need to have a daughter, an auntie, a grandma to treat women with respect, to not harass women. You shouldn’t need that to be a good human. We should see men and women do joint talks at conferences, on topics they wouldn’t usually talk about. Be uncomfortable. Expand your network and talk about things you would never talk about.
Genius crosses zip codes, opportunity does not. These conversations need to happen.
Don’t pretend we’re all the same or we all have the same opportunities. If you know something’s happening in someone’s community, don’t avoid it for fear of making it weird. Find the time to show that you care about people by asking how they are.
Question from the audience: how do you avoid being pushed into management if you’re a woman?
I didn’t take notes for the audience questions, but after some confusion and back and forth everyone got on the same wavelength: this doesn’t happen to Black women. /o\ We should be pushing more black women towards leadership positions. The common (annoying) trope of women being sidelined into people roles when they want to be senior engineers is real, but for Black women there are problems to solve before we even have that problem.
Question from the audiemce: as a black man, should I try to learn how to speak to white people?
A lot of people in the audience laughed at this, which actually felt kind of shitty because it was a serious question. The panel agreed that code switching is a thing, but said that mostly you should be allowed be yourself and communicate in the way that makes you comfortable. Which I agree that you should, but I think it’s probably not as easy as that :-/
This was a great panel. I wished it had been two hours long, because we were just getting into the mildly uncomfortable topics and I think that would have been great. But I guess that would a long time to be sitting on stage. Anyway, the panellists kicked ass and I hope LISA has this conversation again next year.
Managing SSH Access without Managing SSH Keys. Niall Sheridan, Intercom
According to a report from NIST, most organisations don’t know how many ssh keys they have or who has access, and most don’t do key rotation.
How did we get here? We moved from telnet and ftp to ssh and we collectively decided keys were better than passwords. And keys do have good points: they’re just files, they’re easy to move around. But a lot of time has passed and we never got good at managing keys. Losing keys can be a company-ending event, but somehow companies still have better policies around managing passwords than keys. We require complexity, expiry dates, two factor. Keys don’t do any of that out of the box.
When an employee gives you a public key to put on a server, you don’t know anything about it: you don’t how how old it is, whether it has a good passphrase, whether it’s stored on a USB key or copied into dropbox or even checked into github. And keys are high-value targets for theft. When sony was hacked, many of the files taken were key files. Some malware focuses on stealing keys.
Intercom used to have their authorized_keys file be part of their machine image. When someone joined or left the company or updated the keys, they’d deploy a new image. It didn’t scale, and they couldn’t revoke keys quickly.
So they moved to SSH Certificates. SSH certificates contain a public key and are signed with an signing key. They’re not new – they’ve been in OpenSSH for seven years – but they’re not widely used. SSH certificates can contain extra metadata, including expiry dates and named users with restricted capabilities.
They built their own CA, called Cashier. When an engineer wants to sign in, they run the cashier command line tool, which opens a browser pointing to the CA. The CA redirects to Google for Oauth, then provides a token. The engineer pastes that token into the cashier tool, and cashier generates a new ssh keypair and sends the public key and the token back to the CA. The CA signs it, the client loads the certificate and the private key into the ssh agent, and the user can now ssh to the production machine.
This is all audited and logged, and it’s very easy to quickly add and remove access. Users generate new certificates every day, so only the signing key needs to be managed and rotated. Since engineers don’t generate keys until they need them, they’ve gone from hundreds of keys to around 30.
This was an engaging and educational talk. A topic like this has high potential to be dry and dull, but Niall kept it very interesting. And this is a cool solution to a common and long-running problem.
Where’s the Kaboom? There Was Supposed to Be an Earth-Shattering Kaboom! David Blank Edelman.
David promised us an analogy that wouldn’t work perfectly, but that was likely to spark some ideas. I think it worked pretty well! Also, we got to see a video of a building imploding, so I think everyone was happy.
We know how to build and how to integrate, but we’re not so good at destroying or decommissioning things. What can we learn from the people who do demolition for a living?
Demolition’s different from construction: we think of it as easier, even though we may have much less knowledge of the structure, materials and dependencies than we do during construction. It’s a process of reverse engineering to take something apart: we’re stripping off layers in the reverse order we added them.
We don’t consciously aim to build things that are easy to destroy. It’s easier if you have a deconstruction plan, just like a design plan.
What makes it easier: transparency, regularity, simplicity, limited number of components, smaller numbers of large things rather than a large number of small things, easily separable materials, simple regular layout, layers, common standard shapes and connecions, removable fasteners, salvaged materials (because then your know they’re salvagable!).
David referenced a book called How Buildings Learn: What Happens After They’re Built, which described a building as a set of six “shearing layers of change” in constant friction: the house’s site, its structure, the ‘skin’ it’s enclosed it, its plumbing and other services, its space plan/layout and the ‘stuff’ inside it. Each of these have different life spans.
When we design for disassembly, we need to identify the most critical connections between these different layers.
Look at your system, service, software like a demolition person might.
Debugging at Scale Using Elastic and Machine Learning. Mohit Suley, Microsoft
Mohit started off on the old story of Grace Hopper finding the first bug, and drew our attention to the timestamps on the log. It looks like it took them about 20 minutes to dig through all those relays and find it. Good debugging!
Developers and SREs treat debugging in a different way. There’s higher pressure once something’s running in production, so SREs want it to be faster. But debugging is hard for two reasons:
- who am I when I’m debugging? If it’s fun, I feel like Sherlock Holmes. But often we treat it as if we’re explorers, finding interesting things. We should be more like ER doctors, solving the problem as quickly as possible to move to the next patient.
- we have broken tools, or tools that don’t scale.
Our traditional tools have been low-level: a debugger, a packet analyser, a memory profiler, logs. We need to get higher level.
1) user-visible error messages. If they contain a code or id, they’re useful for engineers too. They can tell us where to start looking.
2) distributed tracing is the new debugger. Adding it takes a long time but it works and it’s worth doing.
3) machine learning. Log-relevant tokenization takes heterogenous logs, detects key/value pairs and extracts them. e.g., extracting sentences and keys like ‘latency’ and ‘time’ from logs. Negative Phrase detection finds words like “not resolved” “too long” “unable to reach” and extracts them. Clustering aggregates similar things to make it easier to work with them.
Think about anomaly detection horizontally for all of your systems. Twitter’s R library is a good place to start.
Automatically crunch all of this debugging data. Automatically react to incidents, e.g., moving machines out of rotation. Don’t wake up a human engineer unless the machine learning can’t fix it.
Privacy is important. Logs have a lot of information and debugging needs to take privacy seriously. Be especially careful when sending data to 3rd party analysis and hosting services.
Machine learning isn’t just a buzzword any more, it works for debugging.
Keep debugging fun. Keep your inner Sherlock Holmes alive.
Closing Plenary: System Crash, Plane Crash: Lessons from Commercial Aviation and Other Engineering Fields. Jon Kuroda, University of California, Berkeley
Jon submitted this as a 45 minute talk and the organisers liked it so much they turned it into the final 90m plenary. And that was a very good call!
He opened on four stories of aviation disasters with very different outcomes. Everyone walked away when US Airways 1549 ditched in the Hudson River. All lives lost when Air France 447’s sensor iced over and caused the humans to make the wrong decisions.
Why were these such different outcomes? We might like to blame pilot error, but that’s never the real root cause. Good design doesn’t force pilots to react perfectly. In truth there were many factors, including luck. And some of those look familiar: lack of fault tolerance, poor incident response, confirmation bias, monitoring, correlated failures, alarm fatigue, poor UI. And, in some good cases, well-practiced teams who worked well together.
This sounds a lot like our world. We deal with UIs and commands that don’t do what we expect, systems we don’t know well, people who we don’t know how to work with. We have complex interactions and often a low-diversity team. Unlike aviators, we don’t have as much formal training, and we work on a much wider variety of platforms.
Computing and systems administration came out of world war two and the WRNS. We think of ourselves as a young industry, but we started at the same time as air travel, nuclear power, emergency medical services and electronics. Why is our industry less mature? Because we haven’t needed as much reliability as the other industries do. The stakes have been lower. Until now.
Technology is now in a position to ruin people’s lives. We’re honestly talking about using machine learning for real-time air traffic control applications. We have self-driving cars and health applications running in the cloud. 911 services are handled by VOIP. We’re becoming life-safety critical faster than we think. We should learn from other industries.
Humans are terrible. We’re bad at stress management and pattern matching. We get fatigued, distracted. Our skills are poorly maintained. We have cognitive biases. We can’t deal with alarm saturation, multitasking, repetition of being bored. And we aren’t good at self-monitoring! We don’t know when we’re performing worse.
Is it all bad news? No, we’re doing some good stuff:
We got post-mortems from medicine and military: they’re blameless and actionable. That’s happened in the last five years. We still redact too much though. Other industries have collective reports of everything that went wrong in the field. There’s an anonymous aviation safety reporting system. The American Alpine Club has an annual report of climbing accidents. The FAA has a lessons learned site. NTSB and Chemical Safety Board have investigations. It’s hard to do this without a 3rd party external organisation.
We’ve become better at testing, code review and code coverage, though we’re still bad at knowing what to test. We often don’t test states we don’t think the code can get into. We test maxed out behaviour but not normal behaviour. We don’t always test tiny changes.
Checklists. Doctors resisted them for a decade or more but now they’re standard. It’s scripting for human behaviour. Airline checklists have checkpoints for restarting at if they’re interrupted.
What should we do better?
The sterile cockpit rule. We should accept that we perform worse if distracted and have only essential staff in the room or channel during critical changes.
We tell ourselves that it’s ok to miss occasional steps or not hold up to the standard. We need to not normalise deviating from the rules. It catches up with us eventially.
We don’t do communication training. We should. It should be part of our culture. Korean Air had a streak of accidents in the 90s. They instituted training on how to pass information and ask questions while respecting the hierarchy. They’ve had no fatal accidents since 1999.
UI interactions. The three mile island accident was caused by a misleading monitoring system. We have outages caused by typos and bad UIs on tools.
Telemetry. Our logs are mostly unstructured and you need a human to parse them. We need a better way of assembling the story of an incident.
Training. We need the equivalent of sandboxed flight simulators. In pair training, the first officer does most of the flying and the captain watches and only takes over if needed. We need to formally train in this way.
Regulation and licensing adds a lot of baggage and overhead. But it would be good to self-regulate now, rather than waiting for the major disaster that causes Government regulation of our industry. We should have a third party independent organisation like the NTSB.
I’ve been thinking about this pretty much every hour since Jon’s talk. What’s our Quebec Bridge collapse going to be? Can we get ahead of it and regulate now?
Some of the stuff I didn’t see but would have liked to
Here are a few other talks I heard tons of people enthusing about and especially regret missing (some because I had another talk in the same slot, some because I got in late, one because it was just before my talk and I was freaking out^W^Wcalmly getting into the zone.) Catch the videos if you can (you probably can’t for the tutorial :-( ) when they’re up athttps://www.usenix.org/conference/lisa17/conference-program
- the opening plenaries on Security in Automation by Jamesha Fisher and Leigh Honeywell.
- the How to Get Out of Your Own Way tutorial by Jessica Hilt and Allison Flick
- Working with DBAs in a DevOps World by Silvia Botros
- The 7 Deadly Sins of Documentation, by Chastity Blackwell
LISA’s still a good conference! Nearly every slot had at least one talk I really wanted to go to, and some had several. This year had an especially diverse lineup, which was fantastic. Going to LISA in 2010 and 2011, I could have told you, from memory, a list of which women and PoC were speaking and when. This year, I had to skip a bunch of talks from non-white-dude speakers I knew were excellent. This is such a good problem to have. This is also the first conference I’ve been at that passed whatever the race equivalent of the Bechdel test is. Multiple Black engineers having hallway track conversations about tech like it’s no big deal… well, it shouldn’t be a big deal, but it’s still too rare at SRE conferences. We have a long way to go as an industry, but this is a small step in the right direction. And that panel was excellent. Thank you, LISA organisers.
As an institutionalised Googler, conferences are great for catching up with what the rest of the world is doing. It feels like suddenly (ok, not really suddenly) everyone wants Kubernetes, even if not everyone knows why they want it. Distributed tracing has become a no-brainer, which is nice. The zeitgeist still seems to be that we’re publicly insulting Docker and secretly using it for everything. Jenkins is widely used but not widely loved. And the Cloudy future continues to be glorious. I went to a Kubernetes tutorial and, when the wifi was too crappy to download the binaries, I spun up a couple of GCE VMs and worked from there. It took a couple of minutes, including deliberating over which version of Ubuntu I wanted. I love it.
LISA’s over Halloween again next year, and I can’t miss trick or treating with the kiddo for two years in a row. Probably see y’all in 2019 though, or maybe at an SRECon in the meantime.