21 Dec 2016 : Conference Report: LISA 2016

I went to the Large Installation Systems Administration conference this year. It’s a great conference, full of interesting people thinking about systems administration, devops, operations, site reliability, production engineering and all of the other things that boil down to “make services keep running in a low-drama fashion”. I recommend it.

Here’s some quick notes on what I saw and what was good. They’re not necessarily the most exciting parts of each talk, just some things I noted. If you want more, most of the sessions will have videos, mp3s and slides up at https://www.usenix.org/conference/lisa16/conference-program.

Modern Cryptography Concepts: Hype or Hope. Radia Perlman, Dell EMC Corporation

Radia Perlman invented the spanning-tree protocol and a bunch of other stuff. She’s a charismatic speaker, and is one of those people who are clearly the smartest person in the room, but don’t make you feel bad about it: she did a great job of making the crypto accessible. That said, I still wanted a pause button every few minutes so I could think through what she’d just said.

This was a whirlwind tour through cryptography concepts that appear (with frustrating misinformation) in the media. News reports say things like “We’re going to use blockchain for securing the IoT”, which makes no sense. Radia showed us nifty techniques for sharing a secret and flipping a coin over the phone. She talked about homomorphic encryption, which allows computations to be carried out on encrypted data without decrypting it. This sounds very exciting, but it’s 5-6 orders of magnitude slower than computing on plain text, so we should not hold our breath.

Then she gave us a walkthrough how bitcoin works, and why it’s not sustainable: she reckons it’s fragile and wasteful of electricity and expensive in network bandwidth, and eventually transaction fees will have to be very high. The trust assumptions – that the number of honest miners will always outcompute the number of dishonest miners – are not necessarily true in the age of state-sponsored hacking. As of December 2015, 71% of bitcoins have been mined. The reward currently barely covers the cost of electricity (though if you’re stealing the electricity in the first place, you might not mind that.)

[slides]

Strategic Storytelling. Jessica Hilt, University of California, San Diego

Storytelling is what Jessica does – she teaches people how to tell a story and refine it for the stage – and she’s great at it. This was a fun and funny session. People who tell the best stories get their ideas adopted faster than people who just bring facts: we like to think that we’re data-driven, but storytelling activates parts of the brain that facts don’t; the listener is more likely to turn a story into their own ideas and experience. For example, people will give more money if the person asks for it with a story, and they’ll work harder if they feel like they understand their company’s strategy.

Consider the truth, the audience, the narrative. Don’t tell a Pollyanna story full of unrealistic joy: you need to do the Good, the Bad and the Ugly. Admitting failure gives you credibility. Tell stories from unexpected angles: not just the manager’s or PM’s point of view, but junior staff, and of course the client. And don’t tell people the moral of the story: if you told the story right, they’ll take away the moral themselves. “Don’t tell people you’re the king of the north: let them come to that conclusion.”

Set the stage, dramatic conflict, resolution. The conflict is the change you’re trying to make, the unknown territory, the what-if. Resolution is the call to action. Resolution includes why it’s worth taking the risk, why failure is an option – don’t elide that – but why it’s ok. But don’t say it’s going to be easy, because the first time it’s not, you’ve lost them.

How to Succeed in Ops without Really Trying. Chastity Blackwell, Yelp

This was immediately after my own talk, and I was sitting back in my seat being all “OMG I just spoke at LISA!!!” and looking at my 72 twitter notifications. Which is to say that I didn’t hear a word of the first 10 minutes of this. I regret that, because by all accounts this was a great talk. Luckily for me, the slides are available and I can fill in what I missed.

Chastity spoke about how people who used to feel comfortable in sysadmin jobs can feel behind the curve, and unable to participate in this new shiny devops/no-ops/sre/production engineer world. But most of these ‘new’ jobs are a new name for what good sysadmins have been doing for years: config management, incident response, troubleshooting the whole stack, scripting, monitoring, performance and capacity management. She talked about some myths, starting with “everyone is looking for a coder, I don’t code”. But scripting is code, even if sysadmins self-deprecate and don’t think of it as “real programming”.

Engineers/Opsen at small sites shouldn’t feel that they need to be at Google or Facebook to be good at what they do. The majority of tech employees work at small shops and they have advantages in making greater impact and having more exposure to different parts of the stack. Conversely, we shouldn’t be discouraged from applying to those places because ‘famous’ people work there and we don’t think we measure up. You don’t need to be a visionary or genius; just be good enough at being you and get things done. The real rockstar employees are the people who are doing good work and trying to help other people do good work too.

Be a “T-shaped” engineer: find an area where you want to focus, and it’s ok not to know everything. You don’t need to chase the new hotness: it’s ok, good even, to use “boring” tech. Protect yourself from burnout, learn the business as well as the tech, and keep in touch with people. Your network matters.

[slides]

Passing the Console: Fostering the Next Generation of Ops Professionals, Alice Goldfuss, New Relic

Alice is always funny and wise, and this was a highly anticipated talk. She talked about paths to operational jobs: unlike software engineering, people don’t usually get into ops on purpose; they’re doing something else and drift in. She argued that we should be more deliberate about bringing good people into ops roles: your bank, your healthcare, your grocery store are all online now, and we want good people behind the scenes making sure they all keep running.

What should we teach? Not tools! The tools change too fast. A new ops person really just needs python and some linux to be effective. Teach them the ability to pick up other skills as they need them. It’s tempting to teach monitoring systems and databases and etc, but better to let them fill their own toolbox as they go along.

Instead, teach culture. Ops has traditionally been a “no” culture, and has used dark humour and crankiness to survive. Keep the gallows humour, but don’t be crusty. Teach new people to be curious, to build defensively, to serve their users. Teach them to be a life preserver, not a stop sign. Teach them “yes, but” instead of “no, because”.

We should look for potential ops folks in Support, IT and Bootcamps. Anywhere there are people who are curious, build strong systems and serve their users. People who tinker. People who like puzzles and games. People who are organised and fastidious. These are not always places we look now. opsschool.org can be a good place to help people learn, and we should all contribute to it.

[slides]

SRE: It’s People All The Way Down, Courtney Eckhardt and Lex Neva, Heroku

Courtney and Lex talked about the traditional ops org where the dev team through code over the wall and ops needs to deploy it. In that world, ops says “no” a lot. There’s a knowledge gap. But if the ops team refuses to deploy it in whatever standard way, the devs will put it somewhere awful instead. (There was a funny picture of code being deployed on a toaster.). So don’t say “no”, say “yes, if…”. It was funny to hear this immediately after Alice telling us to say “yes, but”. We are all about qualified yesses!

The end state of pager fatigue is burnout, and a burned out oncall staff is a reliability (and attrition) risk. At Heroku they use a buddy system and limit the number of hours any one person spends on an incident: after four hours you’re more likely to compound the error than fix it. We often hear advice to not read mail out of hours, but Courtney and Lex went further: they advised against “spectating” incidents you’re not responding to. I hadn’t thought about that as a specific cause of fatigue; it’s a good point.

They talked about good and bad retrospectives, using the sinking of the Evergreen Point floating bridge as a model. They recommended using a template to make sure nothing gets left out. Finally they emphasised that “human error” is never a root cause, and “try harder” is not a remediation: the environment allows or prevents errors. Change the environment.

[slides]

Tutorial: Writing (Micro)Services with Flask. Chris St. Pierre, Cisco Systems, Inc.

Flask is a framework that makes it easy to create RESTful services quickly. After a day and a half of listening to talks, it was a nice change of pace to sit down and write easy code for a couple of hours. Chris provided skeleton code, and we filled it in to serve the output of ‘uptime’ and ‘iostat’ on our laptops over a RESTful api. Nice examples, and fun to make it work. And it was indeed fast to get something up and running.

Anyway I’m a full stack web developer now. Hire me for your startup.

No User Left Behind: Making Sure Customers Reach Your Service. Mohit Suley, Bing

One day, Bing stopped working from one place in Brazil. Users shrugged and changed to a different search engine (I wonder which one, heh), but an engineer visiting the area noticed, debugged and got people to fix it. It was a proxy problem at the ISP. It wasn’t Bing that was broken, but that’s not relevant: some users never came back. How do we know if there are other very local problems like that.

Mohit distinguished between availability and reachability. It doesn’t matter if your site’s up if users can’t get to it. From the user point of view, there’s no difference between broken DNS, ISP or your website.

Measuring reachability is hard. External testing isn’t good enough: false positives are high and they get low coverage. If they’re going to page on it, they need a low false negative rate, but an extremely low false positive rate. The tradeoff for avoiding false positives is that it takes longer to detect problems.

They made a model, with some gamification that I didn’t quite get. There was a monster and they attacked the monster with things. But the model made sense on its own. They sent signals to it as a proxy for user experience. Various different kinds:

  • Direct: using external monitoring services like Catchpoint and Dynatrace
  • Indirect: using alternate path telemetry
  • Inferential: noticing changes in traffic patterns as seen inside the data center
  • Correlational: sentiment analysis on social networks
  • Causal: failures exposed by BGP

They used different weights for different kinds of signals, with a calendrical model for understanding when traffic was expected to be lower. They successfully tested the model on various simulations, then tried it in real life. It worked: they found real issues: multiple DNS hijacks across ISPs in the US and elsewhere; fiber cuts; BGP hijacks. They got surprisingly (to me) granular data: a government agency took half a day off and they noticed the drop in traffic.

[slides]

From BOFH to Just Another Person in the Standup, Surviving the Move to DevOps. Jamie Riedesel, HelloSign

Kudos to Jamie for opening on telling us that she was fired from a job “because I was percieved to be an asshole”. That got our attention! She compared ‘traditional’ and ‘startup’ ops cultures, and showed how they might not be compatible. What makes an effective team is all the team members feeling safe, and that manifests differently in different places.

At a bad ops culture job (the “helldesk”), you’re used to fighting malware, phishing, unauthorized VPNs, people who think they know more than you, willful ignorance, and people trying to work against your team. It’s a siege mentality, and your “team of awesome” is what keeps you sane. You defend your team and say no to ideas that you think would impose on them. When you have an idea and another team rejects it (presumably for the same reason that you do), it’s typical to have psychological safety survival strategies: intentional deinvestment, compartmentalisation, and bonding through negativity. Self-deprecating/gallows humour can make a team strong, but the humour can also be cruel, especially when directed towards users. (I didn’t notice until Jamie pointed it out: I haven’t heard the term “luser” in years. Well done, us.)

When you move to a startup, you’re no longer your team’s last line of defense. Agile is improv. Everyone’s working together to solve the same problem. A flat “No” isn’t collaborative. When your own idea is shot down, the psychological safety mechanisms become an impediment. Deinvesting and compartmentalising stop you from helping to create the thing. Being continually negative about other people is being an asshole.

Jamie had advice for managers dealing with people coming from BOfH to agile cultures: signpost your office culture. People may be actively trying to adapt and just need guidance. Don’t be vague about it. Talk directly to the person and say “that specific thing you are doing is not acceptable”. And watch out for people who retreat after losing a technical decision; you might be glad that they’ve stopped complaining, but they may just stop caring and stop being collaborative.

[slides]

Keynote: Identifying Emergent Behaviors in Complex Systems. Jane Adams, Two Sigma

Jane has a TEDx talk about the theory of emergence, the appearance of complex behaviours in previously simple systems. She told us about ants: 50 ants on a table wander pointlessly until they die. 500,000 ants start a colony. There are no extra moving parts – it’s still ants – but they start acting with more complexity. The whole is greater than the sum of its parts.

We have a lot to learn from ants and their simple, distributed, scalable system. They have been doing TCP for a lot longer than we have! They have flow control and congestion handling. Jane showed us a video of a starling murmuration. Look at that thing! Audible sounds of awe from the audience. Starlings achieve shockingly fast consensus and move like a fluid. Each one integrates signals from its seven nearest neighbours. That maximises robustness.

Plants and fish also have exciting consensus algorithms, and there are lots and lots of other biological analogs to tech stuff, including slime molds and network routing.

Overall

In previous years at LISA, I’ve noticed distinct themes: we’re all talking about virtualisation, or containers, or the Cloud, or whatever. This year, what really jumped out was the focus on humans. The ability to empathise, predict human behaviour, allow human error, teach humans, work with humans. To some extent this is influenced by the talks I chose to go to (I’m less drawn to “advanced usage of syslog” type talks than I might once have been), but it certainly felt that there were more thoughtful talks on ‘humaning’ than I’ve ever seen before I think this is a good development and a sign of the increased maturity of the industry.

Docker had some sort of promotion that made people continually DoS the #lisa16 twitter stream with “I’m at the Docker booth”. Grrrr. Don’t do that, friends.

No talks on IPv6 this year! Does that mean we’re done?

Comments