Riddle Me This Job Listing

Jump to the Site Reliability job description best practices

In Ancient Greek mythology, the Sphinx guarded the entrance to the city of Thebes, challenging would-be entrants with a riddle and eating anyone who failed to answer correctly. Tourism apparently was not a major industry in Thebes, because no one could solve the Sphinx’s riddle, and it apparently did not occur to the Thebans that they should maybe build a second entrance to the city.

One day, Oedipus, fresh off murdering his father, the (now former) king of Thebes, met up with the Sphinx. She sprang her riddle on him and… he answered it correctly. The Sphinx, apparently unable to come up with a new riddle or any other skill to feed herself, then threw herself off a cliff. Oedipus entered Thebes where, as a reward for single-handedly rescuing the city’s tourism industry, he was made king and married off to the queen, AKA his mother.

Most ancient sources did not specify the riddle, although the one now commonly associated with the story goes like so: What walks on four feet in the morning, two in the afternoon and three at night?

Answer: An SRE crawling out of bed to get their laptop after getting paged at 3AM, somehow managing to walk upright most of the day, then crawling back into bed at night using only one hand because the other is carrying their laptop.


I’ve been on both sides of SRE job postings and I’m actively in the middle of a job search now, so I’ve got a lot of opinions about what SRE job descriptions should and should not look like. The list comes from efforts trying to encourage diversity of applicants, to optimize for both my time and the potential candidate/employer, and not to be a total asshole to job seekers.

Karen’s List of Best Practices for SRE Job Postings

1. State the Opening’s Level

I see a lot of listings for “Site Reliability Engineer” with no indication of level. Is it entry-level? No, they seem to want some experience, although they often don’t give concrete ranges. While in some rare cases, generally at startups, none of the engineering listings may have specified levels, I will usually see this on a site where the product dev levels are spelled out in the listing title. I suspect, in many of these cases, the eng manager comes from a product dev background and either has no idea how to level SREs or possibly even thinks any amount of experience after 2-3 years is irrelevant.

Whatever the reason, please do one of the following:

  • Explicitly state a level in the listing so the candidate doesn’t waste their time clicking to find out you want someone junior.
  • If you’re open to multiple levels, say that: e.g. “Sr or Staff SRE opening”
  • If you have multiple openings at different levels, say that
  • If you frankly have no idea what level SRE you need, say that, unless you think someone may hoodwink you. (I never would, though.)

Also consider that SREs can come from a range of different engineering roles. If you’re ok with hiring a very experienced developer who wants to become an SRE but may need some training, state as much in the description (and be honest beforehand whether you have the people to teach them and that you really will give them the support they need).

Of course, putting a potential salary range serves as another way to hint to people about whether or not the role is a level-match, and it could also provide secondary signals if you do not get any resumes from people who are actually qualified.

2. Keep the “Must-Have” Specific Technology List Short

And by short, I mean pretty much non-existent.

Look, a smart, motivated engineer can learn whatever thingie they need. Seriously, they weren’t born knowing how to PXE boot a goddamn bare-metal Linux server. They can learn. You want to make sure they are willing and able to learn your shit, not that they come in knowing it, because, guess what? Your shit will change over time anyways. That’s what technology does.

You probably do want to make sure they have had to deal with “a” database, “a” monitoring system, “a” build system, but it really shouldn’t need to be “my specific DB/dashboard/pipeline.” Now, if you really, really need someone who knows, for example, Kafka inside and out because [stuff happened] and you need someone to hit the ground running NOW, say that. That’s ok; that’s good, even. But if you need someone who is already an expert with running a dozen+ specific pieces of software, especially if they’re complex, you probably need to get real, be ready to shell out a lot of money, or take a hard look at how you’ve chosen your stack’s tech. Also maybe try to figure out how to retain the engineers who knew this stuff in the first place, because apparently losing that knowledge creates… issues.

And who cares if a candidate has “worked with” Docker for 3 years? What if all they’d done that entire time was run docker pull and docker run? I guess you could list specific tasks they have performed around Docker, but now we’re back to that thing about not making a laundry list of required experiences.

Here’s another important reason to keep this list short: you’re selecting against diversity, particularly against female candidates, who tend to be much less likely to Dunning-Kruger themselves into thinking they have a depth of knowledge which they do not actually possess.

So just stop with these goddamn everything-but-the-kitchen-sink “requirements” lists, people. I’m going to start shaming you on Twitter, for real. (Yeah, I’m sure you’re quaking at the thought of my 3 followers seeing that.)

3. Talk About Your Tech Stack

While you don’t want to list every (or any) component of your entire tech stack and platform infrastructure as “must-haves,” you should still let the job seeker know what you use. They may think running SQL Server on Linux is the most fun ever, or they may see that you still monitor everything with Nagios and you aren’t moving off, which is a major deal-killer to them, as it would be to any sane person. And adding specific tech will also make searches by keyword easier, so candidates with or interested in matching platform skills may even be able to find you more easily.

Again, just make it clear that these specific services are not requirements or even nice-to-haves. “Hey, this is some of the stuff we use, in case you’re interested! If you don’t know these, though, that’s ok, too! We’ll figure out if you can learn them when we interview you because that’s something we should be trying to figure out in the interview anyway, right? Right?”

4. Explain What Your SREs Actually Do

I should maybe put this one at the top. Do you think an SRE sits downstream of product engineers and just has to somehow make their pile of code work? Or do your SREs (or your vision of SRE, if this is your first) have real standing, make decisions for the wider engineering team about how to measure and improve service stability and performance, and work directly with the other teams at all stages of product lifecycle? Is site relability (uptime, happy customers, etc.) a first-class citizen or just necessary overhead? And if your answer is the latter, maybe, um, spend some time thinking about that.

5. Be Real

Maybe forgo the standard lists of what you’re looking for and talk instead about what the org actually needs. “We want to improve the productivity of product engineers by reducing the number/duration of production incidents that eat their time.” “We are building out our next-gen platform and want a collaborative SRE to help build resilience into the design and implementation.” Etc.

6. Be Honest

This one should go without saying, but I know I’m not the only person who took a job that bore no resemblance other than job title to what was published and then discussed with the recruiter, management, and interviewers. People hired under deliberately or accidentally false pretenses will probably leave fast and now you’re back at step 0 in filling the role, and how much fun was that for anyone the last time?

7. Write for Diverse Candidates

I’m not an expert on exactly how to do this, although not having a white male engineer write the listing would be a good starting place. There’s plenty of information on the Internet with advice on how to approach inclusive listings, though. And while it should go without saying, if the hiring manager and the recruiter and everyone else touching the listing are all part of the majority tech demographic, find someone who does not look and sound like them to check the description for tone and other red flags. And if all your candidates still look like your hiring manager and existing team, fix it. A number of consultants and services exist who specialize in helping recruit for diversity.

Just make sure you actively address inclusion and equity so you can keep those employees once you actually find and hire them. That’s another story, though.


Obviously, all these points are optional. And they only partially address a worse problem, namely that most organizations seem to have job listings that could be interchangeable, give or take a few weird combinations of specific required hands-on experience, when the reality of the role in each org varies greatly. That lack of awareness or honestly is also something that seriously needs to change because I can’t imagine whom it actually benefits in the long run.

But that’s also a topic for another post.

WTFM

You’re asleep when you suddenly get woken up by a loud noise that won’t stop. You’re groggy and have no idea what’s going on. The pressure starts to make you panic because you can’t figure out what to do to make the noise stop. You feel like you’re in a (barely) waking nightmare.

You’re the on-call, you just got paged in the middle of the night, and you just started at the company so what documentation you can find seems like riddles more than instructions for what to do to diagnose and fix failures.

If you have never been on-call and this doesn’t sound like a big deal to you, then go find an engineering on-call team and volunteer to take a rotation. Honestly, if your work contributes in any way to systems or services that people can get paged for in the middle of the night, you should be on call regularly, because if everything generally works fine, it’s no big deal, and if stuff doesn’t, the organization or company clearly needs more motivation to fix underlying issues, both technical and cultural. But that’s a topic for another blog post, or twelve.

The Why

Internal engineering documentation in most companies tends to be one of three things: out-of-date, incomprehensible, or non-existent. This situation creates a great deal of frustration and wasted time for new team members and it makes the loss of senior team members who carry all of this domain knowledge in their heads that much more painful.

You can and should improve the quality and coverage of your internal docs. Clear documentation can be a force multiplier. The time that it takes to create it will be paid back many times over when people can find reliable, straightforward information covering how to perform a specific task or how some service works.

To get to that point, though, you have to create a culture where both writing and reading the documentation are standard expected behaviors. A lot of people would rather just ask someone else, but interrupt cultures are the death of efficient engineers. One way to minimize that behavior is to, well, write useful documentation. It’s much easier to foster an environment where RTFM is the de facto behavior if there’s a readable manual.

You also need to dismantle the tendency some engineers have to use docs as a way to “show off” how smart they are or how much they know. “Maybe next time don’t document your ego.” In fact, the best docs are written from a place of empathy, trying to remember what it was like when you, the writer, did not know this stuff either, and how you would have wanted it explained to you.

However, most engineers loathe writing documentation more than almost anything else. Being willing and able to write clear docs is an undervalued skill, and like most skills, developing it and honing it requires practice and feedback, just like writing code does. Teams need to incentivize the writing of docs and make it as simple as possible.

If your docs are stored as code in a git repo, that’s great, but maybe don’t put them in your code’s monorepo, which takes an hour to go through CI and testing. Yes, your build+test should really not take an hour, and you should go address that immediately, but for now, let’s acknowledge reality. That kind of friction may not sound like much, but those pull requests for documentation changes will get forgotten as they do not fix a broken build and as people move on to the other dozen things jockeying for attention.

Organizing docs well generally proves as big a challenge as getting them written in the first place, but it’s just as critical. Google Docs is a terrible choice for documentation, because between inconsistent sharing permissions and a surprisingly useless search function, you can’t find anything. I don’t know what its intended use was, but it is completely non-functional as a knowledge base. Ideally you would want a true content management system (CMS), but whether you opt for more of a wiki, make sure your solution has a good search function that allows advanced string queries, that offers easily-navigable topic hierarchies, and which saves your change history. No, I don’t have any suggestions, because I’m not even sure it exists.

You also need to recognize that, broadly speaking, you will have a few major categories of documentation, including design docs and instructional docs. First, you need to recognize that design docs are not a replacement for instructional documentation. They are blueprints for building, not for using. Also, teams rarely tend to update the design doc of a project, whether the spec changes during the initial implementation or is modified later.

I would also treat post-mortem write-ups as internal documentation that should live in the same CMS. Whether or not your team thinks it addresses the root causes of incidents, going through discussions of previous issues often turns up relevant information when troubleshooting.

Tips for Writing

So what should these magically useful instructional docs look like?

Basically, you need your team to write out domain knowledge in a way that does not assume domain knowledge. You also need to use a format and simple writing style which scan well and help users who are just trying to get shit done.

  1. Use lists often, which force engineers to organize tasks sequentially and usually write less verbiage because who writes three-paragraph list items.
  2. Establish some minimum but realistic expectations for what your most junior engineers will know.
  3. Do not write for yourself. You know this stuff already. Write for someone who does not know it. (This may seem obvious, but so many eng docs read like someone is just dumping their knowledge, which generally does not make for consumable documentation.)
  4. Do not omit steps or assume they’re implied.
  5. Avoid using terms like “simply” or “this is easy.” They don’t add readability and to a learning engineer, they can be frustrating.
  6. Link everything linkable. Mention dashboards or a monitoring site? Link it. A git repo? Link it. A tool? Link the download or installation instructions. Log service? Link it.
  7. If you send someone to a log or other monitoring service, tell them what search terms to use to narrow focus, or better yet, make sure the link includes the query.
  8. Make CLI commands copy/pasteable. Also make sure you’re not relying on some personal shell alias when you copy/paste your history into the doc.
  9. Show the expected output of a command. Also note how long the user should expect it to take to finish. I need to know if I should wait on it or not, and if it’s taking longer than expected, that may mean something’s broken.
  10. Spell out where services run, whether they’re specific hostnames, specific Kubernetes clusters, or some cloud platform or service, and how to get access if needed.
  11. Add a tl;dr page section for basic usage and keep it visually separate from longer explanations. They can share the same page.
  12. Create a page index for longer pages.
  13. Find someone to test your docs, preferably someone who does not normally work with the tools.

Let’s look at some examples.

BadGood
Check out the main git repogit clone git@github.com:example/repo.git
Install bazelInstall bazel version 2.2.0 (we’re behind)
ssh to the bastion hostssh bastion.example.com
Check for failed podskubectl get pods --all-namespaces | grep -E 'Error|CrashLoopBackOff'
Ok, now you want to run the build. It will probably do this thing where it will spit out an error but you should ignore errors unless make exits with an error. Also, if it takes a really long time, something is probably wrong.make build
– Ignore errors about missing timestamps.
– Build should finish in <5 minutes.

tl;dr

  • Encourage a culture where engineers read and write docs.
  • Write clear docs so engineers will actually read them.
  • Writing readable docs takes practice and support.
  • Find a sane, searchable, sustainable way to organize your docs.
  • Useful documentation becomes a force multiplier, saving more time than invested and empowering new team members.

Roots of Knowledge

I have a small collection of succulents, the only kind of plant I have managed to keep alive. Succulents do not come from a single family of related plants; instead the name describes plants that have adapted usually to arid ecosystems by storing water in their leaves, stems, or roots.

Like pretty much all plants, succulents can have one of two major root systems: a fibrous root system or a taproot. A taproot has a single, usually thick root, usually with smaller roots growing from the taproot. When you eat a carrot, you eat that plant’s taproot. Lithops, the “living stone” succulents, also have persistent taproots.

My juvenile lithops, one removed to show the taproot. The lithops on the left is currently shedding its old leaf pair, something healthy lithops do annually. (“Lithops” is both the singular and plural form of the plant’s name.) This dude needs its summer watering.

Most common succulents have branching root systems, though, including Haworthias like this H. retusa pup receiving water therapy. (In a rather surprising twist, while overwatering is the most surefire way to kill most succulents, submerging the roots of some plants that have suffered root damage, a practice called “water therapy,” can help them grow new roots and rebound.)

In fact, some of these plants can grow new roots even when it appears their entire root system had died, like my H. cymbiformis. I thought it was a goner after a mealy bug infestation, because all the roots had dried up and broken off. After just a couple weeks of water therapy, though, and it has new roots.

Haworthia retusa (left) and H cymbiformis (right). The latter had lost all its roots after a mealy bug infestation but after a couple weeks of water therapy, it is growing new roots (white protrusions) and seems to be bouncing back.

Mealy bugs strike true fear in the hearts of succulent growers. They are fluffy little fuckers which like to hide among, and consume, the roots and stems of succulents. Their favorites, like mine, seem to be haworthias. The only succulents I’ve lost were to mealy bugs. (Ok, I did have a lithops which likely succumbed to its injuries from a kitten.) The little assholes can usually be detected by the whispy spiderweb-like strands they deposit on the leaves. I’m not going to post a picture of them because I hate them so very, very much.

So while plants with branching roots can sometimes apparently regrow their entire root system, plants with taproots cannot. If the taproot dies or breaks off, the plant is a goner, although some plants with persistent taproots, like dandelions, can regrow the plant from an intact taproot. Haworthias can be propagated by cutting off smaller rosettes in a clump as long as they get some part of the root system. This practice not only does not work with succulents with taproots, but it also would kill the plant.


You could probably apply the taproot vs fibrous roots analogy to a number of aspects of software engineering, including monolith architecture vs microservices. But this blog claims to be at least tangentially about devops, so let’s talk about breaking down some silos.

I’m not a botanist or an evolutionary biologist, but, at least as an amateur succulent grower, I’d say the plants with a fibrous root system seem to have a clear advantage over those with taproots. Those plants with fibrous roots can often regrow roots after serious injury or damage, and they often allow the plants to be divided into new, smaller clumps. Branching roots can grow off in new directions if they hit an obstruction; taproots tend to grow downward. The root and therefore the plant can suffer and die if they hit a hard obstruction.

Engineering teams in an organization can also take on characteristics of a taproot or a fibrous root system. Teams with a taproot approach may have domain knowledge of one area of the software or architecture, but their connections to other teams relating to shared or connected expertise tend to be more tenuous and shallower.

Visualization of team knowledge with little overlap or connection

A more fibrous structure would ideally still have specialized concentrations of knowledge in each team, but the knowledge distribution from team to team would be more of a continuum rather than a set of bounded, unshared proficiencies.

Visualization of teams with large amounts of shared or interconnected knowledge.

The overlapping model does not mean to say that everyone needs to be an expert in every other team’s wheelhouse. Specialization should not disappear. But it does need to be collaborative and open. And more importantly, it needs to operate with the understanding that these teams and their areas of responsibility do not exist in vacuums or silos.

Take this example. You may have a team that owns the cloud infrastructure platform, another team which handles database reliability, and a backend product engineering team.

Obviously, the cloud infra team and the database reliability team need a solid subset of shared information and knowledge. While the infra team could just throw some virtual machines and storage at the DB team, effectively managing database reliability and performance is impossible without understanding some key characteristics and behaviors of the underlying infrastructure. Meanwhile, the cloud infra team would have to understand best practices for the database engine’s infrastructure requirements.

The backend team would also need some understanding of cloud services on which their own services depend, such as infrastructure security, performance, and availability. They may need to have some understanding of the actual dollar costs of certain design and implementation choices which can affect the cloud provider bill. The cloud infra team needs to understand the backend software’s capacity and performance needs and its dependencies and usage of difference cloud services.

The backend and database teams also need to have shared knowledge. The product engineers need to understand the performance characteristics and best practices of the database engine to create efficient schema and queries as well as to avoid anti-patterns that can adversely impact application performance or database stability. The database reliability engineers need to understand the backend services’ requirements for stateful storage well enough to make both architectural and more low-level recommendations, as well as to add resilience to the database infrastructure and scale it as needed.

Many engineering organizations either intentionally or, through lack of intention, have teams which do not integrate as well as they should with respect to knowledge. This lack of cohesion can lead to wasted work, incompatible solutions to multi-team projects and problems, and general frustration and mistrust all around.

While some individuals in these teams may make their own attempts to bridge the jargon chasms, learning requires an investment of time and effort. When so many engineering organizations are unwilling or unable to tap on the brakes, thus disrupting their constant state of building broken stuff but doing it fast, this shared technical understanding can be difficult to acquire.

One way to start down the path of being a branched-root rather than taproot org would be to have engineers spend guest sprints or rotations on other teams. These exchanges create some temporary drag as the visiting engineer ramps up and host members help, but actually doing the job is one of the ways to understand both the technology and the experience of that host team.

Another less targeted but good introductory method would be to have regular presentations from different teams to try to educate each other and create the space for conversations.

Spreading technical knowledge across teams does more than help engineers on every team make more informed decisions. It helps develop empathy between teams. “Oh, so that’s what they’re doing every day.” And again, teams don’t need to become experts in the specialties of neighboring teams. They do need to learn enough to have useful conversations and, perhaps most importantly, to learn what questions they need to ask when faced with an issue or decision.

Branched-root teams and organizations grow their contextual understanding of their own domain by learning about neighboring teams and their concerns. This strengthens individual engineers because knowledge is power, and it strengthens the organization because teams can talk to each other and collaborate intelligently to create more resilient and efficient systems.

Blog at WordPress.com.

Up ↑

%d bloggers like this: