You’re asleep when you suddenly get woken up by a loud noise that won’t stop. You’re groggy and have no idea what’s going on. The pressure starts to make you panic because you can’t figure out what to do to make the noise stop. You feel like you’re in a (barely) waking nightmare.
You’re the on-call, you just got paged in the middle of the night, and you just started at the company so what documentation you can find seems like riddles more than instructions for what to do to diagnose and fix failures.
If you have never been on-call and this doesn’t sound like a big deal to you, then go find an engineering on-call team and volunteer to take a rotation. Honestly, if your work contributes in any way to systems or services that people can get paged for in the middle of the night, you should be on call regularly, because if everything generally works fine, it’s no big deal, and if stuff doesn’t, the organization or company clearly needs more motivation to fix underlying issues, both technical and cultural. But that’s a topic for another blog post, or twelve.
Internal engineering documentation in most companies tends to be one of three things: out-of-date, incomprehensible, or non-existent. This situation creates a great deal of frustration and wasted time for new team members and it makes the loss of senior team members who carry all of this domain knowledge in their heads that much more painful.
You can and should improve the quality and coverage of your internal docs. Clear documentation can be a force multiplier. The time that it takes to create it will be paid back many times over when people can find reliable, straightforward information covering how to perform a specific task or how some service works.
To get to that point, though, you have to create a culture where both writing and reading the documentation are standard expected behaviors. A lot of people would rather just ask someone else, but interrupt cultures are the death of efficient engineers. One way to minimize that behavior is to, well, write useful documentation. It’s much easier to foster an environment where RTFM is the de facto behavior if there’s a readable manual.
You also need to dismantle the tendency some engineers have to use docs as a way to “show off” how smart they are or how much they know. “Maybe next time don’t document your ego.” In fact, the best docs are written from a place of empathy, trying to remember what it was like when you, the writer, did not know this stuff either, and how you would have wanted it explained to you.
However, most engineers loathe writing documentation more than almost anything else. Being willing and able to write clear docs is an undervalued skill, and like most skills, developing it and honing it requires practice and feedback, just like writing code does. Teams need to incentivize the writing of docs and make it as simple as possible.
If your docs are stored as code in a git repo, that’s great, but maybe don’t put them in your code’s monorepo, which takes an hour to go through CI and testing. Yes, your build+test should really not take an hour, and you should go address that immediately, but for now, let’s acknowledge reality. That kind of friction may not sound like much, but those pull requests for documentation changes will get forgotten as they do not fix a broken build and as people move on to the other dozen things jockeying for attention.
Organizing docs well generally proves as big a challenge as getting them written in the first place, but it’s just as critical. Google Docs is a terrible choice for documentation, because between inconsistent sharing permissions and a surprisingly useless search function, you can’t find anything. I don’t know what its intended use was, but it is completely non-functional as a knowledge base. Ideally you would want a true content management system (CMS), but whether you opt for more of a wiki, make sure your solution has a good search function that allows advanced string queries, that offers easily-navigable topic hierarchies, and which saves your change history. No, I don’t have any suggestions, because I’m not even sure it exists.
You also need to recognize that, broadly speaking, you will have a few major categories of documentation, including design docs and instructional docs. First, you need to recognize that design docs are not a replacement for instructional documentation. They are blueprints for building, not for using. Also, teams rarely tend to update the design doc of a project, whether the spec changes during the initial implementation or is modified later.
I would also treat post-mortem write-ups as internal documentation that should live in the same CMS. Whether or not your team thinks it addresses the root causes of incidents, going through discussions of previous issues often turns up relevant information when troubleshooting.
Tips for Writing
So what should these magically useful instructional docs look like?
Basically, you need your team to write out domain knowledge in a way that does not assume domain knowledge. You also need to use a format and simple writing style which scan well and help users who are just trying to get shit done.
- Use lists often, which force engineers to organize tasks sequentially and usually write less verbiage because who writes three-paragraph list items.
- Establish some minimum but realistic expectations for what your most junior engineers will know.
- Do not write for yourself. You know this stuff already. Write for someone who does not know it. (This may seem obvious, but so many eng docs read like someone is just dumping their knowledge, which generally does not make for consumable documentation.)
- Do not omit steps or assume they’re implied.
- Avoid using terms like “simply” or “this is easy.” They don’t add readability and to a learning engineer, they can be frustrating.
- Link everything linkable. Mention dashboards or a monitoring site? Link it. A git repo? Link it. A tool? Link the download or installation instructions. Log service? Link it.
- If you send someone to a log or other monitoring service, tell them what search terms to use to narrow focus, or better yet, make sure the link includes the query.
- Make CLI commands copy/pasteable. Also make sure you’re not relying on some personal shell alias when you copy/paste your history into the doc.
- Show the expected output of a command. Also note how long the user should expect it to take to finish. I need to know if I should wait on it or not, and if it’s taking longer than expected, that may mean something’s broken.
- Spell out where services run, whether they’re specific hostnames, specific Kubernetes clusters, or some cloud platform or service, and how to get access if needed.
- Add a tl;dr page section for basic usage and keep it visually separate from longer explanations. They can share the same page.
- Create a page index for longer pages.
- Find someone to test your docs, preferably someone who does not normally work with the tools.
Let’s look at some examples.
|Check out the main git repo|
|Install bazel||Install bazel version 2.2.0 (we’re behind)|
|ssh to the bastion host|
|Check for failed pods|
|Ok, now you want to run the build. It will probably do this thing where it will spit out an error but you should ignore errors unless make exits with an error. Also, if it takes a really long time, something is probably wrong.||– |
– Ignore errors about missing timestamps.
– Build should finish in <5 minutes.
- Encourage a culture where engineers read and write docs.
- Write clear docs so engineers will actually read them.
- Writing readable docs takes practice and support.
- Find a sane, searchable, sustainable way to organize your docs.
- Useful documentation becomes a force multiplier, saving more time than invested and empowering new team members.