Choose Your Own Architecture

I remember reading Choose Your Own Adventure books as a kid. These are like RPGs in book form. Yes, we still had actual RPGs (Zork ftw!) but I couldn’t exactly take the Apple ][e on car trips, so books.

I only had a few of the titles, so after going through them often enough to end up at the same endings a few times, I start would page through the books to find endings I hadn’t reached through the intended manner of picking one of two or more decision options and turning to the corresponding page. In other words, I would try to backwards-engineer all the possible endings in the books. One book, though, had an ending that was simply unreachable. None of the decision pages had an option that led to that page. I did not know if that unattainable ending came about by accident or as an intentional private joke, but the whole thing felt unresolved to me. I don’t even remember which book it was or the storyline. I just remember I couldn’t reach that ending.

Let’s try this. You just founded a new company with a SaaS-based business model. Now you have to build the actual SaaS out from scratch. Where do you start?

Pages: 1 2 3 4

Burst Water Mains

Scary story: commuting in LA traffic.

That’s it. If you’ve ever lived there, that’s enough.

Now imagine you’re on a major surface street during an evening commute. (“surface street” == not a highway, major means it’s as wide as a highway. LA traffic has its own jargon, and when I moved to the Bay Area, I was surprised by how much of that jargon was local to Socal specifically.) You’re stopped at a major intersection (where two major surface streets meet) and suddenly water starts gushing up into the intersection from the maintenance holes.

Yeah, you’re not getting home any time soon.

In the summer of 2009, when I was still living just outside LA proper, this scenario played out multiple times. The LA Department of Water and Power (DWP) recorded 101 water main breaks, more than double the preceding year. (The city’s water mains tend to run beneath major streets, hence the locations for the flooding.)

Some of the bursts just caused traffic disruptions, but some were large enough to cause real property damage to buildings. One blowout opened up a sinkhole that tried to eat a fire engine.

As the realization hit that the numbers were much higher than in previous years, the DWP and others began floating theories, including:

  • The most obvious reason: the age of the mains. Most of them were iron pipes dating back almost a full century. The DWP had already started planning replacement work.
  • Temperature variations, although the summer of 2009 was about average.
  • Increased seismic activity.
  • Statistical anomaly.

City engineers were puzzled, though, because the breaks were taking place all over the city. (They also seemed to take place at many different times of the day and night, but I can’t find a comprehensive list.)

However, 2009 was a bit different from previous years in one critical way. The DWP had instituted new lawn-watering rules for water conservation that went into effect at the beginning of the summer. Automatic sprinkler systems could only run on Mondays and Thursdays and only outside the hours of 9AM-4PM.

The new water rationing was not a major theory in the DWP, though, because many other cities in the area, including Long Beach, had similar rules but without the increased incidence of pipe blowouts.

The city created a panel of scientists and engineers to investigate. In the end, they found that water rationing was the key player here. While the age and condition of the pipes played a major role, the extreme changes in water pressure in the pipes between days with and without residential watering proved to be the tipping point. As a result of the findings, the city changed the water conservation policy to try to maintain more consistent flows.

(While we’re here, let’s note the irony of a policy designed to conserve water that led directly to conditions that sent millions of gallons of it flooding city streets.)

Production service incidents tend to follow similar patterns. There is never just one root cause. Instead, as in LA in the summer of 2009, pre-existing, less-than-ideal conditions often suddenly get pushed past their breaking point by a sometimes small change. I don’t know how much research was conducted before the city decided to institute the original conservation policy, and whether or not water pressure changes and their effects on the mains were even considered. If the DWP was not consulted, that’s another contributing factor. If they were and, for whatever reason, they did not anticipate issues given the known state of their infrastructure (remember, they were already planning on replacing century-old pipes, a process ongoing to this day), either case would also act as a contributing factor to the incident.

One of my favorite examples of a (very, in this case) complex chain reaction of events colliding with less-than-ideal conditions comes from the write-up of the AWS S3 outage in us-east-1 in February 2017. In addition to the sheer size and length of the outage, it also gave many engineering teams and users a chaos monkey look into which services had hard dependencies on that specific S3 region; one of these impacted services was AWS’s own status board. AWS had to use Twitter to supply updates to customers during the outage.

The media at the time kept writing stories with headlines like, “Typo Takes Down S3,” but that was not only a gross oversimplification, it was… well, maybe not wrong, but arbitary scapegoating. Here are some equally valid and yet invalid, given the lack of scope, headlines they could have used:

  • “Maintenance Script That Did Not Check Input Parameters Takes Down S3”
  • “S3 Suffers Regional Outage Because AWS Stopped Testing for Regional Outages”
  • “S3 Collapses Under the Weight of Its Own Scale”

Honestly, someone should make a Mad Libs based on those headlines.

At any rate, focusing on a chain of discrete or tightly-coupled events in a post-mortem makes less sense than focusing on the contributing conditions, at least if you genuinely want to prevent future issues. These conditions, especially where humans are involved (which is everywhere), are highly contextual and directly related to the organization’s culture. If the engineering teams involved have a culture of strong ownership and collaboration, your set of solutions, both technical and process-related, can and probably should be very different than if the team has a lack of discipline or a lack of footing in reality. And in the latter case, ideally (but probably not realistically), the existing culture should be a target of remediation.

Those century-old cast-iron water mains would have failed sooner or later. In fact, they still do on a regular basis. However, a well-intentioned policy change meant to address a situation unrelated the health of the infrastructure and which may or may not have taken that infrastructure’s decrepit, degraded condition in mind created some chaotic water main blowouts that summer. If you’re looking at a production incident and your “root cause” is singular, whether it’s either just one event, or one pre-existing condition, or a simple combination, you’re not going to prevent anything in the future, except perhaps effective incident prevention itself.

(P.S. You should read John Allspaw’s more definitive writings on incident analysis.)

Fake It Until Ik Spreek Het

In the story “The Page Who Feigned to Know the Speech of Birds” from 1001 Nights, a servant overhears his rich boss telling his wife that she should spend the following day relaxing in their garden. That night, the servant sneaks into the garden, placing several items.

The next day, as the servant accompanies the lady to the garden. As they walked around the garden, a crow cawed out. The servant thanked the bird and told the lady the bird said they could food under a nearby tree which she should eat. Since the lady was apparently not too bright, she took this to mean that the servant could understand the birds’ language. The next time the crow cawed, she asked the servant to translate. He replied that she could find some wine under another nearby tree. Drinking the wine that was, in fact, nearby, the lady became even more impressed with the servant. The third time the crow cawed, the servant thanked it and told the lady the bird told him there were sweets under yet another tree, by which time she found the servant completely fascinating.

The next time the crow cawed, the servant threw a rock at it. The lady asked him why he would do that, and he replied…

Ok, this story gets a little adult here, so you can go read the ending on your own if so inclined.

I recently gave an online talk on zero trust architectures in Kubernetes for Cloud Native Day. Learning that it was based out of Québec, I was told they didn’t require bilingual slides, but I decided to try my hand at them anyway.

I am by no means fluent in French, although I took French all through high school, and in the past couple years, I’ve been practicing on Duolingo to refresh and update it. My skills are mostly along the lines of « Je peux probablement trouver mon hôtel, commander mon dîner et m’excuser pour mon terrible accent » (“I can probably find my hotel, order dinner, and apologize profusely for my horrible accent.”) (I’m also learning Dutch and Spanish on Duolingo, hence the wonky English/Dutch bilingual post title.)

Embedded slide deck from Trust No 8 / Ne faisez pas confiance à 8

Here are some tips if you ever happen to find yourself in the position of preparing bilingual slides for a technical talk when you are familiar with, but not fluent in, the second language.

  • Puns probably don’t translate. I gave my talk a (very bad) title before I realized I was going to try to make it at least a little bilingual-friendly.
  • Keep the slides simple. Really, most recommendations for technical slide content say you should limit the amount of text on slides. People should be listening, not reading. Tersely-worded slides make even more sense when balancing two languages. Avoid idioms or other non-literal phrasings that are unlikely to translate well.

    My slides feature a mix of my high-quality stick figure illustrations and some small groups of bullet points.
  • Have a native or fluent speaker check your translations. Hopefully you can find someone with a technical background, but if not, even just simple grammar and spelling checks are useful.

But how do you find the accepted translations for technical terms? Often, even in somewhat closely related languages, the accepted translation for a term may not be the literal translation. For example, the Dutch term for “peanut butter” is pindakaas, which literally translates to “peanut cheese.”

Finding accepted translations for uncommon technical jargon can require some digging. I was writing about zero trust networks, Kubernetes, and Istio, so I did a lot of googling and ended up using a mix of the following sites and methods:

  • While the Istio docs only come in English and Chinese, the official Kubernetes documentation comes in many translations, although not all pages are translated for all languages. Check the docs for the technology you’re covering to see if there are translations. You don’t even need to be able to comprehend everything, only to pick out the phrasings used for the concepts you want to cover.
  • If the official docs don’t cover your language, try finding hits on documentation sites of large, multi-national companies which may use or leverage the tech in question. One of my page hits for Istio was on the IBM Cloud site. They have a language pull-down menu in the page footer, so I switched to French and got some useful jargon translations there.
  • Modify your Google search settings to return pages in the language you need. This won’t be immediately useful unless you also disable English, because most page hits will likely be in English. However, once you start using the translated terms you’ve been able to find, you will start getting hits in the second language.
  • Once you start finding the key terms you need, you may want to double-check that they are the most commonly used by googling those and making sure you get a good number of legitimate hits back.
  • Reverso is not a technical site, but they have a huge database of examples in actual texts, so you may be able to find localized translations for some terms you need there.
  • And of course there’s Google Translate, but even for the most popular languages, its translations for all but the simplest phrases still feel unnatural if not plain wrong.

So, that’s it! Alors, c’est tout !

Blog at

Up ↑

%d bloggers like this: