Work · 2013–2014 · 🇿🇦Cape Town

Software Developer

AWS — EC2

EC2 API team out of Amazon’s Cape Town office — a team of about eight building the external API surface for what was, at the time, one of the largest distributed systems ever run.

  • Developed the EC2 API, Caching, and Tagging services across 10 global regions.
  • On-call rotation for >1000 production hosts — incident response and root-cause analysis on a team where mistakes were measurable in customer-visible ways.
  • Led deployment of the team’s services into the new Beijing EC2 region — flew to Seattle for the launch.
  • Implemented new EC2 IAM permissions; contributed to service observability and metrics.
  • Mentored interns and conducted interviews.

Debugging the load balancers’ mystery error codes

On call for the EC2 API, I kept seeing rare error codes from our load balancers that nobody on the team could explain. Reading through the load-balancer manuals, I worked out that they signalled the remote end — here, an EC2 API client — had hung up the connection. I traced the hang-ups to API calls timing out at 30 seconds, the default an EC2 API client waits for a response.

The cause turned out to be old EC2 tutorials, which often showed an unpaginated call to list every public AMI machine image. As the catalogue grew, those calls came to dump gigabytes of data — and because we never deprecated valid EC2 API calls, they still had to work. I worked around it by caching the public AMI list on our API hosts and streaming it to clients that made these very large requests, so the data arrived far faster — well before they’d give up and hang up the connection.

The lasting outcome was a process change: it led to a new rule in our API review — never ship an unpaginated endpoint, even when we were certain the responses would stay small.

The customer who used tags as a database

I noticed that one of our largest customers at the time had a worryingly low cache hit rate on the EC2 API. Digging in, I found they were using the EC2 tagging service as a generic key-value store — tagging their EC2 machines with build progress, as percentages, while they deployed and rapidly updated them. Because tags feed into EC2’s IAM authentication scheme, changing any tag invalidated that customer’s cache key entirely, so almost every one of their calls missed the cache.

The fix was twofold: our customer-success team reached out to ask them to stop using the tagging service as a key-value store, and we put new limits on the rate at which customers could make EC2 tagging API calls.

My first AWS-level SEV review

We had a service-level objective: a month-long ceiling on the P99 duration of our API calls. For weeks, we’d watch some calls creep over that limit. The standing mitigation was to find the offending host and pull it from the rotation — sometimes just restarting the API service was enough — which made the symptom go away without anyone understanding the actual root cause.

I was on call when we finally breached the SLO. The problem had been recurring for weeks, but instead of reaching for the usual band-aid I took ownership: I root-caused it and took it to SEV review — my first at the AWS level. The slow calls turned out to be clustered in US-East-1 and on a shifting set of hosts serving our largest customer at the time, Netflix. US-East-1, as the oldest region, carried the most legacy hardware — and we’d missed pulling that oldest tier of hosts out of the load balancer’s rotation.

Removing the oldest generation of hosts from the load balancer fixed it, and a new team rule — never run more than two generations of hardware behind the same service — kept it fixed for good.