cascading failures

how one slow service takes down everything upstream.

it was a 200ms timeout. we thought that was fine.
postmortem, every major incident ever