We’re still wading through the aftermath of moving, and I don’t know where half my stuff is, so I’ll keep this short if only to preserve the few shreds of sanity I have left after the past few weeks. In case you hadn’t heard, T-Mobile had a massive outage on Monday when one of their leased fiber circuits failed in the Southeast part of the country. Normally when you operate one of the largest communication networks in the country, you have backup circuits that automatically cover any outages caused by – oh, I don’t know – someone accidentally back-hoeing your buried fiber relays. In T-Mobile’s case, their fail-over actually failed, which caused a cascading network tsunami that crippled their network for the better part of 11 hours.
“I thought you were going to keep this short.”
While I’m sure many lessons (and colorful words) were learned that day by T-Mobile’s network team, the one that I took to heart personally was this: fail-over systems for critical infrastructure are (relatively) easy to design but inherently risky to test, as the only way to really test a design requires forcing a failure. When building a bridge, engineers are able to use well-known formulas based on decades of research and data to calculate just how much weight various designs can support. The key difference between a physical bridge versus a resilient data network is that a bridge design created decades (or even centuries) ago will still serve to cross a gap, whereas network architecture can become obsolete in the span of months due to the pace of technological change. All this to say, everyone wants and expects technology to be infallible, when in fact the pace of change guarantees the opposite. We should have our own fail-overs when critical infrastructure fails because with technology, the question of failure is not “if” but “when”.