Project to track follow-up tasks from the March 12th site outage.
Summary:
Previously T91773: mc1014 server has been flaking out and dropping connectivity had meant mc1014 was disabled to diagnose some network issues. It was flagged for service again and so I went do put it back into service with https://gerrit.wikimedia.org/r/#/c/196279/ and https://gerrit.wikimedia.org/r/#/c/196281/. The changeset https://gerrit.wikimedia.org/r/#/c/196279/ unfortunately had leading whitespace which does not seem to be highlighted in red as default with the "Ignore whitespace:" setting. Jenkins gave me the +2 and go ahead so I went to Tin and pulled and issued sync-file wmf-config/session.php "enable mc1014". Shortly thereafter users reported 503'd to production sites. The change was reverted and things started returning to normal. A few bad cases were cached (seen in associated tickets) but overall the outage was sub 5 minutes. The change should have been flagged before merge as invalid, or at least flagged before sync.
Eventually, I did put mc1014 back into service successfully with https://gerrit.wikimedia.org/r/#/c/196302/