I have no information about how this incident came to be but I can confidently predict that people will blame it on greedy execs and sloppy devs, regardless of what the actual details are. And they will therefore learn nothing from the details.
@norootcause spoiler: it will turn out to be a process gap, as almost always, and people will waste a lot of time looking for an individual to blame
@norootcause one of my favorite thought exercises here is "let's say this was entirely due to one person's total failure to follow process, or even acting with malice: it's still a process gap if one single person's actions (or lack thereof) have the potential to cause this kind of outage.”
It's always a process gap.
@darkuncle @norootcause
I am guessing that CloudStrike doesn't do gradual rollout, which would seem to be a well-known best practice. (Of course it costs extra to develop and use such a system)
@PeterLudemann @norootcause post-mortem on this is going to be interesting
@norootcause Also DEI, Biden, etc.
@Bobsee @norootcause You forgot Bill Gates.
@rpetre True enough
@norootcause It's way too public and people need that simple answer. :(
@dtauvdiodr I wouldn’t be surprised if there was a congressional hearing!
@norootcause honestly, I'm really hoping we eventually learn what mitigations they put in place. It's all well and good for us to say you should have secondary access channels or automated rollbacks or whatever else. But that's a hard enough problem in userspace application code. How do you even do these things in the bootloader?