Context- someone on the birdside are blaming #crowdstrike on DEI hiring
Here’s the thing folks. I’ve been coding 32 years. When something like this happens it’s an organizational failure. Yes, some human wrote a bad line. Someone can “git blame” and point to a human and it’s awful. But it’s the testing, the Cl/CD, the A/B testing, the metered rollouts, an oh shit button to roll it back, the code coverage, the static analysis tools, the code reviews, the organizational health, and on and on 1/3
It’s always one line of code but it’s NEVER one person. Implying inclusion policies caused a bug is simplistic, reductive, and racist. Engineering is a team sport. Inclusion makes for good teams. Good engineering practices makes for good software. Engineering practices failed to find a bug multiple times, regardless of the seniority of the human who checked that code in. Solving the larger system thinking SDLC matters more than the null pointer check. 2/3
This isn’t a “git gud C++ is hard” issue and it damn well isn’t an DEI one. 3/3
+1 for psychological safety in the workplace.
If you're the work experience student who doesn't understand how the organisation establishes confidence in their deployable entity before rollout, how comfortable are you asking someone senior? What if the answer doesn't make sense? Do you feel safe pushing the issue?
There may be a glaring gap in the organisation's way of working, and your perspective may mean you're the only one who sees it. Do you know you're safe to flag it?
@shanselman Yeah, this is more like, "did y'all even fucking test this thing on an actual workstation? Like, sure, it passed whatever unit tests, but why didn't y'all just install it on a test machine/vm and make sure it didn't crash windows somehow?"
@digitalCalibrator @shanselman it’s possible that they did just straight up push from someone’s laptop to the CDN, but it’s also possible that the system they had used in good faith for a long time and had caught a lot of problems had a latent swiss-cheesey flaw in it. I think this is much much more likely to be a systemic flaw than one of individual or small-group malpractice or incompetence
@shaver @digitalCalibrator @shanselman Or watch and see comments from actual employees talking about the under funding of the testing group.
@alison @digitalCalibrator @shanselman as someone who has had to decide funding for a testing group I think that they are all underfunded against a defect-detection rate of 100%, but also generally misused in terms of applying them to the most important risks rather than the things that are easiest to test. it is an especially easy trap to say “our quality is good, we don’t need our QA tool investment to be this high” and then…it’s not good any more
@shaver @alison @digitalCalibrator @shanselman The right way to roll out a potentially crippling change (e.g. any kernel or boot code) is to target a few selected customers first, and then if it all goes perfectly, push to the rest. So yeah, this was an organizational systemic problem. The responsibility falls entirely on management.
@digitalCalibrator @shanselman It's very likely it passed there, too.
@shanselman And the telemetry, don't forget the telemetry. This wouldn't have happened (well it would have, but on a far smaller scale) if a random subset of machines would randomly connect to their servers and acknowledge how many restarts they had in the last n minutes. They should have had a system in place to pause rollouts if too few machines were connecting or too many of them reported crashes, both would have helped here.
@shanselman The amount of lines I didn't write I'd be responsible for...
If `git blame` was a reliable source of truth, no refactoring should be done.
However, I've weirdly recently seen a lot of this DEI-blaming. Is this the latest right-wing strategy? Think they were saying that for the secret service as well for recent events...
@jesper
I think it is part of the latest game play alongside ESG shaming.
@shanselman
@jesper @shanselman It's a pretty ubiquitous right-wing strategy these days, yes. Comes up all the time in aviation and higher ed. ()
@jesper @shanselman they're just using DEI as a dog whistle for racism. They can't use the N word openly or blame it on women, so they group all their grievances behind that acronym and call it a day.
@jesper @shanselman it's an interesting strategy. When times were less "woke", we had the ILOVEYOU worm amongst other big tech issues, or economies were ruined for years because of bugs/features in the banking system, or we had a lot of dead Kennedy's. But somehow DEI is an evil that is now to blame for rarer, less severe events?
DEI and ESG have both become the boogeyman for an unhinged anti-democracy movement, funded by billionaires.
Gets blamed for everything nowadays. It's an effort to roll back 20th century progress.
https://www.cnn.com/2024/07/17/media/secret-service-agents-women-trump-shooting/index.html
https://www.theguardian.com/culture/2024/apr/21/dei-language-conservatives-baltimore
https://www.washingtonpost.com/technology/2024/02/10/bill-ackman-end-dei-industry/
https://www.newyorker.com/news/our-columnists/the-campaign-against-dei
https://www.motherjones.com/politics/2024/01/woke-capital-vivek-ramaswamy-esg-capitalism-finance/
https://slate.com/business/2022/06/woke-capitalism-esg-investing-republicans-mike-pence.html
https://www.cnn.com/2024/03/17/politics/dark-money-fga-ashcroft-invs/index.html
https://www.theguardian.com/us-news/2023/jun/22/rightwing-war-on-woke-capitalism-industry-interests
@Npars01 @jesper @shanselman Not just 20th / 21st century progress. These Christofascists want to roll back the Enlightenment and return to 17th century radical Calvinism.
@shanselman But, it's so much easier to say, "it's magic," than understand the complexity of zero-trust and everyone's role in such an environment.
@shanselman Project 2025 has started early it seems.
@shanselman as a disabled dev who made a point not to be a diversity hire (my experience has always been that disclosing my disability isn’t advantageous), I was going to write up a whole thing.
Ultimately, I don’t think I have much to add, besides wondering how much attention we should pay to the birdsite, given its continued decay, and reinforcing that good processes help experience and inexperienced people alike, and only impossibly good people can consistently push through bad processes.
@shanselman these are all fundamentally economic, which is to say “resource allocation” and “risk allocation” problems. they’re the human flavour of undecideable, because of difficulties we have working with risks and apportioning costs. most successful companies are successful on the basis of externalizing costs (risks) outside their customers’ businesses, and the “best” ones keep those costs outside their own balance sheet too
@shanselman Crowdstrike itself is lovely to consider, because the same commons that gives it value (visibility into measurements from millions of computers) is also its chief form of societal danger (extremely correlated risk across those computers)
@shanselman Crowdstrike could do things to decorrelate that risk (at some cost), or All Of Us Who Need Computers could “socialize” the value to remove the link to the correlation of risk, but we are really not good at ecosystem thinking
@shanselman See also…everything, yeah, hmm
@shanselman honestly diverse hiring is more likely to solve these kinds of problems as it means you have more people with different experiences and viewpoints that might see things differently and point out issues that others have missed
As long as management hasn't beat it into them that they aren’t allowed to speak up that is
I saw an unsubstantiated claim this morning that the "channel file" to delete was full of NULs. If true, then the failure cannot even be down to a bad line of code, as it would also involve whatever tool generated the bad data file going very wrong.
However:
I am inclined to disbelieve this claim, as (for starters) this would result in a file without a valid PE header and a "channel file" is notionally an NT driver file as far as I can tell.
FWIW Crowdstrike posted an update that specifically refutes that hypothesis.
“This is not related to null bytes contained within Channel File 291 or any other Channel File.”
https://www.crowdstrike.com/blog/technical-details-on-todays-outage/
Yes. I caught up on that after I had caught up on the FediVerse posts.
I have a suspicion that the NULs thing is one of those Chinese Whispers distortions of someone talking about NULL pointers, in turn because they've just guessed that that was the STOP that occurred. (alas, see https://mastodonapp.uk/@JdeBP/112813708543808092, though)
I've certainly not seen an authoritative analysis of the specific crash that happens, yet, and certainly #CrowdStrike has not supplied one.
@shanselman Was thinking about this exactly. When a problem like this surfaces, it's not just a single failure that occurred but rather the culmination of several failures that coincided.
@shanselman The buck stops with the CEO, right? Given his history of incompetence I’m happy to hold him responsible. Rich white dudes who continue to fail upward is the DEI we have.
@disappearinjon @shanselman shhhh, you're not supposed to notice that
@shanselman exactly. Similar thread on reddit claiming it's because of RTO mandates.
It's bad luck and bad practices on both the Crowdstrike side *and* their clients who were running automatic updates/patches in prod with no staging testing first. It's everyone's fault, not one person.
@shanselman even the fact this is a "one line" failure misrepresents the problem. There were expectations in that file that were violated elsewhere - which is why the fix wasn't to that file, it was to the files that had the nulls that line was reading.
Is it an issue with that file, or the expectations that the programmer coded to?
Equally, DEI is a response within certain organisations to address the fact that the expectations of a meritocracy are violated by a number of systemic issues outside those organisations. DEI is only a problem in that we need to validate inputs from an environment hostile to minorities, which violates basic expectations that "the best" will always follow "the true path" to this career.
@craignicol @shanselman this assumes that the organization itself has not been warped by society's biases. Perhaps judging the competence of people with disparate backgrounds is an internal issue, believing the myth of meritocracy included.
Plus, DEI doesn't address class or adversity. We still have legacy admissions in higher ed & unequal opportunities. So, assuming the material differences in resources for training weren't worth more than "merit," some people would appear elite from that.
@craignicol @shanselman basically, I'm calling HR incompetent. They use lots of shortcuts based off of their own life experiences, which generally have little to do with the skills they're selecting for. And even when they do have experience from those fields, which may temper the credentialist impulse, people tend to judge people similar to themselves more favorably.
So, i agree with you. I just think dei should be characterized as a way of overcoming org's shortcomings that give rise to bias
@cykonot @shanselman oh yeah, DEI is a band aid over a myriad of systemic issues. Trying to fix those issues locally is absolutely a step in the right direction, and some of those initiatives will have a bigger impact than just in that organisation, but pretending those initiatives are anything close to resolving the systemic issues in the organisation or wider society is disingenuous at best, and *washing at best.
In other words, DEI is a necessary reaction to the way society is currently structured, and helps to popularise the language to describe that structure, but it's nowhere near sufficient. Anyone who is attacking DEI for the minor dents it's making in the structure is someone who's identity is entwined with the status quo
@cykonot @shanselman as I said earlier, it's very hard to for colonists to attack the system that provides their identity
@craignicol @shanselman an attack on something one identifies with is percieved as an attack on oneself.
Like, even if you aren't the CEO who was hired for having a history of failures like this, you may want to defend them out of credentialed solidarity. Lest your imposter syndrome blossom into something else
@craignicol @shanselman emphasizing the benefit to the organization, rather than larger social justice, tends to convince a different kind of person
@shanselman But this time, *which* "one line of code" is it?
(a) the one that got changed in the data file this week
(b) the top level "try ... catch" in the application code which did not get changed this week because it has never existed
?
@shanselman Or, or
(c) the one line of code in the regression tests to cover malformed data files, which also doesn't exist
?
@shanselman Inclusion makes for better teams - the company I work for (in the "day job") is better for bringing in and supporting great people. Better commercially, but also makes it a better place to work.
@plwt @shanselman inclusion is a cheap way to get great software engineers. This is the engineering talent equivalent of arbitrage = free money.
@shanselman
or remote work with minimal in person coordination and overreliance upon git, slack with limited ability to get a global view
or neoliberal or VC economic pressures to keep costs low via outsourcing and contingent labor and blitz scaling...
@shanselman
I learnt about the same thing happened with the #white #crowdstrike ceo in 2010 in a different company.
Was that #DEI too?
Failing up!!!!!! @shera@mastodon.online @shanselman@hachyderm.io
@shanselman can't agree more.
@shanselman and I'm also wondering and concerned why in CRITIS organizations such updates are deployed untested. If they tested the update at least on one system before rolling out nothing happened.
@shanselman Whether it's Boeing, CrowdStrike or any other prominent business failure, the culprit is much more likely to be shareholder pressure to cut costs and rush products than diversity in hiring.
Edit: And I suspect the shareholder community is a lot whiter than America as a whole.
@shanselman Firing the QA team (in Q4 2023, allegedly) might have something to do with #CrowdStrike issues today...