CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

Aatube · edit-2 2 years ago

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

@tiramichu@lemm.ee · 2 years ago

If I send you on stage at the Olympic Games opening ceremony with a sealed envelope

And I say “This contains your script, just open it and read it”

And then when you open it, the script is blank

You’re gonna freak out

@Gork@lemm.ee · 2 years ago

Ah, makes sense. I guess a driver would completely freak out if that file gave no instructions and was just like “…”

@PriorityMotif@lemmy.world · 2 years ago

deleted by creator

@planish@sh.itjust.works · 2 years ago

That’s what the BSOD is. It tries to bring the system back to a nice safe freshly-booted state where e.g. the fans are running and the GPU is not happily drawing several kilowatts and trying to catch fire.

TimeSquirrel · edit-2 2 years ago

No try-catch, no early exit condition checking and return, just nuke the system and start over?

Aatube · 2 years ago

what do you propose, run faulty code that could maybe actually nuke your system, not just memory but storage as well?

Kogasa · 2 years ago

Catch and then what? Return to what?

𝙲𝚑𝚊𝚒𝚛𝚖𝚊𝚗 𝙼𝚎𝚘𝚠 · 2 years ago

Windows assumes that you installed that AV for a reason. If it suddenly faults, who’s to say it’s a bug and not some virus going ham on the AV? A BSOD is the most graceful exit you could do, ignoring and booting a potentially compromised system is a fairly big no-no (especially in systems that feel the need to install AV like this in the first place).

Morphit · 2 years ago

A page fault can be what triggers a catch, but you can’t unwind what a loaded module (the Crowdstrike driver) did before it crashed. It could have messed with Windows kernel internals and left them in a state that is not safe to continue. Rather than potentially damage the system, Windows stops with a BSOD. The only solution would be to not allow code to be loaded into the kernel at all, but that would make hardware drivers basically impossible.

@reddit_sux@lemmy.world · 2 years ago

BSOD is the ultimate catch statement of the OS. It will gracefully close all open data streams and exit. Of course it is not the usual exit so it gives a graphic representation of what not have gone wrong.

If it would have been nuking it wouldn’t show anything.

@Kaboom@reddthat.com · 2 years ago

For most things, yes. But if someone were to compromise the file, stopping when they see it invalid is probably a good idea for security

@deadbeef79000@lemmy.nz · 2 years ago

Except “freak out” could have various manifestations.

In this case it was “burn down the venue”.

It should have been “I’m sorry, there’s been an issue, let’s move on to the next speaker”

@Strykker@programming.dev · 2 years ago

Except since it was an antivirus software the system is basically told “I must be running for you to finish booting”, which does make sense as it means the antivirus can watch the system before any malicious code can get it’s hooks into things.

Morphit · 2 years ago

I don’t think the kernel could continue like that. The driver runs in kernel mode and took a null pointer exception. The kernel can’t know how badly it’s been screwed by that, the only feasible option is to BSOD.

The driver itself is where the error handling should take place. First off it ought to have static checks to prove it can’t have trivial memory errors like this. Secondly, if a configuration file fails to load, it should make a determination about whether it’s safe to continue or halt the system to prevent a potential exploit. You know, instead of shitting its pants and letting Windows handle it.

@the_crotch@sh.itjust.works · 2 years ago

In this case it was “burn down the venue”.

It was more like “barricade the doors until a swat team sniper gets a clear shot at you”.

@deadbeef79000@lemmy.nz · 2 years ago

Hmmmm.

More like standing there and loudly shitting your pants and spreading it around the stage.

@Thann@lemmy.ml · 2 years ago

The envelope contains a barrel of diesel and a lit flare

@OozingPositron@feddit.cl · 2 years ago

Computers have social anxiety.

@tiramichu@lemm.ee · 2 years ago

You’re right of course and that should be on Microsoft to better implement their driver loading. But yes.

Morphit · 2 years ago

The driver is in kernel mode. If it crashes, the kernel has no idea if any internal structures have been left in an inconsistent state. If it doesn’t halt then it has the potential to cause all sorts of damage.

@sigmaklimgrindset@sopuli.xyz · 2 years ago

Great layman’s explanation.

@Imgonnatrythis@sh.itjust.works · 2 years ago

Maybe. But I’d like to think I’d just say something clever like, “says here that this year the pummel horse will be replaced by yours truly!”

@Takios@discuss.tchncs.de · 2 years ago

Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn’t written code for, then there will be a crash.

@deadbeef79000@lemmy.nz · 2 years ago

Poorly written code can’t.

In this case:

Load config data
If data is valid:
1. Use config data
If data is invalid:
1. Crash entire OS

Is just poor code.

@5C5C5C@programming.dev · 2 years ago

When talking about the driver level, you can’t always just proceed to the next thing when an error happens.

Imagine if you went in for open heart surgery but the doctor forgot to put in the new valve while he was in there. He can’t just stitch you up and tell you to get on with it, you’ll be bleeding away inside.

In this specific case we’re talking about security for business devices and critical infrastructure. If a security driver is compromised, in a lot of cases it may legitimately be better for the computer to not run at all, because a security compromise could mean it’s open season for hackers on your sensitive device. We’ve seen hospitals held random, we’ve seen customer data swiped from major businesses. A day of downtime is arguably better than those outcomes.

The real answer here is crowdstrike needs a more reliable CI/CD pipeline. A failure of this magnitude is inexcusable and represents a major systemic failure in their development process. But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.

Morphit · 2 years ago

This error isn’t intentionally crashing because of a security risk, though that could happen. It’s a null pointer exception, so there are no static or runtime checks that could have prevented or handled this more gracefully. This was presumably a bug in the driver for a long time, then a faulty config file came and triggered the crashes. Better static analysis and testing of the kernel driver is one aspect, how these live config updates are deployed and monitored is another.

@deadbeef79000@lemmy.nz · 2 years ago

But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.

In which case this should’ve been documented behaviour and probably configurable.

@CeeBee_Eh@lemmy.world · edit-2 2 years ago

That’s a bad analogy. CrowdStrike’s driver encountering an error isn’t the same as not having disk IO or a memory corruption. If CrowdStrike’s driver ~~didn’t load at all~~ wasn’t installed the system could still boot.

It should absolutely be expected that if the CrowdStrike driver itself encounters an error, there should be a process that allows the system to gracefully recover. The issue is that CrowdStrike likely thought of their code as not being able to crash as they likely only ever tested with good configs, and thus never considered a graceful failure of their driver.

@5C5C5C@programming.dev · 2 years ago

I don’t doubt that in this case it’s both silly and unacceptable that their driver was having this catastrophic failure, and it was probably caused by systemic failure at the company, likely driven by hubris and/or cost-cutting measures.

Although I wouldn’t take it as a given that the system should be allowed to continue if the anti-virus doesn’t load properly more generally.

For an enterprise business system, it’s entirely plausible that if a crucial anti-virus driver can’t load properly then the system itself may be compromised by malware, or at the very least the system may be unacceptably vulnerable to malware if it’s allowed to finish booting. At that point the risk of harm that may come from allowing the system to continue booting could outweigh the cost of demanding manual intervention.

In this specific case, given the scale and fallout of the failure, it probably would’ve been preferable to let the system continue booting to a point where it could receive a new update, but all I’m saying is that I’m not surprised more generally that an OS just goes ahead and treats an anti-virus driver failure at BSOD worthy.

𝙲𝚑𝚊𝚒𝚛𝚖𝚊𝚗 𝙼𝚎𝚘𝚠 · 2 years ago

If AV suddenly stops working, it could mean the AV is compromised. A BSOD is a desirable outcome in that case. Booting a compromised system anyway is bad code.

@CeeBee_Eh@lemmy.world · 2 years ago

You know there’s a whole other scenario where the system can simply boot the last known good config.

𝙲𝚑𝚊𝚒𝚛𝚖𝚊𝚗 𝙼𝚎𝚘𝚠 · 2 years ago

And what guarantees that that “last known good config” is available, not compromised and there’s no malicious actor trying to force the system to use a config that has a vulnerability?

@CeeBee_Eh@lemmy.world · edit-2 2 years ago

The following:

An internal backup of previous configs
Encrypted copies
Massive warnings in the system that current loaded config has failed integrity check

There’s a load of other checks that could be employed. This is literally no different than securing the OS itself.

This is essentially a solved problem, but even then it’s impossible to make any system 100% secure. As the person you replied to said: “this is poor code”

Edit: just to add, failure for the system to boot should NEVER be the desired outcome. Especially when the party implementing that is a 3rd party service. The people who setup these servers are expecting them to operate for things to work. Nothing is gained from a non-booting critical system and literally EVERYTHING to lose. If it’s critical then it must be operational.

𝙲𝚑𝚊𝚒𝚛𝚖𝚊𝚗 𝙼𝚎𝚘𝚠 · 2 years ago

The 3rd party service is AV. You do not want to boot a potentially compromised or insecure system that is unable to start its AV properly, and have it potentially access other critical systems. That’s a recipe for a perhaps more local but also more painful disaster. It makes sense that a critical enterprise system does not boot if something is off. No AV means the system is a security risk and should not boot and connect to other critical/sensitive systems, period.

These sorts of errors should be alleviated through backup systems and prevented by not auto-updating these sorts of systems.

Sure, for a personal PC I would not necessarily want a BSOD, I’d prefer if it just booted and alerted the user. But for enterprise servers? Best not.

@Takios@discuss.tchncs.de · 2 years ago

I agree that the code is probably poor but I doubt it was a conscious decision to crash the OS.

The code is probably just:

Load config data
Do something with data

And 2 fails unexpectedly because the data is garbage and wasn’t checked if it’s valid.

Morphit · 2 years ago

You can still catch the error at runtime and do something appropriate. That might be to say this update might have been tampered with and refuse to boot, but more likely it’d be to just send an error report back to the developers that an unexpected condition is being hit and just continuing without loading that one faulty definition file.

@CeeBee_Eh@lemmy.world · 2 years ago

If there’s an error, use last known good config. So many systems do this.

@ToyDork · 2 years ago

Unfortunately, an OS that covers such cases is a lost monetization opportunity, fuck the system, use a Linux distro, you get the idea. Microsoft makes money off of tech support for people too unversed in computers to fix it themselves.

@Hazzia@infosec.pub · 2 years ago

I’m gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO’s have been hyping AI up so much, what could possibly go wrong?

@Couldbealeotard@lemmy.world · 2 years ago

What are the chances that Crowdstrike started using ai to do their update deployments, and they just won’t admit it?

@Cocodapuf@lemmy.world · 2 years ago

I’m nominating this for the “best metaphor of the day” award.

Well done!

DigitalDilemma · 2 years ago

Nice analogy, except you’d check the script before you tried to use it. Computers are really good at crc/hash checking files to verify their integrity, and that’s exactly what a privileged process like antivirus should do with every source of information.

@JasonDJ@lemmy.zip · edit-2 2 years ago

The funny bit is, I’m sure more than a few people at Crowdstrike are preparing 3 envelopes right now.

@crystalmerchant@lemmy.world · 2 years ago

This guy ELI5s

@CeeBee_Eh@lemmy.world · 2 years ago

Ah yes. So Windows is the screaming in terror version and other systems are the “oh, sorry everyone, looks like there’s an error. Let’s just move on to the next bit” version.

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

christian_taillon (@christian_tail)