CrowdStrike Incident

Tuesday, July 23, 2024 · 6 min read · BSoD CrowdStrike Microsoft Windows ·

Overview

I woke up on Friday July 19, 2024, and read that there was a massive IT outage in progress that was affecting airlines, financial institutions, and various other businesses worldwide. I was aware of CrowdStrike prior to this outage, and it was not surprising to me at all that something like this finally happened at this scale. While I have never been a customer of CrowdStrike, I have used products in the past that worked in a similar manner, and they always made me nervous how they deployed updates. These updates have the potential to cripple all workstations and servers in an organization with only one minor issue, and that is exactly what happened.

The issue occurred because of a bad update that was deployed to the CrowdStrike Falcon Sensor product, which caused millions of Windows workstations and servers to crash resulting in a BSoD. The issue with the CrowdStrike Falcon Sensor software prevented those devices from restarting and it required manual intervention to correct the issue (automated processes are also available). According to a blog post from Microsoft, approximately 8.5 million Windows devices were affected due to this issue from CrowdStrike. This is potentially the largest IT outage in history (so far), and it demonstrates how a single vendor can cripple critical infrastructure.

Full disclosure, I have never worked with CrowdStrike before, but I am familiar with their software. The solutions that they provide are not unique, and there are other vendors that provide similar software. Because it is 2024 all these solutions are somehow “AI integrated”, at least from their marketing teams.

Since this is an on-going event, I will update this post as more information is revealed. The CrowdStrike CEO has been called to testify in Congress about the incident, so that will be interesting to see if it happens.

Updates

July 24, 2024: CrowdStrike has released a Preliminary Post Incident Review (PIR) regarding the issue.
July 30, 2024: Microsoft has released an update regarding the incident, and how the corrupted file was able to cause the issue in the Windows kernel in the first place.
August 6, 2024: CrowdStrike has released the RCA Exec Summary for the incident.

What Happened?

A lot of information has been written about this, so I will sum this up as quickly as possible:

CrowdStrike Falcon Sensor is a vulnerability scanner that is available for the Linux, macOS and Windows operating systems. This issue was only present on Windows, although it has been revealed that a similar issue happened with Linux systems earlier in 2024.
CrowdStrike pushes updates to the software as required, with definition updates and signatures to identify zero-day attacks.
The CrowdStrike Falcon software installs itself as a Windows kernel driver, which gives it visibility and access to the entire operating system.
The kernel driver loads the files that are provided from CrowdStrike, which happens without any interaction from the user.
The driver was configured as a boot driver, which means that the operating system will not start if there is an issue. This is important, as this caused the issue with Windows not being able to boot properly.
On July 19, 2024, CrowdStrike pushed an update that was corrupt and invalid, and that caused the operating system to crash.
Since the faulty update was present on the Windows device and the software was configured as a boot driver, the operating system was not able to start.

I am a fan of Dave Plummer’s YouTube channel (Dave’s Garage), and he posted a video explaining the issue and why the CrowdStrike Falcon Sensor software caused Windows to crash and require recovery:

Dave Plummer posted a follow-up video about this topic on July 24, 2024:

At the end of the day this was a massive failure on CrowdStrike’s part:

They pushed an update which was invalid. The CrowdStrike QA team should have found this issue.
They pushed an update to all clients at the same time, and not staggered.
The Falcon Sensor kernel driver did not have adequate error checking in place.

I am not here to defend Microsoft, but the amount of negative press that they have received about this issue is unwarranted. Everything in the Windows operating system worked as it was supposed to, but the software from CrowdStrike caused the issues.

Remediation

Microsoft has posted instructions on how to fix the issue with CrowdStrike in two articles, KB5042421 and KB5042426, and CrowdStrike has also posted instructions as well. Microsoft has also created a recovery tool that can automate the process with a bootable USB drive, and details about it can be found in the KB5042429 article.

Fixing the issue is straightforward, and at a high-level has the following steps:

Boot Windows into Safe Mode.
Navigate to the C:\Windows\System32\drivers\CrowdStrike folder.
Delete the file with the C-00000291*.sys name.
Reboot the device.

Some people were also able to resolve the issue by rebooting the Windows device upwards of fifteen times.

There are a few things that can complicate this recovery, and that includes BitLocker Recovery (or any third-party encryption software), and whether the user can login in Safe Mode. A lot of organizations use tools to manage the local administrator account (such as LAPS), and a regular user would not have credentials to login. Even if they did have credentials, they may have a limited account that cannot delete files in the Windows directory.

How to Avoid This in the Future?

There are a few ways to avoid this issue in the future, the most obvious method is to stagger the deployment of updates so that bad updates can be limited to a smaller subset of users. At the same time, the company that is deploying the software should ensure that the QA process finds these catastrophic issues in the testing phase and avoid the issue altogether.

The most obvious way to avoid an issue like this in the future is to avoid using the same security product on workstations and servers. More diversity is required to ensure that a single vendor is not able to crash all devices in the organization at the same time. The problem with this approach is that security companies are aggressive in the sales phase and want their software on everything.

Only time will tell on how this outage will affect the IT landscape.