CrowdStrike Didn’t Take Its Own Advice

CrowdStrike Didn’t Take Its Own Advice

and the whole world paid for it.

Late last night, Windows machines just about everywhere started rebooting and crashing to a Recovery version of the infamous Blue Screen of Death. They didn’t come back. It wasn’t all Windows machines, mind you, but it was many machines, and it was difficult to find any rhyme or reason to it.

Photos started to emerge on social media. Laptops around a conference room table in Asia uniformly showed the Recovery screen to the amusement and bewilderment of employees. Arrivals and departures vanished from airline monitors as travelers looked on at the BSoD in confusion. Australian self-checkout lanes shut down. Rumors emerged about a global cyberattack, and it looked like the most successful one in history.

The damage was only beginning. Medical prescription and patient record systems failed. Some hospitals initiated a “diversion” protocol in which inbound or arriving patients are turned away due to a concern about the facility’s ability to safely treat them. The New York Times reported 911 emergency service disruption in multiple states. Security expert Troy Hunt would go on to describe the event as being similar to “Y2K, except it’s actually happened this time.”

“Not quite an update”

By morning, major players in computing (including Google) recognized a pattern in which Windows virtual machines using CrowdStrike security software were crashing and unable to reboot. CrowdStrike confirmed that a faulty update to their CrowdStrike Falcon agent was distributed at 4:09 UTC, saying 80% of Windows VMs affected would “self-heal” during a reboot.

It’s unclear how many did.

Apparently the specific issue was that CrowdStrike made a “content change” that caused subscribed machines to acquire a corrupted kernel driver file which borked their ability to start up at just about the deepest level which is why the mistake took out pretty much every machine possible. If that sounds to you like “a bad update went out,” I’d say that’s a pretty valid assessment, and we’ll get into that here shortly.

In estimating just how many machines went down, Wired pointed to market research firm Gartner which says CrowdStrike sees about 14 percent of revenue in the security software market. While most people had never heard the name a week ago, this means the company was indeed in a position where one false move could ripple outward and touch business services and devices in every part of day-to-day life worldwide.

So shouldn’t this have been, y’know, not able to happen so easily?

It’s common in the business world to operate under strict rules not to run just every operating system and vendor update you can get your hands on. Admins tend to push out patches and new versions on their own schedule after they’ve run any testing they wish and had a chance to observe how the change is doing elsewhere in the wild.

This, on the other hand, didn’t even sound like an auto-update situation. It was much more like Thanos snapping his fingers with the Infinity Gauntlet and wiping out half of the world’s technology.

I visited the CrowdStrike subreddit to see what I could learn about this part. r/crowdstrike has years’ worth of questions and advice from corporate admins, and the group is moderated by what appears to be a CrowdStrike employee. Four years ago, it just so happens that someone asked about this very scenario. The wording is a little difficult to follow, but the essence of the question is, what are the chances of a bad automatic update?

The CrowdStrike moderator weighed in revealing that CrowdStrike pushes these changes about every two weeks, so admins can configure the software to update fully automatically or take advantage of settings referred to as “N-1” or “N-2,” meaning that the software can stay one or two updates behind. This has the advantage of updating automatically while still trailing back a few weeks in case of, well, the very scenario that just happened.

“I personally recommend my clients use N-1 for all but their critical machines/services,” the employee/mod says. In effect, everyone agrees that automatic updates in production with no human intervention is almost never the right approach.

This led to a new question: did this many administrators around the globe have the bad idea setting turned on?

Probably not.

Up until earlier today, CrowdStrike’s “Director of OverWatch” (I have no idea) was present and relatively active on Twitter, and even tried helping Earth navigate the disaster after it began. In a now-deleted Tweet, he says, “There is a faulty channel file, so not quite an update.” He goes on to share an early workaround which may depend on your device and the state of the rest of your network.

“Not quite an update.”

They did a thing, and devices everywhere got a corrupted file. I’ve seen it called a “content change” and a “config change,” but the CrowdStrike team seems a little hesitant just to call it an “update.” Probably because if they did, admins everywhere who were doing what they were supposed to, who were promised complete control over how they accepted changes and when, and who somehow still managed to get that angry phone call from their employers in the middle of the night would rightly point out that it should have gone through the official update pipeline, and everyone under a -1 or -2 policy should have been protected from it.

It sounds to me like CrowdStrike has a very sensible update system in place, and even a pretty healthy mindset about how to use it.

It seems like they just didn’t.

Was it about speed?

One thing you need to know about CrowdStrike’s CEO, George Kurtz, is that he likes to go fast. He moved from accounting into security at Price Waterhouse after college, and in the late 90s co-wrote Hacking Exposed which I may even have a copy of somewhere downstairs. This early success in security led him to start Foundstone, his own security consulting business, which later joined the McAfee family.

In a rather on-the-nose profile in The Straits Times titled CrowdStrike’s George Kurtz is in a race against hackers, Kurtz tells the editor about the time he saw a fellow passenger on a flight take out his laptop and wait for Kurtz’s own McAfee software to run for fifteen minutes before the machine was ready for use. This bothered him. Before long, Kurtz resigned to work on launching CrowdStrike.

Today, George Kurtz’s net worth is estimated above $3.2 billion. Speed is still his obsession. He collects fast cars and drives for CrowdStrike Racing with pretty competitive results. In 2020, Kurtz turned heads after telling a panel audience that one of the biggest lessons he’s learned as a CEO is that he never regrets firing someone too fast, only waiting too long.

As for CrowdStrike itself, the company advertises threat detection rates between six and eleven times faster than competitors. The Los Angeles Times has described continuous, automatic updates as part of the service CrowdStrike provides.

It’s the race against hackers, and it’s Tech School 101: you’re hoping to take the weekend off and sleep tight tonight. Hackers won’t rest and might not eat until they’ve broken into your system. There’s not a moment to lose. Kurtz and CrowdStrike all know this, and it seems maybe they’re not leaving everything up to system admins after all.

What now?

In the coming weeks (and, probably, future hearings), I find myself wondering if CrowdStrike will lean further into the position that this was not technically a software update, just a configuration change that somehow caused an unwanted file to appear on crashed devices. I sort of anticipate a you-can’t-handle-the-truth-style monologue about the scary adversaries stalking us from the dark web night and day, waiting for the opportunity to strike and, in turn, making Kurtz and CrowdStrike’s work necessary so kindly get off their backs. He tested those waters this morning in his interview on TODAY. Unfortunately for his defense, CrowdStrike accidentally hit us all about as hard as we’ve ever been hit.

And yet, CrowdStrike’s emphasis on speed is not without merit, and that complicates things. Security breaches do take place in the blink of an eye, and being slow and steady with system changes does come with its own set of risks. It’s difficult to imagine a new law or regulation for the industry that doesn’t stifle open competition or further slow down the process.

But we just watched one SYS file bring the world to a screeching halt, and we have to do better than that.


Discover more from CodeWritePlay

Subscribe to get the latest posts sent to your email.

Todd Mitchell Avatar

Leave a Reply

Discover more from CodeWritePlay

Subscribe now to keep reading and get access to the full archive.

Continue reading