CrowdStrike

CrowdStrike Lessons: Liability Shields Fuel Risky Practices, Expert Warns

by Lynn Parramore | Jul 30, 2024

July 19 dawned with despair as CrowdStrike’s update sparked a seismic cybersecurity disaster.

A miniscule code glitch transformed Windows computers into ticking time bombs, causing widespread crashes that paralyzed airlines, banks, hospitals, and government agencies. The fallout was massive: over $5 billion in direct losses for Fortune 500 companies, with healthcare and banking sectors facing nearly $3 billion in combined damage. Stranded passengers, disrupted 911 centers, and delayed surgeries underscored the disaster’s profound impact. Indirect losses for people whose plans and activities were interrupted will likely run even higher.

This debacle casts a glaring spotlight on the fragility of the cybersecurity industry — a brutal reminder of the risks inherent in a market where consolidation, lack of oversight, and inadequate testing breed vulnerability. With firms like CrowdStrike holding sway over critical systems, a single misstep can set off a chain reaction of chaos — a wake-up call for lawmakers and regulators to step up their game,

Digital security analyst Muayyad Al-Chalabi joins the Institute for New Economic Thinking to advocate for a more resilient and diverse cybersecurity infrastructure, identifying key players responsible for widespread failures that hit ordinary people the hardest.

Lynn Parramore: What exactly happened in the wee hours of Friday, July 19^th?

Muayyad Al-Chalabi: CrowdStrike, a leading cybersecurity firm, monitors global threats by collecting and analyzing data to detect anomalies and malware. Essentially, it’s a massive surveillance operation. The idea is that they do that monitoring, and they learn things in order to “protect” the end devices — your laptops, your computers — whatever that end device may be. They issue updates to look for the anomalies so that they can stop them or quarantine them.

I believe they perform these updates multiple times a day. On July 19th, one of these updates had an error that caused laptops, computers, and other end devices to enter what they call “panic mode,” resulting in a freeze and a blue screen.

LP: The dreaded “blue screen of death”—the digital equivalent of your computer yelling, “I’m melting down right before your eyes!”

MA: Correct. Once that happens, there’s a recovery process, which is manual. Fortunately, the manual process on your laptop or computer is straightforward—though it might seem hard if you’re a layperson and not familiar with moving files around.

The challenge is that large businesses and enterprises use a security feature called BitLocker that issues a key, which is about forty-eight characters long. Each machine needs its own recovery key from some IT group. Since your machine is already locked up, you can’t access it to retrieve the key. Somebody has to give you the recovery key on the phone or some other method. That’s why the recovery process became very cumbersome and it took a long time.

In summary, an error occurred, and the recovery procedure, which is manual and requires a lot of elaborate stuff, took a significant amount of time.

LP: Why did this one buggy update do so much damage?

MA: The global spread of the infection happened very quickly. It began at 4:09 AM UTC. CrowdStrike rolled out a fix at 5:27 AM UTC the same day, but in the 78 minutes it took their team to address the issue, the infection had already spread worldwide. The distribution mechanism was so efficient that it quickly affected everyone.

LP: CrowdStrike is blaming the testing software that it uses for this error. Why wasn’t the testing better?

MA: I just think they got away with it for such a long time.

LP: So it wasn’t a one-off problem?

MA: No, it wasn’t. This is not the software. There’s two things going on. One had to do with CrowdStrike’s engine software, called “Falcon,” the cloud-based security platform that’s supposed to keep your endpoints safe and spot threats in real-time.

The other piece has to do with what they call “signatures.” These are the files that get updated to look for things. CrowdStrike calls them “sensors.” It’s a driver that sits inside the operating system. It’s that file – which is a very small file—that caused the havoc. So it’s not, I would think, an engine software error per se. Rather it’s the file that caused the operating system, which is the Windows operating system, to malfunction.

LP: Do you think the file issue was a pre-existing problem, or did it arise specifically because of this testing error?

MA: There’s no indication that the file error happened just because they made changes to their testing procedures or they implemented testing. I don’t believe that’s the case.

LP: What’s your view of who is responsible for the fiasco?

MA: There really four elements of responsibility. One is CrowdStrike. Second is Microsoft and the operating system. Third is the enterprises that accept these auto updates without checking on the effects it will have on them. And the fourth element is this whole regulatory system and the issue of market concentration of businesses and so on.

LP: A lot of blame to go around.

MA: Right. CrowdStrike issues these updates, and they get away with it for a long time. The interesting thing on the Microsoft side is that Microsoft has a process to do testing before a rollout, including drivers from various vendors.

It’s not just CrowdStrike that has these files; think about other companies like Netflix or the manufacturers of printers—all of these involve drivers. Microsoft has a procedure to test and validate these drivers before they’re rolled out to ensure everything works properly. However, there’s a loophole that allows some code to bypass the usual testing process. And that’s what CrowdStrike has been using so that they don’t have to go through the longer cycle of testing through Microsoft.

LP: So, a weakness on Microsoft’s side allowed CrowdStrike’s unsafe updates to go on for a long time.

MA: Yes, the reason goes two ways. If you are CrowdStrike, you want to do these updates frequently because you want to catch the bad guys, and testing through Microsoft might take you longer. You might be behind. This creates a dilemma: balancing the need for rapid updates with the time required for thorough testing. They got away with using the loophole to bypass Microsoft testing for a long time. But this time, an error popped up.

LP: What about independent testing? There seem to be indications that CrowdStrike made an effort to restrict it.

MA: I honestly don’t know. Generally, most security software companies operate the same way when it comes to handling proprietary information and so on. You’ve got McAfee and Norton plus others, and the biggest two are supposed to be Microsoft and CrowdStrike according to market research firms such as Gartner. I believe independent testing is usually not performed.

With large enterprises, however, things are different. For instance, some enterprises don’t allow updates to be rolled out without thorough testing first. This is because enterprises have specialized applications that interact with the operating system, and even minor updates to the operating system can mess up these applications. So updates are usually tested on our specific applications before they are rolled out. If the enterprises don’t do this, it’s a problem. That’s why I say that you have four areas of concern: CrowdStrike, Microsoft, the enterprises themselves, as well as the regulatory issues.

LP: Looks like a colossal market failure on several fronts: You’ve got seller-side problems with CrowdStrike’s questionable updating and testing practices and Microsoft’s failure to address vulnerabilities – something downplayed in the press. On the buyer side, many large companies are also neglecting proper testing, while smaller companies lack the resources to do so. I empathize with smaller companies—they’re like patients relying on a doctor’s expertise, trusting these major players to provide reliable solutions.

MA: Yes, with the smaller companies/businesses, it’s an issue of trust. I would think they trusted both CrowdStrike and Microsoft. As for the larger companies, like Delta, for example, I’m sorry, shame on you.

LP: Because the big guys do have the money and the resources to make sure that these products work, and they didn’t bother to do it.

MA: Yes. Sometimes, you may need to prioritize thorough checks over speed, meaning it’s crucial to catch issues early, even if it slows down the process. People often take risks. And again, it’s a problem on the regulatory side, because these companies exploit liability loopholes. If liability were a factor, companies might be less inclined to take these risks, in my opinion. That issue lies in regulatory and liability laws, which effectively buffer Microsoft and others from accountability.

LP: So if these companies were held accountable, they might prioritize resilience more.

MA: Yes. Resilience has two components. One is stopping issues from happening, which I think is very hard. You’re going to get hit. The second, more important component, is how quickly you can recover when they do. You may not be able to avoid problems entirely — there are things outside your control — it’s like getting sick. But you do have some control over how effectively you respond and recover. The focus should be on improving the response and recovery when you do get hit. I think what the CrowdStrike situation has shown is that nobody was prepared for the recovery procedures. Or the recovery procedures were inadequate.

I want to give you an example of companies addressing the issue of retrieving recovery keys via automated systems when laptops lock up. The main challenge is the labor-intensive process of having to call into call centers to get the key. We developed a system that automates this process. You could call in, and based on the phone number, the user could be authenticated and receive the recovery key automatically without human involvement. Somebody still has to manually enter the key, but this system reduces dependency on call center staff and speeds up the recovery process.

The CrowdStrike problem arose because many end devices required a recovery key, and there wasn’t enough staff to handle the high volume of requests, whether in large enterprises or small businesses. For instance, in hotels and gyms, it took up to three days to reboot machines simply because they didn’t have the recovery key.

The key to resilience is having a fast recovery procedure. It seems that many enterprises have not invested enough in planning what to do when things go wrong.

LP: And yet things do go wrong pretty often. With the CrowdStrike disaster, some were reminded of the SolarWinds cybersecurity breach where attackers inserted malicious code into updates of SolarWinds’ Orion software, compromising thousands of organizations, including U.S. government agencies and major corporations. The attackers reportedly breached the system using the absurdly insecure password “solarwinds123”!

MA: Network failures have impacted major companies like Amazon, Microsoft, and Google in the past, often caused by misconfigurations in files and routing tables. Recovery processes were slow and cumbersome. So that’s one group.

Then you’ve got the second group: hackers and other nefarious actors. In the case of SolarWinds, hackers targeted the company because it was the predominant provider of network management systems. By compromising SolarWinds, they were able to piggyback into multiple systems, much like a stowaway sneaking onto a plane to cause havoc.

With CrowdStrike, even if the issue wasn’t due to something nefarious happening internally, it raises a crucial question: who is protecting the protectors? What are their procedures for safeguarding themselves? These virus scanners hold vast amounts of data from businesses—such as traffic analyses, application usage, transaction frequencies, and more—through their surveillance and monitoring. CrowdStrike and other cybersecurity firms are sitting on a vast amount of sensitive information.

Recall what happened with the AT&T data breach. The hackers didn’t go after AT&T directly — they went after the cloud provider, Snowflake, and stole the data from there. for example, CrowdStrike uses Microsoft as their cloud provider and other security firms do the same.

Now we’re delving into dependencies: these vendors rely on each other, and in the end, only a few key players have a comprehensive view of everything.

LP: The “too big to fail” analogy has surfaced here. These companies, like the big banks, face little regulatory pressure or liability, as you note, and when things go wrong, it’s the everyday people who end up getting screwed.

MA: I use the analogy of the one percenters that cause havoc for everybody else. And guess what? In this case, it was exactly one percent of the system that went down (8.5 million according to Microsoft). But it caused major disruptions for everyone else. This highlights a bigger issue that nobody has looked at: even though we have what appear to be multiple vendors and players, if they all use the same small component—like one feature on an operating system—it can create massive problems. I don’t believe anybody has done a true supply chain dependency analysis.

What I mean by dependency is about impact, not just connections. A one percent failure can have huge consequences worldwide. In this case, a one percent failure essentially brought the entire global system to its knees. Hospitals in the US, Europe, and elsewhere couldn’t operate because they couldn’t access patient records, and MRIs failed because they relied on precise dosage information. This illustrates the cascading effects of such failures. We tend to ignore the one percent at our peril. China was not affected since they do not use US security firms such as CrowdStrike, though some US hotel chains in China were affected.

LP: What you’re saying highlights the fact that on the regulatory side, we need experts who truly understand the complexities involved. We need better models for systemic risk, more information on how these issues occur, and effective strategies for prevention and response.

MA: Yes. Some have defended Microsoft and argued that they had to grant third parties like CrowdStrike access to their operating system due to EU regulations, specifically the General Data Protection Regulation (GDPR), which mandated this access as part of their settlement with the EU. That’s their argument. I think that’s bogus.

Granting access without proper checks and balances is not what the regulation says. To build resilience and incentivize companies to improve, the focus should be on three key areas: First, ensuring the right talent is in place to handle security and recovery. Second, developing robust systems for recovery, such as BitLocker. Third, establishing effective processes to support these systems and ensure resilience. Both selling companies and buyers, like Delta and others, need to address these aspects to enhance their security and recovery capabilities.

It’s people, systems, and processes.

LP: What happens if we don’t tackle these issues?

MA: From risk management, these guys are thinking, oh, gee, it happens once a year, so let me take the risk. I think that’s the mode we’re in. So, from a risk management perspective, addressing these issues only once a year, once every three years, or just a single time is not good enough.

LP: How does CrowdStrike’s market share affect its role for big companies and impact interdependencies?

MA: There are other firms besides CrowdStrike, which holds an 18 percent market share. However, CrowdStrike services about 60 percent of the Fortune 500 and 40 percent of the Fortune 1000 companies. This means that many large companies heavily rely on CrowdStrike without fully vetting their products, creating a dependency on this single provider.

Small businesses have largely outsourced their cloud services to various providers, even down to simple apps for everyday services like laundry. We live in a world with extensive interdependencies. For example, a significant issue that wasn’t widely discussed with CrowdStrike was that airport parking lots became inoperable because payment systems for credit cards were down. People could not get out. At Baltimore Washington International Airport, hotels couldn’t check guests in due to a lack of access to credit card verification systems. I was able to check in because I was a regular guest, but it took them three days to validate my credit card.

Even though CrowdStrike has only an 18% market share, the rest of the market is still significantly dependent on that 18%, and this dependency has not been fully addressed or understood.

It seems we’re constantly relearning these lessons. Despite repeated experiences, businesses often seem surprised and need to learn again. Over the past 20 years, many companies have prioritized short-run revenue over resilience, which is crucial but costly. Investing in resilience costs money, but it’s essential for long-term stability.

LP: Focusing on short-term shareholder value often leads companies to skimp on security, but does it ultimately cost them more in the long run?

MA: Something I want to know, and it’s more of a question than anything else: How effective are these cybersecurity companies at stopping attacks relative to their cost? In other words, how many cyberattacks have they prevented over the last umpteen years compared to how much companies spend on them? It’s a question. No doubt businesses do suffer as a result of security breaches.

Supposedly, CrowdStrike’s website has a lot of information, They boast about how businesses have consolidated their security needs with them, presenting numerous case studies that claim customers saved six dollars for every dollar spent with CrowdStrike. Is that true? It’s on their website, but how did they come up with that number?

We’ve observed various issues that come in different flavors, but they often share a common root cause. First, the shift to cloud services has led to significant breaches, such as the theft of AT&T’s 100 million customer records. Second, major cloud providers like Amazon and Azure have experienced massive network failures. Third, there are vulnerabilities in management systems that allow for infiltration. Lastly, failures in surveillance systems have led to widespread disruptions.

Think about the airlines’ failures, planes potentially crashing because when sensors fail. For redundancy, three sensors are installed. When there is a mismatch between the sensors’ data, the majority rule is used, i.e. trust two sensors that give the same data. Sometimes, an event like a bird strike could make two sensors malfunction. There’s an overreliance these days on sensing, specifically sensing that is given to automated agents. We need to reevaluate how sensor failures affect automated systems. When sensors fail, the resulting issues can quickly propagate through these systems due to their rapid processing capabilities, amplifying the original problem.

The national security implications are significant due to these dependencies we’ve been discussing. I’m concerned about what can happen with the government increasingly relying on commercial companies, which handle extensive data collection. The business models of some companies include a lot of government data, and you see the CEO reportedly earning over a billion-dollar salary. Many of these companies are either directly owned or heavily influenced by billionaires, intertwining corporate practices with political interests—a complex issue that merits further examination.

LP: Right, billionaires are tightly linked to our regulatory system, as they are showing us right now in the election cycle.

MA: Yes. Reid Hoffman, the Linked-In billionaire, just gave $7 million to the Harris campaign, and then he goes on CNN demanding the firing of people at the FTC who regulate.

LP: He’s unhappy with FTC Chair Lina Khan, who has taken a strong stance against Big Tech, and is now demanding her removal.

MA: Lots of Silicon Valley is not happy with Lina Khan. There are legitimate concerns for smaller companies—the costs of regulatory approval can be prohibitively high. For example, I know of a company that was acquiring another for $40 billion. The regulatory approval alone cost $27 million. While this is a minor expense for big firms, it’s a huge burden for smaller ones. This cost might make them think twice about future acquisitions and instead, opt to license technology and hire a few smart people as an alternative. This shift can increase dependencies across the board in ways that aren’t always visible. Many hadn’t heard of CrowdStrike until a major incident brought it to attention. Smaller companies, though less visible, are deeply integrated into critical systems.