CrowdStrike outage calamity: What happened and RapidScale’s response

On Friday, July 19th, 2024, banks, airlines, hospitals, retail chains, law enforcement agencies, and government offices around the world woke up to the blue screen of death.

Jul 24, 2024 |Sarah Davis |4 Minute Read

On Friday, July 19th, 2024, banks, airlines, hospitals, retail chains, law enforcement agencies, and government offices around the world woke up to the blue screen of death. 

LGA baggage claim display with error message on it after CrowdStrike outage

The CrowdStrike outage impacted various facets of technology and transportation, including subways, airports, and even Times Square billboards. (wikipedia)

Overnight, cybersecurity company CrowdStrike had released a routine update of its Falcon sensor security software. It was supposed to help detect security threats for businesses globally. 

What Caused the CrowdStrike Outage? 

CrowdStrike reportedly updated its Falcon sensor between 11:09 PM EST on July 18th – 12:27 AM EST on July 19th. Any customers that were online or updated their software between those times were affected. 

It’s been reported that faulty code that was left behind in QA resulted in global outages for companies using Microsoft’s Windows operating system. Mac and Linux hosts weren’t impacted. 

 

Over 8.5 million devices were impacted and 5,000+ flights have been canceled since Friday. The impacts of the outage could cost over $1 billion. 

 

The issue was in a Channel File. Channel Files are typically updated multiple times per day to ensure systems are protected against the latest security threats. These types of updates are typically tested in a smaller pool before being released to all customers.

Over half of Fortune 500 companies and the top U.S. cybersecurity agency, the Cybersecurity and Infrastructure Security Agency, use CrowdStrike to protect their organizations against hacking. 

Dive deeper into the stats in our infographic.


RapidScale’s Response to the CrowdStrike Outage
 

RapidScale’s teams were prepared for an event like this. With an escalation plan in place, our experts were up in the early hours of Friday morning, working together to put out the fires.  

RapidScale’s experts went into each of our affected client’s servers individually to manually update the code. Each server took about 5 minutes to recover, and with 60 experts on deck, RapidScale was able to resolve 60 servers every 5 minutes

That’s the benefit of having a legion of IT experts by your side that operates like a well-oiled machine – even in the chaos and calamity of a major global outage, when there’s no way to mass-deploy or automate a solution and each server has to be fixed individually. That’s the power of having an expert team with a plethora of resources dedicated to helping your organization during a critical event. 

The result? All of RapidScale’s customers’ issues were resolved by Friday night.  

For some servers, it was as simple as going through a reboot once CrowdStrike shared the fix.  

For others, that didn’t work. They would go through a reboot loop cycle, which required manual remediation. Since it was impossible to log into the machines, RapidScale’s experts got the data from those machines, deleted the corrupted file, put it back onto the server, and restarted it. From there, the fix had to be deployed server by server. 

In the background, RapidScale also seamlessly flexed into Linux expertise to restore all but one data center server before the East Coast even came online on Friday morning. The final server was a physical box and took as much time itself as all the others combined. 

How Long Did the CrowdStrike Outage Last? 

Journey through the unfolding events with this comprehensive timeline: 

  • 12:27 AM: CrowdStrike finishes rolling out the Falcon sensor software update, and the CrowdStrike outage starts. 
  • 12:30 AM: RapidScale’s monitoring tool detected that servers were down. 
  • 1:00 AM: RapidScale's experts’ hands were on the keyboards, investigating what was going on before CrowdStrike announced the outage. 
  • 2:30 AM: CrowdStrike issues a manual resolution for the software. RapidScale starts working to resolve the issue – even for clients who didn’t know there was an issue yet.  
  • 5:00 AM: The first RapidScale customer comes back online, while many businesses were just waking up to the outage. 
  • 1:29 PM: Final RapidScale private cloud customer resolved. 
  • 9:00 PM: Final RapidScale public cloud customer resolved, after extenuating issues with domain controllers. A RapidScale engineer was in touch with the customer throughout the day to resolve the issue.  

RapidScale has a set of experts for private cloud and a set of experts for public cloud, but they came together, shared insights, and worked as a unified team to get every customer back online.  

The issue was noticed in Australia first. It quickly spread to the United States, Asia, Europe, New Zealand, and other countries.  

CrowdStrike released steps to fix affected systems, but the solution required manually weeding out the flawed code. Since it’s not an easy fix, the issue reportedly continued for many businesses into the weekend following the crash and could continue to have impacts for the upcoming week.  


Some reports say the issue could take weeks to be fully resolved.  


On July 21, CrowdStrike released a statement on LinkedIn that they were testing a technique to “accelerate impacted system remediation.” 
 


CrowdStrike’s Response to the Outage
 

In a statement released on Friday, CrowdStrike Founder & CEO George Kurtz stated, “I want to sincerely apologize directly to all of you for today’s outage.”  

He also acknowledged that “adversaries and bad actors will try to exploit events like this” and encouraged companies to “remain vigilant” and engage with CrowdStrike representatives.  

He ended the statement by saying, “As we resolve this incident, you have my commitment to provide full transparency on how this occurred and steps we’re taking to prevent anything like this from happening again.” 

 

Turning Chaos Into Calm: The RapidScale Way 

When you have certified cloud experts like the team at RapidScale working by your side, even the greatest chaos becomes calm. While many organizations were scrambling to manually update their servers with uncorrupted code, RapidScale’s customers were able to be more at ease with a team of IT experts doing all the manual work. 

Ready to simplify IT and unleash innovation for your organization? Talk to our experts today. 

 

Sources 

  • CNN Business, https://edition.cnn.com/2024/07/21/business/crowdstrike-outage-cost/index.html
  • CrowdStrike, https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
  • CrowdStrike, https://www.crowdstrike.com/blog/to-our-customers-and-partners/
  • Forrester, https://www.forrester.com/blogs/crowdstrike-global-outage-critical-next-steps-for-tech-and-security-leaders/
  • Independent, https://www.independent.co.uk/tech/microsoft-it-outage-crowdstrike-windows-flights-latest-update-b2583539.html
  • Reuters, https://www.reuters.com/technology/cybersecurity/crowdstrike-update-that-caused-global-outage-likely-skipped-checks-experts-say-2024-07-20/
  • SBS Cybersecurity, https://sbscyber.com/blog/security-advisory-crowdstrike-outage-due-to-faulty-windows-update
  • Technology Magazine, https://technologymagazine.com/cloud-and-cybersecurity/worldwide-it-outage-the-pressure-on-cybersecurity-vendors