The three things we learned as QA experts from the Crowdstrike Incident of July 2024.

And None of Them is to "Do More Testing"

This week has been significant in both security and QA circles. Crowdstrike became a brand that was known only to institutional investors and security experts, but overnight it became common vernacular worldwide for software that affected almost everyone. The security incident led to widespread disruptions, revealing vulnerabilities not only in Crowdstrike's DevOps processes but also in their crisis communications strategy. Watching this unfold, three key learning points became clear to us as QA experts, and surprisingly it's not "do more testing".

1. There's No Hiding with Site Down Situations, So Be Transparent with Everyone

Crowdstrike initially informed their customers through internal support channels rather than openly acknowledging the issue to all users. End users, and even some on-duty IT employees, had no idea until information started leaking from those support chats to Reddit. Crowdstrike isn't some obscure software: it's kernel-level and everyone is affected so you want to control the flow of information instead of forcing people to find answers through unofficial postings of other affected users on social media to get their systems up and running. It points to an issue of clear communication to your customers.

MAJOR LEARNING: Ensure you have a SOP on emergency communications, and reach out to your customers where they are. Internal support forums are not sufficient when your software is ubiquitous.

2. Overnight Deployment Windows Don't Give You "Time to Sort Things Out"

When software is used globally, deploying overnight where you live may seem to mitigate risk but many tend to forget your customers are everywhere including places where it is prime working hours. Crowdstrike rolled out its update overnight North American time, and the first people to be affected were in Australia - their screens went blue ahead of everyone else, in the middle of their workday. Critical path patching and deployments cannot be lackadaisical just because it’s nighttime in your time zone.

MAJOR LEARNING: Your overnight is someone else's workday: Ensure there is an A-Team available to respond to any issues.

3. Even the Most Rigorous Tests Can Be Foiled by "Assumptions"

Crowdstrike assumed that testing of their March deployment of Template Types was proof that their testing and validation was trustworthy. From that assumption, a small 40KB Rapid Response Content update in July 2024 that caused an out-of-bounds memory exception slipped through validation and was released into the wild. Obviously, the size of an update does not correlate with its potential complexity or risk, but more importantly, insufficient testing of validations to catch boundary cases is a major risk that can eventually lead to a quality incident such as this.

What did we learn and what’s the major takeway?

The major learning is: Thorough testing without assumptions is essential, and a mindset of quality doesn't allow for assumptions to slip into the QA process.

The major takeaway looking back at this incident is something we’ve been extolling for many years - QA isn't just about testing: It's about a mindset that puts quality and user experience above all else.

The three things we've learned has less to do with the technical side of testing, and more to do with the human side of it: It’s about embedding a mindset of quality and user experience as being an integral part of your organizational culture. Communication, assumptions, and risk management are human activities, and having a quality-first mindset forces us to do better for our customers.


Previous
Previous

6 Big Questions to ask your team before your next Dev & QA automation project in 2025

Next
Next

Bell Canada needed a partner that understood that quality means getting it right the first time.