Technology
What everybody should learn from CrowdStrike
Friday October 25, 2024
On July 19th 2024, roughly 8.5 million Windows machines around the world crashed and were unable to restart. Within hours, the bug — not in Windows but in security software from CrowdStrike — was discovered and fixed. However, it took manual action from sysadmins to get everything up and running again. Some companies were affected for days. CrowdStrike published their root cause analysis for everyone to read, which is commendable. While customers of CrowdStrike worry about the learnings of the company, I asked myself: what can I learn from this?
Disclaimer
First of all I have to say that, luckily, I was not affected in any way by this incident. I am also not in any way connected to CrowdStrike. All interpretations and opinions in this blog are my own, and based on publicly available information and my own knowledge and assumptions. Anything in this blog in regard to CrowdStrike should not be interpreted as the truth.
Learning from mistakes
In software engineering (and probably in other engineering disciplines as well), we try to understand the cause of any issue that occurs. Often a bug is easily fixed, but this is purely reactive (“we observed something to be wrong, we addressed the symptom”). To prevent a similar issue from occurring again, it is required to look at the cause(s) that lead to the issue being introduced. The direct cause of the issue may not be the one thing to address though, often that cause is itself caused by something else.
This is why the process to investigate and prevent further issues is called root cause analysis. While different techniques exist, the commonality is repeatedly asking “why?”.
(If you want to learn more about root cause analysis, Wikipedia is probably a good place to start.)
We perform root cause analysis to learn from mistakes and prevent the same (or similar) mistakes in the future. In practice these learnings are often not shared, sometimes not even within a company. So when the opportunity arises to learn from what is probably one of the largest outages of the decade (foreshadowing?), we should grab the opportunity with both hands! If you have the time, I recommend reading the CrowdStrike RCA document yourself to make up your own mind.
CrowdStrike outage root cause
So what did CrowdStrike identify as the root cause?
Well, they found multiple “root” causes — all related to engineering. Could all those engineering issues could indeed be all root causes to this problem?
To keep it understandable and generic (so everybody can learn), I rephrased the found causes based on my personal interpretation:
- Not all configuration that the software used at runtime was tested against the software before being applied.
- A common programming mistake was not caught.
- Testing of the software used a limited (and in hindsight inaccurate) mock.
- A logic error.
- As far as I can tell, a repeat of point 1.
- Releasing configuration directly to production.
If we categorize these quickly, we can identify points 2 and 4 as human error. Points 1, 3 and 5 relate to testing methodology. Point 6 is mostly its own thing and may actually point us towards something not related to engineering.
Unpacking the findings
Let’s think about what these findings actually mean.
Where people (or LLMs for that matter) work, mistakes are made. Especially when software gets complex, it is easy to make mistakes. While adequate testing should catch these logic errors, there always is a chance that something slips through.
And testing looks to be a problem, noting that half of the identified causes clearly indicate issues with testing. The final cause hints at a problem with testing as well, by not running the thing pushed to clients in a controlled environment first. In my opinion, any human mistake that reaches production can be rephrased as a testing problem — but since the testing is done by humans you have to stop somewhere.
Thorough testing — especially without breaking the bank — is hard. However, when taking the root causes regarding testing into account, the improvements to address those issues do seem to be low-hanging fruit. Best practices in software engineering and testing have included the importance of testing on multiple layers for years (the test pyramid that showcases this idea was invented in 2009, which is 15 years ago!). While mocking can be of great use when testing small details of a subsystem in a unit test, they are no replacement for the real thing on a higher level of the pyramid. The larger a system you mock, the more likely that not all behavior of that system is covered in the mock. And on that topic, clearly the unit tests were not adequate (if they exist at all) since it is good practice to not only test the happy flow, but also the obvious edge cases.
(As an aside, mutation testing can help you assess whether your unit tests are adequate. I gave a talk about it earlier this year.)
Talking about good practices, you obviously have some static analysis tooling in your build pipeline to make sure you don’t make obvious mistakes. The more complicated the problem you’re trying to solve, the greater the chance that the complicated stuff is correct and tested and the mistakes that get made are more stupid. Also, trying software in a controlled (acceptance) environment is a must, nobody wants to break production! And to avoid breaking all your customers at once, you could do a staged roll-out and notice issues before they impact everyone.
CrowdStrike is not unique
In general, this gives the impression that good practices in the industry could have helped to avoid the issue. So does that make CrowdStrike a bad software company? Maybe, we just don’t know. These kinds of mistakes certainly do not make CrowdStrike stand out. There are plenty of companies and software projects out there, closed-source and open-source, with similar or worse issues.
Inadequate testing can be catastrophic
Whether the issues we discussed are problematic all depends on what type of software you write. It can simply be annoying if you have to wait unnecessarily long for your game to load, which in the grand scheme of things is not too bad. However, it can cost a lot of money and a research opportunity if your spacecraft malfunctions, or cost human lives when your radiation therapy machine overdoses patients. The amount and type of testing you need to have in your software project depends on the risk of a failure and the impact of a failure. For high-impact failures, you better get the risk down!
The CrowdStrike software we are talking about here is client software that runs on the machines of others (not under control of the software vendor), which is red flag number one. Patching software on somebody else’s machine is usually a lot more complicated than on your own machine (for example for a SaaS product), especially when the update mechanism is (also) broken. Furthermore, the software ran hooked into the kernel — it is security software after all, that is not uncommon — which is red flag number two. A kernel module crash usually crashes the whole system. It also started very early in the boot sequence, and can actually stop the machine from booting when it fails, which is red flag number 3. After all, if the machine does not boot, you cannot update the software that causes it not to boot.
In summary: no control over the runtime and therefore fixing anything will be difficult; running in the kernel means crashes have a high impact; running on boot makes the other two issues a lot worse.
No control over machines + running in the kernel = you better make sure it works, and if you are sure it does, you check again!
What everybody can learn from this
This outage illustrates the need for adequate testing. I am not arguing that all software must be tested until it is virtually without faults. Instead, the amount of testing (and the tests you focus on) should match the risk and impact of a failure. (I don’t like bureaucracy, although I may be advocating for risk assessments here!)
Even with a well-considered test scope, the implementation of tests must also be done right. When the occurrence of the issue is then dependent on a specific input, it is easy to miss if you only test in isolation, especially if you did not properly write and test your input handling. Black-box testing and including the possible configurations in test cases are a necessity for confidence in the combination of the system with its configuration(s). Alternatively, techniques like mutation testing and property-based testing can be applied to ensure the software works for all inputs, but comes at a hefty cost. Having a production-like acceptance environment and staged roll-outs are useful tools to spot issues before they have a high impact.
Adding endless layers of defence is obviously not a sustainable way of making and testing software. In aerospace, the swiss cheese model is often used as an analogy for layers of security. By not relying on a single layer of defense against an issue, the chance of a catastrophic failure is much lower — the holes in each layer must line up for a catastrophic event to happen.
If CrowdStrike had applied such a strategy the issues identified as “root causes” would have been caught before ending up on customers’ systems. And there would not have been a CrowdStrike outage, or at least not this one.
Final thoughts
So can I do better in my projects? Almost certainly.
Is it worth the investment? Maybe.
For me, this story reinforces the belief that the level of testing should serve a purpose by reducing risk where it counts. Situations where the risk and/or impact of faults are high should be dealt with (first). And for that, identifying the risks and discussing them should be the first step.
That may give new roles to testing tools such as mutation testing or property-based testing. How about we use them to find the holes in the layers of cheese, and then discuss whether additional layers are necessary?
I want to end with advice to CrowdStrike, and anyone else that recognized their own project in some way in the descriptions in this article. Please perform root cause analysis. Talking about organizational issues, team expectations and responsibilities, and challenging the priorities of superiors is not easy. It is however necessary to understand the forces behind the technological challenges to really drive change. That is where priorities of everyone involved — including stakeholders — are aligned. Change may prevent an outage like this to ever happen again — it will at least lower the chances of it significantly. That appeals to customers, shareholders, and employees alike.
Read more
Comments?
Drop me an e-mail at janjelle@jjkester.nl.