The end of last week was remembered for the largest failure in the operation of Windows PCs on which CrowdStrike software designed to protect against cyber attacks was installed. Following its investigation, CrowdStrike said the outage was due to a bug in testing software that failed to properly test the update, which was distributed to millions of PCs on Friday.
At the same time, CrowdStrike promised to more thoroughly test updates to its software in the future, as well as implement a phased rollout of packages to avoid a repeat of the incident that happened a few days ago. As a reminder, CrowdStrike’s Falcon app is used by companies around the world to protect against cyber attacks and is installed on millions of PCs. On Friday, the company began distributing an update to Falcon that was supposed to collect “telemetry data on possible new methods to combat cyber threats.” Such updates come out with some regularity, but in this case, one of them caused a large-scale crash on a Windows PC.
CrowdStrike typically releases two types of updates. Sensor Content packages update content for Falcon on the user’s device and run at the Windows kernel level. Rapid Response Content packages update the signatures of the Falcon sensor, which is used to detect malware. In this case, a tiny 40KB Rapid Response Content file crashed 8.5 million computers.
Falcon sensor updates are typically not deployed from the cloud and include artificial intelligence and machine learning models that allow CrowdStrike to improve its malware detection capabilities over the long term. Some of these capabilities include what are called “Template Types,” which are programming code for new detections that are customized based on how the package is delivered to users’ devices.
CrowdStrike has a cloud platform that is used to manage the company’s products and validate the contents of update packages before they are widely distributed. Last week, the company released two Rapid Response Content updates at once. Now it has been determined that a bug in the content validation tool caused both packages to pass the test, although one of them was problematic and ultimately led to a massive failure.
Although CrowdStrike performs automated and manual testing of updates before mass distribution, it appears that in this case the testing was not done thoroughly enough. The previous deployment of “Template Types” provided the company with “confidence in the checks performed by content validators,” so CrowdStrike felt that a new rollout of such an update would not cause complications. This caused the Falcon sensor to receive the problematic content along with the Rapid Response Content update, load its code into its content interpreter, and then fail due to an attempt to access memory areas outside the valid address space. This error could not be handled by Falcon, causing Windows to crash.
To prevent similar incidents in the future, CrowdStrike intends to improve the content testing process for Rapid Response Content updates, including through testing on local developer systems, staged deployment of packages, and integration of the ability to roll back to a previous system state. In addition to this, developers will deploy additional tools on their systems to stress test updates and identify errors. The stability of service packs and the Rapid Response Content interface will be tested. CrowdStrike will also update the cloud update checking tool, as well as improve the error handling mechanism in the content interpreter, which is part of the Falcon sensor.