Developing a Strategy for Container Vulnerability Management
A tactical guide to find your ideal scanning and patching approach for container images
While there is a plethora of posts and articles on how to run vulnerability scans, and the importance of scanning, there is not much said about the different scanning tools and whether or not they would fit into your pipeline strategy. This blog post will provide you a first insight on which aspects you may want to consider when choosing your CVE scanner and reflect on possible scanning strategies you may want to adopt in your game.
Developing the right process to manage Common Vulnerabilities and Exposures (CVEs) in container images is a crucial first step towards securing your cluster’s security posture. In my humble opinion, there are different pieces at stake that have to interact with each other to establish a successful container security strategy.
However, analogues to learning how to play chess you will not start learning opening strategies before you know how to move each piece.
In this regard, container vulnerability management is pretty much like learning to move your pawns. The moves are straightforward and positioning your pawns (scanning) at the right position will determine whether your flanks are open for weak spots or not. Done right, it can be a big game-changer for your team and who knows maybe you reach the other side of the board and keep your attacker in check.
Now, enough with the metaphors and let’s take a look at why container vulnerability management is an important part of your overall container security strategy.
3 reasons why container vulnerability management is important
Reason 1: The agony of complexity by introducing flexibility
Containerizing your applications has its perks, no doubt. You can encapsulate functionalities, spin up applications on almost any system, and scale them up and down. However, all this flexibility has its price, which is best understood when you take a closer look at the anatomy of a container:
Each image incorporates everything needed to run your code (and sometimes it has even a few more extra ingredients!). Besides inhabiting your code, a single container image can contain numerous dependencies and tools, each with their own version. This does not only refer to 3rd party dependencies in your application but also language frameworks, runtime and tools needed on your operating system. Additionally, the base image itself can be everything from small (e.g. slim, Alpine, etc.) to full-blown OS images. It may even be another image hosting an application (like WordPress) and being extended for additional features or functionalities. This flexibility has huge implications on the probability of introducing vulnerabilities with every layer added to the image.
Keep in mind that we are talking about the anatomy of one single container image and the more containerized applications you own, the more components you will probably need to maintain to get rid of vulnerable and outdated packages. Now, there are some best practices to reduce this complexity a bit (minimize base images, multi-stage builds etc.) and the guide provided by Snyk and Docker might help you, in this regard(https://snyk.io/learn/container-security/). However, you will always face a trade-off between the flexibility in your image design and complexity.
Reason 2: The illusion of short container lifespans
Even though you may have heard that “the average lifespan of a container image is several minutes instead of days”, the statement can be misleading. According to Sysdig’s container usage report from 2019 the lifespan of actual running containers implies, that at least 14 % of containers were not shut down for a week or longer. I have experienced even longer lifespans, especially when you host tools for pipelining, monitoring, or storage management on the cluster. It further shows, that only half of all container images are updated on a “less than a weekly” basis (2019 Container Usage Report). The lifespan of vulnerable and exploitable containers existing unpatched in your environment should therefore not be underestimated.
Reason 3: There is nothing more annoying than being punished for carelessness
Don’t forget that finding and exploiting known vulnerabilities belongs to the standard repertoire of any attacker. Thanks to exploitation tools such as Metasploit, you do not even need to know how to construct the attack yourself. Being a victim of an exploited vulnerability that has not been remediated can be a painful experience. Especially, if there was already a fix available for that vulnerability.
Selecting a fitting container scanning strategy
Now that we talked about the importance of container vulnerability management, let’s have a look at “how the pawns are moved”. To be more specific, we will take a look at possible vulnerability scanning strategies around your container lifecycle. There are three possible locations you can incorporate a scan:
- Local/After Image Build
- On Registry
- During Runtime
Independently of those locations, each container vulnerability management approach will follow the same principle illustrated below:
Strategy 1 — Local vulnerability scans and breaking the workflow before pushing the image to the registry
In this strategy, images are scanned right after they were built by either baking in an inline scanner in your Dockerfile, running a script, CLI command or CI plugin. The results of the scan are compared to a pre-defined threshold before any push to the trusted registry. If the threshold is not exceeded, the image will be pushed and the cycle continues. If not, the workflow will break the build pipeline and the image owner is notified of the scan results.
Strategy 2 — Trigger scan on the registry after push
This strategy is probably the most commonly distributed. An image is first built and pushed to a registry. Right after the image is pushed, a scan is triggered either externally (e.g. CLI command, API call, CI plugin, etc.) or internally (built-in trigger in your registry). After the scan is performed, it will return the results to your workflow, so that you can proceed with the threshold evaluation. The exact workflow in this approach strongly depends on the implementation of your vulnerability scanner and its integration with the registry. Some registries like Artifactory or Harbor will be able to prevent pulling and therefore deploying vulnerable images. This feature can also be useful in combination with periodic scans across the repositories within the registry, updating the latest vulnerability information to existing images (see Artifactory, see Harbor).
Strategy 3 — Registry scanning with staging registry
Now, let’s have a look at a similar but a bit more complex strategy. The approach involves not one but two registries. One for staging and one for production images. Analogous to Strategy II, an external or internal event on the push will trigger the image scan on the staging registry. If the image is clean it will be pushed to the production registry. If it does not meet the threshold, it will remain in the staging registry and trigger an alert to the image owner. The production registry will store only the latest clean images that are allowed to be deployed in production. Older versions can be deleted (via retention time), depending on your registry tool. A periodical scan on the production registry can also alert the image owner about necessary actions and prevent the deployment of images that became vulnerable over time.
Strategy 4 — Scanning vulnerabilities during runtime
The fourth strategic approach is scanning container images during runtime. This scanning capability is very strong, as it shows you the actual vulnerability exposure of all running containers. This also includes containers that do not run your own images like for example system containers, or tools and increases in its importance when you are not able to control users to deploy only from trusted image registries. Container runtime vulnerability scans run in an automated manner on your cluster to scan all images in all accessible pods and benchmarking the findings against your defined thresholds. As the images are already deployed, you will need to define an adequate response behaviour for exceeded thresholds. There are two possible options:
1. Alert the image or cluster owner on the findings
2. Alert and isolate/kill vulnerable containers after a grace period
Strategy 5 — Combining Local, Registry, and Runtime Vulnerability Scanning and adding an Admission Control
As you might have already expected by now, the scanning strategies introduced until now can be combined in various ways to leverage their advantages. Local scans could help you to detect the findings early on and avoid unnecessary pushes to your registry/registries. Runtime scanning and registry scanning can interact with each other as well, but may require one additional puzzle piece “the Kubernetes Admission Controller”. Admission controllers allow Kubernetes to extend their API request lifecycle by forwarding deployment decisions to external sources like your scanning results. The runtime scanner can verify that each image that is about to be deployed has been scanned (e.g. on the registry, or during runtime) and holds against your threshold.
This strategy feels like the most wholesome one as scanning would be part of your full container lifecycle.
Tips when selecting your vulnerability scanning tool
As we now have seen some strategies, we are ready to pick a random scanning tool and get started, right? Easier said than done. There are numerous scanning tools on the market, these all have their tweaks and perks. The decision which tool you add to your workflow should take some of the following aspects into consideration:
I. Your Budget: Obviously, the scanning tool should fit your budget and the majority of tools on the market are not for free.
II. A fit to your preferred scanning strategy: As you may expect not all tools support all positioning along the lifecycle. Therefore, not every solution is eligible for every strategic approach. Here is a short overview of tools I know and their scan positioning along the image lifecycle (Note: The capabilities of the products can of course change over time and there is no claim for completeness of the list):
III. The supported layers and frameworks:
Vulnerability scanners normally cover the scanning of base images, system tools, runtime, and frameworks. However, not every scanner also covers software composition analysis (3rd party dependencies and legal license violations) scans for your application. Keep this limitation in mind, if this capability is crucial to you and not yet covered by other scanning tools in your pipeline (Tip: Crosscheck with your Application Security Testing tools). Also make sure to check if your programming languages and frameworks are supported. This also applies for using less popular OS in base images such as AWS Bottlerocket, OpenSuse, or Windows container which may not be supported.
IV. The information richness of the scan results:
When you will receive your first scan results, you may want to take a closer look at the provided information by the scanner. Besides the accuracy of findings, the information richness is pretty scattered over the tool landscape. Personally, I consider the following information very useful when handling CVEs:
- Time of performed scan
- CVE Database version
- Full image name (including repository/image:tag and digest)
- Overview of image content (including types, versions, and layers)
- (CVE-)IDs of all findings
- Title and description of the finding
- CVSS score if available (maybe v2 and v3)
- Components affected by a vulnerability
- Available fixed versions
- Link to external sources for further information
In case of runtime scanning, every information that helps you track down the image owner (e.g. labels, pod, namespace, cluster)
V. The accuracy and completeness of your scan results:
As nearly all image scanners will differ in their detection and classification capabilities, it will probably be hard to say which scanning tool is more accurate without verifying your findings. You should be very careful when comparing the severity of detected findings between different scanning tools. Since there is no strict definition of how to classify a finding (e.g. critical, high, medium, low, etc.), the provider of scanning tools will either rely on the classification found in the used CVE database or define their own classification (e.g. based on CVSS scoring values). Severity is therefore a highly subjective perception. The same will be true in regards to the number of findings. Some providers may only flag patchable vulnerabilities whereas others arenot flagging unpatched vulnerabilities at all. If you are interested in these observations, you may want to read Raesene’s blogpost series.
VI. Additional capabilities the scanning tool may provide:
Even though our focus lies on container vulnerability scanning in this post, it is worth noting that some scanning tools provide additional capabilities beyond vulnerability scanning that can be quite powerful in regards to improving your security posture. Interesting features can be:
- Extended scan scope that covers vulnerability scanning for your worker nodes and cluster components
- Extended scanning capabilities that provide compliance benchmark scans
- Extended scanning capabilities that allow signature-based malware scans
- Capabilities to define admission controls and runtime security rules
- Capabilities that detect malicious or suspicious behaviour on the cluster
VII. The operational effort to install and maintain your scanner:
A vulnerability scanner can incorporate large CVE databases that increase their setup time and operability. This aspect is especially important if the databases are stored within your environment and not consumed as an external service. Depending on the tool’s overall architecture, you may also experience various complexity when installing and updating them.
VIII. The vulnerability databases the scanner uses and the update frequency:
True to the motto “shit in = shit out”, the best scanning tool will be of no use to you, if it relies on outdated vulnerability information. Hence, you need to make sure that the tool provider updates its vulnerability database frequently and that these updates reach your scanner installation as well.
IX. Resources needed to perform a scan:
This aspect is very important for scanning tools deployed on your cluster. Scan jobs will consume resources and when your scanner is slow while the number of scan jobs is high, you can run into problems like draining out. It is not unlikely, that the number of scan jobs will be high in the beginning, especially when your registries store large image histories or you operate huge clusters. To cope with the masses, you may want to ensure scans run fast enough, and that the scanner is scalable (e.g. allowing to run scans in parallel).
X. Does the scan happen fully locally in your environment or does the scanner send your detailed data to a cloud service:
Some of the available scanning tools on the market are analysing the contents of your container images and sending a list of the contents/artefacts out of your cluster to a cloud service. Only a few scanning tools are operating fully air-gapped or optionally offer this capability. Keep this in mind if you don’t want to share any details of your internal projects with the outside world or if you’re forced to operate in a fully air-gapped environment. Therefore, you may also want to check for the general deployment model of the scanning tool..
Even though we just covered the first piece on our chess plate, you can see that there are quite a few options on the table. Adapting the various scanning strategies and finding your new favourite scanning tool is up to you!
What to do after your vulnerability management setting is set up, you may ask?
Well, there are plenty of container security measures you could pick up next. I would recommend taking a closer look at the definition of runtime security policies (e.g. for networking, processes, and file access) to avoid weak configurations across your cluster. Also, secret management in container environments is a very interesting topic. But that’s a topic for another blog post!