mirror of
https://github.com/anchore/grype
synced 2024-11-14 00:07:08 +00:00
141 lines
6.1 KiB
Markdown
141 lines
6.1 KiB
Markdown
|
# Match quality testing
|
||
|
|
||
|
This form of testing compares the results from various releases of grype using a
|
||
|
static set of reference container images. The kinds of comparisons made are:
|
||
|
|
||
|
1) "relative": find the vulnerability matching differences between both tools
|
||
|
for a given image. This helps identify when a change has occurred in matching
|
||
|
behavior and where the changes are.
|
||
|
|
||
|
2) "against labels": pair each tool results for an image with ground truth. This
|
||
|
helps identify how well the matching behavior is performing (did it get
|
||
|
better or worse).
|
||
|
|
||
|
|
||
|
## Getting started
|
||
|
|
||
|
To capture raw tool output and store into the local `.yardstick` directory for
|
||
|
further analysis:
|
||
|
```
|
||
|
make capture
|
||
|
```
|
||
|
|
||
|
To analyze the tool output and evaluate a pass/fail result:
|
||
|
```
|
||
|
make validate
|
||
|
```
|
||
|
|
||
|
A pass/fail result is shown in the output with reasons for the failure being
|
||
|
listed explicitly.
|
||
|
|
||
|
|
||
|
## What is the quality gate criteria
|
||
|
|
||
|
The label comparison results are used to determine a pass/fail result,
|
||
|
specifically with the following criteria:
|
||
|
|
||
|
- fail when current grype F1 score drops below last grype release F1 score (or
|
||
|
F1 score is indeterminate)
|
||
|
- fail when the indeterminate matches % > 10% in the current grype results
|
||
|
- fail when there is a rise in FNs relative to the results from the last grype
|
||
|
release
|
||
|
- otherwise, pass
|
||
|
|
||
|
F1 score is the primary way that tool matching performance is characterized. F1
|
||
|
score combines the TP, FP, and FN counts into a single metric between 0 and 1.
|
||
|
Ideally the F1 score for an image-tool pair should be 1. F1 score is a good way
|
||
|
to summarize the matching performance but does not explain why the matching
|
||
|
performance is what it is.
|
||
|
|
||
|
Indeterminate matches are matches from results that could not be pared with a
|
||
|
label (TP or FP). This could also mean that multiple conflicting labels were
|
||
|
found for the a single match. The more indeterminate matches there are the less
|
||
|
confident you can be about the F1 score. Ideally there should be 0 indeterminate
|
||
|
matches, but this is difficult to achieve since vulnerability data is constantly
|
||
|
changing.
|
||
|
|
||
|
False negatives represent matches that should have been made by the tool but
|
||
|
were missed. We should always make certain that this value does not increase
|
||
|
between releases of grype.
|
||
|
|
||
|
## Assumptions
|
||
|
|
||
|
1. **Comparing vulnerability results taken at different times is invalid**.
|
||
|
We leverage the yardstick result-set feature to capture all vulnerability
|
||
|
results at one time for a specific image and tool set. Why? If we use grype
|
||
|
at version `a` on monday and grype at version `b` on tuesday and attempt to
|
||
|
compare the results, if differences are found it will not be immediately
|
||
|
clear why the results are different. That is, it is entirely possible that
|
||
|
the vulnerability databases from the run of `b` simply had more up to date
|
||
|
information, and if `grype@a` were run at the same time (on tuesday) this
|
||
|
reason can be almost entirely eliminated.
|
||
|
|
||
|
2. **Comparing vulnerability results across images with different digests is invalid**.
|
||
|
It may be very tempting to compare vulnerability results for
|
||
|
`alpine:3.2` from monday and `alpine:3.2` from tuesday to see if there are
|
||
|
any changes. However, this is potentially inaccurate as the image references
|
||
|
are for the same tag, but the publisher may have pushed a new image with
|
||
|
differing content. Any change could lead to different vulnerability matching
|
||
|
results but we are only interested in vulnerability match differences that
|
||
|
are due to actionable reasons (grype matcher logic problems or [SBOM] input
|
||
|
data into matchers).
|
||
|
|
||
|
## Approach
|
||
|
|
||
|
Vulnerability matching has essentially two inputs:
|
||
|
|
||
|
- the packages that were found in the scanned artifact
|
||
|
|
||
|
- the vulnerability data from upstream providers (e.g. NVD, GHSA, etc.)
|
||
|
|
||
|
|
||
|
These are both moving targets!
|
||
|
|
||
|
|
||
|
We may implement more catalogers in syft that raise up more packages discovered
|
||
|
over time (for the same artifact scanned). Also the world is continually finding
|
||
|
and reporting new vulnerabilities. The more moving parts there are in this form
|
||
|
of testing the harder it is to come to a conclusion about the actual quality of
|
||
|
the output over time.
|
||
|
|
||
|
|
||
|
To reduce the eroding value over time we've decided to change as many moving
|
||
|
targets into fixed targets as possible:
|
||
|
|
||
|
- Vulnerability results beyond a particular year are ignored (the current config
|
||
|
allows for <= 2020). Though there are still retroactive CVEs created, this
|
||
|
helps a lot in terms of keeping vulnerability results relatively stable.
|
||
|
|
||
|
- SBOMs are used as input into grype instead of the raw container images. This
|
||
|
allows the artifacts under test to remain truly fixed and saves a lot of time
|
||
|
when capturing grype results (as the container image is no longer needed
|
||
|
during analysis).
|
||
|
|
||
|
- For the captured SBOMs, container images referenced must be with a digest, not
|
||
|
just a tag. In case we update a tool version (say syft) we want to make
|
||
|
certain that we are scanning the exact same artifact later when we re-run the
|
||
|
analysis.
|
||
|
|
||
|
- Versions of tools used are fixed to a specific `major.minor.patch` release used.
|
||
|
This allows us to account for capability differences between tool runs.
|
||
|
|
||
|
|
||
|
To reduce maintenance effort of this comparison over time there are a few things
|
||
|
to keep in mind:
|
||
|
|
||
|
- Once an image is labeled (at a specific digest) the image digest should be
|
||
|
considered immutable (never updated). Why? It takes a lot of effort to label
|
||
|
images and there are no "clearly safe" assumptions that can be made when it
|
||
|
comes to migrating labels from one image to another no matter how "similar"
|
||
|
the images may be. There is also no value in updating the image; these images
|
||
|
are not being executed and their only purpose is to survey the matching
|
||
|
performance of grype. In the philosophy of "maximizing fixed points" it
|
||
|
doesn't make sense to change these assets. Over time it may be that we remove
|
||
|
assets that are no longer useful for comparison, but this should rarely be
|
||
|
done.
|
||
|
|
||
|
- Consider not changing the CVE year max-ceiling (currently set to 2020).
|
||
|
Pushing this ceiling will likely raise the number of unlabled matches
|
||
|
significantly for all images. Only bump this ceiling if all possible matches
|
||
|
are labeled.
|