grype/test/quality/README.md

# Match quality testing

This form of testing compares the results from various releases of grype using a
static set of reference container images. The kinds of comparisons made are:

1) "relative": find the vulnerability matching differences between both tools
   for a given image. This helps identify when a change has occurred in matching
   behavior and where the changes are.

2) "against labels": pair each tool results for an image with ground truth. This
   helps identify how well the matching behavior is performing (did it get
   better or worse).


## Getting started

To capture raw tool output and store into the local `.yardstick` directory for
further analysis:
```
make capture
```

To analyze the tool output and evaluate a pass/fail result:
```
make validate
```

A pass/fail result is shown in the output with reasons for the failure being
listed explicitly.


## What is the quality gate criteria

The label comparison results are used to determine a pass/fail result,
specifically with the following criteria:

 - fail when current grype F1 score drops below last grype release F1 score (or
   F1 score is indeterminate)
 - fail when the indeterminate matches % > 10% in the current grype results
 - fail when there is a rise in FNs relative to the results from the last grype
   release
 - otherwise, pass

F1 score is the primary way that tool matching performance is characterized. F1
score combines the TP, FP, and FN counts into a single metric between 0 and 1.
Ideally the F1 score for an image-tool pair should be 1. F1 score is a good way
to summarize the matching performance but does not explain why the matching
performance is what it is.

Indeterminate matches are matches from results that could not be pared with a
label (TP or FP). This could also mean that multiple conflicting labels were
found for the a single match. The more indeterminate matches there are the less
confident you can be about the F1 score. Ideally there should be 0 indeterminate
matches, but this is difficult to achieve since vulnerability data is constantly
changing. 

False negatives represent matches that should have been made by the tool but
were missed. We should always make certain that this value does not increase
between releases of grype.

## Assumptions

1. **Comparing vulnerability results taken at different times is invalid**.
   We leverage the yardstick result-set feature to capture all vulnerability
   results at one time for a specific image and tool set. Why? If we use grype
   at version `a` on monday and grype at version `b` on tuesday and attempt to
   compare the results, if differences are found it will not be immediately
   clear why the results are different. That is, it is entirely possible that
   the vulnerability databases from the run of `b` simply had more up to date
   information, and if `grype@a` were run at the same time (on tuesday) this
   reason can be almost entirely eliminated.

2. **Comparing vulnerability results across images with different digests is invalid**.
   It may be very tempting to compare vulnerability results for
   `alpine:3.2` from monday and `alpine:3.2` from tuesday to see if there are
   any changes. However, this is potentially inaccurate as the image references
   are for the same tag, but the publisher may have pushed a new image with
   differing content. Any change could lead to different vulnerability matching
   results but we are only interested in vulnerability match differences that
   are due to actionable reasons (grype matcher logic problems or [SBOM] input
   data into matchers).

## Approach

Vulnerability matching has essentially two inputs:

- the packages that were found in the scanned artifact

- the vulnerability data from upstream providers (e.g. NVD, GHSA, etc.)


These are both moving targets!


We may implement more catalogers in syft that raise up more packages discovered
over time (for the same artifact scanned). Also the world is continually finding
and reporting new vulnerabilities. The more moving parts there are in this form
of testing the harder it is to come to a conclusion about the actual quality of
the output over time.


To reduce the eroding value over time we've decided to change as many moving
targets into fixed targets as possible:

- Vulnerability results beyond a particular year are ignored (the current config
  allows for <= 2020). Though there are still retroactive CVEs created, this
  helps a lot in terms of keeping vulnerability results relatively stable.

- SBOMs are used as input into grype instead of the raw container images. This
  allows the artifacts under test to remain truly fixed and saves a lot of time
  when capturing grype results (as the container image is no longer needed
  during analysis).

- For the captured SBOMs, container images referenced must be with a digest, not
  just a tag. In case we update a tool version (say syft) we want to make
  certain that we are scanning the exact same artifact later when we re-run the
  analysis.

- Versions of tools used are fixed to a specific `major.minor.patch` release used.
  This allows us to account for capability differences between tool runs.


To reduce maintenance effort of this comparison over time there are a few things 
to keep in mind:

- Once an image is labeled (at a specific digest) the image digest should be
  considered immutable (never updated). Why? It takes a lot of effort to label
  images and there are no "clearly safe" assumptions that can be made when it
  comes to migrating labels from one image to another no matter how "similar"
  the images may be. There is also no value in updating the image; these images
  are not being executed and their only purpose is to survey the matching
  performance of grype. In the philosophy of "maximizing fixed points" it
  doesn't make sense to change these assets. Over time it may be that we remove
  assets that are no longer useful for comparison, but this should rarely be
  done.

- Consider not changing the CVE year max-ceiling (currently set to 2020).
  Pushing this ceiling will likely raise the number of unlabled matches
  significantly for all images. Only bump this ceiling if all possible matches
  are labeled.
Add in-depth quality gate checks (#949) * add in-depth quality gate checks Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add quality tests to PR checks Signed-off-by: Alex Goodman <alex.goodman@anchore.com> Signed-off-by: Alex Goodman <alex.goodman@anchore.com> 2022-10-05 20:26:26 +00:00			`# Match quality testing`

			`This form of testing compares the results from various releases of grype using a`
			`static set of reference container images. The kinds of comparisons made are:`

			`1) "relative": find the vulnerability matching differences between both tools`
			`for a given image. This helps identify when a change has occurred in matching`
			`behavior and where the changes are.`

			`2) "against labels": pair each tool results for an image with ground truth. This`
			`helps identify how well the matching behavior is performing (did it get`
			`better or worse).`


			`## Getting started`

			To capture raw tool output and store into the local `.yardstick` directory for
			`further analysis:`
			```
			`make capture`
			```

			`To analyze the tool output and evaluate a pass/fail result:`
			```
			`make validate`
			```

			`A pass/fail result is shown in the output with reasons for the failure being`
			`listed explicitly.`


			`## What is the quality gate criteria`

			`The label comparison results are used to determine a pass/fail result,`
			`specifically with the following criteria:`

			`- fail when current grype F1 score drops below last grype release F1 score (or`
			`F1 score is indeterminate)`
			`- fail when the indeterminate matches % > 10% in the current grype results`
			`- fail when there is a rise in FNs relative to the results from the last grype`
			`release`
			`- otherwise, pass`

			`F1 score is the primary way that tool matching performance is characterized. F1`
			`score combines the TP, FP, and FN counts into a single metric between 0 and 1.`
			`Ideally the F1 score for an image-tool pair should be 1. F1 score is a good way`
			`to summarize the matching performance but does not explain why the matching`
			`performance is what it is.`

			`Indeterminate matches are matches from results that could not be pared with a`
			`label (TP or FP). This could also mean that multiple conflicting labels were`
			`found for the a single match. The more indeterminate matches there are the less`
			`confident you can be about the F1 score. Ideally there should be 0 indeterminate`
			`matches, but this is difficult to achieve since vulnerability data is constantly`
			`changing.`

			`False negatives represent matches that should have been made by the tool but`
			`were missed. We should always make certain that this value does not increase`
			`between releases of grype.`

			`## Assumptions`

			`1. Comparing vulnerability results taken at different times is invalid.`
			`We leverage the yardstick result-set feature to capture all vulnerability`
			`results at one time for a specific image and tool set. Why? If we use grype`
			at version `a` on monday and grype at version `b` on tuesday and attempt to
			`compare the results, if differences are found it will not be immediately`
			`clear why the results are different. That is, it is entirely possible that`
			the vulnerability databases from the run of `b` simply had more up to date
			information, and if `grype@a` were run at the same time (on tuesday) this
			`reason can be almost entirely eliminated.`

			`2. Comparing vulnerability results across images with different digests is invalid.`
			`It may be very tempting to compare vulnerability results for`
			`alpine:3.2` from monday and `alpine:3.2` from tuesday to see if there are
			`any changes. However, this is potentially inaccurate as the image references`
			`are for the same tag, but the publisher may have pushed a new image with`
			`differing content. Any change could lead to different vulnerability matching`
			`results but we are only interested in vulnerability match differences that`
			`are due to actionable reasons (grype matcher logic problems or [SBOM] input`
			`data into matchers).`

			`## Approach`

			`Vulnerability matching has essentially two inputs:`

			`- the packages that were found in the scanned artifact`

			`- the vulnerability data from upstream providers (e.g. NVD, GHSA, etc.)`


			`These are both moving targets!`


			`We may implement more catalogers in syft that raise up more packages discovered`
			`over time (for the same artifact scanned). Also the world is continually finding`
			`and reporting new vulnerabilities. The more moving parts there are in this form`
			`of testing the harder it is to come to a conclusion about the actual quality of`
			`the output over time.`


			`To reduce the eroding value over time we've decided to change as many moving`
			`targets into fixed targets as possible:`

			`- Vulnerability results beyond a particular year are ignored (the current config`
			`allows for <= 2020). Though there are still retroactive CVEs created, this`
			`helps a lot in terms of keeping vulnerability results relatively stable.`

			`- SBOMs are used as input into grype instead of the raw container images. This`
			`allows the artifacts under test to remain truly fixed and saves a lot of time`
			`when capturing grype results (as the container image is no longer needed`
			`during analysis).`

			`- For the captured SBOMs, container images referenced must be with a digest, not`
			`just a tag. In case we update a tool version (say syft) we want to make`
			`certain that we are scanning the exact same artifact later when we re-run the`
			`analysis.`

			- Versions of tools used are fixed to a specific `major.minor.patch` release used.
			`This allows us to account for capability differences between tool runs.`


			`To reduce maintenance effort of this comparison over time there are a few things`
			`to keep in mind:`

			`- Once an image is labeled (at a specific digest) the image digest should be`
			`considered immutable (never updated). Why? It takes a lot of effort to label`
			`images and there are no "clearly safe" assumptions that can be made when it`
			`comes to migrating labels from one image to another no matter how "similar"`
			`the images may be. There is also no value in updating the image; these images`
			`are not being executed and their only purpose is to survey the matching`
			`performance of grype. In the philosophy of "maximizing fixed points" it`
			`doesn't make sense to change these assets. Over time it may be that we remove`
			`assets that are no longer useful for comparison, but this should rarely be`
			`done.`

			`- Consider not changing the CVE year max-ceiling (currently set to 2020).`
			`Pushing this ceiling will likely raise the number of unlabled matches`
			`significantly for all images. Only bump this ceiling if all possible matches`
			`are labeled.`