2021-05-03 10:26:05 +00:00
|
|
|
# Benchmarking `factor`
|
|
|
|
|
|
|
|
## Microbenchmarking deterministic functions
|
|
|
|
|
|
|
|
We currently use [`criterion`] to benchmark deterministic functions,
|
|
|
|
such as `gcd` and `table::factor`.
|
|
|
|
|
|
|
|
Those benchmarks can be simply executed with `cargo bench` as usual,
|
|
|
|
but may require a recent version of Rust, *i.e.* the project's minimum
|
|
|
|
supported version of Rust does not apply to the benchmarks.
|
|
|
|
|
2021-05-03 12:45:00 +00:00
|
|
|
|
|
|
|
However, µbenchmarks are by nature unstable: not only are they specific to
|
|
|
|
the hardware, operating system version, etc., but they are noisy and affected
|
|
|
|
by other tasks on the system (browser, compile jobs, etc.), which can cause
|
|
|
|
`criterion` to report spurious performance improvements and regressions.
|
|
|
|
|
|
|
|
This can be mitigated by getting as close to [idealised conditions][lemire]
|
|
|
|
as possible:
|
|
|
|
- minimize the amount of computation and I/O running concurrently to the
|
|
|
|
benchmark, *i.e.* close your browser and IM clients, don't compile at the
|
|
|
|
same time, etc. ;
|
|
|
|
- ensure the CPU's [frequency stays constant] during the benchmark ;
|
|
|
|
- [isolate a **physical** core], set it to `nohz_full`, and pin the benchmark
|
|
|
|
to it, so it won't be preempted in the middle of a measurement ;
|
|
|
|
- disable ASLR by running `setarch -R cargo bench`, so we can compare results
|
|
|
|
across multiple executions.
|
|
|
|
|
|
|
|
|
2021-05-03 10:26:05 +00:00
|
|
|
[`criterion`]: https://bheisler.github.io/criterion.rs/book/index.html
|
2021-05-03 12:45:00 +00:00
|
|
|
[lemire]: https://lemire.me/blog/2018/01/16/microbenchmarking-calls-for-idealized-conditions/
|
|
|
|
[isolate a **physical** core]: https://pyperf.readthedocs.io/en/latest/system.html#isolate-cpus-on-linux
|
|
|
|
[frequency stays constant]: XXXTODO
|
2021-05-03 12:45:24 +00:00
|
|
|
|
|
|
|
|
|
|
|
### Guidance for designing µbenchmarks
|
|
|
|
|
|
|
|
*Note:* this guidance is specific to `factor` and takes its application domain
|
|
|
|
into account; do not expect it to generalise to other projects. It is based
|
|
|
|
on Daniel Lemire's [*Microbenchmarking calls for idealized conditions*][lemire],
|
|
|
|
which I recommend reading if you want to add benchmarks to `factor`.
|
|
|
|
|
|
|
|
1. Select a small, self-contained, deterministic component
|
|
|
|
`gcd` and `table::factor` are good example of such:
|
|
|
|
- no I/O or access to external data structures ;
|
|
|
|
- no call into other components ;
|
|
|
|
- behaviour is deterministic: no RNG, no concurrency, ... ;
|
|
|
|
- the test's body is *fast* (~100ns for `gcd`, ~10µs for `factor::table`),
|
|
|
|
so each sample takes a very short time, minimizing variability and
|
|
|
|
maximizing the numbers of samples we can take in a given time.
|
|
|
|
|
|
|
|
2. Benchmarks are immutable (once merged in `uutils`)
|
|
|
|
Modifying a benchmark means previously-collected values cannot meaningfully
|
|
|
|
be compared, silently giving nonsensical results. If you must modify an
|
|
|
|
existing benchmark, rename it.
|
|
|
|
|
|
|
|
3. Test common cases
|
|
|
|
We are interested in overall performance, rather than specific edge-cases;
|
|
|
|
use **reproducibly-randomised inputs**, sampling from either all possible
|
|
|
|
input values or some subset of interest.
|
|
|
|
|
|
|
|
4. Use [`criterion`], `criterion::black_box`, ...
|
|
|
|
`criterion` isn't perfect, but it is also much better than ad-hoc
|
|
|
|
solutions in each benchmark.
|
2021-05-03 13:04:06 +00:00
|
|
|
|
|
|
|
|
|
|
|
## Wishlist
|
|
|
|
|
|
|
|
### Configurable statistical estimators
|
|
|
|
|
|
|
|
`criterion` always uses the arithmetic average as estimator; in µbenchmarks,
|
|
|
|
where the code under test is fully deterministic and the measurements are
|
|
|
|
subject to additive, positive noise, [the minimum is more appropriate][lemire].
|
|
|
|
|
|
|
|
|
|
|
|
### CI & reproducible performance testing
|
|
|
|
|
|
|
|
Measuring performance on real hardware is important, as it relates directly
|
|
|
|
to what users of `factor` experience; however, such measurements are subject
|
|
|
|
to the constraints of the real-world, and aren't perfectly reproducible.
|
|
|
|
Moreover, the mitigations for it (described above) aren't achievable in
|
|
|
|
virtualized, multi-tenant environments such as CI.
|
|
|
|
|
|
|
|
Instead, we could run the µbenchmarks in a simulated CPU with [`cachegrind`],
|
|
|
|
measure execution “time” in that model (in CI), and use it to detect and report
|
|
|
|
performance improvements and regressions.
|
|
|
|
|
|
|
|
[`iai`] is an implementation of this idea for Rust.
|
|
|
|
|
|
|
|
[`cachegrind`]: https://www.valgrind.org/docs/manual/cg-manual.html
|
|
|
|
[`iai`]: https://bheisler.github.io/criterion.rs/book/iai/iai.html
|
|
|
|
|
|
|
|
|
|
|
|
### Comparing randomised implementations across multiple inputs
|
|
|
|
|
|
|
|
`factor` is a challenging target for system benchmarks as it combines two
|
|
|
|
characteristics:
|
|
|
|
|
|
|
|
1. integer factoring algorithms are randomised, with large variance in
|
|
|
|
execution time ;
|
|
|
|
|
|
|
|
2. various inputs also have large differences in factoring time, that
|
|
|
|
corresponds to no natural, linear ordering of the inputs.
|
|
|
|
|
|
|
|
|
|
|
|
If (1) was untrue (i.e. if execution time wasn't random), we could faithfully
|
|
|
|
compare 2 implementations (2 successive versions, or `uutils` and GNU) using
|
|
|
|
a scatter plot, where each axis corresponds to the perf. of one implementation.
|
|
|
|
|
|
|
|
Similarly, without (2) we could plot numbers on the X axis and their factoring
|
|
|
|
time on the Y axis, using multiple lines for various quantiles. The large
|
|
|
|
differences in factoring times for successive numbers, mean that such a plot
|
|
|
|
would be unreadable.
|