4.9 KiB
Benchmarking factor
Microbenchmarking deterministic functions
We currently use criterion
to benchmark deterministic functions,
such as gcd
and table::factor
.
Those benchmarks can be simply executed with cargo bench
as usual,
but may require a recent version of Rust, i.e. the project's minimum
supported version of Rust does not apply to the benchmarks.
However, µbenchmarks are by nature unstable: not only are they specific to
the hardware, operating system version, etc., but they are noisy and affected
by other tasks on the system (browser, compile jobs, etc.), which can cause
criterion
to report spurious performance improvements and regressions.
This can be mitigated by getting as close to idealised conditions as possible:
- minimize the amount of computation and I/O running concurrently to the benchmark, i.e. close your browser and IM clients, don't compile at the same time, etc. ;
- ensure the CPU's frequency stays constant during the benchmark ;
- isolate a physical core, set it to
nohz_full
, and pin the benchmark to it, so it won't be preempted in the middle of a measurement ; - disable ASLR by running
setarch -R cargo bench
, so we can compare results across multiple executions.
Guidance for designing µbenchmarks
Note: this guidance is specific to factor
and takes its application domain
into account; do not expect it to generalise to other projects. It is based
on Daniel Lemire's Microbenchmarking calls for idealized conditions,
which I recommend reading if you want to add benchmarks to factor
.
-
Select a small, self-contained, deterministic component
gcd
andtable::factor
are good example of such:- no I/O or access to external data structures ;
- no call into other components ;
- behaviour is deterministic: no RNG, no concurrency, ... ;
- the test's body is fast (~100ns for
gcd
, ~10µs forfactor::table
), so each sample takes a very short time, minimizing variability and maximizing the numbers of samples we can take in a given time.
-
Benchmarks are immutable (once merged in
uutils
)
Modifying a benchmark means previously-collected values cannot meaningfully be compared, silently giving nonsensical results. If you must modify an existing benchmark, rename it. -
Test common cases
We are interested in overall performance, rather than specific edge-cases; use reproducibly-randomised inputs, sampling from either all possible input values or some subset of interest. -
Use
criterion
,criterion::black_box
, ...
criterion
isn't perfect, but it is also much better than ad-hoc solutions in each benchmark.
Wishlist
Configurable statistical estimators
criterion
always uses the arithmetic average as estimator; in µbenchmarks,
where the code under test is fully deterministic and the measurements are
subject to additive, positive noise, the minimum is more appropriate.
CI & reproducible performance testing
Measuring performance on real hardware is important, as it relates directly
to what users of factor
experience; however, such measurements are subject
to the constraints of the real-world, and aren't perfectly reproducible.
Moreover, the mitigations for it (described above) aren't achievable in
virtualized, multi-tenant environments such as CI.
Instead, we could run the µbenchmarks in a simulated CPU with cachegrind
,
measure execution “time” in that model (in CI), and use it to detect and report
performance improvements and regressions.
iai
is an implementation of this idea for Rust.
Comparing randomised implementations across multiple inputs
factor
is a challenging target for system benchmarks as it combines two
characteristics:
-
integer factoring algorithms are randomised, with large variance in execution time ;
-
various inputs also have large differences in factoring time, that corresponds to no natural, linear ordering of the inputs.
If (1) was untrue (i.e. if execution time wasn't random), we could faithfully
compare 2 implementations (2 successive versions, or uutils
and GNU) using
a scatter plot, where each axis corresponds to the perf. of one implementation.
Similarly, without (2) we could plot numbers on the X axis and their factoring time on the Y axis, using multiple lines for various quantiles. The large differences in factoring times for successive numbers, mean that such a plot would be unreadable.