coreutils/src/uu/factor/BENCHMARKING.md

5 KiB

Benchmarking factor

The benchmarks for factor are located under tests/benches/factor and can be invoked with cargo bench in that directory.

They are located outside the uu_factor crate, as they do not comply with the project's minimum supported Rust version, i.e. may require a newer version of rustc.

Microbenchmarking deterministic functions

We currently use criterion to benchmark deterministic functions, such as gcd and table::factor.

However, microbenchmarks are by nature unstable: not only are they specific to the hardware, operating system version, etc., but they are noisy and affected by other tasks on the system (browser, compile jobs, etc.), which can cause criterion to report spurious performance improvements and regressions.

This can be mitigated by getting as close to idealized conditions as possible:

  • minimize the amount of computation and I/O running concurrently to the benchmark, i.e. close your browser and IM clients, don't compile at the same time, etc. ;
  • ensure the CPU's [frequency stays constant] during the benchmark ;
  • isolate a physical core, set it to nohz_full, and pin the benchmark to it, so it won't be preempted in the middle of a measurement ;
  • disable ASLR by running setarch -R cargo bench, so we can compare results across multiple executions.

[frequency stays constant]: ...

Guidance for designing microbenchmarks

Note: this guidance is specific to factor and takes its application domain into account; do not expect it to generalize to other projects. It is based on Daniel Lemire's Microbenchmarking calls for idealized conditions, which I recommend reading if you want to add benchmarks to factor.

  1. Select a small, self-contained, deterministic component (gcd and table::factor are good examples):

    • no I/O or access to external data structures ;
    • no call into other components ;
    • behavior is deterministic: no RNG, no concurrency, ... ;
    • the test's body is fast (~100ns for gcd, ~10µs for factor::table), so each sample takes a very short time, minimizing variability and maximizing the numbers of samples we can take in a given time.
  2. Benchmarks are immutable (once merged in uutils)

    Modifying a benchmark means previously-collected values cannot meaningfully be compared, silently giving nonsensical results. If you must modify an existing benchmark, rename it.

  3. Test common cases

    We are interested in overall performance, rather than specific edge-cases; use reproducibly-randomized inputs, sampling from either all possible input values or some subset of interest.

  4. Use criterion, criterion::black_box, ...

    criterion isn't perfect, but it is also much better than ad-hoc solutions in each benchmark.

Wishlist

Configurable statistical estimators

criterion always uses the arithmetic average as estimator; in microbenchmarks, where the code under test is fully deterministic and the measurements are subject to additive, positive noise, the minimum is more appropriate.

CI & reproducible performance testing

Measuring performance on real hardware is important, as it relates directly to what users of factor experience; however, such measurements are subject to the constraints of the real-world, and aren't perfectly reproducible. Moreover, the mitigation for it (described above) isn't achievable in virtualized, multi-tenant environments such as CI.

Instead, we could run the microbenchmarks in a simulated CPU with cachegrind, measure execution “time” in that model (in CI), and use it to detect and report performance improvements and regressions.

iai is an implementation of this idea for Rust.

Comparing randomized implementations across multiple inputs

factor is a challenging target for system benchmarks as it combines two characteristics:

  1. integer factoring algorithms are randomized, with large variance in execution time ;

  2. various inputs also have large differences in factoring time, that corresponds to no natural, linear ordering of the inputs.

If (1) was untrue (i.e. if execution time wasn't random), we could faithfully compare 2 implementations (2 successive versions, or uutils and GNU) using a scatter plot, where each axis corresponds to the perf. of one implementation.

Similarly, without (2) we could plot numbers on the X axis and their factoring time on the Y axis, using multiple lines for various quantiles. The large differences in factoring times for successive numbers, mean that such a plot would be unreadable.