2021-05-03 10:26:05 +00:00
|
|
|
# Benchmarking `factor`
|
|
|
|
|
2021-05-30 05:09:01 +00:00
|
|
|
<!-- spell-checker:ignore (names) Daniel Lemire * Lemire's ; (misc) nohz -->
|
|
|
|
|
2021-05-17 17:22:56 +00:00
|
|
|
The benchmarks for `factor` are located under `tests/benches/factor`
|
|
|
|
and can be invoked with `cargo bench` in that directory.
|
|
|
|
|
|
|
|
They are located outside the `uu_factor` crate, as they do not comply
|
|
|
|
with the project's minimum supported Rust version, *i.e.* may require
|
|
|
|
a newer version of `rustc`.
|
|
|
|
|
2021-05-03 10:26:05 +00:00
|
|
|
## Microbenchmarking deterministic functions
|
|
|
|
|
|
|
|
We currently use [`criterion`] to benchmark deterministic functions,
|
|
|
|
such as `gcd` and `table::factor`.
|
|
|
|
|
2021-05-30 05:09:01 +00:00
|
|
|
However, microbenchmarks are by nature unstable: not only are they specific to
|
2021-05-03 12:45:00 +00:00
|
|
|
the hardware, operating system version, etc., but they are noisy and affected
|
|
|
|
by other tasks on the system (browser, compile jobs, etc.), which can cause
|
|
|
|
`criterion` to report spurious performance improvements and regressions.
|
|
|
|
|
2021-05-30 05:23:12 +00:00
|
|
|
This can be mitigated by getting as close to [idealized conditions][lemire]
|
2021-05-03 12:45:00 +00:00
|
|
|
as possible:
|
2021-05-30 05:23:12 +00:00
|
|
|
|
2021-05-03 12:45:00 +00:00
|
|
|
- minimize the amount of computation and I/O running concurrently to the
|
|
|
|
benchmark, *i.e.* close your browser and IM clients, don't compile at the
|
|
|
|
same time, etc. ;
|
|
|
|
- ensure the CPU's [frequency stays constant] during the benchmark ;
|
|
|
|
- [isolate a **physical** core], set it to `nohz_full`, and pin the benchmark
|
|
|
|
to it, so it won't be preempted in the middle of a measurement ;
|
|
|
|
- disable ASLR by running `setarch -R cargo bench`, so we can compare results
|
2021-05-30 05:23:12 +00:00
|
|
|
across multiple executions.
|
2021-05-03 12:45:00 +00:00
|
|
|
|
2021-05-03 10:26:05 +00:00
|
|
|
[`criterion`]: https://bheisler.github.io/criterion.rs/book/index.html
|
2021-05-03 12:45:00 +00:00
|
|
|
[lemire]: https://lemire.me/blog/2018/01/16/microbenchmarking-calls-for-idealized-conditions/
|
|
|
|
[isolate a **physical** core]: https://pyperf.readthedocs.io/en/latest/system.html#isolate-cpus-on-linux
|
2021-05-30 05:09:01 +00:00
|
|
|
[frequency stays constant]: ... <!-- ToDO -->
|
2021-05-03 12:45:24 +00:00
|
|
|
|
2021-05-30 05:09:01 +00:00
|
|
|
### Guidance for designing microbenchmarks
|
2021-05-03 12:45:24 +00:00
|
|
|
|
|
|
|
*Note:* this guidance is specific to `factor` and takes its application domain
|
2021-05-30 05:09:01 +00:00
|
|
|
into account; do not expect it to generalize to other projects. It is based
|
2021-05-03 12:45:24 +00:00
|
|
|
on Daniel Lemire's [*Microbenchmarking calls for idealized conditions*][lemire],
|
|
|
|
which I recommend reading if you want to add benchmarks to `factor`.
|
|
|
|
|
2021-05-30 05:23:12 +00:00
|
|
|
1. Select a small, self-contained, deterministic component
|
2021-11-13 16:58:43 +00:00
|
|
|
(`gcd` and `table::factor` are good examples):
|
|
|
|
|
2021-05-03 12:45:24 +00:00
|
|
|
- no I/O or access to external data structures ;
|
|
|
|
- no call into other components ;
|
2021-05-30 05:23:12 +00:00
|
|
|
- behavior is deterministic: no RNG, no concurrency, ... ;
|
2021-05-03 12:45:24 +00:00
|
|
|
- the test's body is *fast* (~100ns for `gcd`, ~10µs for `factor::table`),
|
|
|
|
so each sample takes a very short time, minimizing variability and
|
|
|
|
maximizing the numbers of samples we can take in a given time.
|
|
|
|
|
2021-05-30 05:23:12 +00:00
|
|
|
2. Benchmarks are immutable (once merged in `uutils`)
|
2021-11-13 16:58:43 +00:00
|
|
|
|
2021-05-03 12:45:24 +00:00
|
|
|
Modifying a benchmark means previously-collected values cannot meaningfully
|
|
|
|
be compared, silently giving nonsensical results. If you must modify an
|
|
|
|
existing benchmark, rename it.
|
|
|
|
|
2021-05-30 05:23:12 +00:00
|
|
|
3. Test common cases
|
2021-11-13 16:58:43 +00:00
|
|
|
|
2021-05-03 12:45:24 +00:00
|
|
|
We are interested in overall performance, rather than specific edge-cases;
|
2021-05-30 05:23:12 +00:00
|
|
|
use **reproducibly-randomized inputs**, sampling from either all possible
|
2021-05-03 12:45:24 +00:00
|
|
|
input values or some subset of interest.
|
|
|
|
|
2021-05-30 05:23:12 +00:00
|
|
|
4. Use [`criterion`], `criterion::black_box`, ...
|
2021-11-13 16:58:43 +00:00
|
|
|
|
2021-05-03 12:45:24 +00:00
|
|
|
`criterion` isn't perfect, but it is also much better than ad-hoc
|
|
|
|
solutions in each benchmark.
|
2021-05-03 13:04:06 +00:00
|
|
|
|
|
|
|
## Wishlist
|
|
|
|
|
|
|
|
### Configurable statistical estimators
|
|
|
|
|
2021-05-30 05:09:01 +00:00
|
|
|
`criterion` always uses the arithmetic average as estimator; in microbenchmarks,
|
2021-05-03 13:04:06 +00:00
|
|
|
where the code under test is fully deterministic and the measurements are
|
|
|
|
subject to additive, positive noise, [the minimum is more appropriate][lemire].
|
|
|
|
|
|
|
|
### CI & reproducible performance testing
|
|
|
|
|
|
|
|
Measuring performance on real hardware is important, as it relates directly
|
|
|
|
to what users of `factor` experience; however, such measurements are subject
|
|
|
|
to the constraints of the real-world, and aren't perfectly reproducible.
|
2021-05-30 05:09:01 +00:00
|
|
|
Moreover, the mitigation for it (described above) isn't achievable in
|
2021-05-03 13:04:06 +00:00
|
|
|
virtualized, multi-tenant environments such as CI.
|
|
|
|
|
2021-05-30 05:09:01 +00:00
|
|
|
Instead, we could run the microbenchmarks in a simulated CPU with [`cachegrind`],
|
2021-05-03 13:04:06 +00:00
|
|
|
measure execution “time” in that model (in CI), and use it to detect and report
|
|
|
|
performance improvements and regressions.
|
|
|
|
|
|
|
|
[`iai`] is an implementation of this idea for Rust.
|
|
|
|
|
|
|
|
[`cachegrind`]: https://www.valgrind.org/docs/manual/cg-manual.html
|
|
|
|
[`iai`]: https://bheisler.github.io/criterion.rs/book/iai/iai.html
|
|
|
|
|
2021-05-30 05:23:12 +00:00
|
|
|
### Comparing randomized implementations across multiple inputs
|
2021-05-03 13:04:06 +00:00
|
|
|
|
|
|
|
`factor` is a challenging target for system benchmarks as it combines two
|
|
|
|
characteristics:
|
|
|
|
|
2021-05-30 05:23:12 +00:00
|
|
|
1. integer factoring algorithms are randomized, with large variance in
|
2021-05-03 13:04:06 +00:00
|
|
|
execution time ;
|
|
|
|
|
|
|
|
2. various inputs also have large differences in factoring time, that
|
|
|
|
corresponds to no natural, linear ordering of the inputs.
|
|
|
|
|
|
|
|
If (1) was untrue (i.e. if execution time wasn't random), we could faithfully
|
|
|
|
compare 2 implementations (2 successive versions, or `uutils` and GNU) using
|
|
|
|
a scatter plot, where each axis corresponds to the perf. of one implementation.
|
|
|
|
|
|
|
|
Similarly, without (2) we could plot numbers on the X axis and their factoring
|
|
|
|
time on the Y axis, using multiple lines for various quantiles. The large
|
|
|
|
differences in factoring times for successive numbers, mean that such a plot
|
|
|
|
would be unreadable.
|