* Use buffered stdout to reduce write sys calls.
This simple change yielded the biggest performace gain.
* Use `for_byte_record_with_terminator` from the `bstr` crate.
This is to minimize the per line copying needed by
`BufReader::read_until`. The `cut_fields` and `cut_fields_delimiter`
functions used `read_until` to iterate over lines. That required copying
each input line to the line buffer. With
`for_byte_record_with_terminator` copying is minimized as it calls our
closure with a reference to BufReader's buffer most of the time. It
needs to copy (internally) only to process any incomplete lines at the
end of the buffer.
* Re-write `Searcher` to use `memchr`.
Switch from the naive implementation to one that uses `memchr`.
* Rewrite `cut_bytes` almost entirely.
This was already well optimized. The performance gain in this case is
not from avoiding copying. In fact, it needed zero copying whereas new
implementation introduces some copying similar to `cut_fields` described
above. But the occassional copying cost is more than offset by the use
of the very fast `memchr` inside `for_byte_record_with_terminator`.
This change also simplifies the code significantly. Removed the `buffer`
module.
- refactor internal version specifications to be ">=M.m.p" (where M.m.p is *already published*)
## [why]
Loosening internal version dependencies decreases the coupling between packages such
that packages can be published in a looser order. It allows the packages to be version
updated and published in tandem (ie, by using `cargo workspace ...`). Once published,
the internal versions can then be updated (again, to an *already published* package
version), as needed.