Implement the `--line-bytes` option to `split`. In this mode, the
program tries to write as many lines of the input as possible to each
chunk of output without exceeding a specified byte limit. The new
`LineBytesChunkWriter` struct represents this functionality.
Implement `-n l/k/N` option, where the `k`th chunk of the input file
is written to stdout. For example,
$ seq -w 0 99 > f; split -n l/3/10 f
20
21
22
23
24
25
26
27
28
29
Add the `-e` flag, which indicates whether to elide (that is, remove)
empty files that would have been created by the `-n` option.
The `-n` command-line argument gives a specific number of chunks into
which the input files will be split. If the number of chunks is
greater than the number of bytes, then empty files will be created for
the excess chunks. But if `-e` is given, then empty files will not be
created.
For example, contrast
$ printf 'a\n' > f && split -e -n 3 f && cat xaa xab xac
a
cat: xac: No such file or directory
with
$ printf 'a\n' > f && split -n 3 f && cat xaa xab xac
a
Clean up unit tests in the `dd` crate to make them easier to
manage. This commit does a few things.
* move test cases that test the complete functionality of the `dd`
program from the `dd_unit_tests` module up to the
`tests/by-util/test_dd.rs` module so that they can take advantage of
the testing framework and common testing tools provided by uutils,
* move test cases that test internal functions of the `dd`
implementation into the `tests` module within `dd.rs` so that they
live closer to the code they are testing,
* replace test cases defined by macros with test cases defined by
plain old functions to make the test cases easier to read at a
glance.
Add support for the `-x` command-line option to `split`. This option
causes `split` to produce filenames with hexadecimal suffixes instead
of the default alphabetic suffixes.
Replace `ByteSplitter` and `LineSplitter` with `ByteChunkWriter` and
`LineChunkWriter` respectively. This results in a more maintainable
design and an increase in the speed of splitting by lines.
Correct the `test_split::test_suffixes_exhausted` test case so that it
actually exercises the intended behavior of `split`. Previously, the
test fixture contained 26 bytes. After this commit, the test fixture
contains 27 bytes. When using a suffix width of one, only 26 filenames
should be available when naming chunk files---one for each lowercase
ASCII letter. This commit ensures that the filenames will be exhausted
as intended by the test.
When this option is present, the files argument is not processed. This option processes the file list from provided file, splitting them by the ascii NUL (\0) character. When files0-from is '-', the file list is processed from stdin.
This allows for `-t` to take invalid unicode (but still single-byte) values
on unix-like platforms. Other platforms, which as of the time of this commit
do not support `OsStr::as_bytes()`, could possibly be supported in the future,
but would require design decisions as to what that means.
Fix two issues with the filename creation algorithm. First, this
corrects the behavior of the `-a` option. This commit ensures a
failure occurs when the number of chunks exceeds the number of
filenames representable with the specified fixed width:
$ printf "%0.sa" {1..11} | split -d -b 1 -a 1
split: output file suffixes exhausted
Second, this corrects the behavior of the default behavior when `-a`
is not specified on the command line. Previously, it was always
settings the filenames to have length 2 suffixes. This commit corrects
the behavior to follow the algorithm implied by GNU split, where the
filename lengths grow dynamically by two characters once the number of
chunks grows sufficiently large:
$ printf "%0.sa" {1..91} | ./target/debug/coreutils split -d -b 1 \
> && ls x* | tail
x81
x82
x83
x84
x85
x86
x87
x88
x89
x9000
* hashsum: support --check for var. length outputs
Add the ability for `hashsum --check` to work with algorithms with
variable output length. Previously, the program would terminate with an
error due to constructing an invalid regular expression.
* fixup! hashsum: support --check for var. length outputs
* tac: correct behavior of -b option
Correct the behavior of `tac -b` to match that of GNU coreutils
`tac`. Specifically, this changes `tac -b` to assume *leading* line
separators instead of the default *trailing* line separators.
Before this commit, the (incorrect) behavior was
$ printf "/abc/def" | tac -b -s "/"
def/abc/
After this commit, the behavior is
$ printf "/abc/def" | tac -b -s "/"
/def/abc
Fixes#2262.
* fixup! tac: correct behavior of -b option
* fixup! tac: correct behavior of -b option
Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
- Adds words to cspell exceptions
- Converts test macros to use Default trait.
- Converts parser to use Default trait.
- Adds Windows-friendly test files for block/unblock when nl is present
in test/spec file.
This reimplements version_cmp, which is used in sort and ls to sort
according to versions.
However, it is not bug-for-bug identical with GNU's implementation.
I reported a bug with GNU here:
https://lists.gnu.org/archive/html/bug-coreutils/2021-06/msg00045.html
This implementation does not contain the bugs regarding the handling of
file extensions and null bytes.
We were reporting "no match" when sorting something like "0 ". This is
because we don't distinguish between 0 and invalid lines when sorting.
For debug output we have to get this information back.
Closes#2254. We should only inherit global settings for keys when there
are absolutely no options attached to the key.
The default key (matching the whole line) is implicitly added only if no
keys are supplied.
Improved some error messages by including more context.
* sort: disable support for thousand separators
In order to be compatible with GNU, we have to disable thousands
separators. GNU does not enable them for the C locale, either.
Once we add support for locales we can add this feature back.
* sort: delete unused fixtures
* sort: compare -0 and 0 equal
I must have misunderstood this when implementing, but GNU considers
-0, 0, and invalid numbers to be equal.
* sort: strip blanks before applying the char index
* sort: don't crash when key start is after key end
* sort: add "no match" for months at the first non-whitespace char
We should put the "^ no match for key" indicator at the first
non-whitespace character of a field.
* sort: improve support for e notation
* sort: use maches! macros
Fix a bug in which `head` failed to print headings for `stdin` inputs
when reading from multiple files, and fix another bug in which `head`
failed to print a blank line between the contents of a file and the
heading for the next file when reading multiple files. The output now
matches that of GNU `head`.
When merging files we need to prioritize files that occur earlier in the
command line arguments with -m.
This also makes the extsort merge step (and thus extsort itself) stable again.
Moved argument parsing to clap and added tests to cover using "-" as
stdin, passing in too many file arguments, and updated the "wrap" error
message in the tests.
This adds a --debug flag, which, when activated, will draw lines below
the characters that are actually used for comparisons.
This is not a complete implementation of --debug. It should, quoting the man page
for GNU sort: "annotate the part of the line used to sort, and warn
about questionable usage to stderr". Warning about "questionable usage"
is not part of this patch.
This change required some adjustments to be able to get the range that
is actually used for comparisons. Most notably, general numeric comparisons
were rewritten, fixing some bugs along the lines.
Testing is mostly done by adding fixtures for the expected debug output of
existing tests.
* sort: implement numeric string comparison
This implements -n and -h using a string comparison algorithm instead
of parsing each number to a f64 and comparing those.
This should result in a moderate performance increase and eliminate loss
of precision.
* cache parsed f64 numbers
For general numeric comparisons we have to parse numbers as f64,
as this behavior is explicitly documented by GNU coreutils.
We can however cache the parsed value to speed up comparisons.
* fix leading zeroes for negative numbers
* use more appropriate name for exponent
* improvements to the parse function
* move checks into main loop and fix thousands separator condition
* remove unneeded checks
* rustfmt
* Various fixes and performance improvements
* fix a typo
Co-authored-by: Michael Debertol <michael.debertol@gmail.com>
* Fix month parse for months with leading whitespace
* Implement test for months whitespace fix
* Confirm human numeric works as expected with whitespace with a test
* Correct arg help value name for --parallel
* Fix SemVer non version lines/empty line sorting with a test
Co-authored-by: Sylvestre Ledru <sledru@mozilla.com>
Co-authored-by: Michael Debertol <michael.debertol@gmail.com>
* cat: Unrevert splice patch
* cat: Add fifo test
* cat: Add tests for error cases
* cat: Add tests for character devices
* wc: Make sure we handle short splice writes
* cat: Fix tests for 1.40.0 compiler
* cat: Run rustfmt on test_cat.rs
* Run 'cargo +1.40.0 update'
* Various fixes and performance improvements
* fix a typo
Co-authored-by: Michael Debertol <michael.debertol@gmail.com>
Co-authored-by: Sylvestre Ledru <sledru@mozilla.com>
Co-authored-by: Michael Debertol <michael.debertol@gmail.com>
Treat tab chars as advancing to the next tab stop rather than having a fixed
8-column width.
Also treat tab as a whitespace split target only when splitting on word
boundaries.