The previous encoding handling was unnecessarily complex. This commit removes the enum that specifies the handling and instead has two separate methods to collect the strings either with lossy conversion or by ignoring invalidly encoded strings.
Outside of tests, only `accept_any` was used, meaning that this unnecessarily complicated the code. The behaviour of `accept_any` is now the default (and only) option.
* Fix a bug in split where chunking would be skipped when the chunk size
happened to be an exact divisor of the buffer size used to read the
input stream.
The issue here was that file was being split byte-wise in chunks of 1G.
The input stream was being read in chunks of 8KB, which evenly divides
the chunk size. Because the check to allocate the next output chunk was
done at the bottom of the loop previously, it would never occur because
the current input chunk was fully consumed at that point. By moving the
check to the top of the loop (but still late enough that we know we have
bytes to write) we resolve this issue.
This scenario is unfortunately hard to write a test for, since we don't
explicitly control the input chunk size.
Fixes https://github.com/uutils/coreutils/issues/3790
* tail: fix race condition (fix#3765)
There exists a race condition (RC) that can occur if changes to a path
happen after the initial print loop in `uu_tail()`, but before the
path is added to the notify-Watcher thread in `follow()`.
To minimize the window where the RC can occur, this moves starting the
Watcher thread and adding paths to it from `follow()` to the initial
print loop in `uu_tail()`.
Additionally, to make sure the RC cannot happen in
"gnu/tests/tail-2/F-headers.sh", the error message that is used as a trigger
in this test, is delayed until the path is added to the Watcher thread.
* build-gnu: remove workarounds for tail
Remove workarounds for "tests/tail-2/F-headers.sh" which are
(presumably) no longer needed because of the race condition fix.
* build-gnu: remove workarounds for tail
Remove workarounds for "tests/tail-2/F-headers.sh" which are
(presumably) no longer needed because of the race condition fix.
* tail: refactor to minimize chances of RC
Move "adding paths to Watcher thread" to its own loop and run this loop
before the initial tail-print-loop in order to minimize the window for
race conditions.
* cp: Refactor `reflink`/`sparse` handling to enable `--sparse` flag
`--sparse` and `--reflink` options have a lot of similarities:
- They have similar options (`always`, `never`, `auto`)
- Both need OS specific handling
- They can be mutually exclusive
Prior to this change, `sparse` was defined as `CopyMode`, but `reflink`
wasn't. Given the similarities, it makes sense to handle them similarly.
The idea behind this change is to move all OS specific file copy
handling in the `copy_on_write_*` functions. Those function then
dispatch to the correct logic depending on the arguments (at the moment,
the tuple `(reflink, sparse)`).
Also, move the handling of `--reflink=never` from `copy_file` to the
`copy_on_write_*` functions, at the cost of a bit of code duplication,
to allow `copy_on_write_*` to handle all cases (and later handle
`--reflink=never` with `--sparse`).
* cp: Implement `--sparse` flag
This begins to address #3362
At the moment, only the `--sparse=always` logic matches the requirement
form GNU cp info page, i.e. always make holes in destination when
possible.
Sparse copy is done by copying the source to the destination block by
block (blocks being of the destination's fs block size). If the block
only holds NUL bytes, we don't write to the destination.
About `--sparse=auto`: according to GNU cp info page, the destination
file will be made sparse if the source file is sparse as well. The next
step are likely to use `lseek` with `SEEK_HOLE` detect if the source
file has holes. Currently, this has the same behaviour as
`--sparse=never`. This `SEEK_HOLE` logic can also be applied to
`--sparse=always` to improve performance when copying sparse files.
About `--sparse=never`: from my understanding, it is not guaranteed that
Rust's `fs::copy` will always produce a file with no holes, as
["platform-specific behavior may change in the
future"](https://doc.rust-lang.org/std/fs/fn.copy.html#platform-specific-behavior)
About other platforms:
- `macos`: The solution may be to use `fcntl` command `F_PUNCHHOLE`.
- `windows`: I only see `FSCTL_SET_SPARSE`.
This should pass the following GNU tests:
- `tests/cp/sparse.sh`
- `tests/cp/sparse-2.sh`
- `tests/cp/sparse-extents.sh`
- `tests/cp/sparse-extents-2.sh`
`sparse-perf.sh` needs `--sparse=auto`, and in particular a way to skip
holes in the source file.
Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
* ls: Implement --zero flag. (#2929)
This flag can be used to provide a easy machine parseable output from
ls, as discussed in the GNU bug report
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49716.
There are some peculiarities with this flag:
- Current implementation of GNU ls of the `--zero` flag implies some
other flags. Those can be overridden by setting those flags after
`--zero` in the command line.
- This flag is not compatible with `--dired`. This patch is not 100%
compliant with GNU ls: GNU ls `--zero` will fail if `--dired` and
`-l` are set, while with this patch only `--dired` is needed for the
command to fail.
We also add `--dired` flag to the parser, with no additional behaviour
change.
Testing done:
```
$ bash util/build-gnu.sh
[...]
$ bash util/run-gnu-test.sh tests/ls/zero-option.sh
[...]
PASS: tests/ls/zero-option.sh
============================================================================
Testsuite summary for GNU coreutils 9.1.36-8ec11
============================================================================
# TOTAL: 1
# PASS: 1
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
```
* Use the US way to spell Behavior
* Fix formatting with cargo fmt -- tests/by-util/test_ls.rs
* Simplify --zero flag overriding logic by using index_of
Also, allow multiple --zero flags, as this is possible with GNU ls
command. Only the last one is taken into account.
Co-authored-by: Sylvestre Ledru <sledru@mozilla.com>
* Speed up sum by using reasonable read buffer sizes.
Use a 4K read buffer for each of the checksum functions, which seems
reasonable. This improves the performance of BSD checksums on
odyssey1024.txt from 399ms to 325ms on my laptop, and of SysV
checksums from 242ms to 67ms.
* Add BENCHMARKING.md for `sum`.
* Add comment regarding block sizes.
* Improve portability of BENCHMARKING.md
* Make `div_ceil` const and enhance comment.
Byte, character, and line counting can all be done on the raw bytes
of the incoming stream without decoding the Unicode characters. This
fact was previously exploited in specific fast paths for counting
characters and counting lines. This change unifies those fast paths into
a single shared fast paths, using const generics to specialize the
function for each use case. This has the benefit of making sure that all
combinations of these Unicode-oblivious fast paths benefit from the same
optimization.
On my laptop, this speeds up `wc -clm odyssey1024.txt` from 840ms to
120ms. I experimented with using a filter loop for line counting, but
continuing to use the bytecount crate came out ahead by a significant
margin.
When wc is invoked with only the -m flag, we only need to count the
number of Unicode characters in the input. In order to do so, we don't
actually need to decode the input bytes into characters. Rather, we can
simply count the number of non-continuation bytes in the UTF-8 stream,
since every character will contain exactly one non-continuation byte.
On my laptop, this speeds up `wc -m odyssey1024.txt` from 745ms to
109ms.
* wc: specialize scanning loop on settings.
The primary computational loop in wc (iterating over all the
characters and computing word lengths, etc) is configured by a
number of boolean options that control the text-scanning behavior.
If we monomorphize the code loop for each possible combination of
scanning configurations, the rustc is able to generate better code
for each instantiation, at the least by removing the conditional
checks on each iteration, and possibly by allowing things like
vectorization.
On my computer (aarch64/macos), I am seeing at least a 5% performance
improvement in release builds on all wc flag configurations
(other than those that were already specialized) against
odyssey1024.txt, with wc -l showing the greatest improvement at 15%.
* Reduce the size of the wc dispatch table by half.
By extracting the handling of hand-written fast-paths to the
same dispatch as the automatic specializations, we can avoid
needing to pass `show_bytes` as a const generic to
`word_count_from_reader_specialized`. Eliminating this parameter
halves the number of arms in the dispatch.
* ls: handle looping symlinks infinite printing
* ls: better coloring and printing symlinks when dereferenced
* tests/ls: add dereferencing and symlink loop tests
* ls: reformat changed using rustfmt
* ls: follow clippy advice for cleaner code
* uucore/fs: fix FileInformation to open directory handles in Windows as
well
tee is supposed to exit when there is nothing left to write to. For
finite inputs, it can be hard to determine whether this functions
correctly, but for tee of infinite streams, it is very important to
exit when there is nothing more to write to.
This is part of fixing the tee tests. 'yes' is used by the GNU test
suite to identify what the SIGPIPE exit code is on the target
platform. By trapping SIGPIPE, it creates a requirement that other
utilities also trap SIGPIPE (and exit 0 after SIGPIPE). This is
sometimes at odds with their desired behaviour.
When the monitored process exits, the GNU version of timeout will
preserve its exit status, including the signal state.
This is a partial fix for timeout to enable the tee tests to pass. It
removes the default Rust trap for SIGPIPE, and kill itself with the
same signal as its child exited with to preserve the signal state.
This has the following behaviours. On Unix:
- The default is to exit on pipe errors, and warn on other errors.
- "--output-error=warn" means to warn on all errors
- "--output-error", "--output-error=warn-nopipe" and "-p" all mean
that pipe errors are suppressed, all other errors warn.
- "--output-error=exit" means to warn and exit on all errors.
- "--output-error=exit-nopipe" means to suppress pipe errors, and to
warn and exit on all other errors.
On non-Unix platforms, all pipe behaviours are ignored, so the default
is effectively "--output-error=warn" and "warn-nopipe" is identical.
The only meaningful option is "--output-error=exit" which is identical
to "--output-error=exit-nopipe" on these platforms.
Note that warnings give a non-zero exit code, but do not halt writing
to non-erroring targets.
Add a `uucore::fs::is_symlink()` function that takes in a
`std::path::Path` and decides whether the given path is a symbolic
link. This is essentially a backport of the `Path::is_symlink()`
function that appears in Rust version 1.58.0. This commit also
replaces some now-duplicate code in `chmod`, `cp`, `ln`, and `rmdir`
that checks whether a path is a symbolic link with a call to
`is_symlink()`.
Technically, this commit slightly changes the behavior of
`cp`. Previously, there was a line of code like this
if fs::symlink_metadata(&source)?.file_type().is_symlink() {
where the `?` operator propagates an error from `symlink_metadata()`
to the caller. Now the line of code is
if is_symlink(source) {
in which any error from `symlink_metadata()` has been converted to
just be a `false` value. I believe this is a satisfactory tradeoff to
make, since an error in accessing the file will likely cause an error
later in the same code path.
Fix a bug in which `cp` incorrectly exited with an error when
attempting to copy the attributes of a dangling symbolic link (that
is, when running `cp -P -p`).
Fixes#3531.
Refactor common code used in several places into a convenience
function `is_symlink()` that behaves like `Path::is_symlink()` added
in Rust 1.58.0. (We support earlier versions of Rust so we cannot use
the standard library version of this function.)
This change will extract a utility already present in ls to uucore.
This utility is used by dir and vdir too, which are adjusted to
look it up in uucode. No further changes to ls, dir or dirv intended.
The change here largely fiddles with the output of uu_wc to match
that of GNU wc. This is the case to the extent to make unit tests
pass, however, there are differences remaining. One specific
difference I did not tackle is that GNU wc will not align the
output columns (compute_number_width() -> 1) in the specific case
of the input for --files0-from=- being a named pipe, not real stdin.
This difference can be triggered using the following two invocations.
- wc --files0-from=- < files0 # use a named pipe, GNU does align
- cat files0- | wc --files0-from=- # use real stdin, GNU does not
align.
* tail: reduce CPU load for polling
This reduces the CPU load for polling drastically (from ~80% down to ~5%)
by removing/fixing several previous workarounds related to polling,
while still passing all related GNU test-suite checks.
* set Notify::PollWatcher delay to: sleep_sec/10 instead of
sleep_sec/100
* set recv_timeout to sleep_sec instead of sleep_sec/100
* remove the manual polling of watched files
Bugs:
* fix an issue with headers to consistently pass
"test_follow_name_retry_headers" and "gnu/tests/tail-2/overlay-headers.sh"
Code clean-up and refactor
* make fields of struct FileHandling private (and add getters/setters)
to ensure that the paths are absolute and match the paths returned by
Notify::Events
* replace calls to "crash!" with "return USimpleError"
* clean-up formatting
When `--backup` is supplied, `cp` will take a backup of *destination* before *source* is copied. When `--backup=simple` is supplied, it is possible for the backup path for *destination* to equal the path for *source*, destroying source before the copy is made. This change prevents this by returning an error instead.
This fixes https://github.com/uutils/coreutils/issues/3629
Update `dd` to only print a concise form of the number of bytes with
an SI prefix (like "1 MB" or "2 GB") if the number is at least
1000. Similarly, only print the concise form with an IEC prefix (like
"1 MiB" or "2 GiB") if the number is at least 1024. For example,
$ head -c 999 /dev/zero | dd > /dev/null
1+1 records in
1+1 records out
999 bytes copied, 0.0 s, 999.0 KB/s
$ head -c 1000 /dev/zero | dd > /dev/null
1+1 records in
1+1 records out
1000 bytes (1000 B) copied, 0.0 s, 1000.0 KB/s
$ head -c 1024 /dev/zero | dd > /dev/null
2+0 records in
2+0 records out
1024 bytes (1 KB, 1024 B) copied, 0.0 s, 1.0 MB/s
unnecessary trailing space was being added. because we were padding for alignment,
which is not required with -m
fixes#3608
Signed-off-by: anastygnome <noreplygitemail@protonmail.com>
Several binaries have been added to `hashsum` that have never been part
of GNU Coreutils:
- `sha3*sum` (uutils/coreutils#869)
- `shake*sum` (uutils/coreutils#987)
- `b3sum` (uutils/coreutils#3108 and uutils/coreutils#3164)
In particular, the `--bits` option, and the `--no-names` option added in
uutils/coreutils#3361, are not valid for any GNU Coreutils `*sum` binary
(as of Coreutils 9.0).
This commit refactors the argument parsing so that `--bits` and
`--no-names` become invalid options for the binaries intended to match
the GNU Coreutils API, instead of being ignored options. It also
refactors the custom binary name handling to distinguish between
binaries intended to match the GNU Coreutils API, and binaries that
don't have that constraint.
Part of uutils/coreutils#2930.
Make `mktemp` exit with an error if the `--suffix` option is the empty
string and the template argument does not end in an "X". Previously,
the program succeeded.
Before this commit,
$ mktemp --suffix= aXXXb
apBEb
After this commit,
$ mktemp --suffix= aXXXb
mktemp: with --suffix, template 'aXXXb' must end in X
This PR reuses the defined --sync flag for df, which was previously a
no-op, assigns its default value and uses the fact that
get_all_filesystems takes CLI options as a parameter to perform a sync
operation before any fs call.
Fixes#3564
Signed-off-by: anastygnome <noreplygitemail@protonmail.com>
Fix a bug in which `mktemp` would replace everything in the template
argument from the first 'X' to the last 'X' with random bytes, instead
of just replacing the last contiguous block of 'X's.
Before this commit,
$ mktemp XXX_XXX
2meCpfM
After this commit,
$ mktemp XXX_XXX
XXX_Rp5
This fixes test cases `suffix2f` and `suffix2d` in
`tests/misc/mktemp.pl` in the GNU coreutils test suite.
Simplify the logic of computing the file path parameters (the
directory, prefix, suffix, and number of random characters) for the
temporary file created by `mktemp`. This commits adds an `Options`
struct as a layer of indirection between the application logic and
`clap`, and a `Params` struct whose associated function is responsible
for determining the file path parameters from the `Options`. This is
an improvement because the previous code had some logic for
determining file path parameters in one place and some in another
place.
Show the "total" label in the "source" column or in the "target"
column if the "source" column is not visible.
Before this commit,
$ df --total --output=target .
Mounted on
/
-
After this commit,
$ df --total --output=target .
Mounted on
/
total
Include the suffix in the error message produced by `mktemp` when
there are too few Xs in the template. Before this commit,
$ mktemp --suffix=X aXX
mktemp: too few X's in template 'aXX'
After this commit,
$ mktemp --suffix=X aXX
mktemp: too few X's in template 'aXXX'
This matches the behavior of GNU `mktemp`.
Correct the error message when the template argument contains a path
separator in its suffix. Before this commit:
$ mktemp aXXX/b
mktemp: too few X's in template 'b'
After this commit:
$ mktemp aXXX/b
mktemp: invalid suffix '/b', contains directory separator
This error message is more appropriate and matches the behavior of GNU
mktemp.
On macOS path.is_dir() can be false for directories
if it was a redirect, e.g. ` tail < DIR`
* fix some tests for macOS
Cleanup:
* fix clippy/spell-checker
* fix build for windows by refactoring stdin_is_pipe_or_fifo()
Correct the error message produced by `mktemp` when `--tmpdir` is
given and the template is an absolute path:
$ mktemp --tmpdir=a /XXX
mktemp: invalid template, '/XXX'; with --tmpdir, it may not be absolute
This is WIP or even WONT-FIX because there's a workaround in Rust's
stdlib which prevents us from detecting a closed FD.
see also the discussion at:
https://github.com/uutils/coreutils/issues/2873
Update `chown` to allow setting the owner of a file to a numeric user
ID regardless of whether a corresponding username exists on the
system.
For example,
$ touch f && sudo chown 12345 f
succeeds even though there is no named user with ID 12345.
Fixes#3380.
Correct the error that arises from a path separator in the prefix
portion of a template argument provided to `mktemp`. Before this
commit, the error message was incorrect:
$ mktemp -t a/bXXX
mktemp: failed to create file via template 'a/bXXX': No such file or directory (os error 2) at path "/tmp/a/bege"
After this commit, the error message is correct:
$ mktemp -t a/bXXX
mktemp: invalid template, 'a/bXXX', contains directory separator
The code was failing to check for a path separator in the prefix
portion of the template.
Change the return type of the `parse_template()` helper function in
the `mktemp` program so that it returns `Result<..., MkTempError>`
instead of `UResult<...>`. This separates the lower level helper
function from the higher level `UResult` abstraction and will make it
easier to refactor this code in future commits.
The refactoring consists of three parts:
1) Introduction of SizeFormat & HumanReadable enums
2) Addition of a size_format field to the options struct
3) Movement of header logic from BlockSize to Header
Fixes an issue with cp not copying symlinks in spite of the -P (no dereference option)
Fix issue #3364
Performance improvements
Avoid useless read from metadata and reuse previous dest information
Signed-off-by: anastygnome <noreplygit@protonmail.com>
Fix a bug in `mktemp` where it was not respecting the path given by
the positional argument. Previously, it would place the temporary file
whose name is induced by a given template in the `/tmp` directory,
like this:
$ mktemp XXX
/tmp/LJr
$ mktemp d/XXX
/tmp/d/IhS
After this commit, it respects the directory given in the template
argument:
$ mktemp XXX
LJr
$ mktemp d/XXX
d/IhS
Fixes#3440.
* Fix a timing related bug with polling (---disable-inotify) where some
Events weren't delivered fast enough by `Notify::PollWatcher` to pass all
of tests/tail-2/retry.sh and test_tail::{test_retry4, retry7}.
* uu_tail now reverts to polling automatically if inotify backend reports
too many open files (this mimics the behavior of GNU's tail).
This makes uu_tail pass the "gnu/tests/tail-2/inotify-only-regular" test
again by adding support for charater devices.
test_tail:
* add test_follow_inotify_only_regular
* add clippy fixes for windows