special-case ExternalStream in bytes starts-with (#8203)
# Description
`bytes starts-with` converts the input into a `Value` before running
.starts_with to find if the binary matches. This has two side effects:
it makes the code simpler, only dealing in whole values, and simplifying
a lot of input pipeline handling and value transforming it would
otherwise have to do. _Especially_ in the presence of a cell path to
drill into. It also makes buffers the entire input into memory, which
can take up a lot of memory when dealing with large files, especially if
you only want to check the first few bytes (like for a magic number).
This PR adds a special branch on PipelineData::ExternalStream with a
streaming version of starts_with.
# User-Facing Changes
Opening large files and running bytes starts-with on them will not take
a long time.
# Tests + Formatting
Don't forget to add tests that cover your changes.
Make sure you've run and fixed any issues with these commands:
- `cargo fmt --all -- --check` to check standard code formatting (`cargo
fmt --all` applies these changes)
- `cargo clippy --workspace -- -D warnings -D clippy::unwrap_used -A
clippy::needless_collect` to check that you're using the standard code
style
- `cargo test --workspace` to check that all tests pass
# Drawbacks
Streaming checking is more complicated, and there may be bugs. I tested
it with multiple chunks with string data and binary data and it seems to
work alright up to 8k and over bytes, though.
The existing `operate` method still exists because the way it handles
cell paths and values is complicated. This causes some "code
duplication", or at least some intent duplication, between the value
code and the streaming code. This might be worthwhile considering the
performance gains (approaching infinity on larger inputs).
Another thing to consider is that my ExternalStream branch considers
string data as valid input. The operate branch only parses Binary
values, so it would fail. `open` is kind of unpredictable on whether it
returns string data or binary data, even when passing `--raw`. I think
this can be a problem but not really one I'm trying to tackle in this
PR, so, it's worth considering.
2023-02-26 14:17:44 +00:00
|
|
|
use nu_test_support::nu;
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn basic_binary_starts_with() {
|
2023-07-17 16:43:51 +00:00
|
|
|
let actual = nu!(r#"
|
special-case ExternalStream in bytes starts-with (#8203)
# Description
`bytes starts-with` converts the input into a `Value` before running
.starts_with to find if the binary matches. This has two side effects:
it makes the code simpler, only dealing in whole values, and simplifying
a lot of input pipeline handling and value transforming it would
otherwise have to do. _Especially_ in the presence of a cell path to
drill into. It also makes buffers the entire input into memory, which
can take up a lot of memory when dealing with large files, especially if
you only want to check the first few bytes (like for a magic number).
This PR adds a special branch on PipelineData::ExternalStream with a
streaming version of starts_with.
# User-Facing Changes
Opening large files and running bytes starts-with on them will not take
a long time.
# Tests + Formatting
Don't forget to add tests that cover your changes.
Make sure you've run and fixed any issues with these commands:
- `cargo fmt --all -- --check` to check standard code formatting (`cargo
fmt --all` applies these changes)
- `cargo clippy --workspace -- -D warnings -D clippy::unwrap_used -A
clippy::needless_collect` to check that you're using the standard code
style
- `cargo test --workspace` to check that all tests pass
# Drawbacks
Streaming checking is more complicated, and there may be bugs. I tested
it with multiple chunks with string data and binary data and it seems to
work alright up to 8k and over bytes, though.
The existing `operate` method still exists because the way it handles
cell paths and values is complicated. This causes some "code
duplication", or at least some intent duplication, between the value
code and the streaming code. This might be worthwhile considering the
performance gains (approaching infinity on larger inputs).
Another thing to consider is that my ExternalStream branch considers
string data as valid input. The operate branch only parses Binary
values, so it would fail. `open` is kind of unpredictable on whether it
returns string data or binary data, even when passing `--raw`. I think
this can be a problem but not really one I'm trying to tackle in this
PR, so, it's worth considering.
2023-02-26 14:17:44 +00:00
|
|
|
"hello world" | into binary | bytes starts-with 0x[68 65 6c 6c 6f]
|
2023-07-17 16:43:51 +00:00
|
|
|
"#);
|
special-case ExternalStream in bytes starts-with (#8203)
# Description
`bytes starts-with` converts the input into a `Value` before running
.starts_with to find if the binary matches. This has two side effects:
it makes the code simpler, only dealing in whole values, and simplifying
a lot of input pipeline handling and value transforming it would
otherwise have to do. _Especially_ in the presence of a cell path to
drill into. It also makes buffers the entire input into memory, which
can take up a lot of memory when dealing with large files, especially if
you only want to check the first few bytes (like for a magic number).
This PR adds a special branch on PipelineData::ExternalStream with a
streaming version of starts_with.
# User-Facing Changes
Opening large files and running bytes starts-with on them will not take
a long time.
# Tests + Formatting
Don't forget to add tests that cover your changes.
Make sure you've run and fixed any issues with these commands:
- `cargo fmt --all -- --check` to check standard code formatting (`cargo
fmt --all` applies these changes)
- `cargo clippy --workspace -- -D warnings -D clippy::unwrap_used -A
clippy::needless_collect` to check that you're using the standard code
style
- `cargo test --workspace` to check that all tests pass
# Drawbacks
Streaming checking is more complicated, and there may be bugs. I tested
it with multiple chunks with string data and binary data and it seems to
work alright up to 8k and over bytes, though.
The existing `operate` method still exists because the way it handles
cell paths and values is complicated. This causes some "code
duplication", or at least some intent duplication, between the value
code and the streaming code. This might be worthwhile considering the
performance gains (approaching infinity on larger inputs).
Another thing to consider is that my ExternalStream branch considers
string data as valid input. The operate branch only parses Binary
values, so it would fail. `open` is kind of unpredictable on whether it
returns string data or binary data, even when passing `--raw`. I think
this can be a problem but not really one I'm trying to tackle in this
PR, so, it's worth considering.
2023-02-26 14:17:44 +00:00
|
|
|
|
|
|
|
assert_eq!(actual.out, "true");
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn basic_string_fails() {
|
2023-07-17 16:43:51 +00:00
|
|
|
let actual = nu!(r#"
|
special-case ExternalStream in bytes starts-with (#8203)
# Description
`bytes starts-with` converts the input into a `Value` before running
.starts_with to find if the binary matches. This has two side effects:
it makes the code simpler, only dealing in whole values, and simplifying
a lot of input pipeline handling and value transforming it would
otherwise have to do. _Especially_ in the presence of a cell path to
drill into. It also makes buffers the entire input into memory, which
can take up a lot of memory when dealing with large files, especially if
you only want to check the first few bytes (like for a magic number).
This PR adds a special branch on PipelineData::ExternalStream with a
streaming version of starts_with.
# User-Facing Changes
Opening large files and running bytes starts-with on them will not take
a long time.
# Tests + Formatting
Don't forget to add tests that cover your changes.
Make sure you've run and fixed any issues with these commands:
- `cargo fmt --all -- --check` to check standard code formatting (`cargo
fmt --all` applies these changes)
- `cargo clippy --workspace -- -D warnings -D clippy::unwrap_used -A
clippy::needless_collect` to check that you're using the standard code
style
- `cargo test --workspace` to check that all tests pass
# Drawbacks
Streaming checking is more complicated, and there may be bugs. I tested
it with multiple chunks with string data and binary data and it seems to
work alright up to 8k and over bytes, though.
The existing `operate` method still exists because the way it handles
cell paths and values is complicated. This causes some "code
duplication", or at least some intent duplication, between the value
code and the streaming code. This might be worthwhile considering the
performance gains (approaching infinity on larger inputs).
Another thing to consider is that my ExternalStream branch considers
string data as valid input. The operate branch only parses Binary
values, so it would fail. `open` is kind of unpredictable on whether it
returns string data or binary data, even when passing `--raw`. I think
this can be a problem but not really one I'm trying to tackle in this
PR, so, it's worth considering.
2023-02-26 14:17:44 +00:00
|
|
|
"hello world" | bytes starts-with 0x[68 65 6c 6c 6f]
|
2023-07-17 16:43:51 +00:00
|
|
|
"#);
|
special-case ExternalStream in bytes starts-with (#8203)
# Description
`bytes starts-with` converts the input into a `Value` before running
.starts_with to find if the binary matches. This has two side effects:
it makes the code simpler, only dealing in whole values, and simplifying
a lot of input pipeline handling and value transforming it would
otherwise have to do. _Especially_ in the presence of a cell path to
drill into. It also makes buffers the entire input into memory, which
can take up a lot of memory when dealing with large files, especially if
you only want to check the first few bytes (like for a magic number).
This PR adds a special branch on PipelineData::ExternalStream with a
streaming version of starts_with.
# User-Facing Changes
Opening large files and running bytes starts-with on them will not take
a long time.
# Tests + Formatting
Don't forget to add tests that cover your changes.
Make sure you've run and fixed any issues with these commands:
- `cargo fmt --all -- --check` to check standard code formatting (`cargo
fmt --all` applies these changes)
- `cargo clippy --workspace -- -D warnings -D clippy::unwrap_used -A
clippy::needless_collect` to check that you're using the standard code
style
- `cargo test --workspace` to check that all tests pass
# Drawbacks
Streaming checking is more complicated, and there may be bugs. I tested
it with multiple chunks with string data and binary data and it seems to
work alright up to 8k and over bytes, though.
The existing `operate` method still exists because the way it handles
cell paths and values is complicated. This causes some "code
duplication", or at least some intent duplication, between the value
code and the streaming code. This might be worthwhile considering the
performance gains (approaching infinity on larger inputs).
Another thing to consider is that my ExternalStream branch considers
string data as valid input. The operate branch only parses Binary
values, so it would fail. `open` is kind of unpredictable on whether it
returns string data or binary data, even when passing `--raw`. I think
this can be a problem but not really one I'm trying to tackle in this
PR, so, it's worth considering.
2023-02-26 14:17:44 +00:00
|
|
|
|
2023-07-14 03:20:35 +00:00
|
|
|
assert!(actual.err.contains("command doesn't support"));
|
special-case ExternalStream in bytes starts-with (#8203)
# Description
`bytes starts-with` converts the input into a `Value` before running
.starts_with to find if the binary matches. This has two side effects:
it makes the code simpler, only dealing in whole values, and simplifying
a lot of input pipeline handling and value transforming it would
otherwise have to do. _Especially_ in the presence of a cell path to
drill into. It also makes buffers the entire input into memory, which
can take up a lot of memory when dealing with large files, especially if
you only want to check the first few bytes (like for a magic number).
This PR adds a special branch on PipelineData::ExternalStream with a
streaming version of starts_with.
# User-Facing Changes
Opening large files and running bytes starts-with on them will not take
a long time.
# Tests + Formatting
Don't forget to add tests that cover your changes.
Make sure you've run and fixed any issues with these commands:
- `cargo fmt --all -- --check` to check standard code formatting (`cargo
fmt --all` applies these changes)
- `cargo clippy --workspace -- -D warnings -D clippy::unwrap_used -A
clippy::needless_collect` to check that you're using the standard code
style
- `cargo test --workspace` to check that all tests pass
# Drawbacks
Streaming checking is more complicated, and there may be bugs. I tested
it with multiple chunks with string data and binary data and it seems to
work alright up to 8k and over bytes, though.
The existing `operate` method still exists because the way it handles
cell paths and values is complicated. This causes some "code
duplication", or at least some intent duplication, between the value
code and the streaming code. This might be worthwhile considering the
performance gains (approaching infinity on larger inputs).
Another thing to consider is that my ExternalStream branch considers
string data as valid input. The operate branch only parses Binary
values, so it would fail. `open` is kind of unpredictable on whether it
returns string data or binary data, even when passing `--raw`. I think
this can be a problem but not really one I'm trying to tackle in this
PR, so, it's worth considering.
2023-02-26 14:17:44 +00:00
|
|
|
assert_eq!(actual.out, "");
|
|
|
|
}
|
|
|
|
|
Replace `ExternalStream` with new `ByteStream` type (#12774)
# Description
This PR introduces a `ByteStream` type which is a `Read`-able stream of
bytes. Internally, it has an enum over three different byte stream
sources:
```rust
pub enum ByteStreamSource {
Read(Box<dyn Read + Send + 'static>),
File(File),
Child(ChildProcess),
}
```
This is in comparison to the current `RawStream` type, which is an
`Iterator<Item = Vec<u8>>` and has to allocate for each read chunk.
Currently, `PipelineData::ExternalStream` serves a weird dual role where
it is either external command output or a wrapper around `RawStream`.
`ByteStream` makes this distinction more clear (via `ByteStreamSource`)
and replaces `PipelineData::ExternalStream` in this PR:
```rust
pub enum PipelineData {
Empty,
Value(Value, Option<PipelineMetadata>),
ListStream(ListStream, Option<PipelineMetadata>),
ByteStream(ByteStream, Option<PipelineMetadata>),
}
```
The PR is relatively large, but a decent amount of it is just repetitive
changes.
This PR fixes #7017, fixes #10763, and fixes #12369.
This PR also improves performance when piping external commands. Nushell
should, in most cases, have competitive pipeline throughput compared to,
e.g., bash.
| Command | Before (MB/s) | After (MB/s) | Bash (MB/s) |
| -------------------------------------------------- | -------------:|
------------:| -----------:|
| `throughput \| rg 'x'` | 3059 | 3744 | 3739 |
| `throughput \| nu --testbin relay o> /dev/null` | 3508 | 8087 | 8136 |
# User-Facing Changes
- This is a breaking change for the plugin communication protocol,
because the `ExternalStreamInfo` was replaced with `ByteStreamInfo`.
Plugins now only have to deal with a single input stream, as opposed to
the previous three streams: stdout, stderr, and exit code.
- The output of `describe` has been changed for external/byte streams.
- Temporary breaking change: `bytes starts-with` no longer works with
byte streams. This is to keep the PR smaller, and `bytes ends-with`
already does not work on byte streams.
- If a process core dumped, then instead of having a `Value::Error` in
the `exit_code` column of the output returned from `complete`, it now is
a `Value::Int` with the negation of the signal number.
# After Submitting
- Update docs and book as necessary
- Release notes (e.g., plugin protocol changes)
- Adapt/convert commands to work with byte streams (high priority is
`str length`, `bytes starts-with`, and maybe `bytes ends-with`).
- Refactor the `tee` code, Devyn has already done some work on this.
---------
Co-authored-by: Devyn Cairns <devyn.cairns@gmail.com>
2024-05-16 14:11:18 +00:00
|
|
|
// #[test]
|
|
|
|
// fn short_stream_binary() {
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// nu --testbin repeater (0x[01]) 5 | bytes starts-with 0x[010101]
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(actual.out, "true");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn short_stream_mismatch() {
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// nu --testbin repeater (0x[010203]) 5 | bytes starts-with 0x[010204]
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(actual.out, "false");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn short_stream_binary_overflow() {
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// nu --testbin repeater (0x[01]) 5 | bytes starts-with 0x[010101010101]
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(actual.out, "false");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn long_stream_binary() {
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// nu --testbin repeater (0x[01]) 32768 | bytes starts-with 0x[010101]
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(actual.out, "true");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn long_stream_binary_overflow() {
|
|
|
|
// // .. ranges are inclusive..inclusive, so we don't need to +1 to check for an overflow
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// nu --testbin repeater (0x[01]) 32768 | bytes starts-with (0..32768 | each {|| 0x[01] } | bytes collect)
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(actual.out, "false");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn long_stream_binary_exact() {
|
|
|
|
// // ranges are inclusive..inclusive, so we don't need to +1 to check for an overflow
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// nu --testbin repeater (0x[01020304]) 8192 | bytes starts-with (0..<8192 | each {|| 0x[01020304] } | bytes collect)
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(actual.out, "true");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn long_stream_string_exact() {
|
|
|
|
// // ranges are inclusive..inclusive, so we don't need to +1 to check for an overflow
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// nu --testbin repeater hell 8192 | bytes starts-with (0..<8192 | each {|| "hell" | into binary } | bytes collect)
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(actual.out, "true");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn long_stream_mixed_exact() {
|
|
|
|
// // ranges are inclusive..inclusive, so we don't need to +1 to check for an overflow
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// let binseg = (0..<2048 | each {|| 0x[003d9fbf] } | bytes collect)
|
|
|
|
// let strseg = (0..<2048 | each {|| "hell" | into binary } | bytes collect)
|
|
|
|
|
|
|
|
// nu --testbin repeat_bytes 003d9fbf 2048 68656c6c 2048 | bytes starts-with (bytes build $binseg $strseg)
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(
|
|
|
|
// actual.err, "",
|
|
|
|
// "invocation failed. command line limit likely reached"
|
|
|
|
// );
|
|
|
|
// assert_eq!(actual.out, "true");
|
|
|
|
// }
|
|
|
|
|
|
|
|
// #[test]
|
|
|
|
// fn long_stream_mixed_overflow() {
|
|
|
|
// // ranges are inclusive..inclusive, so we don't need to +1 to check for an overflow
|
|
|
|
// let actual = nu!(r#"
|
|
|
|
// let binseg = (0..<2048 | each {|| 0x[003d9fbf] } | bytes collect)
|
|
|
|
// let strseg = (0..<2048 | each {|| "hell" | into binary } | bytes collect)
|
|
|
|
|
|
|
|
// nu --testbin repeat_bytes 003d9fbf 2048 68656c6c 2048 | bytes starts-with (bytes build $binseg $strseg 0x[01])
|
|
|
|
// "#);
|
|
|
|
|
|
|
|
// assert_eq!(
|
|
|
|
// actual.err, "",
|
|
|
|
// "invocation failed. command line limit likely reached"
|
|
|
|
// );
|
|
|
|
// assert_eq!(actual.out, "false");
|
|
|
|
// }
|