The translation is fairly direct though it adds some duplication, for example
there are multiple "match" statements that mimic function overloading.
Rust has no overloading, and we cannot have generic methods in the Node trait
(due to a Rust limitation, the error is like "cannot be made into an object")
so we include the type name in method names.
Give clients like "indent_visitor_t" a Rust companion ("IndentVisitor")
that takes care of the AST traversal while the AST consumption remains
in C++ for now. In future, "IndentVisitor" should absorb the entirety of
"indent_visitor_t". This pattern requires that "fish_indent" be exposed
includable header to the CXX bridge.
Alternatively, we could define FFI wrappers for recursive AST traversal.
Rust requires we separate the AST visitors for "mut" and "const"
scenarios. Take this opportunity to concretize both visitors:
The only client that requires mutable access is the populator. To match the
structure of the C++ populator which makes heavy use of function overloading,
we need to add a bunch of functions to the trait. Since there is no other
mutable visit, this seems acceptable.
The "const" visitors never use "will_visit_fields_of()" or
"did_visit_fields_of()", so remove them (though this is debatable).
Like in the C++ implementation, the AST nodes themselves are largely defined
via macros. Union fields like "Statement" and "ArgumentOrRedirection"
do currently not use macros but may in future.
This commit also introduces a precedent for a type that is defined in one
CXX bridge and used in another one - "ParseErrorList". To make this work
we need to manually define "ExternType".
There is one annoyance with CXX: functions that take explicit lifetime
parameters require to be marked as unsafe. This makes little sense
because functions that return `&Foo` with implicit lifetime can be
misused the same way on the C++ side.
One notable change is that we cannot directly port "find_block_open_keyword()"
(which is used to compute an error) because it relies on the stack of visited
nodes. We cannot modify a stack of node references while we do the "mut"
walk. Happily, an idiomatic solution is easy: we can tell the AST visitor
to backtrack to the parent node and create the error there.
Since "node_t::accept_base" is no longer a template we don't need the
"node_visitation_t" trampoline anymore.
The added copying at the FFI boundary makes things slower (memcpy dominates
the profile) but it's not unusable, which is good news:
$ hyperfine ./fish.{old,new}" -c 'source ../share/completions/git.fish'"
Benchmark 1: ./fish.old -c 'source ../share/completions/git.fish'
Time (mean ± σ): 195.5 ms ± 2.9 ms [User: 190.1 ms, System: 4.4 ms]
Range (min … max): 193.2 ms … 205.1 ms 15 runs
Benchmark 2: ./fish.new -c 'source ../share/completions/git.fish'
Time (mean ± σ): 677.5 ms ± 62.0 ms [User: 665.4 ms, System: 10.0 ms]
Range (min … max): 611.7 ms … 805.5 ms 10 runs
Summary
'./fish.old -c 'source ../share/completions/git.fish'' ran
3.47 ± 0.32 times faster than './fish.new -c 'source ../share/completions/git.fish''
Leftovers:
- Enum variants are still snakecase; I didn't get around to changing this yet.
- "ast_type_to_string()" still returns a snakecase name. This could be
changed since it's not user visible.
Most of it is duplicated, hence untested.
Functions like mbrtowc are not exposed by the libc crate, so declare them
ourselves.
Since we don't know the definition of C macros, add two big hacks to make
this work:
1. Replace MB_LEN_MAX and mbstate_t with values (resp types) that should
be large enough for any implementation.
2. Detect the definition of MB_CUR_MAX in the build script. This requires
more changes for each new libc. We could also use this approach for 1.
Additionally, this commit brings a small behavior change to
read_unquoted_escape(): we cannot decode surrogate code points like \UDE01
into a Rust char, so use � (\UFFFD, replacement character) instead.
Previously, we added such code points to a wcstring; looks like they were
ignored when printed.
Let's hope this doesn't causes build failures for e.g. musl: I just
know it's good on macOS and our Linux CI.
It's been a long time.
One fix this brings, is I discovered we #include assert.h or cassert
in a lot of places. If those ever happen to be in a file that doesn't
include common.h, or we are before common.h gets included, we're
unawaringly working with the system 'assert' macro again, which
may get disabled for debug builds or at least has different
behavior on crash. We undef 'assert' and redefine it in common.h.
Those were all eliminated, except in one catch-22 spot for
maybe.h: it can't include common.h. A fix might be to
make a fish_assert.h that *usually* common.h exports.
This makes the awkward case
fish: Unexpected end of string, square brackets do not match
echo f[oo # not valid, no matching ]
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(that `]` is simply the last character on the line, it's firmly in a comment)
less awkward by only marking the starting brace.
The implementation here is awkward mostly because the tok_t
communicates two things: The error location and how to carry on.
So we need to store the error length separately, and this is the first
time we've done so.
It's possible we can make this simpler.
This fixes a regression about where we report errors:
echo error(here
old: ^
fixed: ^
Commit 0c22f67bd (Remove the old parser bits, 2020-07-02) removed
uses of "error_offset_within_token" so we always report errors at
token start. Add it back, hopefully restoring the 3.1.2 behavior.
Note that for cases like
echo "$("
we report "unbalanced quotes" because we treat the $( as double
quote. Giving a better error seems hard because of the ambguity -
we don't know if quote is meant to be inside or outside the command
substitution.
Very slight performance increase (1% when parsing *all .fish scripts
in fish-shell*), but this removes a useless variable and some
.c_str()inging.
Theoretically it should also remove some wcslen() calls, but those
seem to be optimized out?
This removes the "did_visit" message because it doesn't really add
anything.
For example:
```
ast-construction: make job_list 0x55a6d19729f0
ast-construction: make job_conjunction 0x55a6d1971c00
ast-construction: will_visit job_conjunction 0x55a6d1971c00
ast-construction: will_visit job 0x55a6d1971c18
ast-construction: variable_assignment_list size: 0
ast-construction: will_visit statement 0x55a6d1971c48
ast-construction: make decorated_statement 0x55a6d1972650
ast-construction: will_visit decorated_statement 0x55a6d1972650
ast-construction: make argument_or_redirection 0x55a6d1968310
ast-construction: will_visit argument_or_redirection 0x55a6d1968310
ast-construction: make argument 0x55a6d197b0b0
ast-construction: did_visit argument_or_redirection 0x55a6d1968310
ast-construction: argument_or_redirection_list size: 1
ast-construction: did_visit decorated_statement 0x55a6d1972650
ast-construction: did_visit statement 0x55a6d1971c48
ast-construction: job_continuation_list size: 0
ast-construction: did_visit job 0x55a6d1971c18
ast-construction: job_conjunction_continuation_list size: 0
ast-construction: did_visit job_conjunction 0x55a6d1971c00
ast-construction: job_list size: 1
```
those "did_visit" messages all correspond to "will_visit" ones. They
are effectively block delimiters like `end` or `}`.
If we remove them it turns into:
```
ast-construction: make job_list 0x55a6d19729f0
ast-construction: make job_conjunction 0x55a6d1971c00
ast-construction: will_visit job_conjunction 0x55a6d1971c00
ast-construction: will_visit job 0x55a6d1971c18
ast-construction: variable_assignment_list size: 0
ast-construction: will_visit statement 0x55a6d1971c48
ast-construction: make decorated_statement 0x55a6d1972650
ast-construction: will_visit decorated_statement 0x55a6d1972650
ast-construction: make argument_or_redirection 0x55a6d1968310
ast-construction: will_visit argument_or_redirection 0x55a6d1968310
ast-construction: make argument 0x55a6d197b0b0
ast-construction: argument_or_redirection_list size: 1
ast-construction: job_continuation_list size: 0
ast-construction: job_conjunction_continuation_list size: 0
ast-construction: job_list size: 1
```
Which is still unambiguous because of the indentation.
(this is still *super verbose* and we might want to remove it from the
`*` "all" debug category and only allow turning it on explicitly)
In an interactive shell, typing "for x in (<RET>" would print an error:
fish: Expected end of the statement, but found a parse_token_type_t::tokenizer_error
Our tokenizer converts "(" into a special error token, hence this message.
Fix two cases by not reporting errors, but only if we allow parsing incomplete
input. I'm not really sure if this is necessary, but it's sufficient.
Fixes#7693
This reworks the "a=" detection to be simpler.
If we detect a variable assignment that produces an error,
simply consume it.
We also take the opportunity to not highlight it as an error,
and add some tests.
Original commit is 1ca05d32d3.
Typing that command in an interactive prompt would make the highlighter thread
eat up CPU and memory. Probably not the right fix; I think the token should
already have been consumed when the error is detected, then there is no need
to consume it when unwinding.
This is the first commit of a series intended to replace the existing
"parse tree" machinery. It adds a new abstract syntax tree and uses a more
normal recursive descent parser.
Initially there are no users of the new ast. The following commits will
replace parse_tree -> ast for all usages.