This commit is contained in:
Aleksey Kladov 2018-08-24 18:14:21 +03:00
parent 6cade3f6d8
commit 2812015d40
5 changed files with 102 additions and 672 deletions

107
README.md
View file

@ -5,13 +5,110 @@
[![Build status](https://ci.appveyor.com/api/projects/status/j56x1hbje8rdg6xk/branch/master?svg=true)](https://ci.appveyor.com/project/matklad/libsyntax2/branch/master) [![Build status](https://ci.appveyor.com/api/projects/status/j56x1hbje8rdg6xk/branch/master?svg=true)](https://ci.appveyor.com/project/matklad/libsyntax2/branch/master)
libsyntax2.0 is an **experimental** parser of the Rust language,
intended for the use in IDEs.
[RFC](https://github.com/rust-lang/rfcs/pull/2256).
libsyntax2.0 is an **experimental** implementation of the corresponding [RFC](https://github.com/rust-lang/rfcs/pull/2256).
See [`docs`](./docs) folder to learn how libsyntax2 works, and check ## Quick Start
[`CONTRIBUTING.md`](./CONTRIBUTING.md) if you want to contribute!
**WARNING** everything is in a bit of a flux recently, the docs are obsolete, ```
see the recent work on red/green trees. $ cargo test
$ cargo parse < crates/libsyntax2/src/lib.rs
```
## Trying It Out
This installs experimental VS Code plugin
```
$ cargo install-code
```
It's better to remove existing Rust plugins to avoid interference.
Warning: plugin is not intended for general use, has a lot of rough
edges and missing features (notably, no code completion). That said,
while originally libsyntax2 was developed in IntelliJ, @matklad now
uses this plugin (and thus, libsytax2) to develop libsyntax2, and it
doesn't hurt too much :-)
### Features:
* syntax highlighting (LSP does not have API for it, so impl is hacky
and sometimes fall-backs to the horrible built-in highlighting)
* commands (`ctrl+shift+p` or keybindings)
- **Show Rust Syntax Tree** (use it to verify that plugin works)
- **Rust Extend Selection** (works with multiple cursors)
- **Rust Matching Brace** (knows the difference between `<` and `<`)
- **Rust Parent Module**
- **Rust Join Lines** (deals with trailing commas)
* **Go to symbol in file**
* **Go to symbol in workspace** (no support for Cargo deps yet)
* code actions:
- Flip `,` in comma separated lists
- Add `#[derive]` to struct/enum
- Add `impl` block to struct/enum
- Run tests at caret
* **Go to definition** ("correct" for `mod foo;` decls, index-based for functions).
## Code Walk-Through
### `crates/libsyntax2`
- `yellow`, red/green syntax tree, heavily inspired [by this](https://github.com/apple/swift/tree/ab68f0d4cbf99cdfa672f8ffe18e433fddc8b371/lib/Syntax)
- `grammar`, the actual parser
- `parser_api/parser_impl` bridges the tree-agnostic parser from `grammar` with `yellow` trees
- `grammar.ron` RON description of the grammar, which is used to
generate `syntax_kinds` and `ast` modules.
- `algo`: generic tree algorithms, including `walk` for O(1) stack
space tree traversal (this is cool) and `visit` for type-driven
visiting the nodes (this is double plus cool, if you understand how
`Visitor` works, you understand libsyntax2).
### `crates/libeditor`
Most of IDE features leave here, unlike `libanalysis`, `libeditor` is
single-file and is basically a bunch of pure functions.
### `crates/libanalysis`
A stateful library for analyzing many Rust files as they change.
`WorldState` is a mutable entity (clojure's atom) which holds current
state, incorporates changes and handles out `World`s --- immutable
consistent snapshots of `WorldState`, which actually power analysis.
### `crates/server`
An LSP implementation which uses `libanalysis` for managing state and
`libeditor` for actually doing useful stuff.
### `crates/cli`
A CLI interface to libsyntax
### `crate/tools`
Code-gen tasks, used to develop libsyntax2:
- `cargo gen-kinds` -- generate `ast` and `syntax_kinds`
- `cargo gen-tests` -- collect inline tests from grammar
- `cargo install-code` -- build and install VS Code extension and server
### `code`
VS Code plugin
## License ## License

View file

@ -1,93 +0,0 @@
# Design and open questions about libsyntax
The high-level description of the architecture is in RFC.md. You might
also want to dig through https://github.com/matklad/fall/ which
contains some pretty interesting stuff build using similar ideas
(warning: it is completely undocumented, poorly written and in general
not the thing which I recommend to study (yes, this is
self-contradictory)).
## Tree
The centerpiece of this whole endeavor is the syntax tree, in the
`tree` module. Open questions:
- how to best represent errors, to take advantage of the fact that
they are rare, but to enable fully-persistent style structure
sharing between tree nodes?
- should we make red/green split from Roslyn more pronounced?
- one can layout nodes in a single array in such a way that children
of the node form a continuous slice. Seems nifty, but do we need it?
- should we use SoA or AoS for NodeData?
- should we split leaf nodes and internal nodes into separate arrays?
Can we use it to save some bits here and there? (leaves don't need
first_child field, for example).
## Parser
The syntax tree is produced using a three-staged process.
First, a raw text is split into tokens with a lexer (the `lexer` module).
Lexer has a peculiar signature: it is an `Fn(&str) -> Token`, where token
is a pair of `SyntaxKind` (you should have read the `tree` module and RFC
by this time! :)) and a len. That is, lexer chomps only the first
token of the input. This forces the lexer to be stateless, and makes
it possible to implement incremental relexing easily.
Then, the bulk of work, the parser turns a stream of tokens into
stream of events (the `parser` module; of particular interest are
the `parser/event` and `parser/parser` modules, which contain parsing
API, and the `parser/grammar` module, which contains actual parsing code
for various Rust syntactic constructs). Not that parser **does not**
construct a tree right away. This is done for several reasons:
* to decouple the actual tree data structure from the parser: you can
build any data structure you want from the stream of events
* to make parsing fast: you can produce a list of events without
allocations
* to make it easy to tweak tree structure. Consider this code:
```
#[cfg(test)]
pub fn foo() {}
```
Here, the attribute and the `pub` keyword must be the children of
the `fn` node. However, when parsing them, we don't yet know if
there would be a function ahead: it very well might be a `struct`
there. If we use events, we generally don't care about this *in
parser* and just spit them in order.
* (Is this true?) to make incremental reparsing easier: you can reuse
the same rope data structure for all of the original string, the
tokens and the events.
The parser also does not know about whitespace tokens: it's the job of
the next layer to assign whitespace and comments to nodes. However,
parser can remap contextual tokens, like `>>` or `union`, so it has
access to the text.
And at last, the TreeBuilder converts a flat stream of events into a
tree structure. It also *should* be responsible for attaching comments
and rebalancing the tree, but it does not do this yet :)
## Validator
Parser and lexer accept a lot of *invalid* code intentionally. The
idea is to post-process the tree and to proper error reporting,
literal conversion and quick-fix suggestions. There is no
design/implementation for this yet.
## AST
Nothing yet, see `AstNode` in `fall`.

View file

@ -1,494 +0,0 @@
- Feature Name: libsyntax2.0
- Start Date: 2017-12-30
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)
>I think the lack of reusability comes in object-oriented languages,
>not functional languages. Because the problem with object-oriented
>languages is theyve got all this implicit environment that they
>carry around with them. You wanted a banana but what you got was a
>gorilla holding the banana and the entire jungle.
>
>If you have referentially transparent code, if you have pure
>functions — all the data comes in its input arguments and everything
>goes out and leave no state behind — its incredibly reusable.
>
> **Joe Armstrong**
# Summary
[summary]: #summary
The long-term plan is to rewrite libsyntax parser and syntax tree data
structure to create a software component independent of the rest of
rustc compiler and suitable for the needs of IDEs and code
editors. This RFCs is the first step of this plan, whose goal is to
find out if this is possible at least in theory. If it is possible,
the next steps would be a prototype implementation as a crates.io
crate and a separate RFC for integrating the prototype with rustc,
other tools, and eventual libsyntax removal.
Note that this RFC does not propose to stabilize any API for working
with rust syntax: the semver version of the hypothetical library would
be `0.1.0`. It is intended to be used by tools, which are currently
closely related to the compiler: `rustc`, `rustfmt`, `clippy`, `rls`
and hypothetical `rustfix`. While it would be possible to create
third-party tools on top of the new libsyntax, the burden of adopting
to breaking changes would be on authors of such tools.
# Motivation
[motivation]: #motivation
There are two main drawbacks with the current version of libsyntax:
* It is tightly integrated with the compiler and hard to use
independently
* The AST representation is not well-suited for use inside IDEs
## IDE support
There are several differences in how IDEs and compilers typically
treat source code.
In the compiler, it is convenient to transform the source
code into Abstract Syntax Tree form, which is independent of the
surface syntax. For example, it's convenient to discard comments,
whitespaces and desugar some syntactic constructs in terms of the
simpler ones.
In contrast, IDEs work much closer to the source code, so it is
crucial to preserve full information about the original text. For
example, IDE may adjust indentation after typing a `}` which closes a
block, and to do this correctly, IDE must be aware of syntax (that is,
that `}` indeed closes some block, and is not a syntax error) and of
all whitespaces and comments. So, IDE suitable AST should explicitly
account for syntactic elements, not considered important by the
compiler.
Another difference is that IDEs typically work with incomplete and
syntactically invalid code. This boils down to two parser properties.
First, the parser must produce syntax tree even if some required input
is missing. For example, for input `fn foo` the function node should
be present in the parse, despite the fact that there is no parameters
or body. Second, the parser must be able to skip over parts of input
it can't recognize and aggressively recover from errors. That is, the
syntax tree data structure should be able to handle both missing and
extra nodes.
IDEs also need the ability to incrementally reparse and relex source
code after the user types. A smart IDE would use syntax tree structure
to handle editing commands (for example, to add/remove trailing commas
after join/split lines actions), so parsing time can be very
noticeable.
Currently rustc uses the classical AST approach, and preserves some of
the source code information in the form of spans in the AST. It is not
clear if this structure can full fill all IDE requirements.
## Reusability
In theory, the parser can be a pure function, which takes a `&str` as
an input, and produces a `ParseTree` as an output.
This is great for reusability: for example, you can compile this
function to WASM and use it for fast client-side validation of syntax
on the rust playground, or you can develop tools like `rustfmt` on
stable Rust outside of rustc repository, or you can embed the parser
into your favorite IDE or code editor.
This is also great for correctness: with such simple interface, it's
possible to write property-based tests to thoroughly compare two
different implementations of the parser. It's also straightforward to
create a comprehensive test suite, because all the inputs and outputs
are trivially serializable to human-readable text.
Another benefit is performance: with this signature, you can cache a
parse tree for each file, with trivial strategy for cache invalidation
(invalidate an entry when the underling file changes). On top of such
a cache it is possible to build a smart code indexer which maintains
the set of symbols in the project, watches files for changes and
automatically reindexes only changed files.
Unfortunately, the current libsyntax is far from this ideal. For
example, even the lexer makes use of the `FileMap` which is
essentially a global state of the compiler which represents all know
files. As a data point, it turned out to be easier to move `rustfmt`
into the main `rustc` repository than to move libsyntax outside!
# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation
Not applicable.
# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation
It is not clear if a single parser can accommodate the needs of the
compiler and the IDE, but there is hope that it is possible. The RFC
proposes to develop libsynax2.0 as an experimental crates.io crate. If
the experiment turns out to be a success, the second RFC will propose
to integrate it with all existing tools and `rustc`.
Next, a syntax tree data structure is proposed for libsyntax2.0. It
seems to have the following important properties:
* It is lossless and faithfully represents the original source code,
including explicit nodes for comments and whitespace.
* It is flexible and allows to encode arbitrary node structure,
even for invalid syntax.
* It is minimal: it stores small amount of data and has no
dependencies. For instance, it does not need compiler's string
interner or literal data representation.
* While the tree itself is minimal, it is extensible in a sense that
it possible to associate arbitrary data with certain nodes in a
type-safe way.
It is not clear if this representation is the best one. It is heavily
inspired by [PSI] data structure which used in [IntelliJ] based IDEs
and in the [Kotlin] compiler.
[PSI]: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html
[IntelliJ]: https://github.com/JetBrains/intellij-community/
[Kotlin]: https://kotlinlang.org/
## Untyped Tree
The main idea is to store the minimal amount of information in the
tree itself, and instead lean heavily on the source code for the
actual data about identifier names, constant values etc.
All nodes in the tree are of the same type and store a constant for
the syntactic category of the element and a range in the source code.
Here is a minimal implementation of this data structure with some Rust
syntactic categories
```rust
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct NodeKind(u16);
pub struct File {
text: String,
nodes: Vec<NodeData>,
}
struct NodeData {
kind: NodeKind,
range: (u32, u32),
parent: Option<u32>,
first_child: Option<u32>,
next_sibling: Option<u32>,
}
#[derive(Clone, Copy)]
pub struct Node<'f> {
file: &'f File,
idx: u32,
}
pub struct Children<'f> {
next: Option<Node<'f>>,
}
impl File {
pub fn root<'f>(&'f self) -> Node<'f> {
assert!(!self.nodes.is_empty());
Node { file: self, idx: 0 }
}
}
impl<'f> Node<'f> {
pub fn kind(&self) -> NodeKind {
self.data().kind
}
pub fn text(&self) -> &'f str {
let (start, end) = self.data().range;
&self.file.text[start as usize..end as usize]
}
pub fn parent(&self) -> Option<Node<'f>> {
self.as_node(self.data().parent)
}
pub fn children(&self) -> Children<'f> {
Children { next: self.as_node(self.data().first_child) }
}
fn data(&self) -> &'f NodeData {
&self.file.nodes[self.idx as usize]
}
fn as_node(&self, idx: Option<u32>) -> Option<Node<'f>> {
idx.map(|idx| Node { file: self.file, idx })
}
}
impl<'f> Iterator for Children<'f> {
type Item = Node<'f>;
fn next(&mut self) -> Option<Node<'f>> {
let next = self.next;
self.next = next.and_then(|node| node.as_node(node.data().next_sibling));
next
}
}
pub const ERROR: NodeKind = NodeKind(0);
pub const WHITESPACE: NodeKind = NodeKind(1);
pub const STRUCT_KW: NodeKind = NodeKind(2);
pub const IDENT: NodeKind = NodeKind(3);
pub const L_CURLY: NodeKind = NodeKind(4);
pub const R_CURLY: NodeKind = NodeKind(5);
pub const COLON: NodeKind = NodeKind(6);
pub const COMMA: NodeKind = NodeKind(7);
pub const AMP: NodeKind = NodeKind(8);
pub const LINE_COMMENT: NodeKind = NodeKind(9);
pub const FILE: NodeKind = NodeKind(10);
pub const STRUCT_DEF: NodeKind = NodeKind(11);
pub const FIELD_DEF: NodeKind = NodeKind(12);
pub const TYPE_REF: NodeKind = NodeKind(13);
```
Here is a rust snippet and the corresponding parse tree:
```rust
struct Foo {
field1: u32,
&
// non-doc comment
field2:
}
```
```
FILE
STRUCT_DEF
STRUCT_KW
WHITESPACE
IDENT
WHITESPACE
L_CURLY
WHITESPACE
FIELD_DEF
IDENT
COLON
WHITESPACE
TYPE_REF
IDENT
COMMA
WHITESPACE
ERROR
AMP
WHITESPACE
FIELD_DEF
LINE_COMMENT
WHITESPACE
IDENT
COLON
ERROR
WHITESPACE
R_CURLY
```
Note several features of the tree:
* All whitespace and comments are explicitly accounted for.
* The node for `STRUCT_DEF` contains the error element for `&`, but
still represents the following field correctly.
* The second field of the struct is incomplete: `FIELD_DEF` node for
it contains an `ERROR` element, but nevertheless has the correct
`NodeKind`.
* The non-documenting comment is correctly attached to the following
field.
## Typed Tree
It's hard to work with this raw parse tree, because it is untyped:
node containing a struct definition has the same API as the node for
the struct field. But it's possible to add a strongly typed layer on
top of this raw tree, and get a zero-cost AST. Here is an example
which adds type-safe wrappers for structs and fields:
```rust
// generic infrastructure
pub trait AstNode<'f>: Copy + 'f {
fn new(node: Node<'f>) -> Option<Self>;
fn node(&self) -> Node<'f>;
}
pub fn child_of_kind<'f>(node: Node<'f>, kind: NodeKind) -> Option<Node<'f>> {
node.children().find(|child| child.kind() == kind)
}
pub fn ast_children<'f, A: AstNode<'f>>(node: Node<'f>) -> Box<Iterator<Item=A> + 'f> {
Box::new(node.children().filter_map(A::new))
}
// AST elements, specific to Rust
#[derive(Clone, Copy)]
pub struct StructDef<'f>(Node<'f>);
#[derive(Clone, Copy)]
pub struct FieldDef<'f>(Node<'f>);
#[derive(Clone, Copy)]
pub struct TypeRef<'f>(Node<'f>);
pub trait NameOwner<'f>: AstNode<'f> {
fn name_ident(&self) -> Node<'f> {
child_of_kind(self.node(), IDENT).unwrap()
}
fn name(&self) -> &'f str { self.name_ident().text() }
}
impl<'f> AstNode<'f> for StructDef<'f> {
fn new(node: Node<'f>) -> Option<Self> {
if node.kind() == STRUCT_DEF { Some(StructDef(node)) } else { None }
}
fn node(&self) -> Node<'f> { self.0 }
}
impl<'f> NameOwner<'f> for StructDef<'f> {}
impl<'f> StructDef<'f> {
pub fn fields(&self) -> Box<Iterator<Item=FieldDef<'f>> + 'f> {
ast_children(self.node())
}
}
impl<'f> AstNode<'f> for FieldDef<'f> {
fn new(node: Node<'f>) -> Option<Self> {
if node.kind() == FIELD_DEF { Some(FieldDef(node)) } else { None }
}
fn node(&self) -> Node<'f> { self.0 }
}
impl<'f> FieldDef<'f> {
pub fn type_ref(&self) -> Option<TypeRef<'f>> {
ast_children(self.node()).next()
}
}
impl<'f> NameOwner<'f> for FieldDef<'f> {}
impl<'f> AstNode<'f> for TypeRef<'f> {
fn new(node: Node<'f>) -> Option<Self> {
if node.kind() == TYPE_REF { Some(TypeRef(node)) } else { None }
}
fn node(&self) -> Node<'f> { self.0 }
}
```
Note that although AST wrappers provide a type-safe access to the
tree, they are still represented as indexes, so clients of the syntax
tree can easily associated additional data with AST nodes by storing
it in a side-table.
## Missing Source Code
The crucial feature of this syntax tree is that it is just a view into
the original source code. And this poses a problem for the Rust
language, because not all compiled Rust code is represented in the
form of source code! Specifically, Rust has a powerful macro system,
which effectively allows to create and parse additional source code at
compile time. It is not entirely clear that the proposed parsing
framework is able to handle this use case, and it's the main purpose
of this RFC to figure it out. The current idea for handling macros is
to make each macro expansion produce a triple of (expansion text,
syntax tree, hygiene information), where hygiene information is a side
table, which colors different ranges of the expansion text according
to the original syntactic context.
## Implementation plan
This RFC proposes huge changes to the internals of the compiler, so
it's important to proceed carefully and incrementally. The following
plan is suggested:
* RFC discussion about the theoretical feasibility of the proposal,
and the best representation representation for the syntax tree.
* Implementation of the proposal as a completely separate crates.io
crate, by refactoring existing libsyntax source code to produce a
new tree.
* A prototype implementation of the macro expansion on top of the new
sytnax tree.
* Additional round of discussion/RFC about merging with the mainline
compiler.
# Drawbacks
[drawbacks]: #drawbacks
- No harm will be done as long as the new libsyntax exists as an
experiemt on crates.io. However, actually using it in the compiler
and other tools would require massive refactorings.
- It's difficult to know upfront if the proposed syntax tree would
actually work well in both the compiler and IDE. It may be possible
that some drawbacks will be discovered during implementation.
# Rationale and alternatives
[alternatives]: #alternatives
- Incrementally add more information about source code to the current
AST.
- Move the current libsyntax to crates.io as is. In the past, there
were several failed attempts to do that.
- Explore alternative representations for the parse tree.
- Use parser generator instead of hand written parser. Using the
parser from libsyntax directly would be easier, and hand-written
LL-style parsers usually have much better error recovery than
generated LR-style ones.
# Unresolved questions
[unresolved]: #unresolved-questions
- Is it at all possible to represent Rust parser as a pure function of
the source code? It seems like the answer is yes, because the
language and especially macros were cleverly designed with this
use-case in mind.
- Is it possible to implement macro expansion using the proposed
framework? This is the main question of this RFC. The proposed
solution of synthesizing source code on the fly seems workable: it's
not that different from the current implementation, which
synthesizes token trees.
- How to actually phase out current libsyntax, if libsyntax2.0 turns
out to be a success?

View file

@ -1,44 +0,0 @@
# libsyntax2.0 testing infrastructure
Libsyntax2.0 tests are in the `tests/data` directory. Each test is a
pair of files, an `.rs` file with Rust code and a `.txt` file with a
human-readable representation of syntax tree.
The test suite is intended to be independent from a particular parser:
that's why it is just a list of files.
The test suite is intended to be progressive: that is, if you want to
write a Rust parser, you can TDD it by working through the test in
order. That's why each test file begins with the number. Generally,
tests should be added in order of the appearance of corresponding
functionality in libsytnax2.0. If a bug in parser is uncovered, a
**new** test should be created instead of modifying an existing one:
it is preferable to have a gazillion of small isolated test files,
rather than a single file which covers all edge cases. It's okay for
files to have the same name except for the leading number. In general,
test suite should be append-only: old tests should not be modified,
new tests should be created instead.
Note that only `ok` tests are normative: `err` tests test error
recovery and it is totally ok for a parser to not implement any error
recovery at all. However, for libsyntax2.0 we do care about error
recovery, and we do care about precise and useful error messages.
There are also so-called "inline tests". They appear as the comments
with a `test` header in the source code, like this:
```rust
// test fn_basic
// fn foo() {}
fn function(p: &mut Parser) {
// ...
}
```
You can run `cargo collect-tests` command to collect all inline tests
into `tests/data/inline` directory. The main advantage of inline tests
is that they help to illustrate what the relevant code is doing.
Contribution opportunity: design and implement testing infrastructure
for validators.

View file

@ -1,36 +0,0 @@
# Tools used to implement libsyntax
libsyntax uses several tools to help with development.
Each tool is a binary in the [tools/](../tools) package.
You can run them via `cargo run` command.
```
cargo run --package tools --bin tool
```
There are also aliases in [./cargo/config](../.cargo/config),
so the following also works:
```
cargo tool
```
## Tool: `gen`
This tool reads a "grammar" from [grammar.ron](../grammar.ron) and
generates the `syntax_kinds.rs` file. You should run this tool if you
add new keywords or syntax elements.
## Tool: `parse`
This tool reads rust source code from the standard input, parses it,
and prints the result to stdout.
## Tool: `collect-tests`
This tools collect inline tests from comments in libsyntax2 source code
and places them into `tests/data/inline` directory.