rust-analyzer

mirror of https://github.com/rust-lang/rust-analyzer synced 2024-12-27 05:23:24 +00:00

History

bors e64610dbbe Auto merge of #17037 - davidsemakula:token-set-collisions, r=Veykril internal: improve `TokenSet` implementation and add reserved keywords The current `TokenSet` type represents "A bit-set of `SyntaxKind`s" as a newtype `u128`. Internally, the flag for each `SyntaxKind` variant in the bit-set is set as the n-th LSB (least significant bit) via a bit-wise left shift operation, where n is the discriminant. Edit: This is problematic because there's currently ~121 token `SyntaxKind`s, so adding new token kinds for missing reserved keywords increases the number of token `SyntaxKind`s above 128, thus making this ["mask"](`7a8374c162/crates/parser/src/token_set.rs (L31-L33)`) operation overflow. ~~This is problematic because there's currently 266 SyntaxKinds, so this ["mask"](`7a8374c162/crates/parser/src/token_set.rs (L31-L33)`) operation silently overflows in release mode.~~ ~~This leads to a single flag/bit in the bit-set being shared by multiple `SyntaxKind`s~~. This PR: - Changes the wrapped type for `TokenSet` from `u128` to `[u64; 3]` ~~`[u; N]` (currently `[u16; 17]`) where `u` can be any desirable unsigned integer type and `N` is the minimum array length needed to represent all token `SyntaxKind`s without any collisions~~. - Edit: Add assertion that `TokenSet`s only include token `SyntaxKind`s - Edit: Add ~7 missing [reserved keywords](https://doc.rust-lang.org/stable/reference/keywords.html#reserved-keywords) - ~~Moves the definition of the `TokenSet` type to grammar codegen in xtask, so that `N` is adjusted automatically (depending on the chosen `u` "base" type) when new `SyntaxKind`s are added~~. - ~~Updates the `token_set_works_for_tokens` unit test to include the `__LAST` `SyntaxKind` as a way of catching overflows in tests.~~ ~~Currently `u16` is arbitrarily chosen as the `u` "base" type mostly because it strikes a good balance (IMO) between unused bits and readability of the generated `TokenSet` code (especially the [`union` method](`7a8374c162/crates/parser/src/token_set.rs (L26-L28)`)), but I'm open to other suggestions or a better methodology for choosing `u*` type.~~ ~~I considered using a third-party crate for the bit-set, but a direct implementation seems simple enough without adding any new dependencies. I'm not strongly opposed to using a third-party crate though, if that's preferred.~~ ~~Finally, I haven't had the chance to review issues, to figure out if there are any parser issues caused by collisions due the current implementation that may be fixed by this PR - I just stumbled upon the issue while adding "new" keywords to solve #16858~~ Edit: fixes #16858	2024-04-16 07:00:12 +00:00
..
src	Auto merge of #17037 - davidsemakula:token-set-collisions, r=Veykril	2024-04-16 07:00:12 +00:00
Cargo.toml	Move diagnostics docs generation into xtask/codegen	2024-03-04 17:14:14 +01:00

bors e64610dbbe Auto merge of #17037 - davidsemakula:token-set-collisions, r=Veykril

internal: improve `TokenSet` implementation and add reserved keywords

The current `TokenSet` type represents "A bit-set of `SyntaxKind`s" as a newtype `u128`.
Internally, the flag for each `SyntaxKind` variant in the bit-set is set as the n-th LSB (least significant bit) via a bit-wise left shift operation, where n is the discriminant.

Edit: This is problematic because there's currently ~121 token `SyntaxKind`s, so adding new token kinds for missing reserved keywords increases the number of token `SyntaxKind`s above 128, thus making this ["mask"](7a8374c162/crates/parser/src/token_set.rs (L31-L33)) operation overflow.
~~This is problematic because there's currently 266 SyntaxKinds, so this ["mask"](7a8374c162/crates/parser/src/token_set.rs (L31-L33)) operation silently overflows in release mode.~~
~~This leads to a single flag/bit in the bit-set being shared by multiple `SyntaxKind`s~~.

This PR:
- Changes the wrapped type for `TokenSet` from `u128` to `[u64; 3]` ~~`[u*; N]` (currently `[u16; 17]`) where `u*` can be any desirable unsigned integer type and `N` is the minimum array length needed to represent all token `SyntaxKind`s without any collisions~~.
- Edit: Add assertion that `TokenSet`s only include token `SyntaxKind`s
- Edit: Add ~7 missing [reserved keywords](https://doc.rust-lang.org/stable/reference/keywords.html#reserved-keywords)
- ~~Moves the definition of the `TokenSet` type to grammar codegen in xtask, so that `N` is adjusted automatically (depending on the chosen `u*` "base" type) when new `SyntaxKind`s are added~~.
- ~~Updates the `token_set_works_for_tokens` unit test to include the `__LAST` `SyntaxKind` as a way of catching overflows in tests.~~

~~Currently `u16` is arbitrarily chosen as the `u*` "base" type mostly because it strikes a good balance (IMO) between unused bits and readability of the generated `TokenSet` code (especially the [`union` method](7a8374c162/crates/parser/src/token_set.rs (L26-L28))), but I'm open to other suggestions or a better methodology for choosing `u*` type.~~

~~I considered using a third-party crate for the bit-set, but a direct implementation seems simple enough without adding any new dependencies. I'm not strongly opposed to using a third-party crate though, if that's preferred.~~

~~Finally, I haven't had the chance to review issues, to figure out if there are any parser issues caused by collisions due the current implementation that may be fixed by this PR - I just stumbled upon the issue while adding "new" keywords to solve #16858~~

Edit: fixes #16858

2024-04-16 07:00:12 +00:00

src Auto merge of #17037 - davidsemakula:token-set-collisions, r=Veykril 2024-04-16 07:00:12 +00:00

Cargo.toml Move diagnostics docs generation into xtask/codegen 2024-03-04 17:14:14 +01:00