Rwlock downgrade
Tracking Issue: #128203
This PR adds a `downgrade` method for `RwLock` / `RwLockWriteGuard` on all currently supported platforms.
Outstanding questions:
- [x] ~~Does the `futex.rs` change affect performance at all? It doesn't seem like it will but we can't be certain until we bench it...~~
- [x] ~~Should the SOLID platform implementation [be ported over](https://github.com/rust-lang/rust/pull/128219#discussion_r1693470090) to the `queue.rs` implementation to allow it to support downgrades?~~
Liberate `aarch64-gnu-debug` from the shackles of `--test-args=clang`
### Changes
- Drop `--test-args=clang` from `aarch64-gnu-debug` so run-make tests that are `//@ needs-force-clang-based-tests` no longer only run if their test name contains `clang` (which is a very cool footgun).
- Reorganize run-make-suport library slightly to accommodate a raw gcc invocation.
- Fix `tests/run-make/mte-ffi/rmake.rs` to use `gcc` instead of *a* c compiler.
try-job: aarch64-gnu
try-job: aarch64-gnu-debug
improve codegen of fmt_num to delete unreachable panic
it seems LLVM doesn't realize that `curr` is always decremented at least once in either loop formatting characters of the input string by their appropriate radix, and so the later `&buf[curr..]` generates a check for out-of-bounds access and panic. this is unreachable in reality as even for `x == T::zero()` we'll produce at least the character `Self::digit(T::zero())`, yielding at least one character output, and `curr` will always be at least one below `buf.len()`.
adjust `fmt_int` to make this fact more obvious to the compiler, which fortunately (or unfortunately) results in a measurable performance improvement for workloads heavy on formatting integers.
in the program i'd noticed this in, you can see the `cmp $0x80,%rdi; ja 7c` here, which branches to a slice index fail helper:
<img width="660" alt="before" src="https://github.com/rust-lang/rust/assets/4615790/ac482d54-21f8-494b-9c83-4beadc3ca0ef">
where after this change the function is broadly similar, but smaller, with one fewer registers updated in each pass through the loop in addition the never-taken `cmp/ja` being gone:
<img width="646" alt="after" src="https://github.com/rust-lang/rust/assets/4615790/1bee1d76-b674-43ec-9b21-4587364563aa">
this represents a ~2-3% difference in runtime in my [admittedly comically i32-formatting-bound](https://github.com/athre0z/disas-bench/blob/master/bench/yaxpeax/src/main.rs#L58-L67) use case (printing x86 instructions, including i32 displacements and immediates) as measured on a ryzen 9 3950x.
the impact on `<impl LowerHex for i8>::fmt` is both more dramatic and less impactful: it continues to have a loop that is evaluated at most twice, though the compiler doesn't know that to unroll it. the generated code there is identical to the impl for `i32`. there, the smaller loop body has less effect on runtime, and removing the never-taken slice bounds check is offset by whatever address recalculation is happening with the `lea/add/neg` at the end of the loop. it behaves about the same before and after.
---
i initially measured slightly better outcomes using `unreachable_unchecked()` here instead, but that was hacking on std and rebuilding with `-Z build-std` on an older rustc (nightly 5b377cece, 2023-06-30). it does not yield better outcomes now, so i see no reason to proceed with that approach at all.
<details>
<summary>initial notes about that, seemingly irrelevant on modern rustc</summary>
i went through a few tries at getting llvm to understand the bounds check isn't necessary, but i should mention the _best_ i'd seen here was actually from the existing `fmt_int` with a diff like
```diff
if x == zero {
// No more digits left to accumulate.
break;
};
}
}
+
+ if curr >= buf.len() {
+ unsafe { core::hint::unreachable_unchecked(); }
+ }
let buf = &buf[curr..];
```
posting a random PR to `rust-lang/rust` to do that without a really really compelling reason seemed a bit absurd, so i tried to work that into something that seems more palatable at a glance. but if you're interested, that certainly produced better (x86_64) code through LLVM. in that case with `buf.iter_mut().rev()` as the iterator, `<impl LowerHex for i8>::fmt` actually unrolls into something like
```
put_char(x & 0xf);
let mut len = 1;
if x > 0xf {
put_char((x >> 4) & 0xf);
len = 2;
}
pad_integral(buf[buf.len() - len..]);
```
it's pretty cool! `<impl LowerHex for i32>::fmt` also was slightly better. that all resulted in closer to an 6% difference in my use case.
</details>
---
i have not looked at formatters other than LowerHex/UpperHex with this change, though i'd be a bit shocked if any were _worse_.
(i have absolutely _no_ idea how you'd regression test this, but that might be just my not knowing what the right tool for that would be in rust-lang/rust. i'm of half a mind that this is small and fiddly enough to not be worth landing lest it quietly regress in the future anyway. but i didn't want to discard the idea without at least offering it upstream here)
[perf] rustdoc: Perform less work when cleaning middle::ty parenthesized generic args
CC #132697. I presume the perf regression it caused (if real) boils down to query invocation overhead, namely of `def_kind` & `trait_def` as we don't seem to be decoding more often from the crate metadata.
I won't try the obvious and reduce the amount of query calls by threading information via params as that would render the code awkward.
So instead I'm simply trying to attack some low-hanging fruits in the vicinity.
---
Previously, we would `clean_middle_generic_args` *unconditionally* inside `clean_middle_generic_args_with_constraints` even though we didn't actually use its result for parenthesized generic args (`Trait(...) -> ...`).
Now, we only call `clean_middle_generic_args` when necessary. Lastly, I've simplified `clean_middle_generic_args_with_constraints`.
---
r? ghost
Delete the `cfg(not(parallel))` serial compiler
Since it's inception a long time ago, the parallel compiler and its cfgs have been a maintenance burden. This was a necessary evil the allow iteration while not degrading performance because of synchronization overhead.
But this time is over. Thanks to the amazing work by the parallel working group (and the dyn sync crimes), the parallel compiler has now been fast enough to be shipped by default in nightly for quite a while now.
Stable and beta have still been on the serial compiler, because they can't use `-Zthreads` anyways.
But this is quite suboptimal:
- the maintenance burden still sucks
- we're not testing the serial compiler in nightly
Because of these reasons, it's time to end it. The serial compiler has served us well in the years since it was split from the parallel one, but it's over now.
Let the knight slay one head of the two-headed dragon!
#113349
Note that the default is still 1 thread, as more than 1 thread is still fairly broken.
cc `@onur-ozkan` to see if i did the bootstrap field removal correctly, `@SparrowLii` on the sync parts
move all mono-time checks into their own folder, and their own query
The mono item collector currently also drives two mono-time checks: the lint for "large moves", and the check whether function calls are done with all the required target features.
Instead of doing this "inside" the collector, this PR refactors things so that we have a new `rustc_monomorphize::mono_checks` module providing a per-instance query that does these checks. We already have a per-instance query for the ABI checks, so this should be "free" for incremental builds. Non-incremental builds might do a bit more work now since we now have two separate MIR visits (in the collector and the mono-time checks) -- but one of them is cached in case the MIR doesn't change, which is nice.
This slightly changes behavior of the large-move check since the "move_size_spans" deduplication logic now only works per-instance, not globally across the entire collector.
Cc `@saethlin` since you're also doing some work related to queries and caching and monomorphization, though I don't know if there's any interaction here.
Emit warning when calling/declaring functions with unavailable vectors.
On some architectures, vector types may have a different ABI depending on whether the relevant target features are enabled. (The ABI when the feature is disabled is often not specified, but LLVM implements some de-facto ABI.)
As discussed in rust-lang/lang-team#235, this turns out to very easily lead to unsound code.
This commit makes it a post-monomorphization future-incompat warning to declare or call functions using those vector types in a context in which the corresponding target features are disabled, if using an ABI for which the difference is relevant. This ensures that these functions are always called with a consistent ABI.
See the [nomination comment](https://github.com/rust-lang/rust/pull/127731#issuecomment-2288558187) for more discussion.
Part of #116558
r? RalfJung
Trim and tidy includes in `rustc_llvm`
These includes tend to accumulate over time, and are usually only removed when something breaks in a new LLVM version, so it's nice to clean them up manually once in a while.
General strategy used for this PR:
- Remove all includes from `LLVMWrapper.h` that aren't needed by the header itself, transplanting them to individual source files as necessary.
- For each source file, temporarily remove each include if doing so doesn't cause a compile error.
- If a “required” include looks like it shouldn't be needed, try replacing it with its sub-includes, then trim that list.
- After doing all of the above, go back and re-add any removed include if the file does actually use things defined in that header, even if the header happens to also be included by something else.
Revert using `HEAP` static in Windows alloc
Fixes#131468
This does the minimum to remove the `HEAP` static that was causing chromium issues. It would be worth having a more substantial look at this module but for now I think this addresses the immediate issue.
cc `@danakj`
Only disable cache if predicate has opaques within it
This is an alternative to https://github.com/rust-lang/rust/pull/132075.
This refines the check implemented in https://github.com/rust-lang/rust/pull/126024 to only disable the global cache if the predicate being considered has opaques in it. This is still theoretically unsound, since goals can indirectly rely on opaques in the defining scope, but we're much less likely to hit it.
It doesn't totally fix https://github.com/rust-lang/rust/issues/132064: for example, `lemmy` goes from 1:29 (on rust 1.81) to 9:53 (on nightly) to 4:07 (after this PR). But I think it's at least *more* sound than a total revert :/
r? lcnr