bevy/crates
Gonçalo Rica Pais da Silva 9c0fca072d
Optimise Entity with repr align & manual PartialOrd/Ord (#10558)
# Objective

- Follow up on https://github.com/bevyengine/bevy/pull/10519, diving
deeper into optimising `Entity` due to the `derive`d `PartialOrd`
`partial_cmp` not being optimal with codegen:
https://github.com/rust-lang/rust/issues/106107
- Fixes #2346.

## Solution

Given the previous PR's solution and the other existing LLVM codegen
bug, there seemed to be a potential further optimisation possible with
`Entity`. In exploring providing manual `PartialOrd` impl, it turned out
initially that the resulting codegen was not immediately better than the
derived version. However, once `Entity` was given `#[repr(align(8)]`,
the codegen improved remarkably, even more once the fields in `Entity`
were rearranged to correspond to a `u64` layout (Rust doesn't
automatically reorder fields correctly it seems). The field order and
`align(8)` additions also improved `to_bits` codegen to be a single
`mov` op. In turn, this led me to replace the previous
"non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits`
comparison.

The result was remarkably better codegen across the board, even for
hastable lookups.

The current baseline codegen is as follows:
https://godbolt.org/z/zTW1h8PnY

Assuming the following example struct that mirrors with the existing
`Entity` definition:

```rust
#[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)]
pub struct FakeU64 {
    high: u32,
    low: u32,
}
```

the output for `to_bits` is as follows:

```
example::FakeU64::to_bits:
        shl     rdi, 32
        mov     eax, esi
        or      rax, rdi
        ret
```

Changing the struct to:
```rust
#[derive(Clone, Copy, Eq)]
#[repr(align(8))]
pub struct FakeU64 {
    low: u32,
    high: u32,
}
```
and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`,
`to_bits` now optimises to:
```
example::FakeU64::to_bits:
        mov     rax, rdi
        ret
```
The full codegen example for this PR is here for reference:
https://godbolt.org/z/n4Mjx165a

To highlight, `gt` comparison goes from
```
example::greater_than:
        cmp     edi, edx
        jae     .LBB3_2
        xor     eax, eax
        ret
.LBB3_2:
        setne   dl
        cmp     esi, ecx
        seta    al
        or      al, dl
        ret
```
to
```
example::greater_than:
        cmp     rdi, rsi
        seta    al
        ret
```

As explained on Discord by @scottmcm :

>The root issue here, as far as I understand it, is that LLVM's
middle-end is inexplicably unwilling to merge loads if that would make
them under-aligned. It leaves that entirely up to its target-specific
back-end, and thus a bunch of the things that you'd expect it to do that
would fix this just don't happen.

## Benchmarks

Before discussing benchmarks, everything was tested on the following
specs:

AMD Ryzen 7950X 16C/32T CPU
64GB 5200 RAM
AMD RX7900XT 20GB Gfx card
Manjaro KDE on Wayland

I made use of the new entity hashing benchmarks to see how this PR would
improve things there. With the changes in place, I first did an
implementation keeping the existing "non shortcircuit" `PartialEq`
implementation in place, but with the alignment and field ordering
changes, which in the benchmark is the `ord_shortcircuit` column. The
`to_bits` `PartialEq` implementation is the `ord_to_bits` column. The
main_ord column is the current existing baseline from `main` branch.


![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261)

My machine is not super set-up for benchmarking, so some results are
within noise, but there's not just a clear improvement between the
non-shortcircuiting implementation, but even further optimisation taking
place with the `to_bits` implementation.

On my machine, a fair number of the stress tests were not showing any
difference (indicating other bottlenecks), but I was able to get a clear
difference with `many_foxes` with a fox count of 10,000:

Test with `cargo run --example many_foxes --features
bevy/trace_tracy,wayland --release -- --count 10000`:


![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3)

On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This
trace" represents the current PR's perf, while "External trace"
represents the `main` branch baseline.

## Changelog

Changed: micro-optimized Entity align and field ordering as well as
providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further.

## Migration Guide

Any `unsafe` code relying on field ordering of `Entity` or sufficiently
cursed shenanigans should change to reflect the different internal
representation and alignment requirements of `Entity`.

Co-authored-by: james7132 <contact@jamessliu.com>
Co-authored-by: NathanW <nathansward@comcast.net>
2023-11-18 20:04:37 +00:00
..
bevy_a11y Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_animation Add Debug, PartialEq and Eq derives to bevy_animation. (#10562) 2023-11-15 14:05:04 +00:00
bevy_app add regression test for #10385/#10389 (#10609) 2023-11-18 12:53:09 +00:00
bevy_asset Do not panic when failing to create assets folder (#10613) (#10614) 2023-11-17 22:06:08 +00:00
bevy_audio Add VolumeLevel::ZERO (#10608) 2023-11-17 15:15:56 +00:00
bevy_core Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_core_pipeline Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_derive Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_diagnostic Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_dylib Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_dynamic_plugin Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_ecs Optimise Entity with repr align & manual PartialOrd/Ord (#10558) 2023-11-18 20:04:37 +00:00
bevy_ecs_compile_fail_tests Updates for rust 1.73 (#10035) 2023-10-06 00:31:10 +00:00
bevy_encase_derive Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_gilrs Update Event send methods to return EventId (#10551) 2023-11-16 17:20:43 +00:00
bevy_gizmos Gizmo Arrows (#10550) 2023-11-15 14:19:15 +00:00
bevy_gltf Explicit color conversion methods (#10321) 2023-11-15 16:47:32 +00:00
bevy_hierarchy Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_input Update Event send methods to return EventId (#10551) 2023-11-16 17:20:43 +00:00
bevy_internal Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_log Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_macro_utils Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_macros_compile_fail_tests bevy_derive: Fix #[deref] breaking other attributes (#9551) 2023-08-28 17:36:18 +00:00
bevy_math Add and impl Primitives (#10580) 2023-11-17 20:10:30 +00:00
bevy_mikktspace Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_pbr Ensure ExtendedMaterial works with reflection (to enable bevy_egui_inspector integration) (#10548) 2023-11-15 12:48:36 +00:00
bevy_ptr Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_reflect Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_reflect_compile_fail_tests Improve TypeUuid's derive macro error messages (#9315) 2023-10-02 12:42:01 +00:00
bevy_render Explicit color conversion methods (#10321) 2023-11-15 16:47:32 +00:00
bevy_scene Use handles for queued scenes in SceneSpawner (#10619) 2023-11-18 01:27:05 +00:00
bevy_sprite Add PartialEq to Anchor (#10424) 2023-11-07 08:36:10 +00:00
bevy_tasks Make FakeTask public on singlethreaded context (#10517) 2023-11-15 14:29:43 +00:00
bevy_text Improved Text Rendering (#10537) 2023-11-14 13:44:25 +00:00
bevy_time Add discard_overstep function to Time<Fixed> (#10453) 2023-11-18 00:50:26 +00:00
bevy_transform Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_ui Fix panic when using image in UiMaterial (#10591) 2023-11-16 21:31:25 +00:00
bevy_utils Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_window Release 0.12 (#10362) 2023-11-04 17:24:23 +00:00
bevy_winit Revert App::run() behavior/Remove winit specific code from bevy_app (#10389) 2023-11-16 21:50:17 +00:00