# Objective
- Fixes#16078
## Solution
- Rename things to clarify that we _want_ unclipped depth for
directional light shadow views, and need some way of disabling the GPU's
builtin depth clipping
- Use DEPTH_CLIP_CONTROL instead of the fragment shader emulation on
supported platforms
- Pass only the clip position depth instead of the whole clip position
between vertex->fragment shader (no idea if this helps performance or
not, compiler might optimize it anyways)
- Meshlets
- HW raster always uses DEPTH_CLIP_CONTROL since it targets a more
limited set of platforms
- SW raster was not handling DEPTH_CLAMP_ORTHO correctly, it ended up
pretty much doing nothing.
- This PR made me realize that SW raster technically should have depth
clipping for all views that are not directional light shadows, but I
decided not to bother writing it. I'm not sure that it ever matters in
practice. If proven otherwise, I can add it.
## Testing
- Did you test these changes? If so, how?
- Lighting example. Both opaque (no fragment shader) and alpha masked
geometry (fragment shader emulation) are working with
depth_clip_control, and both work when it's turned off. Also tested
meshlet example.
- Are there any parts that need more testing?
- Performance. I can't figure out a good test scene.
- How can other people (reviewers) test your changes? Is there anything
specific they need to know?
- Toggle depth_clip_control_supported in prepass/mod.rs line 323 to turn
this PR on or off.
- If relevant, what platforms did you test these changes on, and are
there any important ones you can't test?
- Native
---
## Migration Guide
- `MeshPipelineKey::DEPTH_CLAMP_ORTHO` is now
`MeshPipelineKey::UNCLIPPED_DEPTH_ORTHO`
- The `DEPTH_CLAMP_ORTHO` shaderdef has been renamed to
`UNCLIPPED_DEPTH_ORTHO_EMULATION`
- `clip_position_unclamped: vec4<f32>` is now `unclipped_depth: f32`
# Objective
- wgpu 0.20 made workgroup vars stop being zero-init by default. this
broke some applications (cough foresight cough) and now we workaround
it. wgpu exposes a compilation option that zero initializes workgroup
memory by default, but bevy does not expose it.
## Solution
- expose the compilation option wgpu gives us
## Testing
- ran examples: 3d_scene, compute_shader_game_of_life, gpu_readback,
lines, specialized_mesh_pipeline. they all work
- confirmed fix for our own problems
---
</details>
## Migration Guide
- add `zero_initialize_workgroup_memory: false,` to
`ComputePipelineDescriptor` or `RenderPipelineDescriptor` structs to
preserve 0.14 functionality, add `zero_initialize_workgroup_memory:
true,` to restore bevy 0.13 functionality.
# Objective
- Make the meshlet fill cluster buffers pass slightly faster
- Address https://github.com/bevyengine/bevy/issues/15920 for meshlets
- Added PreviousGlobalTransform as a required meshlet component to avoid
extra archetype moves, slightly alleviating
https://github.com/bevyengine/bevy/issues/14681 for meshlets
- Enforce that MeshletPlugin::cluster_buffer_slots is not greater than
2^25 (glitches will occur otherwise). Technically this field controls
post-lod/culling cluster count, and the issue is on pre-lod/culling
cluster count, but it's still valid now, and in the future this will be
more true.
Needs to be merged after https://github.com/bevyengine/bevy/pull/15846
and https://github.com/bevyengine/bevy/pull/15886
## Solution
- Old pass dispatched a thread per cluster, and did a binary search over
the instances to find which instance the cluster belongs to, and what
meshlet index within the instance it is.
- New pass dispatches a workgroup per instance, and has the workgroup
loop over all meshlets in the instance in order to write out the cluster
data.
- Use a push constant instead of arrayLength to fix the linked bug
- Remap 1d->2d dispatch for software raster only if actually needed to
save on spawning excess workgroups
## Testing
- Did you test these changes? If so, how?
- Ran the meshlet example, and an example with 1041 instances of 32217
meshlets per instance. Profiled the second scene with nsight, went from
0.55ms -> 0.40ms. Small savings. We're pretty much VRAM bandwidth bound
at this point.
- How can other people (reviewers) test your changes? Is there anything
specific they need to know?
- Run the meshlet example
## Changelog (non-meshlets)
- PreviousGlobalTransform now implements the Default trait
# Objective
1. Prevent weird glitches with stray pixels scattered around the scene
![image](https://github.com/user-attachments/assets/f12adb38-5996-4dc7-bea6-bd326b7317e1)
2. Prevent weird glitchy full-screen triangles that pop-up and destroy
perf (SW rasterizing huge triangles is slow)
![image](https://github.com/user-attachments/assets/d3705427-13a5-47bc-a54b-756f0409da0b)
## Solution
1. Use floating point math in the SW rasterizer bounding box calculation
to handle negative verticss, and add backface culling
2. Force hardware raster for clusters that clip the near plane, and let
the hardware rasterizer handle the clipping
I also adjusted the SW rasterizer threshold to < 64 pixels (little bit
better perf in my test scene, but still need to do a more comprehensive
test), and enabled backface culling for the hardware raster pipeline.
## Testing
- Did you test these changes? If so, how?
- Yes, on an example scene. Issues no longer occur.
- Are there any parts that need more testing?
- No.
- How can other people (reviewers) test your changes? Is there anything
specific they need to know?
- Run the meshlet example.
# Objective
- Faster meshlet rasterization path for small triangles
- Avoid having to allocate and write out a triangle buffer
- Refactor gpu_scene.rs
## Solution
- Replace the 32bit visbuffer texture with a 64bit visbuffer buffer,
where the left 32 bits encode depth, and the right 32 bits encode the
existing cluster + triangle IDs. Can't use 64bit textures, wgpu/naga
doesn't support atomic ops on textures yet.
- Instead of writing out a buffer of packed cluster + triangle IDs (per
triangle) to raster, the culling pass now writes out a buffer of just
cluster IDs (per cluster, so less memory allocated, cheaper to write
out).
- Clusters for software raster are allocated from the left side
- Clusters for hardware raster are allocated in the same buffer, from
the right side
- The buffer size is fixed at MeshletPlugin build time, and should be
set to a reasonable value for your scene (no warning on overflow, and no
good way to determine what value you need outside of renderdoc - I plan
to fix this in a future PR adding a meshlet stats overlay)
- Currently I don't have a heuristic for software vs hardware raster
selection for each cluster. The existing code is just a placeholder. I
need to profile on a release scene and come up with a heuristic,
probably in a future PR.
- The culling shader is getting pretty hard to follow at this point, but
I don't want to spend time improving it as the entire shader/pass is
getting rewritten/replaced in the near future.
- Software raster is a compute workgroup per-cluster. Each workgroup
loads and transforms the <=64 vertices of the cluster, and then
rasterizes the <=64 triangles of the cluster.
- Two variants are implemented: Scanline for clusters with any larger
triangles (still smaller than hardware is good at), and brute-force for
very very tiny triangles
- Once the shader determines that a pixel should be filled in, it does
an atomicMax() on the visbuffer to store the results, copying how Nanite
works
- On devices with a low max workgroups per dispatch limit, an extra
compute pass is inserted before software raster to convert from a 1d to
2d dispatch (I don't think 3d would ever be necessary).
- I haven't implemented the top-left rule or subpixel precision yet, I'm
leaving that for a future PR since I get usable results without it for
now
- Resources used:
https://kristoffer-dyrkorn.github.io/triangle-rasterizer and chapters
6-8 of
https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index
- Hardware raster now spawns 64*3 vertex invocations per meshlet,
instead of the actual meshlet vertex count. Extra invocations just
early-exit.
- While this is slower than the existing system, hardware draws should
be rare now that software raster is usable, and it saves a ton of memory
using the unified cluster ID buffer. This would be fixed if wgpu had
support for mesh shaders.
- Instead of writing to a color+depth attachment, the hardware raster
pass also does the same atomic visbuffer writes that software raster
uses.
- We have to bind a dummy render target anyways, as wgpu doesn't
currently support render passes without any attachments
- Material IDs are no longer written out during the main rasterization
passes.
- If we had async compute queues, we could overlap the software and
hardware raster passes.
- New material and depth resolve passes run at the end of the visbuffer
node, and write out view depth and material ID depth textures
### Misc changes
- Fixed cluster culling importing, but never actually using the previous
view uniforms when doing occlusion culling
- Fixed incorrectly adding the LOD error twice when building the meshlet
mesh
- Splitup gpu_scene module into meshlet_mesh_manager, instance_manager,
and resource_manager
- resource_manager is still too complex and inefficient (extract and
prepare are way too expensive). I plan on improving this in a future PR,
but for now ResourceManager is mostly a 1:1 port of the leftover
MeshletGpuScene bits.
- Material draw passes have been renamed to the more accurate material
shade pass, as well as some other misc renaming (in the future, these
will be compute shaders even, and not actual draw calls)
---
## Migration Guide
- TBD (ask me at the end of the release for meshlet changes as a whole)
---------
Co-authored-by: vero <email@atlasdostal.com>
* Rename cull_meshlets -> cull_clusters
* Rename meshlet_visible -> cluster_visible
* Add an if statement around meshlet_second_pass_candidates writes,
maybe a small bit of performance.
# Objective
- Using multiple raster passes to generate the depth pyramid is
extremely slow
- Pulling data from the source image is the largest bottleneck, it's
important to sample in a cache-aware pattern
- Barriers and pipeline drain between the raster passes is the second
largest bottleneck
- Each separate RenderPass on the CPU is _really_ expensive
## Solution
- Port [FidelityFX SPD](https://gpuopen.com/fidelityfx-spd) to WGSL,
replacing meshlet's existing multiple raster passes with a ~~single~~
two compute dispatches. Lack of coherent buffers means we have to do the
the last 64x64 tile from mip 7+ in a separate dispatch to ensure the mip
6 writes were flushed :(
- Workgroup shared memory version only at the moment, as the subgroup
operation is blocked by our upgrade to wgpu 0.20 #13186
- Don't enforce a power-of-2 depth pyramid texture size, simply scaling
by 0.5 is fine
# Objective
- Per-cluster (instance of a meshlet) data upload is ridiculously
expensive in both CPU and GPU time (8 bytes per cluster, millions of
clusters, you very quickly run into PCIE bandwidth maximums, and lots of
CPU-side copies and malloc).
- We need to be uploading only per-instance/entity data. Anything else
needs to be done on the GPU.
## Solution
- Per instance, upload:
- `meshlet_instance_meshlet_counts_prefix_sum` - An exclusive prefix sum
over the count of how many clusters each instance has.
- `meshlet_instance_meshlet_slice_starts` - The starting index of the
meshlets for each instance within the `meshlets` buffer.
- A new `fill_cluster_buffers` pass once at the start of the frame has a
thread per cluster, and finds its instance ID and meshlet ID via a
binary search of `meshlet_instance_meshlet_counts_prefix_sum` to find
what instance it belongs to, and then uses that plus
`meshlet_instance_meshlet_slice_starts` to find what number meshlet
within the instance it is. The shader then writes out the per-cluster
instance/meshlet ID buffers for later passes to quickly read from.
- I've gone from 45 -> 180 FPS in my stress test scene, and saved
~30ms/frame of overall CPU/GPU time.
Keeping track of explicit visibility per cluster between frames does not
work with LODs, and leads to worse culling (using the final depth buffer
from the previous frame is more accurate).
Instead, we need to generate a second depth pyramid after the second
raster pass, and then use that in the first culling pass in the next
frame to test if a cluster would have been visible last frame or not.
As part of these changes, the write_index_buffer pass has been folded
into the culling pass for a large performance gain, and to avoid
tracking a lot of extra state that would be needed between passes.
Prepass previous model/view stuff was adapted to work with meshlets as
well.
Also fixed a bug with materials, and other misc improvements.
---------
Co-authored-by: François <mockersf@gmail.com>
Co-authored-by: atlas dostal <rodol@rivalrebels.com>
Co-authored-by: vero <email@atlasdostal.com>
Co-authored-by: Patrick Walton <pcwalton@mimiga.net>
Co-authored-by: Robert Swain <robert.swain@gmail.com>