Mirrors/bevy

mirror of https://github.com/bevyengine/bevy synced 2025-02-19 15:38:36 +00:00

Author	SHA1	Message	Date
JMS55	a0faf9cd01	More triangles/vertices per meshlet (#15023 ) ### Builder changes - Increased meshlet max vertices/triangles from 64v/64t to 255v/128t (meshoptimizer won't allow 256v sadly). This gives us a much greater percentage of meshlets with max triangle count (128). Still not perfect, we still end up with some tiny <=10 triangle meshlets that never really get simplified, but it's progress. - Removed the error target limit. Now we allow meshoptimizer to simplify as much as possible. No reason to cap this out, as the cluster culling code will choose a good LOD level anyways. Again leads to higher quality LOD trees. - After some discussion and consulting the Nanite slides again, changed meshlet group error from _adding_ the max child's error to the group error, to doing `group_error = max(group_error, max_child_error)`. Error is already cumulative between LODs as the edges we're collapsing during simplification get longer each time. - Bumped the 65% simplification threshold to allow up to 95% of the original geometry (e.g. accept simplification as valid even if we only simplified 5% of the triangles). This gives us closer to log2(initial_meshlet_count) LOD levels, and fewer meshlet roots in the DAG. Still more work to be done in the future here. Maybe trying METIS for meshlet building instead of meshoptimizer. Using ~8 clusters per group instead of ~4 might also make a big difference. The Nanite slides say that they have 8-32 meshlets per group, suggesting some kind of heuristic. Unfortunately meshopt's compute_cluster_bounds won't work with large groups atm (https://github.com/zeux/meshoptimizer/discussions/750#discussioncomment-10562641) so hard to test. Based on discussion from https://github.com/bevyengine/bevy/discussions/14998, https://github.com/zeux/meshoptimizer/discussions/750, and discord. ### Runtime changes - cluster:triangle packed IDs are now stored 25:7 instead of 26:6 bits, as max triangles per cluster are now 128 instead of 64 - Hardware raster now spawns 128 * 3 vertices instead of 64 * 3 vertices to account for the new max triangles limit - Hardware raster now outputs NaN triangles (0 / 0) instead of zero-positioned triangles for extra vertex invocations over the cluster triangle count. Shouldn't really be a difference idt, but I did it anyways. - Software raster now does 128 threads per workgroup instead of 64 threads. Each thread now loads, projects, and caches a vertex (vertices 0-127), and then if needed does so again (vertices 128-254). Each thread then rasterizes one of 128 triangles. - Fixed a bug with `needs_dispatch_remap`. I had the condition backwards in my last PR, I probably committed it by accident after testing the non-default code path on my GPU.	2024-09-08 17:55:57 +00:00
JMS55	6cc96f4c1f	Meshlet software raster + start of cleanup (#14623 ) # Objective - Faster meshlet rasterization path for small triangles - Avoid having to allocate and write out a triangle buffer - Refactor gpu_scene.rs ## Solution - Replace the 32bit visbuffer texture with a 64bit visbuffer buffer, where the left 32 bits encode depth, and the right 32 bits encode the existing cluster + triangle IDs. Can't use 64bit textures, wgpu/naga doesn't support atomic ops on textures yet. - Instead of writing out a buffer of packed cluster + triangle IDs (per triangle) to raster, the culling pass now writes out a buffer of just cluster IDs (per cluster, so less memory allocated, cheaper to write out). - Clusters for software raster are allocated from the left side - Clusters for hardware raster are allocated in the same buffer, from the right side - The buffer size is fixed at MeshletPlugin build time, and should be set to a reasonable value for your scene (no warning on overflow, and no good way to determine what value you need outside of renderdoc - I plan to fix this in a future PR adding a meshlet stats overlay) - Currently I don't have a heuristic for software vs hardware raster selection for each cluster. The existing code is just a placeholder. I need to profile on a release scene and come up with a heuristic, probably in a future PR. - The culling shader is getting pretty hard to follow at this point, but I don't want to spend time improving it as the entire shader/pass is getting rewritten/replaced in the near future. - Software raster is a compute workgroup per-cluster. Each workgroup loads and transforms the <=64 vertices of the cluster, and then rasterizes the <=64 triangles of the cluster. - Two variants are implemented: Scanline for clusters with any larger triangles (still smaller than hardware is good at), and brute-force for very very tiny triangles - Once the shader determines that a pixel should be filled in, it does an atomicMax() on the visbuffer to store the results, copying how Nanite works - On devices with a low max workgroups per dispatch limit, an extra compute pass is inserted before software raster to convert from a 1d to 2d dispatch (I don't think 3d would ever be necessary). - I haven't implemented the top-left rule or subpixel precision yet, I'm leaving that for a future PR since I get usable results without it for now - Resources used: https://kristoffer-dyrkorn.github.io/triangle-rasterizer and chapters 6-8 of https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index - Hardware raster now spawns 64*3 vertex invocations per meshlet, instead of the actual meshlet vertex count. Extra invocations just early-exit. - While this is slower than the existing system, hardware draws should be rare now that software raster is usable, and it saves a ton of memory using the unified cluster ID buffer. This would be fixed if wgpu had support for mesh shaders. - Instead of writing to a color+depth attachment, the hardware raster pass also does the same atomic visbuffer writes that software raster uses. - We have to bind a dummy render target anyways, as wgpu doesn't currently support render passes without any attachments - Material IDs are no longer written out during the main rasterization passes. - If we had async compute queues, we could overlap the software and hardware raster passes. - New material and depth resolve passes run at the end of the visbuffer node, and write out view depth and material ID depth textures ### Misc changes - Fixed cluster culling importing, but never actually using the previous view uniforms when doing occlusion culling - Fixed incorrectly adding the LOD error twice when building the meshlet mesh - Splitup gpu_scene module into meshlet_mesh_manager, instance_manager, and resource_manager - resource_manager is still too complex and inefficient (extract and prepare are way too expensive). I plan on improving this in a future PR, but for now ResourceManager is mostly a 1:1 port of the leftover MeshletGpuScene bits. - Material draw passes have been renamed to the more accurate material shade pass, as well as some other misc renaming (in the future, these will be compute shaders even, and not actual draw calls) --- ## Migration Guide - TBD (ask me at the end of the release for meshlet changes as a whole) --------- Co-authored-by: vero <email@atlasdostal.com>	2024-08-26 17:54:34 +00:00
re0312	65628ed4aa	fix meshlet example (#14471 ) # Objective - meshlet example has broken since #14273 ## Solution - disable msaa in meshlet example Co-authored-by: François Mockers <mockersf@gmail.com>	2024-07-25 15:22:11 +00:00
Sou1gh0st	9da18cce2a	Add support for environment map transformation (#14290 ) # Objective - Fixes: https://github.com/bevyengine/bevy/issues/14036 ## Solution - Add a world space transformation for the environment sample direction. ## Testing - I have tested the newly added `transform` field using the newly added `rotate_environment_map` example. https://github.com/user-attachments/assets/2de77c65-14bc-48ee-b76a-fb4e9782dbdb ## Migration Guide - Since we have added a new filed to the `EnvironmentMapLight` struct, users will need to include `..default()` or some rotation value in their initialization code.	2024-07-19 15:00:50 +00:00
JMS55	6e8d43a037	Faster MeshletMesh deserialization (#14193 ) # Objective - Using bincode to deserialize binary into a MeshletMesh is expensive (~77ms for a 5mb file). ## Solution - Write a custom deserializer using bytemuck's Pod types and slice casting. - Total asset load time has gone from ~102ms to ~12ms. - Change some types I never meant to be public to private and other misc cleanup. ## Testing - Ran the meshlet example and added timing spans to the asset loader. --- ## Changelog - Improved `MeshletMesh` loading speed - The `MeshletMesh` disk format has changed, and `MESHLET_MESH_ASSET_VERSION` has been bumped - `MeshletMesh` fields are now private - Renamed `MeshletMeshSaverLoad` to `MeshletMeshSaverLoader` - The `Meshlet`, `MeshletBoundingSpheres`, and `MeshletBoundingSphere` types are now private - Removed `MeshletMeshSaveOrLoadError::SerializationOrDeserialization` - Added `MeshletMeshSaveOrLoadError::WrongFileType` ## Migration Guide - Regenerate your `MeshletMesh` assets, as the disk format has changed, and `MESHLET_MESH_ASSET_VERSION` has been bumped - `MeshletMesh` fields are now private - `MeshletMeshSaverLoad` is now named `MeshletMeshSaverLoader` - The `Meshlet`, `MeshletBoundingSpheres`, and `MeshletBoundingSphere` types are now private - `MeshletMeshSaveOrLoadError::SerializationOrDeserialization` has been removed - Added `MeshletMeshSaveOrLoadError::WrongFileType`, match on this variant if you match on `MeshletMeshSaveOrLoadError`	2024-07-15 15:06:02 +00:00
François Mockers	a8751390aa	revert reflections PR changes to the meshlet example (#13539 ) # Objective - #13418 introduced unwanted changes to the meshlet example ## Solution - Remove them	2024-05-27 19:48:18 +00:00
Patrick Walton	f398674e51	Implement opt-in sharp screen-space reflections for the deferred renderer, with improved raymarching code. (#13418 ) This commit, a revamp of #12959, implements screen-space reflections (SSR), which approximate real-time reflections based on raymarching through the depth buffer and copying samples from the final rendered frame. This patch is a relatively minimal implementation of SSR, so as to provide a flexible base on which to customize and build in the future. However, it's based on the production-quality [raymarching code by Tomasz Stachowiak](https://gist.github.com/h3r2tic/9c8356bdaefbe80b1a22ae0aaee192db). For a general basic overview of screen-space reflections, see [1](https://lettier.github.io/3d-game-shaders-for-beginners/screen-space-reflection.html). The raymarching shader uses the basic algorithm of tracing forward in large steps, refining that trace in smaller increments via binary search, and then using the secant method. No temporal filtering or roughness blurring, is performed at all; for this reason, SSR currently only operates on very shiny surfaces. No acceleration via the hierarchical Z-buffer is implemented (though note that https://github.com/bevyengine/bevy/pull/12899 will add the infrastructure for this). Reflections are traced at full resolution, which is often considered slow. All of these improvements and more can be follow-ups. SSR is built on top of the deferred renderer and is currently only supported in that mode. Forward screen-space reflections are possible albeit uncommon (though e.g. Doom Eternal uses them); however, they require tracing from the previous frame, which would add complexity. This patch leaves the door open to implementing SSR in the forward rendering path but doesn't itself have such an implementation. Screen-space reflections aren't supported in WebGL 2, because they require sampling from the depth buffer, which Naga can't do because of a bug (`sampler2DShadow` is incorrectly generated instead of `sampler2D`; this is the same reason why depth of field is disabled on that platform). To add screen-space reflections to a camera, use the `ScreenSpaceReflectionsBundle` bundle or the `ScreenSpaceReflectionsSettings` component. In addition to `ScreenSpaceReflectionsSettings`, `DepthPrepass` and `DeferredPrepass` must also be present for the reflections to show up. The `ScreenSpaceReflectionsSettings` component contains several settings that artists can tweak, and also comes with sensible defaults. A new example, `ssr`, has been added. It's loosely based on the [three.js ocean sample](https://threejs.org/examples/webgl_shaders_ocean.html), but all the assets are original. Note that the three.js demo has no screen-space reflections and instead renders a mirror world. In contrast to #12959, this demo tests not only a cube but also a more complex model (the flight helmet). ## Changelog ### Added * Screen-space reflections can be enabled for very smooth surfaces by adding the `ScreenSpaceReflections` component to a camera. Deferred rendering must be enabled for the reflections to appear. ![Screenshot 2024-05-18 143555](https://github.com/bevyengine/bevy/assets/157897/b8675b39-8a89-433e-a34e-1b9ee1233267) ![Screenshot 2024-05-18 143606](https://github.com/bevyengine/bevy/assets/157897/cc9e1cd0-9951-464a-9a08-e589210e5606)	2024-05-27 13:43:40 +00:00
François Mockers	1c15ac647a	Example setup for tooling (#13088 ) # Objective - #12755 introduced the need to download a file to run an example - This means the example fails to run in CI without downloading that file ## Solution - Add a new metadata to examples "setup" that provides setup instructions - Replace the URL in the meshlet example to one that can actually be downloaded - example-showcase execute the setup before running an example	2024-05-02 20:10:09 +00:00
JMS55	e1a0da0fa6	Meshlet LOD-compatible two-pass occlusion culling (#12898 ) Keeping track of explicit visibility per cluster between frames does not work with LODs, and leads to worse culling (using the final depth buffer from the previous frame is more accurate). Instead, we need to generate a second depth pyramid after the second raster pass, and then use that in the first culling pass in the next frame to test if a cluster would have been visible last frame or not. As part of these changes, the write_index_buffer pass has been folded into the culling pass for a large performance gain, and to avoid tracking a lot of extra state that would be needed between passes. Prepass previous model/view stuff was adapted to work with meshlets as well. Also fixed a bug with materials, and other misc improvements. --------- Co-authored-by: François <mockersf@gmail.com> Co-authored-by: atlas dostal <rodol@rivalrebels.com> Co-authored-by: vero <email@atlasdostal.com> Co-authored-by: Patrick Walton <pcwalton@mimiga.net> Co-authored-by: Robert Swain <robert.swain@gmail.com>	2024-04-28 05:30:20 +00:00
JMS55	6d6810c90d	Meshlet continuous LOD (#12755 ) Adds a basic level of detail system to meshlets. An extremely brief summary is as follows: * In `from_mesh.rs`, once we've built the first level of clusters, we group clusters, simplify the new mega-clusters, and then split the simplified groups back into regular sized clusters. Repeat several times (ideally until you can't anymore). This forms a directed acyclic graph (DAG), where the children are the meshlets from the previous level, and the parents are the more simplified versions of their children. The leaf nodes are meshlets formed from the original mesh. * In `cull_meshlets.wgsl`, each cluster selects whether to render or not based on the LOD bounding sphere (different than the culling bounding sphere) of the current meshlet, the LOD bounding sphere of its parent (the meshlet group from simplification), and the simplification error relative to its children of both the current meshlet and its parent meshlet. This kind of breaks two pass occlusion culling, which will be fixed in a future PR by using an HZB from the previous frame to get the initial list of occluders. Many, _many_ improvements to be done in the future https://github.com/bevyengine/bevy/issues/11518, not least of which is code quality and speed. I don't even expect this to work on many types of input meshes. This is just a basic implementation/draft for collaboration. Arguable how much we want to do in this PR, I'll leave that up to maintainers. I've erred on the side of "as basic as possible". References: * Slides 27-77 (video available on youtube) https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf * https://blog.traverseresearch.nl/creating-a-directed-acyclic-graph-from-a-mesh-1329e57286e5 * https://jglrxavpok.github.io/2024/01/19/recreating-nanite-lod-generation.html, https://jglrxavpok.github.io/2024/03/12/recreating-nanite-faster-lod-generation.html, https://jglrxavpok.github.io/2024/04/02/recreating-nanite-runtime-lod-selection.html, and https://github.com/jglrxavpok/Carrot * https://github.com/gents83/INOX/tree/master/crates/plugins/binarizer/src * https://cs418.cs.illinois.edu/website/text/nanite.html ![image](https://github.com/bevyengine/bevy/assets/47158642/e40bff9b-7d0c-4a19-a3cc-2aad24965977) ![image](https://github.com/bevyengine/bevy/assets/47158642/442c7da3-7761-4da7-9acd-37f15dd13e26) --------- Co-authored-by: Ricky Taylor <rickytaylor26@gmail.com> Co-authored-by: vero <email@atlasdostal.com> Co-authored-by: François <mockersf@gmail.com> Co-authored-by: atlas dostal <rodol@rivalrebels.com> Co-authored-by: Patrick Walton <pcwalton@mimiga.net>	2024-04-23 21:43:53 +00:00
JMS55	4f20faaa43	Meshlet rendering (initial feature) (#10164 ) # Objective - Implements a more efficient, GPU-driven (https://github.com/bevyengine/bevy/issues/1342) rendering pipeline based on meshlets. - Meshes are split into small clusters of triangles called meshlets, each of which acts as a mini index buffer into the larger mesh data. Meshlets can be compressed, streamed, culled, and batched much more efficiently than monolithic meshes. ![image](https://github.com/bevyengine/bevy/assets/47158642/cb2aaad0-7a9a-4e14-93b0-15d4e895b26a) ![image](https://github.com/bevyengine/bevy/assets/47158642/7534035b-1eb7-4278-9b99-5322e4401715) # Misc * Future work: https://github.com/bevyengine/bevy/issues/11518 * Nanite reference: https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf Two pass occlusion culling explained very well: https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501 --------- Co-authored-by: Ricky Taylor <rickytaylor26@gmail.com> Co-authored-by: vero <email@atlasdostal.com> Co-authored-by: François <mockersf@gmail.com> Co-authored-by: atlas dostal <rodol@rivalrebels.com>	2024-03-25 19:08:27 +00:00

11 commits