bevy/crates/bevy_pbr/src/render/light.rs

1850 lines
69 KiB
Rust
Raw Normal View History

Deferred Renderer (#9258) # Objective - Add a [Deferred Renderer](https://en.wikipedia.org/wiki/Deferred_shading) to Bevy. - This allows subsequent passes to access per pixel material information before/during shading. - Accessing this per pixel material information is needed for some features, like GI. It also makes other features (ex. Decals) simpler to implement and/or improves their capability. There are multiple approaches to accomplishing this. The deferred shading approach works well given the limitations of WebGPU and WebGL2. Motivation: [I'm working on a GI solution for Bevy](https://youtu.be/eH1AkL-mwhI) # Solution - The deferred renderer is implemented with a prepass and a deferred lighting pass. - The prepass renders opaque objects into the Gbuffer attachment (`Rgba32Uint`). The PBR shader generates a `PbrInput` in mostly the same way as the forward implementation and then [packs it into the Gbuffer](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L168). - The deferred lighting pass unpacks the `PbrInput` and [feeds it into the pbr() function](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L65), then outputs the shaded color data. - There is now a resource [DefaultOpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L599) that can be used to set the default render method for opaque materials. If materials return `None` from [opaque_render_method()](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L131) the `DefaultOpaqueRendererMethod` will be used. Otherwise, custom materials can also explicitly choose to only support Deferred or Forward by returning the respective [OpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L603) - Deferred materials can be used seamlessly along with both opaque and transparent forward rendered materials in the same scene. The [deferred rendering example](https://github.com/DGriffin91/bevy/blob/deferred/examples/3d/deferred_rendering.rs) does this. - The deferred renderer does not support MSAA. If any deferred materials are used, MSAA must be disabled. Both TAA and FXAA are supported. - Deferred rendering supports WebGL2/WebGPU. ## Custom deferred materials - Custom materials can support both deferred and forward at the same time. The [StandardMaterial](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L166) does this. So does [this example](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56). - Custom deferred materials that require PBR lighting can create a `PbrInput`, write it to the deferred GBuffer and let it be rendered by the `PBRDeferredLightingPlugin`. - Custom deferred materials that require custom lighting have two options: 1. Use the base_color channel of the `PbrInput` combined with the `STANDARD_MATERIAL_FLAGS_UNLIT_BIT` flag. [Example.](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56) (If the unlit bit is set, the base_color is stored as RGB9E5 for extra precision) 2. A Custom Deferred Lighting pass can be created, either overriding the default, or running in addition. The a depth buffer is used to limit rendering to only the required fragments for each deferred lighting pass. Materials can set their respective depth id via the [deferred_lighting_pass_id](https://github.com/DGriffin91/bevy/blob/b79182d2a32cac28c4213c2457a53ac2cc885332/crates/bevy_pbr/src/prepass/prepass_io.wgsl#L95) attachment. The custom deferred lighting pass plugin can then set [its corresponding depth](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L37). Then with the lighting pass using [CompareFunction::Equal](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/mod.rs#L335), only the fragments with a depth that equal the corresponding depth written in the material will be rendered. Custom deferred lighting plugins can also be created to render the StandardMaterial. The default deferred lighting plugin can be bypassed with `DefaultPlugins.set(PBRDeferredLightingPlugin { bypass: true })` --------- Co-authored-by: nickrart <nickolas.g.russell@gmail.com>
2023-10-12 22:10:38 +00:00
use bevy_core_pipeline::core_3d::{Transparent3d, CORE_3D_DEPTH_FORMAT};
use bevy_ecs::entity::EntityHashMap;
use bevy_ecs::prelude::*;
use bevy_math::{Mat4, UVec3, UVec4, Vec2, Vec3, Vec3Swizzles, Vec4, Vec4Swizzles};
use bevy_render::{
camera::Camera,
2021-06-02 02:59:17 +00:00
color::Color,
mesh::Mesh,
primitives::{CascadesFrusta, CubemapFrusta, Frustum},
render_asset::RenderAssets,
Make render graph slots optional for most cases (#8109) # Objective - Currently, the render graph slots are only used to pass the view_entity around. This introduces significant boilerplate for very little value. Instead of using slots for this, make the view_entity part of the `RenderGraphContext`. This also means we won't need to have `IN_VIEW` on every node and and we'll be able to use the default impl of `Node::input()`. ## Solution - Add `view_entity: Option<Entity>` to the `RenderGraphContext` - Update all nodes to use this instead of entity slot input --- ## Changelog - Add optional `view_entity` to `RenderGraphContext` ## Migration Guide You can now get the view_entity directly from the `RenderGraphContext`. When implementing the Node: ```rust // 0.10 struct FooNode; impl FooNode { const IN_VIEW: &'static str = "view"; } impl Node for FooNode { fn input(&self) -> Vec<SlotInfo> { vec![SlotInfo::new(Self::IN_VIEW, SlotType::Entity)] } fn run( &self, graph: &mut RenderGraphContext, // ... ) -> Result<(), NodeRunError> { let view_entity = graph.get_input_entity(Self::IN_VIEW)?; // ... Ok(()) } } // 0.11 struct FooNode; impl Node for FooNode { fn run( &self, graph: &mut RenderGraphContext, // ... ) -> Result<(), NodeRunError> { let view_entity = graph.view_entity(); // ... Ok(()) } } ``` When adding the node to the graph, you don't need to specify a slot_edge for the view_entity. ```rust // 0.10 let mut graph = RenderGraph::default(); graph.add_node(FooNode::NAME, node); let input_node_id = draw_2d_graph.set_input(vec![SlotInfo::new( graph::input::VIEW_ENTITY, SlotType::Entity, )]); graph.add_slot_edge( input_node_id, graph::input::VIEW_ENTITY, FooNode::NAME, FooNode::IN_VIEW, ); // add_node_edge ... // 0.11 let mut graph = RenderGraph::default(); graph.add_node(FooNode::NAME, node); // add_node_edge ... ``` ## Notes This PR paired with #8007 will help reduce a lot of annoying boilerplate with the render nodes. Depending on which one gets merged first. It will require a bit of clean up work to make both compatible. I tagged this as a breaking change, because using the old system to get the view_entity will break things because it's not a node input slot anymore. ## Notes for reviewers A lot of the diffs are just removing the slots in every nodes and graph creation. The important part is mostly in the graph_runner/CameraDriverNode.
2023-03-21 20:11:13 +00:00
render_graph::{Node, NodeRunError, RenderGraphContext},
render_phase::*,
render_resource::*,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
renderer::{RenderContext, RenderDevice, RenderQueue},
2021-06-02 02:59:17 +00:00
texture::*,
light renderlayers (#10742) # Objective add `RenderLayers` awareness to lights. lights default to `RenderLayers::layer(0)`, and must intersect the camera entity's `RenderLayers` in order to affect the camera's output. note that lights already use renderlayers to filter meshes for shadow casting. this adds filtering lights per view based on intersection of camera layers and light layers. fixes #3462 ## Solution PointLights and SpotLights are assigned to individual views in `assign_lights_to_clusters`, so we simply cull the lights which don't match the view layers in that function. DirectionalLights are global, so we - add the light layers to the `DirectionalLight` struct - add the view layers to the `ViewUniform` struct - check for intersection before processing the light in `apply_pbr_lighting` potential issue: when mesh/light layers are smaller than the view layers weird results can occur. e.g: camera = layers 1+2 light = layers 1 mesh = layers 2 the mesh does not cast shadows wrt the light as (1 & 2) == 0. the light affects the view as (1+2 & 1) != 0. the view renders the mesh as (1+2 & 2) != 0. so the mesh is rendered and lit, but does not cast a shadow. this could be fixed (so that the light would not affect the mesh in that view) by adding the light layers to the point and spot light structs, but i think the setup is pretty unusual, and space is at a premium in those structs (adding 4 bytes more would reduce the webgl point+spot light max count to 240 from 256). I think typical usage is for cameras to have a single layer, and meshes/lights to maybe have multiple layers to render to e.g. minimaps as well as primary views. if there is a good use case for the above setup and we should support it, please let me know. --- ## Migration Guide Lights no longer affect all `RenderLayers` by default, now like cameras and meshes they default to `RenderLayers::layer(0)`. To recover the previous behaviour and have all lights affect all views, add a `RenderLayers::all()` component to the light entity.
2023-12-12 19:45:37 +00:00
view::{ExtractedView, RenderLayers, ViewVisibility, VisibleEntities},
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
Extract,
2021-06-02 02:59:17 +00:00
};
use bevy_transform::{components::GlobalTransform, prelude::Transform};
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
#[cfg(feature = "trace")]
use bevy_utils::tracing::info_span;
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
use bevy_utils::{
Automatic batching/instancing of draw commands (#9685) # Objective - Implement the foundations of automatic batching/instancing of draw commands as the next step from #89 - NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data. ## Solution - The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on. - The following assumptions have been made: - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases - View bindings are constant across a phase for a given draw function as phases are per-view - `batch_and_prepare_render_phase` is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the `batch_and_prepare_render_phase` system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size. - Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago. - Change batch size for a range in `PhaseItem`, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed. - Add an optional mesh bind group dynamic offset to `PhaseItem`. This avoids having to do a massive table move just to insert `GpuArrayBufferIndex` components. ## Benchmarks All tests have been run on an M1 Max on AC power. `bevymark` and `many_cubes` were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with `--features bevy_ci_testing,trace_tracy` and `CI_TESTING_CONFIG=../benchmark.ron` with the contents of `../benchmark.ron`: ```rust ( exit_after: Some(1500) ) ``` ...in order to run each test for 1500 frames. The recent changes to `many_cubes` and `bevymark` added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch. The graphs compare main (yellow) to this PR (red). ### 3D Mesh `many_cubes --benchmark` <img width="1411" alt="Screenshot 2023-09-03 at 23 42 10" src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4"> The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time. The 1000th frame is identical using the flip tool: ![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4615 seconds ``` ### 3D Mesh `many_cubes --benchmark --material-texture-count 10` <img width="1404" alt="Screenshot 2023-09-03 at 23 45 18" src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf"> This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched. The 1000th frame is identical in flip: ![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4537 seconds ``` ### 3D Mesh `many_cubes --benchmark --vary-per-instance` <img width="1394" alt="Screenshot 2023-09-03 at 23 48 44" src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f"> This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities. The 1000th frame is identical according to flip: ![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4568 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` <img width="1412" alt="Screenshot 2023-09-03 at 23 59 56" src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4"> This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time. Capturing a screenshot of the 1000th frame main vs PR gives: ![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40) ``` Mean: 0.001222 Weighted median: 0.750432 1st weighted quartile: 0.453494 3rd weighted quartile: 0.969758 Min: 0.000000 Max: 0.990296 Evaluation time: 0.4255 seconds ``` So they seem to produce the same results. I also double-checked the number of draws. `main` does 160000 draws, and the PR does 160, as expected. ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10` <img width="1392" alt="Screenshot 2023-09-04 at 00 09 22" src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c"> This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay). ![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f) ``` Mean: 0.002877 Weighted median: 0.964980 1st weighted quartile: 0.668871 3rd weighted quartile: 0.982749 Min: 0.000000 Max: 0.992377 Evaluation time: 0.4301 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance` <img width="1396" alt="Screenshot 2023-09-04 at 00 13 53" src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb"> This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws. The 1000th frame is the same: ![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97) ``` Mean: 0.001214 Weighted median: 0.937499 1st weighted quartile: 0.635467 3rd weighted quartile: 0.979085 Min: 0.000000 Max: 0.988971 Evaluation time: 0.4462 seconds ``` ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` <img width="1396" alt="Screenshot 2023-09-04 at 12 21 12" src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43"> This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10` <img width="1389" alt="Screenshot 2023-09-04 at 12 36 08" src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413"> This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance` <img width="1401" alt="Screenshot 2023-09-04 at 12 29 52" src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6"> This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results. ### Summary | Benchmark | % change in median frame time | | ------------- | ------------- | | many_cubes | 🟩 -30% | | many_cubes 10 materials | 🟩 -5% | | many_cubes unique materials | 🟩 ~0% | | bevymark mesh2d | 🟩 -50% | | bevymark mesh2d 10 materials | 🟩 -50% | | bevymark mesh2d unique materials | 🟩 -7% | | bevymark sprite | 🟥 2% | | bevymark sprite 10 materials | 🟥 0.6% | | bevymark sprite unique materials | 🟥 4.1% | --- ## Changelog - Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Nicola Papale <nico@nicopap.ch>
2023-09-21 22:12:34 +00:00
nonmax::NonMaxU32,
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
tracing::{error, warn},
};
Automatic batching/instancing of draw commands (#9685) # Objective - Implement the foundations of automatic batching/instancing of draw commands as the next step from #89 - NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data. ## Solution - The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on. - The following assumptions have been made: - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases - View bindings are constant across a phase for a given draw function as phases are per-view - `batch_and_prepare_render_phase` is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the `batch_and_prepare_render_phase` system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size. - Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago. - Change batch size for a range in `PhaseItem`, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed. - Add an optional mesh bind group dynamic offset to `PhaseItem`. This avoids having to do a massive table move just to insert `GpuArrayBufferIndex` components. ## Benchmarks All tests have been run on an M1 Max on AC power. `bevymark` and `many_cubes` were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with `--features bevy_ci_testing,trace_tracy` and `CI_TESTING_CONFIG=../benchmark.ron` with the contents of `../benchmark.ron`: ```rust ( exit_after: Some(1500) ) ``` ...in order to run each test for 1500 frames. The recent changes to `many_cubes` and `bevymark` added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch. The graphs compare main (yellow) to this PR (red). ### 3D Mesh `many_cubes --benchmark` <img width="1411" alt="Screenshot 2023-09-03 at 23 42 10" src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4"> The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time. The 1000th frame is identical using the flip tool: ![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4615 seconds ``` ### 3D Mesh `many_cubes --benchmark --material-texture-count 10` <img width="1404" alt="Screenshot 2023-09-03 at 23 45 18" src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf"> This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched. The 1000th frame is identical in flip: ![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4537 seconds ``` ### 3D Mesh `many_cubes --benchmark --vary-per-instance` <img width="1394" alt="Screenshot 2023-09-03 at 23 48 44" src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f"> This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities. The 1000th frame is identical according to flip: ![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4568 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` <img width="1412" alt="Screenshot 2023-09-03 at 23 59 56" src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4"> This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time. Capturing a screenshot of the 1000th frame main vs PR gives: ![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40) ``` Mean: 0.001222 Weighted median: 0.750432 1st weighted quartile: 0.453494 3rd weighted quartile: 0.969758 Min: 0.000000 Max: 0.990296 Evaluation time: 0.4255 seconds ``` So they seem to produce the same results. I also double-checked the number of draws. `main` does 160000 draws, and the PR does 160, as expected. ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10` <img width="1392" alt="Screenshot 2023-09-04 at 00 09 22" src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c"> This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay). ![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f) ``` Mean: 0.002877 Weighted median: 0.964980 1st weighted quartile: 0.668871 3rd weighted quartile: 0.982749 Min: 0.000000 Max: 0.992377 Evaluation time: 0.4301 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance` <img width="1396" alt="Screenshot 2023-09-04 at 00 13 53" src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb"> This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws. The 1000th frame is the same: ![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97) ``` Mean: 0.001214 Weighted median: 0.937499 1st weighted quartile: 0.635467 3rd weighted quartile: 0.979085 Min: 0.000000 Max: 0.988971 Evaluation time: 0.4462 seconds ``` ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` <img width="1396" alt="Screenshot 2023-09-04 at 12 21 12" src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43"> This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10` <img width="1389" alt="Screenshot 2023-09-04 at 12 36 08" src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413"> This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance` <img width="1401" alt="Screenshot 2023-09-04 at 12 29 52" src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6"> This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results. ### Summary | Benchmark | % change in median frame time | | ------------- | ------------- | | many_cubes | 🟩 -30% | | many_cubes 10 materials | 🟩 -5% | | many_cubes unique materials | 🟩 ~0% | | bevymark mesh2d | 🟩 -50% | | bevymark mesh2d 10 materials | 🟩 -50% | | bevymark mesh2d unique materials | 🟩 -7% | | bevymark sprite | 🟥 2% | | bevymark sprite 10 materials | 🟥 0.6% | | bevymark sprite unique materials | 🟥 4.1% | --- ## Changelog - Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Nicola Papale <nico@nicopap.ch>
2023-09-21 22:12:34 +00:00
use std::{hash::Hash, num::NonZeroU64, ops::Range};
2021-06-02 02:59:17 +00:00
use crate::*;
2021-11-22 23:16:36 +00:00
#[derive(Component)]
2021-06-02 02:59:17 +00:00
pub struct ExtractedPointLight {
color: Color,
bevy_pbr2: Improve lighting units and documentation (#2704) # Objective A question was raised on Discord about the units of the `PointLight` `intensity` member. After digging around in the bevy_pbr2 source code and [Google Filament documentation](https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower) I discovered that the intention by Filament was that the 'intensity' value for point lights would be in lumens. This makes a lot of sense as these are quite relatable units given basically all light bulbs I've seen sold over the past years are rated in lumens as people move away from thinking about how bright a bulb is relative to a non-halogen incandescent bulb. However, it seems that the derivation of the conversion between luminous power (lumens, denoted `Φ` in the Filament formulae) and luminous intensity (lumens per steradian, `I` in the Filament formulae) was missed and I can see why as it is tucked right under equation 58 at the link above. As such, while the formula states that for a point light, `I = Φ / 4 π` we have been using `intensity` as if it were luminous intensity `I`. Before this PR, the intensity field is luminous intensity in lumens per steradian. After this PR, the intensity field is luminous power in lumens, [as suggested by Filament](https://google.github.io/filament/Filament.html#table_lighttypesunits) (unfortunately the link jumps to the table's caption so scroll up to see the actual table). I appreciate that it may be confusing to call this an intensity, but I think this is intended as more of a non-scientific, human-relatable general term with a bit of hand waving so that most light types can just have an intensity field and for most of them it works in the same way or at least with some relatable value. I'm inclined to think this is reasonable rather than throwing terms like luminous power, luminous intensity, blah at users. ## Solution - Documented the `PointLight` `intensity` member as 'luminous power' in units of lumens. - Added a table of examples relating from various types of household lighting to lumen values. - Added in the mapping from luminous power to luminous intensity when premultiplying the intensity into the colour before it is made into a graphics uniform. - Updated the documentation in `pbr.wgsl` to clarify the earlier confusion about the missing `/ 4 π`. - Bumped the intensity of the point lights in `3d_scene_pipelined` to 1600 lumens. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-08-23 23:48:11 +00:00
/// luminous intensity in lumens per steradian
2021-06-02 02:59:17 +00:00
intensity: f32,
range: f32,
radius: f32,
transform: GlobalTransform,
shadows_enabled: bool,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
spot_light_angles: Option<(f32, f32)>,
}
#[derive(Component, Debug)]
pub struct ExtractedDirectionalLight {
color: Color,
illuminance: f32,
Take DirectionalLight's GlobalTransform into account when calculating shadow map volume (not just direction) (#6384) # Objective This PR fixes #5789, by enabling movable (and scalable) directional light shadow volumes. ## Solution This PR changes `ExtractedDirectionalLight` to hold a copy of the `DirectionalLight` entity's `GlobalTransform`, instead of just a `direction` vector. This allows the shadow map volume (as defined by the light's `shadow_projection` field) to be transformed honoring translation _and_ scale transforms, and not just rotation. It also augments the texel size calculation (used to determine the `shadow_normal_bias`) so that it now takes into account the upper bound of the x/y/z scale of the `GlobalTransform`. This change makes the directional light extraction code more consistent with point and spot lights (that already use `transform`), and allows easily moving and scaling the shadow volume along with a player entity based on camera distance/angle, immediately enabling more real world use cases until we have a more sophisticated adaptive implementation, such as the one described in #3629. **Note:** While it was previously possible to update the projection achieving a similar effect, depending on the light direction and distance to the origin, the fact that the shadow map camera was always positioned at the origin with a hardcoded `Vec3::Y` up value meant you would get sub-optimal or inconsistent/incorrect results. --- ## Changelog ### Changed - `DirectionalLight` shadow volumes now honor translation and scale transforms ## Migration Guide - If your directional lights were positioned at the origin and not scaled (the default, most common scenario) no changes are needed on your part; it just works as before; - If you previously had a system for dynamically updating directional light shadow projections, you might now be able to simplify your code by updating the directional light entity's transform instead; - In the unlikely scenario that a scene with directional lights that previously rendered shadows correctly has missing shadows, make sure your directional lights are positioned at (0, 0, 0) and are not scaled to a size that's too large or too small.
2022-11-04 20:12:26 +00:00
transform: GlobalTransform,
shadows_enabled: bool,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
cascade_shadow_config: CascadeShadowConfig,
cascades: EntityHashMap<Vec<Cascade>>,
frusta: EntityHashMap<Vec<Frustum>>,
light renderlayers (#10742) # Objective add `RenderLayers` awareness to lights. lights default to `RenderLayers::layer(0)`, and must intersect the camera entity's `RenderLayers` in order to affect the camera's output. note that lights already use renderlayers to filter meshes for shadow casting. this adds filtering lights per view based on intersection of camera layers and light layers. fixes #3462 ## Solution PointLights and SpotLights are assigned to individual views in `assign_lights_to_clusters`, so we simply cull the lights which don't match the view layers in that function. DirectionalLights are global, so we - add the light layers to the `DirectionalLight` struct - add the view layers to the `ViewUniform` struct - check for intersection before processing the light in `apply_pbr_lighting` potential issue: when mesh/light layers are smaller than the view layers weird results can occur. e.g: camera = layers 1+2 light = layers 1 mesh = layers 2 the mesh does not cast shadows wrt the light as (1 & 2) == 0. the light affects the view as (1+2 & 1) != 0. the view renders the mesh as (1+2 & 2) != 0. so the mesh is rendered and lit, but does not cast a shadow. this could be fixed (so that the light would not affect the mesh in that view) by adding the light layers to the point and spot light structs, but i think the setup is pretty unusual, and space is at a premium in those structs (adding 4 bytes more would reduce the webgl point+spot light max count to 240 from 256). I think typical usage is for cameras to have a single layer, and meshes/lights to maybe have multiple layers to render to e.g. minimaps as well as primary views. if there is a good use case for the above setup and we should support it, please let me know. --- ## Migration Guide Lights no longer affect all `RenderLayers` by default, now like cameras and meshes they default to `RenderLayers::layer(0)`. To recover the previous behaviour and have all lights affect all views, add a `RenderLayers::all()` component to the light entity.
2023-12-12 19:45:37 +00:00
render_layers: RenderLayers,
2021-06-02 02:59:17 +00:00
}
#[derive(Copy, Clone, ShaderType, Default, Debug)]
pub struct GpuPointLight {
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// For point lights: the lower-right 2x2 values of the projection matrix [2][2] [2][3] [3][2] [3][3]
// For spot lights: 2 components of the direction (x,z), spot_scale and spot_offset
light_custom_data: Vec4,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
color_inverse_square_range: Vec4,
position_radius: Vec4,
flags: u32,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
spot_light_tan_angle: f32,
}
#[derive(ShaderType)]
pub struct GpuPointLightsUniform {
data: Box<[GpuPointLight; MAX_UNIFORM_BUFFER_POINT_LIGHTS]>,
}
impl Default for GpuPointLightsUniform {
fn default() -> Self {
Self {
data: Box::new([GpuPointLight::default(); MAX_UNIFORM_BUFFER_POINT_LIGHTS]),
}
}
}
#[derive(ShaderType, Default)]
pub struct GpuPointLightsStorage {
#[size(runtime)]
data: Vec<GpuPointLight>,
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub enum GpuPointLights {
Uniform(UniformBuffer<GpuPointLightsUniform>),
Storage(StorageBuffer<GpuPointLightsStorage>),
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
impl GpuPointLights {
fn new(buffer_binding_type: BufferBindingType) -> Self {
match buffer_binding_type {
BufferBindingType::Storage { .. } => Self::storage(),
BufferBindingType::Uniform => Self::uniform(),
}
}
fn uniform() -> Self {
Self::Uniform(UniformBuffer::default())
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
fn storage() -> Self {
Self::Storage(StorageBuffer::default())
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
fn set(&mut self, mut lights: Vec<GpuPointLight>) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
match self {
GpuPointLights::Uniform(buffer) => {
let len = lights.len().min(MAX_UNIFORM_BUFFER_POINT_LIGHTS);
let src = &lights[..len];
let dst = &mut buffer.get_mut().data[..len];
dst.copy_from_slice(src);
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
GpuPointLights::Storage(buffer) => {
buffer.get_mut().data.clear();
buffer.get_mut().data.append(&mut lights);
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
}
fn write_buffer(&mut self, render_device: &RenderDevice, render_queue: &RenderQueue) {
match self {
GpuPointLights::Uniform(buffer) => buffer.write_buffer(render_device, render_queue),
GpuPointLights::Storage(buffer) => buffer.write_buffer(render_device, render_queue),
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
pub fn binding(&self) -> Option<BindingResource> {
match self {
GpuPointLights::Uniform(buffer) => buffer.binding(),
GpuPointLights::Storage(buffer) => buffer.binding(),
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
pub fn min_size(buffer_binding_type: BufferBindingType) -> NonZeroU64 {
match buffer_binding_type {
BufferBindingType::Storage { .. } => GpuPointLightsStorage::min_size(),
BufferBindingType::Uniform => GpuPointLightsUniform::min_size(),
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
// NOTE: These must match the bit flags in bevy_pbr/src/render/mesh_view_types.wgsl!
bitflags::bitflags! {
#[repr(transparent)]
struct PointLightFlags: u32 {
const SHADOWS_ENABLED = 1 << 0;
const SPOT_LIGHT_Y_NEGATIVE = 1 << 1;
const NONE = 0;
const UNINITIALIZED = 0xFFFF;
}
}
#[derive(Copy, Clone, ShaderType, Default, Debug)]
pub struct GpuDirectionalCascade {
view_projection: Mat4,
texel_size: f32,
far_bound: f32,
}
#[derive(Copy, Clone, ShaderType, Default, Debug)]
pub struct GpuDirectionalLight {
cascades: [GpuDirectionalCascade; MAX_CASCADES_PER_LIGHT],
color: Vec4,
dir_to_light: Vec3,
flags: u32,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
num_cascades: u32,
cascades_overlap_proportion: f32,
depth_texture_base_index: u32,
light renderlayers (#10742) # Objective add `RenderLayers` awareness to lights. lights default to `RenderLayers::layer(0)`, and must intersect the camera entity's `RenderLayers` in order to affect the camera's output. note that lights already use renderlayers to filter meshes for shadow casting. this adds filtering lights per view based on intersection of camera layers and light layers. fixes #3462 ## Solution PointLights and SpotLights are assigned to individual views in `assign_lights_to_clusters`, so we simply cull the lights which don't match the view layers in that function. DirectionalLights are global, so we - add the light layers to the `DirectionalLight` struct - add the view layers to the `ViewUniform` struct - check for intersection before processing the light in `apply_pbr_lighting` potential issue: when mesh/light layers are smaller than the view layers weird results can occur. e.g: camera = layers 1+2 light = layers 1 mesh = layers 2 the mesh does not cast shadows wrt the light as (1 & 2) == 0. the light affects the view as (1+2 & 1) != 0. the view renders the mesh as (1+2 & 2) != 0. so the mesh is rendered and lit, but does not cast a shadow. this could be fixed (so that the light would not affect the mesh in that view) by adding the light layers to the point and spot light structs, but i think the setup is pretty unusual, and space is at a premium in those structs (adding 4 bytes more would reduce the webgl point+spot light max count to 240 from 256). I think typical usage is for cameras to have a single layer, and meshes/lights to maybe have multiple layers to render to e.g. minimaps as well as primary views. if there is a good use case for the above setup and we should support it, please let me know. --- ## Migration Guide Lights no longer affect all `RenderLayers` by default, now like cameras and meshes they default to `RenderLayers::layer(0)`. To recover the previous behaviour and have all lights affect all views, add a `RenderLayers::all()` component to the light entity.
2023-12-12 19:45:37 +00:00
render_layers: u32,
2021-06-02 02:59:17 +00:00
}
// NOTE: These must match the bit flags in bevy_pbr/src/render/mesh_view_types.wgsl!
bitflags::bitflags! {
#[repr(transparent)]
struct DirectionalLightFlags: u32 {
const SHADOWS_ENABLED = 1 << 0;
const NONE = 0;
const UNINITIALIZED = 0xFFFF;
}
}
#[derive(Copy, Clone, Debug, ShaderType)]
2021-06-02 02:59:17 +00:00
pub struct GpuLights {
directional_lights: [GpuDirectionalLight; MAX_DIRECTIONAL_LIGHTS],
bevy_pbr2: Add support for most of the StandardMaterial textures (#4) * bevy_pbr2: Add support for most of the StandardMaterial textures Normal maps are not included here as they require tangents in a vertex attribute. * bevy_pbr2: Ensure RenderCommandQueue is ready for PbrShaders init * texture_pipelined: Add a light to the scene so we can see stuff * WIP bevy_pbr2: back to front sorting hack * bevy_pbr2: Uniform control flow for texture sampling in pbr.frag From 'fintelia' on the Bevy Render Rework Round 2 discussion: "My understanding is that GPUs these days never use the "execute both branches and select the result" strategy. Rather, what they do is evaluate the branch condition on all threads of a warp, and jump over it if all of them evaluate to false. If even a single thread needs to execute the if statement body, however, then the remaining threads are paused until that is completed." * bevy_pbr2: Simplify texture and sampler names The StandardMaterial_ prefix is no longer needed * bevy_pbr2: Match default 'AmbientColor' of current bevy_pbr for now * bevy_pbr2: Convert from non-linear to linear sRGB for the color uniform * bevy_pbr2: Add pbr_pipelined example * Fix view vector in pbr frag to work in ortho * bevy_pbr2: Use a 90 degree y fov and light range projection for lights * bevy_pbr2: Add AmbientLight resource * bevy_pbr2: Convert PointLight color to linear sRGB for use in fragment shader * bevy_pbr2: pbr.frag: Rename PointLight.projection to view_projection The uniform contains the view_projection matrix so this was incorrect. * bevy_pbr2: PointLight is an OmniLight as it has a radius * bevy_pbr2: Factoring out duplicated code * bevy_pbr2: Implement RenderAsset for StandardMaterial * Remove unnecessary texture and sampler clones * fix comment formatting * remove redundant Buffer:from * Don't extract meshes when their material textures aren't ready * make missing textures in the queue step an error Co-authored-by: Aevyrie <aevyrie@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-06-27 23:10:23 +00:00
ambient_color: Vec4,
// xyz are x/y/z cluster dimensions and w is the number of clusters
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
cluster_dimensions: UVec4,
// xy are vec2<f32>(cluster_dimensions.xy) / vec2<f32>(view.width, view.height)
// z is cluster_dimensions.z / log(far / near)
// w is cluster_dimensions.z * log(near) / log(far / near)
cluster_factors: Vec4,
n_directional_lights: u32,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// offset from spot light's light index to spot light's shadow map index
spot_light_shadowmap_offset: i32,
2021-06-02 02:59:17 +00:00
}
// NOTE: this must be kept in sync with the same constants in pbr.frag
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub const MAX_UNIFORM_BUFFER_POINT_LIGHTS: usize = 256;
//NOTE: When running bevy on Adreno GPU chipsets in WebGL, any value above 1 will result in a crash
// when loading the wgsl "pbr_functions.wgsl" in the function apply_fog.
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]
pub const MAX_DIRECTIONAL_LIGHTS: usize = 1;
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(any(
not(feature = "webgl"),
not(target_arch = "wasm32"),
feature = "webgpu"
))]
pub const MAX_DIRECTIONAL_LIGHTS: usize = 10;
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(any(
not(feature = "webgl"),
not(target_arch = "wasm32"),
feature = "webgpu"
))]
pub const MAX_CASCADES_PER_LIGHT: usize = 4;
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]
pub const MAX_CASCADES_PER_LIGHT: usize = 1;
2021-06-02 02:59:17 +00:00
#[derive(Resource, Clone)]
pub struct ShadowSamplers {
pub point_light_sampler: Sampler,
pub directional_light_sampler: Sampler,
2021-06-02 02:59:17 +00:00
}
// TODO: this pattern for initializing the shaders / pipeline isn't ideal. this should be handled by the asset system
impl FromWorld for ShadowSamplers {
2021-06-02 02:59:17 +00:00
fn from_world(world: &mut World) -> Self {
let render_device = world.resource::<RenderDevice>();
2021-06-21 23:28:52 +00:00
ShadowSamplers {
point_light_sampler: render_device.create_sampler(&SamplerDescriptor {
address_mode_u: AddressMode::ClampToEdge,
address_mode_v: AddressMode::ClampToEdge,
address_mode_w: AddressMode::ClampToEdge,
mag_filter: FilterMode::Linear,
min_filter: FilterMode::Linear,
mipmap_filter: FilterMode::Nearest,
compare: Some(CompareFunction::GreaterEqual),
..Default::default()
}),
directional_light_sampler: render_device.create_sampler(&SamplerDescriptor {
address_mode_u: AddressMode::ClampToEdge,
address_mode_v: AddressMode::ClampToEdge,
address_mode_w: AddressMode::ClampToEdge,
mag_filter: FilterMode::Linear,
min_filter: FilterMode::Linear,
mipmap_filter: FilterMode::Nearest,
compare: Some(CompareFunction::GreaterEqual),
..Default::default()
}),
}
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
#[derive(Component)]
pub struct ExtractedClusterConfig {
/// Special near value for cluster calculations
near: f32,
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
far: f32,
/// Number of clusters in `X` / `Y` / `Z` in the view frustum
dimensions: UVec3,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
enum ExtractedClustersPointLightsElement {
ClusterHeader(u32, u32),
LightEntity(Entity),
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
#[derive(Component)]
pub struct ExtractedClustersPointLights {
data: Vec<ExtractedClustersPointLightsElement>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
pub fn extract_clusters(
mut commands: Commands,
views: Extract<Query<(Entity, &Clusters, &Camera)>>,
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
) {
for (entity, clusters, camera) in &views {
if !camera.is_active {
continue;
}
let num_entities: usize = clusters.lights.iter().map(|l| l.entities.len()).sum();
let mut data = Vec::with_capacity(clusters.lights.len() + num_entities);
for cluster_lights in &clusters.lights {
data.push(ExtractedClustersPointLightsElement::ClusterHeader(
cluster_lights.point_light_count as u32,
cluster_lights.spot_light_count as u32,
));
for l in &cluster_lights.entities {
data.push(ExtractedClustersPointLightsElement::LightEntity(*l));
}
}
Accept Bundles for insert and remove. Deprecate insert/remove_bundle (#6039) # Objective Take advantage of the "impl Bundle for Component" changes in #2975 / add the follow up changes discussed there. ## Solution - Change `insert` and `remove` to accept a Bundle instead of a Component (for both Commands and World) - Deprecate `insert_bundle`, `remove_bundle`, and `remove_bundle_intersection` - Add `remove_intersection` --- ## Changelog - Change `insert` and `remove` now accept a Bundle instead of a Component (for both Commands and World) - `insert_bundle` and `remove_bundle` are deprecated ## Migration Guide Replace `insert_bundle` with `insert`: ```rust // Old (0.8) commands.spawn().insert_bundle(SomeBundle::default()); // New (0.9) commands.spawn().insert(SomeBundle::default()); ``` Replace `remove_bundle` with `remove`: ```rust // Old (0.8) commands.entity(some_entity).remove_bundle::<SomeBundle>(); // New (0.9) commands.entity(some_entity).remove::<SomeBundle>(); ``` Replace `remove_bundle_intersection` with `remove_intersection`: ```rust // Old (0.8) world.entity_mut(some_entity).remove_bundle_intersection::<SomeBundle>(); // New (0.9) world.entity_mut(some_entity).remove_intersection::<SomeBundle>(); ``` Consider consolidating as many operations as possible to improve ergonomics and cut down on archetype moves: ```rust // Old (0.8) commands.spawn() .insert_bundle(SomeBundle::default()) .insert(SomeComponent); // New (0.9) - Option 1 commands.spawn().insert(( SomeBundle::default(), SomeComponent, )) // New (0.9) - Option 2 commands.spawn_bundle(( SomeBundle::default(), SomeComponent, )) ``` ## Next Steps Consider changing `spawn` to accept a bundle and deprecate `spawn_bundle`.
2022-09-21 21:47:53 +00:00
commands.get_or_spawn(entity).insert((
ExtractedClustersPointLights { data },
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
ExtractedClusterConfig {
near: clusters.near,
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
far: clusters.far,
dimensions: clusters.dimensions,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
},
));
}
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
#[allow(clippy::too_many_arguments)]
2021-06-02 02:59:17 +00:00
pub fn extract_lights(
mut commands: Commands,
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
point_light_shadow_map: Extract<Res<PointLightShadowMap>>,
directional_light_shadow_map: Extract<Res<DirectionalLightShadowMap>>,
global_point_lights: Extract<Res<GlobalVisiblePointLights>>,
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310) # Objective Fixes #4907. Fixes #838. Fixes #5089. Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114 Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities. Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information. ## Solution Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function. Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate. This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views. Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice. Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite. ![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png) Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale. ## Follow up work * Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?. * Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset. * Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc). --- ## Changelog * ComputedVisibility now takes hierarchy visibility into account. * 2D, UI and Light entities now use the ComputedVisibility component. ## Migration Guide If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead: ```rust // before (0.7) fn system(query: Query<&Visibility>) { for visibility in query.iter() { if visibility.is_visible { log!("found visible entity"); } } } // after (0.8) fn system(query: Query<&ComputedVisibility>) { for visibility in query.iter() { if visibility.is_visible() { log!("found visible entity"); } } } ``` Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
point_lights: Extract<
Query<(
&PointLight,
&CubemapVisibleEntities,
&GlobalTransform,
Split `ComputedVisibility` into two components to allow for accurate change detection and speed up visibility propagation (#9497) # Objective Fix #8267. Fixes half of #7840. The `ComputedVisibility` component contains two flags: hierarchy visibility, and view visibility (whether its visible to any cameras). Due to the modular and open-ended way that view visibility is computed, it triggers change detection every single frame, even when the value does not change. Since hierarchy visibility is stored in the same component as view visibility, this means that change detection for inherited visibility is completely broken. At the company I work for, this has become a real issue. We are using change detection to only re-render scenes when necessary. The broken state of change detection for computed visibility means that we have to to rely on the non-inherited `Visibility` component for now. This is workable in the early stages of our project, but since we will inevitably want to use the hierarchy, we will have to either: 1. Roll our own solution for computed visibility. 2. Fix the issue for everyone. ## Solution Split the `ComputedVisibility` component into two: `InheritedVisibilty` and `ViewVisibility`. This allows change detection to behave properly for `InheritedVisibility`. View visiblity is still erratic, although it is less useful to be able to detect changes for this flavor of visibility. Overall, this actually simplifies the API. Since the visibility system consists of self-explaining components, it is much easier to document the behavior and usage. This approach is more modular and "ECS-like" -- one could strip out the `ViewVisibility` component entirely if it's not needed, and rely only on inherited visibility. --- ## Changelog - `ComputedVisibility` has been removed in favor of: `InheritedVisibility` and `ViewVisiblity`. ## Migration Guide The `ComputedVisibilty` component has been split into `InheritedVisiblity` and `ViewVisibility`. Replace any usages of `ComputedVisibility::is_visible_in_hierarchy` with `InheritedVisibility::get`, and replace `ComputedVisibility::is_visible_in_view` with `ViewVisibility::get`. ```rust // Before: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, computed_visibility: ComputedVisibility::default(), }); // After: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, inherited_visibility: InheritedVisibility::default(), view_visibility: ViewVisibility::default(), }); ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_hierarchy() { // After: fn my_system(q: Query<&InheritedVisibility>) { for inherited_visibility in &q { if inherited_visibility.get() { ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_view() { // After: fn my_system(q: Query<&ViewVisibility>) { for view_visibility in &q { if view_visibility.get() { ``` ```rust // Before: fn my_system(mut q: Query<&mut ComputedVisibilty>) { for vis in &mut q { vis.set_visible_in_view(); // After: fn my_system(mut q: Query<&mut ViewVisibility>) { for view_visibility in &mut q { view_visibility.set(); ``` --------- Co-authored-by: Robert Swain <robert.swain@gmail.com>
2023-09-01 13:00:18 +00:00
&ViewVisibility,
&CubemapFrusta,
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310) # Objective Fixes #4907. Fixes #838. Fixes #5089. Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114 Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities. Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information. ## Solution Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function. Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate. This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views. Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice. Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite. ![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png) Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale. ## Follow up work * Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?. * Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset. * Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc). --- ## Changelog * ComputedVisibility now takes hierarchy visibility into account. * 2D, UI and Light entities now use the ComputedVisibility component. ## Migration Guide If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead: ```rust // before (0.7) fn system(query: Query<&Visibility>) { for visibility in query.iter() { if visibility.is_visible { log!("found visible entity"); } } } // after (0.8) fn system(query: Query<&ComputedVisibility>) { for visibility in query.iter() { if visibility.is_visible() { log!("found visible entity"); } } } ``` Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
)>,
>,
spot_lights: Extract<
Query<(
&SpotLight,
&VisibleEntities,
&GlobalTransform,
Split `ComputedVisibility` into two components to allow for accurate change detection and speed up visibility propagation (#9497) # Objective Fix #8267. Fixes half of #7840. The `ComputedVisibility` component contains two flags: hierarchy visibility, and view visibility (whether its visible to any cameras). Due to the modular and open-ended way that view visibility is computed, it triggers change detection every single frame, even when the value does not change. Since hierarchy visibility is stored in the same component as view visibility, this means that change detection for inherited visibility is completely broken. At the company I work for, this has become a real issue. We are using change detection to only re-render scenes when necessary. The broken state of change detection for computed visibility means that we have to to rely on the non-inherited `Visibility` component for now. This is workable in the early stages of our project, but since we will inevitably want to use the hierarchy, we will have to either: 1. Roll our own solution for computed visibility. 2. Fix the issue for everyone. ## Solution Split the `ComputedVisibility` component into two: `InheritedVisibilty` and `ViewVisibility`. This allows change detection to behave properly for `InheritedVisibility`. View visiblity is still erratic, although it is less useful to be able to detect changes for this flavor of visibility. Overall, this actually simplifies the API. Since the visibility system consists of self-explaining components, it is much easier to document the behavior and usage. This approach is more modular and "ECS-like" -- one could strip out the `ViewVisibility` component entirely if it's not needed, and rely only on inherited visibility. --- ## Changelog - `ComputedVisibility` has been removed in favor of: `InheritedVisibility` and `ViewVisiblity`. ## Migration Guide The `ComputedVisibilty` component has been split into `InheritedVisiblity` and `ViewVisibility`. Replace any usages of `ComputedVisibility::is_visible_in_hierarchy` with `InheritedVisibility::get`, and replace `ComputedVisibility::is_visible_in_view` with `ViewVisibility::get`. ```rust // Before: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, computed_visibility: ComputedVisibility::default(), }); // After: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, inherited_visibility: InheritedVisibility::default(), view_visibility: ViewVisibility::default(), }); ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_hierarchy() { // After: fn my_system(q: Query<&InheritedVisibility>) { for inherited_visibility in &q { if inherited_visibility.get() { ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_view() { // After: fn my_system(q: Query<&ViewVisibility>) { for view_visibility in &q { if view_visibility.get() { ``` ```rust // Before: fn my_system(mut q: Query<&mut ComputedVisibilty>) { for vis in &mut q { vis.set_visible_in_view(); // After: fn my_system(mut q: Query<&mut ViewVisibility>) { for view_visibility in &mut q { view_visibility.set(); ``` --------- Co-authored-by: Robert Swain <robert.swain@gmail.com>
2023-09-01 13:00:18 +00:00
&ViewVisibility,
&Frustum,
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310) # Objective Fixes #4907. Fixes #838. Fixes #5089. Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114 Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities. Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information. ## Solution Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function. Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate. This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views. Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice. Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite. ![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png) Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale. ## Follow up work * Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?. * Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset. * Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc). --- ## Changelog * ComputedVisibility now takes hierarchy visibility into account. * 2D, UI and Light entities now use the ComputedVisibility component. ## Migration Guide If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead: ```rust // before (0.7) fn system(query: Query<&Visibility>) { for visibility in query.iter() { if visibility.is_visible { log!("found visible entity"); } } } // after (0.8) fn system(query: Query<&ComputedVisibility>) { for visibility in query.iter() { if visibility.is_visible() { log!("found visible entity"); } } } ``` Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
)>,
>,
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
directional_lights: Extract<
Query<
(
Entity,
&DirectionalLight,
&CascadesVisibleEntities,
&Cascades,
&CascadeShadowConfig,
&CascadesFrusta,
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
&GlobalTransform,
Split `ComputedVisibility` into two components to allow for accurate change detection and speed up visibility propagation (#9497) # Objective Fix #8267. Fixes half of #7840. The `ComputedVisibility` component contains two flags: hierarchy visibility, and view visibility (whether its visible to any cameras). Due to the modular and open-ended way that view visibility is computed, it triggers change detection every single frame, even when the value does not change. Since hierarchy visibility is stored in the same component as view visibility, this means that change detection for inherited visibility is completely broken. At the company I work for, this has become a real issue. We are using change detection to only re-render scenes when necessary. The broken state of change detection for computed visibility means that we have to to rely on the non-inherited `Visibility` component for now. This is workable in the early stages of our project, but since we will inevitably want to use the hierarchy, we will have to either: 1. Roll our own solution for computed visibility. 2. Fix the issue for everyone. ## Solution Split the `ComputedVisibility` component into two: `InheritedVisibilty` and `ViewVisibility`. This allows change detection to behave properly for `InheritedVisibility`. View visiblity is still erratic, although it is less useful to be able to detect changes for this flavor of visibility. Overall, this actually simplifies the API. Since the visibility system consists of self-explaining components, it is much easier to document the behavior and usage. This approach is more modular and "ECS-like" -- one could strip out the `ViewVisibility` component entirely if it's not needed, and rely only on inherited visibility. --- ## Changelog - `ComputedVisibility` has been removed in favor of: `InheritedVisibility` and `ViewVisiblity`. ## Migration Guide The `ComputedVisibilty` component has been split into `InheritedVisiblity` and `ViewVisibility`. Replace any usages of `ComputedVisibility::is_visible_in_hierarchy` with `InheritedVisibility::get`, and replace `ComputedVisibility::is_visible_in_view` with `ViewVisibility::get`. ```rust // Before: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, computed_visibility: ComputedVisibility::default(), }); // After: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, inherited_visibility: InheritedVisibility::default(), view_visibility: ViewVisibility::default(), }); ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_hierarchy() { // After: fn my_system(q: Query<&InheritedVisibility>) { for inherited_visibility in &q { if inherited_visibility.get() { ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_view() { // After: fn my_system(q: Query<&ViewVisibility>) { for view_visibility in &q { if view_visibility.get() { ``` ```rust // Before: fn my_system(mut q: Query<&mut ComputedVisibilty>) { for vis in &mut q { vis.set_visible_in_view(); // After: fn my_system(mut q: Query<&mut ViewVisibility>) { for view_visibility in &mut q { view_visibility.set(); ``` --------- Co-authored-by: Robert Swain <robert.swain@gmail.com>
2023-09-01 13:00:18 +00:00
&ViewVisibility,
light renderlayers (#10742) # Objective add `RenderLayers` awareness to lights. lights default to `RenderLayers::layer(0)`, and must intersect the camera entity's `RenderLayers` in order to affect the camera's output. note that lights already use renderlayers to filter meshes for shadow casting. this adds filtering lights per view based on intersection of camera layers and light layers. fixes #3462 ## Solution PointLights and SpotLights are assigned to individual views in `assign_lights_to_clusters`, so we simply cull the lights which don't match the view layers in that function. DirectionalLights are global, so we - add the light layers to the `DirectionalLight` struct - add the view layers to the `ViewUniform` struct - check for intersection before processing the light in `apply_pbr_lighting` potential issue: when mesh/light layers are smaller than the view layers weird results can occur. e.g: camera = layers 1+2 light = layers 1 mesh = layers 2 the mesh does not cast shadows wrt the light as (1 & 2) == 0. the light affects the view as (1+2 & 1) != 0. the view renders the mesh as (1+2 & 2) != 0. so the mesh is rendered and lit, but does not cast a shadow. this could be fixed (so that the light would not affect the mesh in that view) by adding the light layers to the point and spot light structs, but i think the setup is pretty unusual, and space is at a premium in those structs (adding 4 bytes more would reduce the webgl point+spot light max count to 240 from 256). I think typical usage is for cameras to have a single layer, and meshes/lights to maybe have multiple layers to render to e.g. minimaps as well as primary views. if there is a good use case for the above setup and we should support it, please let me know. --- ## Migration Guide Lights no longer affect all `RenderLayers` by default, now like cameras and meshes they default to `RenderLayers::layer(0)`. To recover the previous behaviour and have all lights affect all views, add a `RenderLayers::all()` component to the light entity.
2023-12-12 19:45:37 +00:00
Option<&RenderLayers>,
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
),
Without<SpotLight>,
>,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
>,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
mut previous_point_lights_len: Local<usize>,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
mut previous_spot_lights_len: Local<usize>,
2021-06-02 02:59:17 +00:00
) {
// NOTE: These shadow map resources are extracted here as they are used here too so this avoids
// races between scheduling of ExtractResourceSystems and this system.
if point_light_shadow_map.is_changed() {
commands.insert_resource(point_light_shadow_map.clone());
}
if directional_light_shadow_map.is_changed() {
commands.insert_resource(directional_light_shadow_map.clone());
}
Scale normal bias by texel size (#26) * 3d_scene_pipelined: Use a shallower directional light angle to provoke acne * cornell_box_pipelined: Remove bias tweaks * bevy_pbr2: Simplify shadow biases by moving them to linear depth * bevy_pbr2: Do not use DepthBiasState * bevy_pbr2: Do not use bilinear filtering for sampling depth textures * pbr.wgsl: Remove unnecessary comment * bevy_pbr2: Do manual shadow map depth comparisons for more flexibility * examples: Add shadow_biases_pipelined example This is useful for stress testing biases. * bevy_pbr2: Scale the point light normal bias by the shadow map texel size This allows the normal bias to be small close to the light source where the shadow map texel to screen texel ratio is high, but is appropriately large further away from the light source where the shadow map texel can easily cover multiple screen texels. * shadow_biases_pipelined: Add support for toggling directional / point light * shadow_biases_pipelined: Cleanup * bevy_pbr2: Scale the directional light normal bias by the shadow map texel size * shadow_biases_pipelined: Fit the orthographic projection around the scene * bevy_pbr2: Directional lights should have no shadows outside their projection Before this change, sampling a fragment position from outside the ndc volume would result in the return sample being clamped to the edge in x,y or possibly always casting a shadow for fragment positions past the orthographic projection's far plane. * bevy_pbr2: Fix the default directional light normal bias * Revert "bevy_pbr2: Do manual shadow map depth comparisons for more flexibility" This reverts commit 7df1bab38a42d8a33bc50ca583d4be37bd9c9f0d. * shadow_biases_pipelined: Adjust directional light normal bias in 0.1 increments * pbr.wgsl: Add a couple of clarifying comments * Revert "bevy_pbr2: Do not use bilinear filtering for sampling depth textures" This reverts commit f53baab0232ce218866a45cad6902b470f4cf2c4. * shadow_biases_pipelined: Print usage to terminal
2021-07-19 19:20:59 +00:00
// This is the point light shadow map texel size for one face of the cube as a distance of 1.0
// world unit from the light.
// point_light_texel_size = 2.0 * 1.0 * tan(PI / 4.0) / cube face width in texels
// PI / 4.0 is half the cube face fov, tan(PI / 4.0) = 1.0, so this simplifies to:
// point_light_texel_size = 2.0 / cube face width in texels
// NOTE: When using various PCF kernel sizes, this will need to be adjusted, according to:
// https://catlikecoding.com/unity/tutorials/custom-srp/point-and-spot-shadows/
let point_light_texel_size = 2.0 / point_light_shadow_map.size as f32;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let mut point_lights_values = Vec::with_capacity(*previous_point_lights_len);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
for entity in global_point_lights.iter().copied() {
let Ok((point_light, cubemap_visible_entities, transform, view_visibility, frusta)) =
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310) # Objective Fixes #4907. Fixes #838. Fixes #5089. Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114 Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities. Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information. ## Solution Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function. Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate. This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views. Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice. Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite. ![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png) Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale. ## Follow up work * Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?. * Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset. * Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc). --- ## Changelog * ComputedVisibility now takes hierarchy visibility into account. * 2D, UI and Light entities now use the ComputedVisibility component. ## Migration Guide If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead: ```rust // before (0.7) fn system(query: Query<&Visibility>) { for visibility in query.iter() { if visibility.is_visible { log!("found visible entity"); } } } // after (0.8) fn system(query: Query<&ComputedVisibility>) { for visibility in query.iter() { if visibility.is_visible() { log!("found visible entity"); } } } ``` Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
point_lights.get(entity)
else {
continue;
};
if !view_visibility.get() {
continue;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
// TODO: This is very much not ideal. We should be able to re-use the vector memory.
// However, since exclusive access to the main world in extract is ill-advised, we just clone here.
let render_cubemap_visible_entities = cubemap_visible_entities.clone();
let extracted_point_light = ExtractedPointLight {
color: point_light.color,
// NOTE: Map from luminous power in lumens to luminous intensity in lumens per steradian
// for a point light. See https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower
// for details.
intensity: point_light.intensity / (4.0 * std::f32::consts::PI),
range: point_light.range,
radius: point_light.radius,
transform: *transform,
shadows_enabled: point_light.shadows_enabled,
shadow_depth_bias: point_light.shadow_depth_bias,
// The factor of SQRT_2 is for the worst-case diagonal offset
shadow_normal_bias: point_light.shadow_normal_bias
* point_light_texel_size
* std::f32::consts::SQRT_2,
spot_light_angles: None,
};
point_lights_values.push((
entity,
(
extracted_point_light,
render_cubemap_visible_entities,
(*frusta).clone(),
),
));
2021-06-02 02:59:17 +00:00
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
*previous_point_lights_len = point_lights_values.len();
commands.insert_or_spawn_batch(point_lights_values);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let mut spot_lights_values = Vec::with_capacity(*previous_spot_lights_len);
for entity in global_point_lights.iter().copied() {
if let Ok((spot_light, visible_entities, transform, view_visibility, frustum)) =
Split `ComputedVisibility` into two components to allow for accurate change detection and speed up visibility propagation (#9497) # Objective Fix #8267. Fixes half of #7840. The `ComputedVisibility` component contains two flags: hierarchy visibility, and view visibility (whether its visible to any cameras). Due to the modular and open-ended way that view visibility is computed, it triggers change detection every single frame, even when the value does not change. Since hierarchy visibility is stored in the same component as view visibility, this means that change detection for inherited visibility is completely broken. At the company I work for, this has become a real issue. We are using change detection to only re-render scenes when necessary. The broken state of change detection for computed visibility means that we have to to rely on the non-inherited `Visibility` component for now. This is workable in the early stages of our project, but since we will inevitably want to use the hierarchy, we will have to either: 1. Roll our own solution for computed visibility. 2. Fix the issue for everyone. ## Solution Split the `ComputedVisibility` component into two: `InheritedVisibilty` and `ViewVisibility`. This allows change detection to behave properly for `InheritedVisibility`. View visiblity is still erratic, although it is less useful to be able to detect changes for this flavor of visibility. Overall, this actually simplifies the API. Since the visibility system consists of self-explaining components, it is much easier to document the behavior and usage. This approach is more modular and "ECS-like" -- one could strip out the `ViewVisibility` component entirely if it's not needed, and rely only on inherited visibility. --- ## Changelog - `ComputedVisibility` has been removed in favor of: `InheritedVisibility` and `ViewVisiblity`. ## Migration Guide The `ComputedVisibilty` component has been split into `InheritedVisiblity` and `ViewVisibility`. Replace any usages of `ComputedVisibility::is_visible_in_hierarchy` with `InheritedVisibility::get`, and replace `ComputedVisibility::is_visible_in_view` with `ViewVisibility::get`. ```rust // Before: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, computed_visibility: ComputedVisibility::default(), }); // After: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, inherited_visibility: InheritedVisibility::default(), view_visibility: ViewVisibility::default(), }); ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_hierarchy() { // After: fn my_system(q: Query<&InheritedVisibility>) { for inherited_visibility in &q { if inherited_visibility.get() { ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_view() { // After: fn my_system(q: Query<&ViewVisibility>) { for view_visibility in &q { if view_visibility.get() { ``` ```rust // Before: fn my_system(mut q: Query<&mut ComputedVisibilty>) { for vis in &mut q { vis.set_visible_in_view(); // After: fn my_system(mut q: Query<&mut ViewVisibility>) { for view_visibility in &mut q { view_visibility.set(); ``` --------- Co-authored-by: Robert Swain <robert.swain@gmail.com>
2023-09-01 13:00:18 +00:00
spot_lights.get(entity)
{
if !view_visibility.get() {
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310) # Objective Fixes #4907. Fixes #838. Fixes #5089. Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114 Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities. Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information. ## Solution Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function. Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate. This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views. Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice. Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite. ![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png) Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale. ## Follow up work * Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?. * Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset. * Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc). --- ## Changelog * ComputedVisibility now takes hierarchy visibility into account. * 2D, UI and Light entities now use the ComputedVisibility component. ## Migration Guide If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead: ```rust // before (0.7) fn system(query: Query<&Visibility>) { for visibility in query.iter() { if visibility.is_visible { log!("found visible entity"); } } } // after (0.8) fn system(query: Query<&ComputedVisibility>) { for visibility in query.iter() { if visibility.is_visible() { log!("found visible entity"); } } } ``` Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
continue;
}
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
// TODO: This is very much not ideal. We should be able to re-use the vector memory.
// However, since exclusive access to the main world in extract is ill-advised, we just clone here.
let render_visible_entities = visible_entities.clone();
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let texel_size =
2.0 * spot_light.outer_angle.tan() / directional_light_shadow_map.size as f32;
spot_lights_values.push((
entity,
(
ExtractedPointLight {
color: spot_light.color,
// NOTE: Map from luminous power in lumens to luminous intensity in lumens per steradian
// for a point light. See https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower
// for details.
// Note: Filament uses a divisor of PI for spot lights. We choose to use the same 4*PI divisor
// in both cases so that toggling between point light and spot light keeps lit areas lit equally,
// which seems least surprising for users
intensity: spot_light.intensity / (4.0 * std::f32::consts::PI),
range: spot_light.range,
radius: spot_light.radius,
transform: *transform,
shadows_enabled: spot_light.shadows_enabled,
shadow_depth_bias: spot_light.shadow_depth_bias,
// The factor of SQRT_2 is for the worst-case diagonal offset
shadow_normal_bias: spot_light.shadow_normal_bias
* texel_size
* std::f32::consts::SQRT_2,
spot_light_angles: Some((spot_light.inner_angle, spot_light.outer_angle)),
},
render_visible_entities,
*frustum,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
),
));
}
}
*previous_spot_lights_len = spot_lights_values.len();
commands.insert_or_spawn_batch(spot_lights_values);
for (
entity,
directional_light,
visible_entities,
cascades,
cascade_config,
frusta,
transform,
Split `ComputedVisibility` into two components to allow for accurate change detection and speed up visibility propagation (#9497) # Objective Fix #8267. Fixes half of #7840. The `ComputedVisibility` component contains two flags: hierarchy visibility, and view visibility (whether its visible to any cameras). Due to the modular and open-ended way that view visibility is computed, it triggers change detection every single frame, even when the value does not change. Since hierarchy visibility is stored in the same component as view visibility, this means that change detection for inherited visibility is completely broken. At the company I work for, this has become a real issue. We are using change detection to only re-render scenes when necessary. The broken state of change detection for computed visibility means that we have to to rely on the non-inherited `Visibility` component for now. This is workable in the early stages of our project, but since we will inevitably want to use the hierarchy, we will have to either: 1. Roll our own solution for computed visibility. 2. Fix the issue for everyone. ## Solution Split the `ComputedVisibility` component into two: `InheritedVisibilty` and `ViewVisibility`. This allows change detection to behave properly for `InheritedVisibility`. View visiblity is still erratic, although it is less useful to be able to detect changes for this flavor of visibility. Overall, this actually simplifies the API. Since the visibility system consists of self-explaining components, it is much easier to document the behavior and usage. This approach is more modular and "ECS-like" -- one could strip out the `ViewVisibility` component entirely if it's not needed, and rely only on inherited visibility. --- ## Changelog - `ComputedVisibility` has been removed in favor of: `InheritedVisibility` and `ViewVisiblity`. ## Migration Guide The `ComputedVisibilty` component has been split into `InheritedVisiblity` and `ViewVisibility`. Replace any usages of `ComputedVisibility::is_visible_in_hierarchy` with `InheritedVisibility::get`, and replace `ComputedVisibility::is_visible_in_view` with `ViewVisibility::get`. ```rust // Before: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, computed_visibility: ComputedVisibility::default(), }); // After: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, inherited_visibility: InheritedVisibility::default(), view_visibility: ViewVisibility::default(), }); ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_hierarchy() { // After: fn my_system(q: Query<&InheritedVisibility>) { for inherited_visibility in &q { if inherited_visibility.get() { ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_view() { // After: fn my_system(q: Query<&ViewVisibility>) { for view_visibility in &q { if view_visibility.get() { ``` ```rust // Before: fn my_system(mut q: Query<&mut ComputedVisibilty>) { for vis in &mut q { vis.set_visible_in_view(); // After: fn my_system(mut q: Query<&mut ViewVisibility>) { for view_visibility in &mut q { view_visibility.set(); ``` --------- Co-authored-by: Robert Swain <robert.swain@gmail.com>
2023-09-01 13:00:18 +00:00
view_visibility,
light renderlayers (#10742) # Objective add `RenderLayers` awareness to lights. lights default to `RenderLayers::layer(0)`, and must intersect the camera entity's `RenderLayers` in order to affect the camera's output. note that lights already use renderlayers to filter meshes for shadow casting. this adds filtering lights per view based on intersection of camera layers and light layers. fixes #3462 ## Solution PointLights and SpotLights are assigned to individual views in `assign_lights_to_clusters`, so we simply cull the lights which don't match the view layers in that function. DirectionalLights are global, so we - add the light layers to the `DirectionalLight` struct - add the view layers to the `ViewUniform` struct - check for intersection before processing the light in `apply_pbr_lighting` potential issue: when mesh/light layers are smaller than the view layers weird results can occur. e.g: camera = layers 1+2 light = layers 1 mesh = layers 2 the mesh does not cast shadows wrt the light as (1 & 2) == 0. the light affects the view as (1+2 & 1) != 0. the view renders the mesh as (1+2 & 2) != 0. so the mesh is rendered and lit, but does not cast a shadow. this could be fixed (so that the light would not affect the mesh in that view) by adding the light layers to the point and spot light structs, but i think the setup is pretty unusual, and space is at a premium in those structs (adding 4 bytes more would reduce the webgl point+spot light max count to 240 from 256). I think typical usage is for cameras to have a single layer, and meshes/lights to maybe have multiple layers to render to e.g. minimaps as well as primary views. if there is a good use case for the above setup and we should support it, please let me know. --- ## Migration Guide Lights no longer affect all `RenderLayers` by default, now like cameras and meshes they default to `RenderLayers::layer(0)`. To recover the previous behaviour and have all lights affect all views, add a `RenderLayers::all()` component to the light entity.
2023-12-12 19:45:37 +00:00
maybe_layers,
Split `ComputedVisibility` into two components to allow for accurate change detection and speed up visibility propagation (#9497) # Objective Fix #8267. Fixes half of #7840. The `ComputedVisibility` component contains two flags: hierarchy visibility, and view visibility (whether its visible to any cameras). Due to the modular and open-ended way that view visibility is computed, it triggers change detection every single frame, even when the value does not change. Since hierarchy visibility is stored in the same component as view visibility, this means that change detection for inherited visibility is completely broken. At the company I work for, this has become a real issue. We are using change detection to only re-render scenes when necessary. The broken state of change detection for computed visibility means that we have to to rely on the non-inherited `Visibility` component for now. This is workable in the early stages of our project, but since we will inevitably want to use the hierarchy, we will have to either: 1. Roll our own solution for computed visibility. 2. Fix the issue for everyone. ## Solution Split the `ComputedVisibility` component into two: `InheritedVisibilty` and `ViewVisibility`. This allows change detection to behave properly for `InheritedVisibility`. View visiblity is still erratic, although it is less useful to be able to detect changes for this flavor of visibility. Overall, this actually simplifies the API. Since the visibility system consists of self-explaining components, it is much easier to document the behavior and usage. This approach is more modular and "ECS-like" -- one could strip out the `ViewVisibility` component entirely if it's not needed, and rely only on inherited visibility. --- ## Changelog - `ComputedVisibility` has been removed in favor of: `InheritedVisibility` and `ViewVisiblity`. ## Migration Guide The `ComputedVisibilty` component has been split into `InheritedVisiblity` and `ViewVisibility`. Replace any usages of `ComputedVisibility::is_visible_in_hierarchy` with `InheritedVisibility::get`, and replace `ComputedVisibility::is_visible_in_view` with `ViewVisibility::get`. ```rust // Before: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, computed_visibility: ComputedVisibility::default(), }); // After: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, inherited_visibility: InheritedVisibility::default(), view_visibility: ViewVisibility::default(), }); ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_hierarchy() { // After: fn my_system(q: Query<&InheritedVisibility>) { for inherited_visibility in &q { if inherited_visibility.get() { ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_view() { // After: fn my_system(q: Query<&ViewVisibility>) { for view_visibility in &q { if view_visibility.get() { ``` ```rust // Before: fn my_system(mut q: Query<&mut ComputedVisibilty>) { for vis in &mut q { vis.set_visible_in_view(); // After: fn my_system(mut q: Query<&mut ViewVisibility>) { for view_visibility in &mut q { view_visibility.set(); ``` --------- Co-authored-by: Robert Swain <robert.swain@gmail.com>
2023-09-01 13:00:18 +00:00
) in &directional_lights
{
Split `ComputedVisibility` into two components to allow for accurate change detection and speed up visibility propagation (#9497) # Objective Fix #8267. Fixes half of #7840. The `ComputedVisibility` component contains two flags: hierarchy visibility, and view visibility (whether its visible to any cameras). Due to the modular and open-ended way that view visibility is computed, it triggers change detection every single frame, even when the value does not change. Since hierarchy visibility is stored in the same component as view visibility, this means that change detection for inherited visibility is completely broken. At the company I work for, this has become a real issue. We are using change detection to only re-render scenes when necessary. The broken state of change detection for computed visibility means that we have to to rely on the non-inherited `Visibility` component for now. This is workable in the early stages of our project, but since we will inevitably want to use the hierarchy, we will have to either: 1. Roll our own solution for computed visibility. 2. Fix the issue for everyone. ## Solution Split the `ComputedVisibility` component into two: `InheritedVisibilty` and `ViewVisibility`. This allows change detection to behave properly for `InheritedVisibility`. View visiblity is still erratic, although it is less useful to be able to detect changes for this flavor of visibility. Overall, this actually simplifies the API. Since the visibility system consists of self-explaining components, it is much easier to document the behavior and usage. This approach is more modular and "ECS-like" -- one could strip out the `ViewVisibility` component entirely if it's not needed, and rely only on inherited visibility. --- ## Changelog - `ComputedVisibility` has been removed in favor of: `InheritedVisibility` and `ViewVisiblity`. ## Migration Guide The `ComputedVisibilty` component has been split into `InheritedVisiblity` and `ViewVisibility`. Replace any usages of `ComputedVisibility::is_visible_in_hierarchy` with `InheritedVisibility::get`, and replace `ComputedVisibility::is_visible_in_view` with `ViewVisibility::get`. ```rust // Before: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, computed_visibility: ComputedVisibility::default(), }); // After: commands.spawn(VisibilityBundle { visibility: Visibility::Inherited, inherited_visibility: InheritedVisibility::default(), view_visibility: ViewVisibility::default(), }); ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_hierarchy() { // After: fn my_system(q: Query<&InheritedVisibility>) { for inherited_visibility in &q { if inherited_visibility.get() { ``` ```rust // Before: fn my_system(q: Query<&ComputedVisibilty>) { for vis in &q { if vis.is_visible_in_view() { // After: fn my_system(q: Query<&ViewVisibility>) { for view_visibility in &q { if view_visibility.get() { ``` ```rust // Before: fn my_system(mut q: Query<&mut ComputedVisibilty>) { for vis in &mut q { vis.set_visible_in_view(); // After: fn my_system(mut q: Query<&mut ViewVisibility>) { for view_visibility in &mut q { view_visibility.set(); ``` --------- Co-authored-by: Robert Swain <robert.swain@gmail.com>
2023-09-01 13:00:18 +00:00
if !view_visibility.get() {
continue;
}
Make `RenderStage::Extract` run on the render world (#4402) # Objective - Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource. - However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource. - This meant that effectively only one extract which wrote to resources could run at a time. - We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that. ## Solution - Move the extract stage to run on the render world. - Add the main world as a `MainWorld` resource. - Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`. ## Future work It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on. We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519 It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too. ## Todo I still need to add doc comments to `Extract`. --- ## Changelog ### Changed - The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. Resources on the render world can now be accessed using `ResMut` during extract. ### Removed - `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead ## Migration Guide The `Extract` `RenderStage` now runs on the render world (instead of the main world as before). You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it. For example, if previously your extract system looked like: ```rust fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { for cloud in clouds.iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` the new version would be: ```rust fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` The diff is: ```diff --- a/src/clouds.rs +++ b/src/clouds.rs @@ -1,5 +1,5 @@ -fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) { - for cloud in clouds.iter() { +fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) { + for cloud in clouds.value().iter() { commands.get_or_spawn(cloud).insert(Cloud); } } ``` You can now also access resources from the render world using the normal system parameters during `Extract`: ```rust fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) { *render_assets = source_assets.clone(); } ``` Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
// TODO: As above
let render_visible_entities = visible_entities.clone();
Accept Bundles for insert and remove. Deprecate insert/remove_bundle (#6039) # Objective Take advantage of the "impl Bundle for Component" changes in #2975 / add the follow up changes discussed there. ## Solution - Change `insert` and `remove` to accept a Bundle instead of a Component (for both Commands and World) - Deprecate `insert_bundle`, `remove_bundle`, and `remove_bundle_intersection` - Add `remove_intersection` --- ## Changelog - Change `insert` and `remove` now accept a Bundle instead of a Component (for both Commands and World) - `insert_bundle` and `remove_bundle` are deprecated ## Migration Guide Replace `insert_bundle` with `insert`: ```rust // Old (0.8) commands.spawn().insert_bundle(SomeBundle::default()); // New (0.9) commands.spawn().insert(SomeBundle::default()); ``` Replace `remove_bundle` with `remove`: ```rust // Old (0.8) commands.entity(some_entity).remove_bundle::<SomeBundle>(); // New (0.9) commands.entity(some_entity).remove::<SomeBundle>(); ``` Replace `remove_bundle_intersection` with `remove_intersection`: ```rust // Old (0.8) world.entity_mut(some_entity).remove_bundle_intersection::<SomeBundle>(); // New (0.9) world.entity_mut(some_entity).remove_intersection::<SomeBundle>(); ``` Consider consolidating as many operations as possible to improve ergonomics and cut down on archetype moves: ```rust // Old (0.8) commands.spawn() .insert_bundle(SomeBundle::default()) .insert(SomeComponent); // New (0.9) - Option 1 commands.spawn().insert(( SomeBundle::default(), SomeComponent, )) // New (0.9) - Option 2 commands.spawn_bundle(( SomeBundle::default(), SomeComponent, )) ``` ## Next Steps Consider changing `spawn` to accept a bundle and deprecate `spawn_bundle`.
2022-09-21 21:47:53 +00:00
commands.get_or_spawn(entity).insert((
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
ExtractedDirectionalLight {
color: directional_light.color,
illuminance: directional_light.illuminance,
Take DirectionalLight's GlobalTransform into account when calculating shadow map volume (not just direction) (#6384) # Objective This PR fixes #5789, by enabling movable (and scalable) directional light shadow volumes. ## Solution This PR changes `ExtractedDirectionalLight` to hold a copy of the `DirectionalLight` entity's `GlobalTransform`, instead of just a `direction` vector. This allows the shadow map volume (as defined by the light's `shadow_projection` field) to be transformed honoring translation _and_ scale transforms, and not just rotation. It also augments the texel size calculation (used to determine the `shadow_normal_bias`) so that it now takes into account the upper bound of the x/y/z scale of the `GlobalTransform`. This change makes the directional light extraction code more consistent with point and spot lights (that already use `transform`), and allows easily moving and scaling the shadow volume along with a player entity based on camera distance/angle, immediately enabling more real world use cases until we have a more sophisticated adaptive implementation, such as the one described in #3629. **Note:** While it was previously possible to update the projection achieving a similar effect, depending on the light direction and distance to the origin, the fact that the shadow map camera was always positioned at the origin with a hardcoded `Vec3::Y` up value meant you would get sub-optimal or inconsistent/incorrect results. --- ## Changelog ### Changed - `DirectionalLight` shadow volumes now honor translation and scale transforms ## Migration Guide - If your directional lights were positioned at the origin and not scaled (the default, most common scenario) no changes are needed on your part; it just works as before; - If you previously had a system for dynamically updating directional light shadow projections, you might now be able to simplify your code by updating the directional light entity's transform instead; - In the unlikely scenario that a scene with directional lights that previously rendered shadows correctly has missing shadows, make sure your directional lights are positioned at (0, 0, 0) and are not scaled to a size that's too large or too small.
2022-11-04 20:12:26 +00:00
transform: *transform,
shadows_enabled: directional_light.shadows_enabled,
shadow_depth_bias: directional_light.shadow_depth_bias,
// The factor of SQRT_2 is for the worst-case diagonal offset
shadow_normal_bias: directional_light.shadow_normal_bias * std::f32::consts::SQRT_2,
cascade_shadow_config: cascade_config.clone(),
cascades: cascades.cascades.clone(),
frusta: frusta.frusta.clone(),
light renderlayers (#10742) # Objective add `RenderLayers` awareness to lights. lights default to `RenderLayers::layer(0)`, and must intersect the camera entity's `RenderLayers` in order to affect the camera's output. note that lights already use renderlayers to filter meshes for shadow casting. this adds filtering lights per view based on intersection of camera layers and light layers. fixes #3462 ## Solution PointLights and SpotLights are assigned to individual views in `assign_lights_to_clusters`, so we simply cull the lights which don't match the view layers in that function. DirectionalLights are global, so we - add the light layers to the `DirectionalLight` struct - add the view layers to the `ViewUniform` struct - check for intersection before processing the light in `apply_pbr_lighting` potential issue: when mesh/light layers are smaller than the view layers weird results can occur. e.g: camera = layers 1+2 light = layers 1 mesh = layers 2 the mesh does not cast shadows wrt the light as (1 & 2) == 0. the light affects the view as (1+2 & 1) != 0. the view renders the mesh as (1+2 & 2) != 0. so the mesh is rendered and lit, but does not cast a shadow. this could be fixed (so that the light would not affect the mesh in that view) by adding the light layers to the point and spot light structs, but i think the setup is pretty unusual, and space is at a premium in those structs (adding 4 bytes more would reduce the webgl point+spot light max count to 240 from 256). I think typical usage is for cameras to have a single layer, and meshes/lights to maybe have multiple layers to render to e.g. minimaps as well as primary views. if there is a good use case for the above setup and we should support it, please let me know. --- ## Migration Guide Lights no longer affect all `RenderLayers` by default, now like cameras and meshes they default to `RenderLayers::layer(0)`. To recover the previous behaviour and have all lights affect all views, add a `RenderLayers::all()` component to the light entity.
2023-12-12 19:45:37 +00:00
render_layers: maybe_layers.copied().unwrap_or_default(),
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
},
render_visible_entities,
));
}
2021-06-02 02:59:17 +00:00
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub(crate) const POINT_LIGHT_NEAR_Z: f32 = 0.1f32;
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
pub(crate) struct CubeMapFace {
pub(crate) target: Vec3,
pub(crate) up: Vec3,
2021-07-01 23:48:55 +00:00
}
// Cubemap faces are [+X, -X, +Y, -Y, +Z, -Z], per https://www.w3.org/TR/webgpu/#texture-view-creation
// Note: Cubemap coordinates are left-handed y-up, unlike the rest of Bevy.
// See https://registry.khronos.org/vulkan/specs/1.2/html/chap16.html#_cube_map_face_selection
//
// For each cubemap face, we take care to specify the appropriate target/up axis such that the rendered
// texture using Bevy's right-handed y-up coordinate space matches the expected cubemap face in
// left-handed y-up cubemap coordinates.
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
pub(crate) const CUBE_MAP_FACES: [CubeMapFace; 6] = [
// +X
2021-07-01 23:48:55 +00:00
CubeMapFace {
target: Vec3::X,
up: Vec3::Y,
2021-07-01 23:48:55 +00:00
},
// -X
2021-07-01 23:48:55 +00:00
CubeMapFace {
target: Vec3::NEG_X,
up: Vec3::Y,
2021-07-01 23:48:55 +00:00
},
// +Y
2021-07-01 23:48:55 +00:00
CubeMapFace {
target: Vec3::Y,
2021-07-01 23:48:55 +00:00
up: Vec3::Z,
},
// -Y
2021-07-01 23:48:55 +00:00
CubeMapFace {
target: Vec3::NEG_Y,
up: Vec3::NEG_Z,
2021-07-01 23:48:55 +00:00
},
// +Z (with left-handed conventions, pointing forwards)
2021-07-01 23:48:55 +00:00
CubeMapFace {
target: Vec3::NEG_Z,
up: Vec3::Y,
2021-07-01 23:48:55 +00:00
},
// -Z (with left-handed conventions, pointing backwards)
2021-07-01 23:48:55 +00:00
CubeMapFace {
target: Vec3::Z,
up: Vec3::Y,
2021-07-01 23:48:55 +00:00
},
];
fn face_index_to_name(face_index: usize) -> &'static str {
match face_index {
0 => "+x",
1 => "-x",
2 => "+y",
3 => "-y",
4 => "+z",
5 => "-z",
_ => "invalid",
}
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ShadowView {
Keep track of when a texture is first cleared (#10325) # Objective - Custom render passes, or future passes in the engine (such as https://github.com/bevyengine/bevy/pull/10164) need a better way to know and indicate to the core passes whether the view color/depth/prepass attachments have been cleared or not yet this frame, to know if they should clear it themselves or load it. ## Solution - For all render targets (depth textures, shadow textures, prepass textures, main textures) use an atomic bool to track whether or not each texture has been cleared this frame. Abstracted away in the new ColorAttachment and DepthAttachment wrappers. --- ## Changelog - Changed `ViewTarget::get_color_attachment()`, removed arguments. - Changed `ViewTarget::get_unsampled_color_attachment()`, removed arguments. - Removed `Camera3d::clear_color`. - Removed `Camera2d::clear_color`. - Added `Camera::clear_color`. - Added `ExtractedCamera::clear_color`. - Added `ColorAttachment` and `DepthAttachment` wrappers. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - Core render passes now track when a texture is first bound as an attachment in order to decide whether to clear or load it. ## Migration Guide - Remove arguments to `ViewTarget::get_color_attachment()` and `ViewTarget::get_unsampled_color_attachment()`. - Configure clear color on `Camera` instead of on `Camera3d` and `Camera2d`. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - `ViewDepthTexture` must now be created via the `new()` method --------- Co-authored-by: vero <email@atlasdostal.com> Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
2023-12-31 00:37:37 +00:00
pub depth_attachment: DepthAttachment,
pub pass_name: String,
2021-06-02 02:59:17 +00:00
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ViewShadowBindings {
pub point_light_depth_texture: Texture,
pub point_light_depth_texture_view: TextureView,
pub directional_light_depth_texture: Texture,
pub directional_light_depth_texture_view: TextureView,
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ViewLightEntities {
2021-06-02 02:59:17 +00:00
pub lights: Vec<Entity>,
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ViewLightsUniformOffset {
pub offset: u32,
2021-06-02 02:59:17 +00:00
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
// NOTE: Clustered-forward rendering requires 3 storage buffer bindings so check that
// at least that many are supported using this constant and SupportedBindingType::from_device()
pub const CLUSTERED_FORWARD_STORAGE_BUFFER_COUNT: u32 = 3;
Make `Resource` trait opt-in, requiring `#[derive(Resource)]` V2 (#5577) *This PR description is an edited copy of #5007, written by @alice-i-cecile.* # Objective Follow-up to https://github.com/bevyengine/bevy/pull/2254. The `Resource` trait currently has a blanket implementation for all types that meet its bounds. While ergonomic, this results in several drawbacks: * it is possible to make confusing, silent mistakes such as inserting a function pointer (Foo) rather than a value (Foo::Bar) as a resource * it is challenging to discover if a type is intended to be used as a resource * we cannot later add customization options (see the [RFC](https://github.com/bevyengine/rfcs/blob/main/rfcs/27-derive-component.md) for the equivalent choice for Component). * dependencies can use the same Rust type as a resource in invisibly conflicting ways * raw Rust types used as resources cannot preserve privacy appropriately, as anyone able to access that type can read and write to internal values * we cannot capture a definitive list of possible resources to display to users in an editor ## Notes to reviewers * Review this commit-by-commit; there's effectively no back-tracking and there's a lot of churn in some of these commits. *ira: My commits are not as well organized :')* * I've relaxed the bound on Local to Send + Sync + 'static: I don't think these concerns apply there, so this can keep things simple. Storing e.g. a u32 in a Local is fine, because there's a variable name attached explaining what it does. * I think this is a bad place for the Resource trait to live, but I've left it in place to make reviewing easier. IMO that's best tackled with https://github.com/bevyengine/bevy/issues/4981. ## Changelog `Resource` is no longer automatically implemented for all matching types. Instead, use the new `#[derive(Resource)]` macro. ## Migration Guide Add `#[derive(Resource)]` to all types you are using as a resource. If you are using a third party type as a resource, wrap it in a tuple struct to bypass orphan rules. Consider deriving `Deref` and `DerefMut` to improve ergonomics. `ClearColor` no longer implements `Component`. Using `ClearColor` as a component in 0.8 did nothing. Use the `ClearColorConfig` in the `Camera3d` and `Camera2d` components instead. Co-authored-by: Alice <alice.i.cecile@gmail.com> Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com> Co-authored-by: devil-ira <justthecooldude@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-08-08 21:36:35 +00:00
#[derive(Resource)]
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub struct GlobalLightMeta {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub gpu_point_lights: GpuPointLights,
pub entity_to_index: EntityHashMap<usize>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
impl FromWorld for GlobalLightMeta {
fn from_world(world: &mut World) -> Self {
Self::new(
world
.resource::<RenderDevice>()
.get_supported_read_only_binding_type(CLUSTERED_FORWARD_STORAGE_BUFFER_COUNT),
)
}
}
impl GlobalLightMeta {
pub fn new(buffer_binding_type: BufferBindingType) -> Self {
Self {
gpu_point_lights: GpuPointLights::new(buffer_binding_type),
entity_to_index: EntityHashMap::default(),
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
}
Make `Resource` trait opt-in, requiring `#[derive(Resource)]` V2 (#5577) *This PR description is an edited copy of #5007, written by @alice-i-cecile.* # Objective Follow-up to https://github.com/bevyengine/bevy/pull/2254. The `Resource` trait currently has a blanket implementation for all types that meet its bounds. While ergonomic, this results in several drawbacks: * it is possible to make confusing, silent mistakes such as inserting a function pointer (Foo) rather than a value (Foo::Bar) as a resource * it is challenging to discover if a type is intended to be used as a resource * we cannot later add customization options (see the [RFC](https://github.com/bevyengine/rfcs/blob/main/rfcs/27-derive-component.md) for the equivalent choice for Component). * dependencies can use the same Rust type as a resource in invisibly conflicting ways * raw Rust types used as resources cannot preserve privacy appropriately, as anyone able to access that type can read and write to internal values * we cannot capture a definitive list of possible resources to display to users in an editor ## Notes to reviewers * Review this commit-by-commit; there's effectively no back-tracking and there's a lot of churn in some of these commits. *ira: My commits are not as well organized :')* * I've relaxed the bound on Local to Send + Sync + 'static: I don't think these concerns apply there, so this can keep things simple. Storing e.g. a u32 in a Local is fine, because there's a variable name attached explaining what it does. * I think this is a bad place for the Resource trait to live, but I've left it in place to make reviewing easier. IMO that's best tackled with https://github.com/bevyengine/bevy/issues/4981. ## Changelog `Resource` is no longer automatically implemented for all matching types. Instead, use the new `#[derive(Resource)]` macro. ## Migration Guide Add `#[derive(Resource)]` to all types you are using as a resource. If you are using a third party type as a resource, wrap it in a tuple struct to bypass orphan rules. Consider deriving `Deref` and `DerefMut` to improve ergonomics. `ClearColor` no longer implements `Component`. Using `ClearColor` as a component in 0.8 did nothing. Use the `ClearColorConfig` in the `Camera3d` and `Camera2d` components instead. Co-authored-by: Alice <alice.i.cecile@gmail.com> Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com> Co-authored-by: devil-ira <justthecooldude@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-08-08 21:36:35 +00:00
#[derive(Resource, Default)]
2021-06-02 02:59:17 +00:00
pub struct LightMeta {
pub view_gpu_lights: DynamicUniformBuffer<GpuLights>,
2021-06-02 02:59:17 +00:00
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
pub enum LightEntity {
Directional {
light_entity: Entity,
cascade_index: usize,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
},
Point {
light_entity: Entity,
face_index: usize,
},
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
Spot {
light_entity: Entity,
},
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
}
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
pub fn calculate_cluster_factors(
near: f32,
far: f32,
z_slices: f32,
is_orthographic: bool,
) -> Vec2 {
if is_orthographic {
Vec2::new(-near, z_slices / (-far - -near))
} else {
let z_slices_of_ln_zfar_over_znear = (z_slices - 1.0) / (far / near).ln();
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
Vec2::new(
z_slices_of_ln_zfar_over_znear,
near.ln() * z_slices_of_ln_zfar_over_znear,
)
}
}
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// this method of constructing a basis from a vec3 is used by glam::Vec3::any_orthonormal_pair
// we will also construct it in the fragment shader and need our implementations to match,
// so we reproduce it here to avoid a mismatch if glam changes. we also switch the handedness
// could move this onto transform but it's pretty niche
pub(crate) fn spot_light_view_matrix(transform: &GlobalTransform) -> Mat4 {
// the matrix z_local (opposite of transform.forward())
let fwd_dir = transform.back().extend(0.0);
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let sign = 1f32.copysign(fwd_dir.z);
let a = -1.0 / (fwd_dir.z + sign);
let b = fwd_dir.x * fwd_dir.y * a;
let up_dir = Vec4::new(
1.0 + sign * fwd_dir.x * fwd_dir.x * a,
sign * b,
-sign * fwd_dir.x,
0.0,
);
let right_dir = Vec4::new(-b, -sign - fwd_dir.y * fwd_dir.y * a, fwd_dir.y, 0.0);
Mat4::from_cols(
right_dir,
up_dir,
fwd_dir,
transform.translation().extend(1.0),
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
)
}
pub(crate) fn spot_light_projection_matrix(angle: f32) -> Mat4 {
// spot light projection FOV is 2x the angle from spot light center to outer edge
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
Mat4::perspective_infinite_reverse_rh(angle * 2.0, 1.0, POINT_LIGHT_NEAR_Z)
}
2021-07-08 03:01:41 +00:00
#[allow(clippy::too_many_arguments)]
2021-06-02 02:59:17 +00:00
pub fn prepare_lights(
mut commands: Commands,
mut texture_cache: ResMut<TextureCache>,
2021-06-21 23:28:52 +00:00
render_device: Res<RenderDevice>,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
render_queue: Res<RenderQueue>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
mut global_light_meta: ResMut<GlobalLightMeta>,
2021-06-02 02:59:17 +00:00
mut light_meta: ResMut<LightMeta>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
views: Query<
Implement minimal reflection probes (fixed macOS, iOS, and Android). (#11366) This pull request re-submits #10057, which was backed out for breaking macOS, iOS, and Android. I've tested this version on macOS and Android and on the iOS simulator. # Objective This pull request implements *reflection probes*, which generalize environment maps to allow for multiple environment maps in the same scene, each of which has an axis-aligned bounding box. This is a standard feature of physically-based renderers and was inspired by [the corresponding feature in Blender's Eevee renderer]. ## Solution This is a minimal implementation of reflection probes that allows artists to define cuboid bounding regions associated with environment maps. For every view, on every frame, a system builds up a list of the nearest 4 reflection probes that are within the view's frustum and supplies that list to the shader. The PBR fragment shader searches through the list, finds the first containing reflection probe, and uses it for indirect lighting, falling back to the view's environment map if none is found. Both forward and deferred renderers are fully supported. A reflection probe is an entity with a pair of components, *LightProbe* and *EnvironmentMapLight* (as well as the standard *SpatialBundle*, to position it in the world). The *LightProbe* component (along with the *Transform*) defines the bounding region, while the *EnvironmentMapLight* component specifies the associated diffuse and specular cubemaps. A frequent question is "why two components instead of just one?" The advantages of this setup are: 1. It's readily extensible to other types of light probes, in particular *irradiance volumes* (also known as ambient cubes or voxel global illumination), which use the same approach of bounding cuboids. With a single component that applies to both reflection probes and irradiance volumes, we can share the logic that implements falloff and blending between multiple light probes between both of those features. 2. It reduces duplication between the existing *EnvironmentMapLight* and these new reflection probes. Systems can treat environment maps attached to cameras the same way they treat environment maps applied to reflection probes if they wish. Internally, we gather up all environment maps in the scene and place them in a cubemap array. At present, this means that all environment maps must have the same size, mipmap count, and texture format. A warning is emitted if this restriction is violated. We could potentially relax this in the future as part of the automatic mipmap generation work, which could easily do texture format conversion as part of its preprocessing. An easy way to generate reflection probe cubemaps is to bake them in Blender and use the `export-blender-gi` tool that's part of the [`bevy-baked-gi`] project. This tool takes a `.blend` file containing baked cubemaps as input and exports cubemap images, pre-filtered with an embedded fork of the [glTF IBL Sampler], alongside a corresponding `.scn.ron` file that the scene spawner can use to recreate the reflection probes. Note that this is intentionally a minimal implementation, to aid reviewability. Known issues are: * Reflection probes are basically unsupported on WebGL 2, because WebGL 2 has no cubemap arrays. (Strictly speaking, you can have precisely one reflection probe in the scene if you have no other cubemaps anywhere, but this isn't very useful.) * Reflection probes have no falloff, so reflections will abruptly change when objects move from one bounding region to another. * As mentioned before, all cubemaps in the world of a given type (diffuse or specular) must have the same size, format, and mipmap count. Future work includes: * Blending between multiple reflection probes. * A falloff/fade-out region so that reflected objects disappear gradually instead of vanishing all at once. * Irradiance volumes for voxel-based global illumination. This should reuse much of the reflection probe logic, as they're both GI techniques based on cuboid bounding regions. * Support for WebGL 2, by breaking batches when reflection probes are used. These issues notwithstanding, I think it's best to land this with roughly the current set of functionality, because this patch is useful as is and adding everything above would make the pull request significantly larger and harder to review. --- ## Changelog ### Added * A new *LightProbe* component is available that specifies a bounding region that an *EnvironmentMapLight* applies to. The combination of a *LightProbe* and an *EnvironmentMapLight* offers *reflection probe* functionality similar to that available in other engines. [the corresponding feature in Blender's Eevee renderer]: https://docs.blender.org/manual/en/latest/render/eevee/light_probes/reflection_cubemaps.html [`bevy-baked-gi`]: https://github.com/pcwalton/bevy-baked-gi [glTF IBL Sampler]: https://github.com/KhronosGroup/glTF-IBL-Sampler
2024-01-19 07:33:52 +00:00
(Entity, &ExtractedView, &ExtractedClusterConfig),
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
With<RenderPhase<Transparent3d>>,
>,
ambient_light: Res<AmbientLight>,
point_light_shadow_map: Res<PointLightShadowMap>,
directional_light_shadow_map: Res<DirectionalLightShadowMap>,
mut max_directional_lights_warning_emitted: Local<bool>,
mut max_cascades_per_light_warning_emitted: Local<bool>,
point_lights: Query<(
Entity,
&ExtractedPointLight,
AnyOf<(&CubemapFrusta, &Frustum)>,
)>,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
directional_lights: Query<(Entity, &ExtractedDirectionalLight)>,
2021-06-02 02:59:17 +00:00
) {
let views_iter = views.iter();
let views_count = views_iter.len();
let Some(mut view_gpu_lights_writer) =
light_meta
.view_gpu_lights
.get_writer(views_count, &render_device, &render_queue)
else {
return;
};
2021-06-02 02:59:17 +00:00
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
// Pre-calculate for PointLights
let cube_face_projection =
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Mat4::perspective_infinite_reverse_rh(std::f32::consts::FRAC_PI_2, 1.0, POINT_LIGHT_NEAR_Z);
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
let cube_face_rotations = CUBE_MAP_FACES
.iter()
.map(|CubeMapFace { target, up }| Transform::IDENTITY.looking_at(*target, *up))
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
.collect::<Vec<_>>();
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta.entity_to_index.clear();
let mut point_lights: Vec<_> = point_lights.iter().collect::<Vec<_>>();
let mut directional_lights: Vec<_> = directional_lights.iter().collect::<Vec<_>>();
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(any(
not(feature = "webgl"),
not(target_arch = "wasm32"),
feature = "webgpu"
))]
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let max_texture_array_layers = render_device.limits().max_texture_array_layers as usize;
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(any(
not(feature = "webgl"),
not(target_arch = "wasm32"),
feature = "webgpu"
))]
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let max_texture_cubes = max_texture_array_layers / 6;
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let max_texture_array_layers = 1;
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let max_texture_cubes = 1;
if !*max_directional_lights_warning_emitted && directional_lights.len() > MAX_DIRECTIONAL_LIGHTS
{
warn!(
"The amount of directional lights of {} is exceeding the supported limit of {}.",
directional_lights.len(),
MAX_DIRECTIONAL_LIGHTS
);
*max_directional_lights_warning_emitted = true;
}
if !*max_cascades_per_light_warning_emitted
&& directional_lights
.iter()
.any(|(_, light)| light.cascade_shadow_config.bounds.len() > MAX_CASCADES_PER_LIGHT)
{
warn!(
"The number of cascades configured for a directional light exceeds the supported limit of {}.",
MAX_CASCADES_PER_LIGHT
);
*max_cascades_per_light_warning_emitted = true;
}
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let point_light_count = point_lights
.iter()
.filter(|light| light.1.spot_light_angles.is_none())
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
.count();
let point_light_shadow_maps_count = point_lights
.iter()
.filter(|light| light.1.shadows_enabled && light.1.spot_light_angles.is_none())
.count()
.min(max_texture_cubes);
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let directional_shadow_enabled_count = directional_lights
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
.iter()
.take(MAX_DIRECTIONAL_LIGHTS)
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
.filter(|(_, light)| light.shadows_enabled)
.count()
.min(max_texture_array_layers / MAX_CASCADES_PER_LIGHT);
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let spot_light_shadow_maps_count = point_lights
.iter()
.filter(|(_, light, _)| light.shadows_enabled && light.spot_light_angles.is_some())
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
.count()
.min(max_texture_array_layers - directional_shadow_enabled_count * MAX_CASCADES_PER_LIGHT);
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// Sort lights by
// - point-light vs spot-light, so that we can iterate point lights and spot lights in contiguous blocks in the fragment shader,
// - then those with shadows enabled first, so that the index can be used to render at most `point_light_shadow_maps_count`
// point light shadows and `spot_light_shadow_maps_count` spot light shadow maps,
// - then by entity as a stable key to ensure that a consistent set of lights are chosen if the light count limit is exceeded.
point_lights.sort_by(|(entity_1, light_1, _), (entity_2, light_2, _)| {
point_light_order(
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
(
entity_1,
&light_1.shadows_enabled,
&light_1.spot_light_angles.is_some(),
),
(
entity_2,
&light_2.shadows_enabled,
&light_2.spot_light_angles.is_some(),
),
)
});
// Sort lights by
// - those with shadows enabled first, so that the index can be used to render at most `directional_light_shadow_maps_count`
// directional light shadows
// - then by entity as a stable key to ensure that a consistent set of lights are chosen if the light count limit is exceeded.
directional_lights.sort_by(|(entity_1, light_1), (entity_2, light_2)| {
directional_light_order(
(entity_1, &light_1.shadows_enabled),
(entity_2, &light_2.shadows_enabled),
)
});
if global_light_meta.entity_to_index.capacity() < point_lights.len() {
global_light_meta
.entity_to_index
.reserve(point_lights.len());
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let mut gpu_point_lights = Vec::new();
for (index, &(entity, light, _)) in point_lights.iter().enumerate() {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
let mut flags = PointLightFlags::NONE;
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// Lights are sorted, shadow enabled lights are first
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
if light.shadows_enabled
&& (index < point_light_shadow_maps_count
|| (light.spot_light_angles.is_some()
&& index - point_light_count < spot_light_shadow_maps_count))
{
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
flags |= PointLightFlags::SHADOWS_ENABLED;
}
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let (light_custom_data, spot_light_tan_angle) = match light.spot_light_angles {
Some((inner, outer)) => {
let light_direction = light.transform.forward();
if light_direction.y.is_sign_negative() {
flags |= PointLightFlags::SPOT_LIGHT_Y_NEGATIVE;
}
let cos_outer = outer.cos();
let spot_scale = 1.0 / f32::max(inner.cos() - cos_outer, 1e-4);
let spot_offset = -cos_outer * spot_scale;
(
// For spot lights: the direction (x,z), spot_scale and spot_offset
light_direction.xz().extend(spot_scale).extend(spot_offset),
outer.tan(),
)
}
None => {
(
// For point lights: the lower-right 2x2 values of the projection matrix [2][2] [2][3] [3][2] [3][3]
Vec4::new(
cube_face_projection.z_axis.z,
cube_face_projection.z_axis.w,
cube_face_projection.w_axis.z,
cube_face_projection.w_axis.w,
),
// unused
0.0,
)
}
};
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
gpu_point_lights.push(GpuPointLight {
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
light_custom_data,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// premultiply color by intensity
// we don't use the alpha at all, so no reason to multiply only [0..3]
color_inverse_square_range: (Vec4::from_slice(&light.color.as_linear_rgba_f32())
* light.intensity)
.xyz()
.extend(1.0 / (light.range * light.range)),
position_radius: light.transform.translation().extend(light.radius),
flags: flags.bits(),
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
shadow_depth_bias: light.shadow_depth_bias,
shadow_normal_bias: light.shadow_normal_bias,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
spot_light_tan_angle,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
});
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta.entity_to_index.insert(entity, index);
}
let mut gpu_directional_lights = [GpuDirectionalLight::default(); MAX_DIRECTIONAL_LIGHTS];
let mut num_directional_cascades_enabled = 0usize;
for (index, (_light_entity, light)) in directional_lights
.iter()
.enumerate()
.take(MAX_DIRECTIONAL_LIGHTS)
{
let mut flags = DirectionalLightFlags::NONE;
// Lights are sorted, shadow enabled lights are first
if light.shadows_enabled && (index < directional_shadow_enabled_count) {
flags |= DirectionalLightFlags::SHADOWS_ENABLED;
}
let num_cascades = light
.cascade_shadow_config
.bounds
.len()
.min(MAX_CASCADES_PER_LIGHT);
gpu_directional_lights[index] = GpuDirectionalLight {
// Filled in later.
cascades: [GpuDirectionalCascade::default(); MAX_CASCADES_PER_LIGHT],
// premultiply color by illuminance
// we don't use the alpha at all, so no reason to multiply only [0..3]
color: Vec4::from_slice(&light.color.as_linear_rgba_f32()) * light.illuminance,
// direction is negated to be ready for N.L
dir_to_light: light.transform.back(),
flags: flags.bits(),
shadow_depth_bias: light.shadow_depth_bias,
shadow_normal_bias: light.shadow_normal_bias,
num_cascades: num_cascades as u32,
cascades_overlap_proportion: light.cascade_shadow_config.overlap_proportion,
depth_texture_base_index: num_directional_cascades_enabled as u32,
light renderlayers (#10742) # Objective add `RenderLayers` awareness to lights. lights default to `RenderLayers::layer(0)`, and must intersect the camera entity's `RenderLayers` in order to affect the camera's output. note that lights already use renderlayers to filter meshes for shadow casting. this adds filtering lights per view based on intersection of camera layers and light layers. fixes #3462 ## Solution PointLights and SpotLights are assigned to individual views in `assign_lights_to_clusters`, so we simply cull the lights which don't match the view layers in that function. DirectionalLights are global, so we - add the light layers to the `DirectionalLight` struct - add the view layers to the `ViewUniform` struct - check for intersection before processing the light in `apply_pbr_lighting` potential issue: when mesh/light layers are smaller than the view layers weird results can occur. e.g: camera = layers 1+2 light = layers 1 mesh = layers 2 the mesh does not cast shadows wrt the light as (1 & 2) == 0. the light affects the view as (1+2 & 1) != 0. the view renders the mesh as (1+2 & 2) != 0. so the mesh is rendered and lit, but does not cast a shadow. this could be fixed (so that the light would not affect the mesh in that view) by adding the light layers to the point and spot light structs, but i think the setup is pretty unusual, and space is at a premium in those structs (adding 4 bytes more would reduce the webgl point+spot light max count to 240 from 256). I think typical usage is for cameras to have a single layer, and meshes/lights to maybe have multiple layers to render to e.g. minimaps as well as primary views. if there is a good use case for the above setup and we should support it, please let me know. --- ## Migration Guide Lights no longer affect all `RenderLayers` by default, now like cameras and meshes they default to `RenderLayers::layer(0)`. To recover the previous behaviour and have all lights affect all views, add a `RenderLayers::all()` component to the light entity.
2023-12-12 19:45:37 +00:00
render_layers: light.render_layers.bits(),
};
if index < directional_shadow_enabled_count {
num_directional_cascades_enabled += num_cascades;
}
}
global_light_meta.gpu_point_lights.set(gpu_point_lights);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta
.gpu_point_lights
.write_buffer(&render_device, &render_queue);
2021-06-02 02:59:17 +00:00
// set up light data for each view
Implement minimal reflection probes (fixed macOS, iOS, and Android). (#11366) This pull request re-submits #10057, which was backed out for breaking macOS, iOS, and Android. I've tested this version on macOS and Android and on the iOS simulator. # Objective This pull request implements *reflection probes*, which generalize environment maps to allow for multiple environment maps in the same scene, each of which has an axis-aligned bounding box. This is a standard feature of physically-based renderers and was inspired by [the corresponding feature in Blender's Eevee renderer]. ## Solution This is a minimal implementation of reflection probes that allows artists to define cuboid bounding regions associated with environment maps. For every view, on every frame, a system builds up a list of the nearest 4 reflection probes that are within the view's frustum and supplies that list to the shader. The PBR fragment shader searches through the list, finds the first containing reflection probe, and uses it for indirect lighting, falling back to the view's environment map if none is found. Both forward and deferred renderers are fully supported. A reflection probe is an entity with a pair of components, *LightProbe* and *EnvironmentMapLight* (as well as the standard *SpatialBundle*, to position it in the world). The *LightProbe* component (along with the *Transform*) defines the bounding region, while the *EnvironmentMapLight* component specifies the associated diffuse and specular cubemaps. A frequent question is "why two components instead of just one?" The advantages of this setup are: 1. It's readily extensible to other types of light probes, in particular *irradiance volumes* (also known as ambient cubes or voxel global illumination), which use the same approach of bounding cuboids. With a single component that applies to both reflection probes and irradiance volumes, we can share the logic that implements falloff and blending between multiple light probes between both of those features. 2. It reduces duplication between the existing *EnvironmentMapLight* and these new reflection probes. Systems can treat environment maps attached to cameras the same way they treat environment maps applied to reflection probes if they wish. Internally, we gather up all environment maps in the scene and place them in a cubemap array. At present, this means that all environment maps must have the same size, mipmap count, and texture format. A warning is emitted if this restriction is violated. We could potentially relax this in the future as part of the automatic mipmap generation work, which could easily do texture format conversion as part of its preprocessing. An easy way to generate reflection probe cubemaps is to bake them in Blender and use the `export-blender-gi` tool that's part of the [`bevy-baked-gi`] project. This tool takes a `.blend` file containing baked cubemaps as input and exports cubemap images, pre-filtered with an embedded fork of the [glTF IBL Sampler], alongside a corresponding `.scn.ron` file that the scene spawner can use to recreate the reflection probes. Note that this is intentionally a minimal implementation, to aid reviewability. Known issues are: * Reflection probes are basically unsupported on WebGL 2, because WebGL 2 has no cubemap arrays. (Strictly speaking, you can have precisely one reflection probe in the scene if you have no other cubemaps anywhere, but this isn't very useful.) * Reflection probes have no falloff, so reflections will abruptly change when objects move from one bounding region to another. * As mentioned before, all cubemaps in the world of a given type (diffuse or specular) must have the same size, format, and mipmap count. Future work includes: * Blending between multiple reflection probes. * A falloff/fade-out region so that reflected objects disappear gradually instead of vanishing all at once. * Irradiance volumes for voxel-based global illumination. This should reuse much of the reflection probe logic, as they're both GI techniques based on cuboid bounding regions. * Support for WebGL 2, by breaking batches when reflection probes are used. These issues notwithstanding, I think it's best to land this with roughly the current set of functionality, because this patch is useful as is and adding everything above would make the pull request significantly larger and harder to review. --- ## Changelog ### Added * A new *LightProbe* component is available that specifies a bounding region that an *EnvironmentMapLight* applies to. The combination of a *LightProbe* and an *EnvironmentMapLight* offers *reflection probe* functionality similar to that available in other engines. [the corresponding feature in Blender's Eevee renderer]: https://docs.blender.org/manual/en/latest/render/eevee/light_probes/reflection_cubemaps.html [`bevy-baked-gi`]: https://github.com/pcwalton/bevy-baked-gi [glTF IBL Sampler]: https://github.com/KhronosGroup/glTF-IBL-Sampler
2024-01-19 07:33:52 +00:00
for (entity, extracted_view, clusters) in &views {
let point_light_depth_texture = texture_cache.get(
&render_device,
TextureDescriptor {
size: Extent3d {
width: point_light_shadow_map.size as u32,
height: point_light_shadow_map.size as u32,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
depth_or_array_layers: point_light_shadow_maps_count.max(1) as u32 * 6,
},
mip_level_count: 1,
sample_count: 1,
dimension: TextureDimension::D2,
Deferred Renderer (#9258) # Objective - Add a [Deferred Renderer](https://en.wikipedia.org/wiki/Deferred_shading) to Bevy. - This allows subsequent passes to access per pixel material information before/during shading. - Accessing this per pixel material information is needed for some features, like GI. It also makes other features (ex. Decals) simpler to implement and/or improves their capability. There are multiple approaches to accomplishing this. The deferred shading approach works well given the limitations of WebGPU and WebGL2. Motivation: [I'm working on a GI solution for Bevy](https://youtu.be/eH1AkL-mwhI) # Solution - The deferred renderer is implemented with a prepass and a deferred lighting pass. - The prepass renders opaque objects into the Gbuffer attachment (`Rgba32Uint`). The PBR shader generates a `PbrInput` in mostly the same way as the forward implementation and then [packs it into the Gbuffer](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L168). - The deferred lighting pass unpacks the `PbrInput` and [feeds it into the pbr() function](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L65), then outputs the shaded color data. - There is now a resource [DefaultOpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L599) that can be used to set the default render method for opaque materials. If materials return `None` from [opaque_render_method()](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L131) the `DefaultOpaqueRendererMethod` will be used. Otherwise, custom materials can also explicitly choose to only support Deferred or Forward by returning the respective [OpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L603) - Deferred materials can be used seamlessly along with both opaque and transparent forward rendered materials in the same scene. The [deferred rendering example](https://github.com/DGriffin91/bevy/blob/deferred/examples/3d/deferred_rendering.rs) does this. - The deferred renderer does not support MSAA. If any deferred materials are used, MSAA must be disabled. Both TAA and FXAA are supported. - Deferred rendering supports WebGL2/WebGPU. ## Custom deferred materials - Custom materials can support both deferred and forward at the same time. The [StandardMaterial](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L166) does this. So does [this example](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56). - Custom deferred materials that require PBR lighting can create a `PbrInput`, write it to the deferred GBuffer and let it be rendered by the `PBRDeferredLightingPlugin`. - Custom deferred materials that require custom lighting have two options: 1. Use the base_color channel of the `PbrInput` combined with the `STANDARD_MATERIAL_FLAGS_UNLIT_BIT` flag. [Example.](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56) (If the unlit bit is set, the base_color is stored as RGB9E5 for extra precision) 2. A Custom Deferred Lighting pass can be created, either overriding the default, or running in addition. The a depth buffer is used to limit rendering to only the required fragments for each deferred lighting pass. Materials can set their respective depth id via the [deferred_lighting_pass_id](https://github.com/DGriffin91/bevy/blob/b79182d2a32cac28c4213c2457a53ac2cc885332/crates/bevy_pbr/src/prepass/prepass_io.wgsl#L95) attachment. The custom deferred lighting pass plugin can then set [its corresponding depth](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L37). Then with the lighting pass using [CompareFunction::Equal](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/mod.rs#L335), only the fragments with a depth that equal the corresponding depth written in the material will be rendered. Custom deferred lighting plugins can also be created to render the StandardMaterial. The default deferred lighting plugin can be bypassed with `DefaultPlugins.set(PBRDeferredLightingPlugin { bypass: true })` --------- Co-authored-by: nickrart <nickolas.g.russell@gmail.com>
2023-10-12 22:10:38 +00:00
format: CORE_3D_DEPTH_FORMAT,
label: Some("point_light_shadow_map_texture"),
usage: TextureUsages::RENDER_ATTACHMENT | TextureUsages::TEXTURE_BINDING,
Wgpu 0.15 (#7356) # Objective Update Bevy to wgpu 0.15. ## Changelog - Update to wgpu 0.15, wgpu-hal 0.15.1, and naga 0.11 - Users can now use the [DirectX Shader Compiler](https://github.com/microsoft/DirectXShaderCompiler) (DXC) on Windows with DX12 for faster shader compilation and ShaderModel 6.0+ support (requires `dxcompiler.dll` and `dxil.dll`, which are included in DXC downloads from [here](https://github.com/microsoft/DirectXShaderCompiler/releases/latest)) ## Migration Guide ### WGSL Top-Level `let` is now `const` All top level constants are now declared with `const`, catching up with the wgsl spec. `let` is no longer allowed at the global scope, only within functions. ```diff -let SOME_CONSTANT = 12.0; +const SOME_CONSTANT = 12.0; ``` #### `TextureDescriptor` and `SurfaceConfiguration` now requires a `view_formats` field The new `view_formats` field in the `TextureDescriptor` is used to specify a list of formats the texture can be re-interpreted to in a texture view. Currently only changing srgb-ness is allowed (ex. `Rgba8Unorm` <=> `Rgba8UnormSrgb`). You should set `view_formats` to `&[]` (empty) unless you have a specific reason not to. #### The DirectX Shader Compiler (DXC) is now supported on DX12 DXC is now the default shader compiler when using the DX12 backend. DXC is Microsoft's replacement for their legacy FXC compiler, and is faster, less buggy, and allows for modern shader features to be used (ShaderModel 6.0+). DXC requires `dxcompiler.dll` and `dxil.dll` to be available, otherwise it will log a warning and fall back to FXC. You can get `dxcompiler.dll` and `dxil.dll` by downloading the latest release from [Microsoft's DirectXShaderCompiler github repo](https://github.com/microsoft/DirectXShaderCompiler/releases/latest) and copying them into your project's root directory. These must be included when you distribute your Bevy game/app/etc if you plan on supporting the DX12 backend and are using DXC. `WgpuSettings` now has a `dx12_shader_compiler` field which can be used to choose between either FXC or DXC (if you pass None for the paths for DXC, it will check for the .dlls in the working directory).
2023-01-29 20:27:30 +00:00
view_formats: &[],
},
);
let directional_light_depth_texture = texture_cache.get(
2021-06-21 23:28:52 +00:00
&render_device,
2021-06-02 02:59:17 +00:00
TextureDescriptor {
size: Extent3d {
width: (directional_light_shadow_map.size as u32)
.min(render_device.limits().max_texture_dimension_2d),
height: (directional_light_shadow_map.size as u32)
.min(render_device.limits().max_texture_dimension_2d),
depth_or_array_layers: (num_directional_cascades_enabled
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
+ spot_light_shadow_maps_count)
.max(1) as u32,
},
2021-06-02 02:59:17 +00:00
mip_level_count: 1,
sample_count: 1,
dimension: TextureDimension::D2,
Deferred Renderer (#9258) # Objective - Add a [Deferred Renderer](https://en.wikipedia.org/wiki/Deferred_shading) to Bevy. - This allows subsequent passes to access per pixel material information before/during shading. - Accessing this per pixel material information is needed for some features, like GI. It also makes other features (ex. Decals) simpler to implement and/or improves their capability. There are multiple approaches to accomplishing this. The deferred shading approach works well given the limitations of WebGPU and WebGL2. Motivation: [I'm working on a GI solution for Bevy](https://youtu.be/eH1AkL-mwhI) # Solution - The deferred renderer is implemented with a prepass and a deferred lighting pass. - The prepass renders opaque objects into the Gbuffer attachment (`Rgba32Uint`). The PBR shader generates a `PbrInput` in mostly the same way as the forward implementation and then [packs it into the Gbuffer](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L168). - The deferred lighting pass unpacks the `PbrInput` and [feeds it into the pbr() function](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L65), then outputs the shaded color data. - There is now a resource [DefaultOpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L599) that can be used to set the default render method for opaque materials. If materials return `None` from [opaque_render_method()](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L131) the `DefaultOpaqueRendererMethod` will be used. Otherwise, custom materials can also explicitly choose to only support Deferred or Forward by returning the respective [OpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L603) - Deferred materials can be used seamlessly along with both opaque and transparent forward rendered materials in the same scene. The [deferred rendering example](https://github.com/DGriffin91/bevy/blob/deferred/examples/3d/deferred_rendering.rs) does this. - The deferred renderer does not support MSAA. If any deferred materials are used, MSAA must be disabled. Both TAA and FXAA are supported. - Deferred rendering supports WebGL2/WebGPU. ## Custom deferred materials - Custom materials can support both deferred and forward at the same time. The [StandardMaterial](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L166) does this. So does [this example](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56). - Custom deferred materials that require PBR lighting can create a `PbrInput`, write it to the deferred GBuffer and let it be rendered by the `PBRDeferredLightingPlugin`. - Custom deferred materials that require custom lighting have two options: 1. Use the base_color channel of the `PbrInput` combined with the `STANDARD_MATERIAL_FLAGS_UNLIT_BIT` flag. [Example.](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56) (If the unlit bit is set, the base_color is stored as RGB9E5 for extra precision) 2. A Custom Deferred Lighting pass can be created, either overriding the default, or running in addition. The a depth buffer is used to limit rendering to only the required fragments for each deferred lighting pass. Materials can set their respective depth id via the [deferred_lighting_pass_id](https://github.com/DGriffin91/bevy/blob/b79182d2a32cac28c4213c2457a53ac2cc885332/crates/bevy_pbr/src/prepass/prepass_io.wgsl#L95) attachment. The custom deferred lighting pass plugin can then set [its corresponding depth](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L37). Then with the lighting pass using [CompareFunction::Equal](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/mod.rs#L335), only the fragments with a depth that equal the corresponding depth written in the material will be rendered. Custom deferred lighting plugins can also be created to render the StandardMaterial. The default deferred lighting plugin can be bypassed with `DefaultPlugins.set(PBRDeferredLightingPlugin { bypass: true })` --------- Co-authored-by: nickrart <nickolas.g.russell@gmail.com>
2023-10-12 22:10:38 +00:00
format: CORE_3D_DEPTH_FORMAT,
label: Some("directional_light_shadow_map_texture"),
usage: TextureUsages::RENDER_ATTACHMENT | TextureUsages::TEXTURE_BINDING,
Wgpu 0.15 (#7356) # Objective Update Bevy to wgpu 0.15. ## Changelog - Update to wgpu 0.15, wgpu-hal 0.15.1, and naga 0.11 - Users can now use the [DirectX Shader Compiler](https://github.com/microsoft/DirectXShaderCompiler) (DXC) on Windows with DX12 for faster shader compilation and ShaderModel 6.0+ support (requires `dxcompiler.dll` and `dxil.dll`, which are included in DXC downloads from [here](https://github.com/microsoft/DirectXShaderCompiler/releases/latest)) ## Migration Guide ### WGSL Top-Level `let` is now `const` All top level constants are now declared with `const`, catching up with the wgsl spec. `let` is no longer allowed at the global scope, only within functions. ```diff -let SOME_CONSTANT = 12.0; +const SOME_CONSTANT = 12.0; ``` #### `TextureDescriptor` and `SurfaceConfiguration` now requires a `view_formats` field The new `view_formats` field in the `TextureDescriptor` is used to specify a list of formats the texture can be re-interpreted to in a texture view. Currently only changing srgb-ness is allowed (ex. `Rgba8Unorm` <=> `Rgba8UnormSrgb`). You should set `view_formats` to `&[]` (empty) unless you have a specific reason not to. #### The DirectX Shader Compiler (DXC) is now supported on DX12 DXC is now the default shader compiler when using the DX12 backend. DXC is Microsoft's replacement for their legacy FXC compiler, and is faster, less buggy, and allows for modern shader features to be used (ShaderModel 6.0+). DXC requires `dxcompiler.dll` and `dxil.dll` to be available, otherwise it will log a warning and fall back to FXC. You can get `dxcompiler.dll` and `dxil.dll` by downloading the latest release from [Microsoft's DirectXShaderCompiler github repo](https://github.com/microsoft/DirectXShaderCompiler/releases/latest) and copying them into your project's root directory. These must be included when you distribute your Bevy game/app/etc if you plan on supporting the DX12 backend and are using DXC. `WgpuSettings` now has a `dx12_shader_compiler` field which can be used to choose between either FXC or DXC (if you pass None for the paths for DXC, it will check for the .dlls in the working directory).
2023-01-29 20:27:30 +00:00
view_formats: &[],
2021-06-02 02:59:17 +00:00
},
);
let mut view_lights = Vec::new();
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
let is_orthographic = extracted_view.projection.w_axis.w == 1.0;
let cluster_factors_zw = calculate_cluster_factors(
clusters.near,
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
clusters.far,
clusters.dimensions.z as f32,
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
is_orthographic,
);
let n_clusters = clusters.dimensions.x * clusters.dimensions.y * clusters.dimensions.z;
let mut gpu_lights = GpuLights {
directional_lights: gpu_directional_lights,
ambient_color: Vec4::from_slice(&ambient_light.color.as_linear_rgba_f32())
* ambient_light.brightness,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
cluster_factors: Vec4::new(
clusters.dimensions.x as f32 / extracted_view.viewport.z as f32,
clusters.dimensions.y as f32 / extracted_view.viewport.w as f32,
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
cluster_factors_zw.x,
cluster_factors_zw.y,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
),
cluster_dimensions: clusters.dimensions.extend(n_clusters),
n_directional_lights: directional_lights.iter().len() as u32,
// spotlight shadow maps are stored in the directional light array, starting at num_directional_cascades_enabled.
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// the spot lights themselves start in the light array at point_light_count. so to go from light
// index to shadow map index, we need to subtract point light count and add directional shadowmap count.
spot_light_shadowmap_offset: num_directional_cascades_enabled as i32
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
- point_light_count as i32,
2021-06-02 02:59:17 +00:00
};
// TODO: this should select lights based on relevance to the view instead of the first ones that show up in a query
for &(light_entity, light, (point_light_frusta, _)) in point_lights
.iter()
// Lights are sorted, shadow enabled lights are first
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
.take(point_light_shadow_maps_count)
.filter(|(_, light, _)| light.shadows_enabled)
{
let light_index = *global_light_meta
.entity_to_index
.get(&light_entity)
.unwrap();
// ignore scale because we don't want to effectively scale light radius and range
// by applying those as a view transform to shadow map rendering of objects
// and ignore rotation because we want the shadow map projections to align with the axes
let view_translation = GlobalTransform::from_translation(light.transform.translation());
for (face_index, (view_rotation, frustum)) in cube_face_rotations
.iter()
.zip(&point_light_frusta.unwrap().frusta)
.enumerate()
{
let depth_texture_view =
point_light_depth_texture
.texture
.create_view(&TextureViewDescriptor {
label: Some("point_light_shadow_map_texture_view"),
format: None,
dimension: Some(TextureViewDimension::D2),
aspect: TextureAspect::All,
base_mip_level: 0,
mip_level_count: None,
base_array_layer: (light_index * 6 + face_index) as u32,
array_layer_count: Some(1u32),
});
let view_light_entity = commands
Spawn now takes a Bundle (#6054) # Objective Now that we can consolidate Bundles and Components under a single insert (thanks to #2975 and #6039), almost 100% of world spawns now look like `world.spawn().insert((Some, Tuple, Here))`. Spawning an entity without any components is an extremely uncommon pattern, so it makes sense to give spawn the "first class" ergonomic api. This consolidated api should be made consistent across all spawn apis (such as World and Commands). ## Solution All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input: ```rust // before: commands .spawn() .insert((A, B, C)); world .spawn() .insert((A, B, C); // after commands.spawn((A, B, C)); world.spawn((A, B, C)); ``` All existing instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api. A new `spawn_empty` has been added, replacing the old `spawn` api. By allowing `world.spawn(some_bundle)` to replace `world.spawn().insert(some_bundle)`, this opened the door to removing the initial entity allocation in the "empty" archetype / table done in `spawn()` (and subsequent move to the actual archetype in `.insert(some_bundle)`). This improves spawn performance by over 10%: ![image](https://user-images.githubusercontent.com/2694663/191627587-4ab2f949-4ccd-4231-80eb-80dd4d9ad6b9.png) To take this measurement, I added a new `world_spawn` benchmark. Unfortunately, optimizing `Commands::spawn` is slightly less trivial, as Commands expose the Entity id of spawned entities prior to actually spawning. Doing the optimization would (naively) require assurances that the `spawn(some_bundle)` command is applied before all other commands involving the entity (which would not necessarily be true, if memory serves). Optimizing `Commands::spawn` this way does feel possible, but it will require careful thought (and maybe some additional checks), which deserves its own PR. For now, it has the same performance characteristics of the current `Commands::spawn_bundle` on main. **Note that 99% of this PR is simple renames and refactors. The only code that needs careful scrutiny is the new `World::spawn()` impl, which is relatively straightforward, but it has some new unsafe code (which re-uses battle tested BundlerSpawner code path).** --- ## Changelog - All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input - All instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api - World and Commands now have `spawn_empty()`, which is equivalent to the old `spawn()` behavior. ## Migration Guide ```rust // Old (0.8): commands .spawn() .insert_bundle((A, B, C)); // New (0.9) commands.spawn((A, B, C)); // Old (0.8): commands.spawn_bundle((A, B, C)); // New (0.9) commands.spawn((A, B, C)); // Old (0.8): let entity = commands.spawn().id(); // New (0.9) let entity = commands.spawn_empty().id(); // Old (0.8) let entity = world.spawn().id(); // New (0.9) let entity = world.spawn_empty(); ```
2022-09-23 19:55:54 +00:00
.spawn((
ShadowView {
Keep track of when a texture is first cleared (#10325) # Objective - Custom render passes, or future passes in the engine (such as https://github.com/bevyengine/bevy/pull/10164) need a better way to know and indicate to the core passes whether the view color/depth/prepass attachments have been cleared or not yet this frame, to know if they should clear it themselves or load it. ## Solution - For all render targets (depth textures, shadow textures, prepass textures, main textures) use an atomic bool to track whether or not each texture has been cleared this frame. Abstracted away in the new ColorAttachment and DepthAttachment wrappers. --- ## Changelog - Changed `ViewTarget::get_color_attachment()`, removed arguments. - Changed `ViewTarget::get_unsampled_color_attachment()`, removed arguments. - Removed `Camera3d::clear_color`. - Removed `Camera2d::clear_color`. - Added `Camera::clear_color`. - Added `ExtractedCamera::clear_color`. - Added `ColorAttachment` and `DepthAttachment` wrappers. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - Core render passes now track when a texture is first bound as an attachment in order to decide whether to clear or load it. ## Migration Guide - Remove arguments to `ViewTarget::get_color_attachment()` and `ViewTarget::get_unsampled_color_attachment()`. - Configure clear color on `Camera` instead of on `Camera3d` and `Camera2d`. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - `ViewDepthTexture` must now be created via the `new()` method --------- Co-authored-by: vero <email@atlasdostal.com> Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
2023-12-31 00:37:37 +00:00
depth_attachment: DepthAttachment::new(depth_texture_view, Some(0.0)),
pass_name: format!(
"shadow pass point light {} {}",
light_index,
face_index_to_name(face_index)
),
},
ExtractedView {
viewport: UVec4::new(
0,
0,
point_light_shadow_map.size as u32,
point_light_shadow_map.size as u32,
),
transform: view_translation * *view_rotation,
view_projection: None,
projection: cube_face_projection,
separate tonemapping and upscaling passes (#3425) Attempt to make features like bloom https://github.com/bevyengine/bevy/pull/2876 easier to implement. **This PR:** - Moves the tonemapping from `pbr.wgsl` into a separate pass - also add a separate upscaling pass after the tonemapping which writes to the swap chain (enables resolution-independant rendering and post-processing after tonemapping) - adds a `hdr` bool to the camera which controls whether the pbr and sprite shaders render into a `Rgba16Float` texture **Open questions:** - ~should the 2d graph work the same as the 3d one?~ it is the same now - ~The current solution is a bit inflexible because while you can add a post processing pass that writes to e.g. the `hdr_texture`, you can't write to a separate `user_postprocess_texture` while reading the `hdr_texture` and tell the tone mapping pass to read from the `user_postprocess_texture` instead. If the tonemapping and upscaling render graph nodes were to take in a `TextureView` instead of the view entity this would almost work, but the bind groups for their respective input textures are already created in the `Queue` render stage in the hardcoded order.~ solved by creating bind groups in render node **New render graph:** ![render_graph](https://user-images.githubusercontent.com/22177966/147767249-57dd4229-cfab-4ec5-9bf3-dc76dccf8e8b.png) <details> <summary>Before</summary> ![render_graph_old](https://user-images.githubusercontent.com/22177966/147284579-c895fdbd-4028-41cf-914c-e1ffef60e44e.png) </details> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-10-26 20:13:59 +00:00
hdr: false,
color_grading: Default::default(),
},
*frustum,
RenderPhase::<Shadow>::default(),
LightEntity::Point {
light_entity,
face_index,
},
))
.id();
view_lights.push(view_light_entity);
}
2021-06-02 02:59:17 +00:00
}
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// spot lights
for (light_index, &(light_entity, light, (_, spot_light_frustum))) in point_lights
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
.iter()
.skip(point_light_count)
.take(spot_light_shadow_maps_count)
.enumerate()
{
let spot_view_matrix = spot_light_view_matrix(&light.transform);
let spot_view_transform = spot_view_matrix.into();
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let angle = light.spot_light_angles.expect("lights should be sorted so that \
[point_light_count..point_light_count + spot_light_shadow_maps_count] are spot lights").1;
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let spot_projection = spot_light_projection_matrix(angle);
let depth_texture_view =
directional_light_depth_texture
.texture
.create_view(&TextureViewDescriptor {
label: Some("spot_light_shadow_map_texture_view"),
format: None,
dimension: Some(TextureViewDimension::D2),
aspect: TextureAspect::All,
base_mip_level: 0,
mip_level_count: None,
base_array_layer: (num_directional_cascades_enabled + light_index) as u32,
array_layer_count: Some(1u32),
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
});
let view_light_entity = commands
Spawn now takes a Bundle (#6054) # Objective Now that we can consolidate Bundles and Components under a single insert (thanks to #2975 and #6039), almost 100% of world spawns now look like `world.spawn().insert((Some, Tuple, Here))`. Spawning an entity without any components is an extremely uncommon pattern, so it makes sense to give spawn the "first class" ergonomic api. This consolidated api should be made consistent across all spawn apis (such as World and Commands). ## Solution All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input: ```rust // before: commands .spawn() .insert((A, B, C)); world .spawn() .insert((A, B, C); // after commands.spawn((A, B, C)); world.spawn((A, B, C)); ``` All existing instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api. A new `spawn_empty` has been added, replacing the old `spawn` api. By allowing `world.spawn(some_bundle)` to replace `world.spawn().insert(some_bundle)`, this opened the door to removing the initial entity allocation in the "empty" archetype / table done in `spawn()` (and subsequent move to the actual archetype in `.insert(some_bundle)`). This improves spawn performance by over 10%: ![image](https://user-images.githubusercontent.com/2694663/191627587-4ab2f949-4ccd-4231-80eb-80dd4d9ad6b9.png) To take this measurement, I added a new `world_spawn` benchmark. Unfortunately, optimizing `Commands::spawn` is slightly less trivial, as Commands expose the Entity id of spawned entities prior to actually spawning. Doing the optimization would (naively) require assurances that the `spawn(some_bundle)` command is applied before all other commands involving the entity (which would not necessarily be true, if memory serves). Optimizing `Commands::spawn` this way does feel possible, but it will require careful thought (and maybe some additional checks), which deserves its own PR. For now, it has the same performance characteristics of the current `Commands::spawn_bundle` on main. **Note that 99% of this PR is simple renames and refactors. The only code that needs careful scrutiny is the new `World::spawn()` impl, which is relatively straightforward, but it has some new unsafe code (which re-uses battle tested BundlerSpawner code path).** --- ## Changelog - All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input - All instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api - World and Commands now have `spawn_empty()`, which is equivalent to the old `spawn()` behavior. ## Migration Guide ```rust // Old (0.8): commands .spawn() .insert_bundle((A, B, C)); // New (0.9) commands.spawn((A, B, C)); // Old (0.8): commands.spawn_bundle((A, B, C)); // New (0.9) commands.spawn((A, B, C)); // Old (0.8): let entity = commands.spawn().id(); // New (0.9) let entity = commands.spawn_empty().id(); // Old (0.8) let entity = world.spawn().id(); // New (0.9) let entity = world.spawn_empty(); ```
2022-09-23 19:55:54 +00:00
.spawn((
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
ShadowView {
Keep track of when a texture is first cleared (#10325) # Objective - Custom render passes, or future passes in the engine (such as https://github.com/bevyengine/bevy/pull/10164) need a better way to know and indicate to the core passes whether the view color/depth/prepass attachments have been cleared or not yet this frame, to know if they should clear it themselves or load it. ## Solution - For all render targets (depth textures, shadow textures, prepass textures, main textures) use an atomic bool to track whether or not each texture has been cleared this frame. Abstracted away in the new ColorAttachment and DepthAttachment wrappers. --- ## Changelog - Changed `ViewTarget::get_color_attachment()`, removed arguments. - Changed `ViewTarget::get_unsampled_color_attachment()`, removed arguments. - Removed `Camera3d::clear_color`. - Removed `Camera2d::clear_color`. - Added `Camera::clear_color`. - Added `ExtractedCamera::clear_color`. - Added `ColorAttachment` and `DepthAttachment` wrappers. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - Core render passes now track when a texture is first bound as an attachment in order to decide whether to clear or load it. ## Migration Guide - Remove arguments to `ViewTarget::get_color_attachment()` and `ViewTarget::get_unsampled_color_attachment()`. - Configure clear color on `Camera` instead of on `Camera3d` and `Camera2d`. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - `ViewDepthTexture` must now be created via the `new()` method --------- Co-authored-by: vero <email@atlasdostal.com> Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
2023-12-31 00:37:37 +00:00
depth_attachment: DepthAttachment::new(depth_texture_view, Some(0.0)),
pass_name: format!("shadow pass spot light {light_index}"),
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
},
ExtractedView {
viewport: UVec4::new(
0,
0,
directional_light_shadow_map.size as u32,
directional_light_shadow_map.size as u32,
),
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
transform: spot_view_transform,
projection: spot_projection,
view_projection: None,
separate tonemapping and upscaling passes (#3425) Attempt to make features like bloom https://github.com/bevyengine/bevy/pull/2876 easier to implement. **This PR:** - Moves the tonemapping from `pbr.wgsl` into a separate pass - also add a separate upscaling pass after the tonemapping which writes to the swap chain (enables resolution-independant rendering and post-processing after tonemapping) - adds a `hdr` bool to the camera which controls whether the pbr and sprite shaders render into a `Rgba16Float` texture **Open questions:** - ~should the 2d graph work the same as the 3d one?~ it is the same now - ~The current solution is a bit inflexible because while you can add a post processing pass that writes to e.g. the `hdr_texture`, you can't write to a separate `user_postprocess_texture` while reading the `hdr_texture` and tell the tone mapping pass to read from the `user_postprocess_texture` instead. If the tonemapping and upscaling render graph nodes were to take in a `TextureView` instead of the view entity this would almost work, but the bind groups for their respective input textures are already created in the `Queue` render stage in the hardcoded order.~ solved by creating bind groups in render node **New render graph:** ![render_graph](https://user-images.githubusercontent.com/22177966/147767249-57dd4229-cfab-4ec5-9bf3-dc76dccf8e8b.png) <details> <summary>Before</summary> ![render_graph_old](https://user-images.githubusercontent.com/22177966/147284579-c895fdbd-4028-41cf-914c-e1ffef60e44e.png) </details> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-10-26 20:13:59 +00:00
hdr: false,
color_grading: Default::default(),
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
},
*spot_light_frustum.unwrap(),
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
RenderPhase::<Shadow>::default(),
LightEntity::Spot { light_entity },
))
.id();
view_lights.push(view_light_entity);
}
// directional lights
let mut directional_depth_texture_array_index = 0u32;
for (light_index, &(light_entity, light)) in directional_lights
.iter()
.enumerate()
.take(directional_shadow_enabled_count)
{
let cascades = light
.cascades
.get(&entity)
.unwrap()
.iter()
.take(MAX_CASCADES_PER_LIGHT);
let frusta = light
.frusta
.get(&entity)
.unwrap()
.iter()
.take(MAX_CASCADES_PER_LIGHT);
for (cascade_index, ((cascade, frusta), bound)) in cascades
.zip(frusta)
.zip(&light.cascade_shadow_config.bounds)
.enumerate()
{
gpu_lights.directional_lights[light_index].cascades[cascade_index] =
GpuDirectionalCascade {
view_projection: cascade.view_projection,
texel_size: cascade.texel_size,
far_bound: *bound,
};
let depth_texture_view =
directional_light_depth_texture
.texture
.create_view(&TextureViewDescriptor {
label: Some("directional_light_shadow_map_array_texture_view"),
format: None,
dimension: Some(TextureViewDimension::D2),
aspect: TextureAspect::All,
base_mip_level: 0,
mip_level_count: None,
base_array_layer: directional_depth_texture_array_index,
array_layer_count: Some(1u32),
});
directional_depth_texture_array_index += 1;
let view_light_entity = commands
.spawn((
ShadowView {
Keep track of when a texture is first cleared (#10325) # Objective - Custom render passes, or future passes in the engine (such as https://github.com/bevyengine/bevy/pull/10164) need a better way to know and indicate to the core passes whether the view color/depth/prepass attachments have been cleared or not yet this frame, to know if they should clear it themselves or load it. ## Solution - For all render targets (depth textures, shadow textures, prepass textures, main textures) use an atomic bool to track whether or not each texture has been cleared this frame. Abstracted away in the new ColorAttachment and DepthAttachment wrappers. --- ## Changelog - Changed `ViewTarget::get_color_attachment()`, removed arguments. - Changed `ViewTarget::get_unsampled_color_attachment()`, removed arguments. - Removed `Camera3d::clear_color`. - Removed `Camera2d::clear_color`. - Added `Camera::clear_color`. - Added `ExtractedCamera::clear_color`. - Added `ColorAttachment` and `DepthAttachment` wrappers. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - Core render passes now track when a texture is first bound as an attachment in order to decide whether to clear or load it. ## Migration Guide - Remove arguments to `ViewTarget::get_color_attachment()` and `ViewTarget::get_unsampled_color_attachment()`. - Configure clear color on `Camera` instead of on `Camera3d` and `Camera2d`. - Moved `ClearColor` and `ClearColorConfig` from `bevy::core_pipeline::clear_color` to `bevy::render::camera`. - `ViewDepthTexture` must now be created via the `new()` method --------- Co-authored-by: vero <email@atlasdostal.com> Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
2023-12-31 00:37:37 +00:00
depth_attachment: DepthAttachment::new(depth_texture_view, Some(0.0)),
pass_name: format!(
"shadow pass directional light {light_index} cascade {cascade_index}"),
},
ExtractedView {
viewport: UVec4::new(
0,
0,
directional_light_shadow_map.size as u32,
directional_light_shadow_map.size as u32,
),
transform: GlobalTransform::from(cascade.view_transform),
projection: cascade.projection,
view_projection: Some(cascade.view_projection),
hdr: false,
color_grading: Default::default(),
},
*frusta,
RenderPhase::<Shadow>::default(),
LightEntity::Directional {
light_entity,
cascade_index,
},
))
.id();
view_lights.push(view_light_entity);
}
}
let point_light_depth_texture_view =
point_light_depth_texture
2021-07-01 23:48:55 +00:00
.texture
.create_view(&TextureViewDescriptor {
label: Some("point_light_shadow_map_array_texture_view"),
2021-07-01 23:48:55 +00:00
format: None,
// NOTE: iOS Simulator is missing CubeArray support so we use Cube instead.
// See https://github.com/bevyengine/bevy/pull/12052 - remove if support is added.
#[cfg(all(
not(ios_simulator),
any(
not(feature = "webgl"),
not(target_arch = "wasm32"),
feature = "webgpu"
)
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
))]
2021-07-01 23:48:55 +00:00
dimension: Some(TextureViewDimension::CubeArray),
#[cfg(any(
ios_simulator,
all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu"))
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
))]
dimension: Some(TextureViewDimension::Cube),
Deferred Renderer (#9258) # Objective - Add a [Deferred Renderer](https://en.wikipedia.org/wiki/Deferred_shading) to Bevy. - This allows subsequent passes to access per pixel material information before/during shading. - Accessing this per pixel material information is needed for some features, like GI. It also makes other features (ex. Decals) simpler to implement and/or improves their capability. There are multiple approaches to accomplishing this. The deferred shading approach works well given the limitations of WebGPU and WebGL2. Motivation: [I'm working on a GI solution for Bevy](https://youtu.be/eH1AkL-mwhI) # Solution - The deferred renderer is implemented with a prepass and a deferred lighting pass. - The prepass renders opaque objects into the Gbuffer attachment (`Rgba32Uint`). The PBR shader generates a `PbrInput` in mostly the same way as the forward implementation and then [packs it into the Gbuffer](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L168). - The deferred lighting pass unpacks the `PbrInput` and [feeds it into the pbr() function](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L65), then outputs the shaded color data. - There is now a resource [DefaultOpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L599) that can be used to set the default render method for opaque materials. If materials return `None` from [opaque_render_method()](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L131) the `DefaultOpaqueRendererMethod` will be used. Otherwise, custom materials can also explicitly choose to only support Deferred or Forward by returning the respective [OpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L603) - Deferred materials can be used seamlessly along with both opaque and transparent forward rendered materials in the same scene. The [deferred rendering example](https://github.com/DGriffin91/bevy/blob/deferred/examples/3d/deferred_rendering.rs) does this. - The deferred renderer does not support MSAA. If any deferred materials are used, MSAA must be disabled. Both TAA and FXAA are supported. - Deferred rendering supports WebGL2/WebGPU. ## Custom deferred materials - Custom materials can support both deferred and forward at the same time. The [StandardMaterial](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L166) does this. So does [this example](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56). - Custom deferred materials that require PBR lighting can create a `PbrInput`, write it to the deferred GBuffer and let it be rendered by the `PBRDeferredLightingPlugin`. - Custom deferred materials that require custom lighting have two options: 1. Use the base_color channel of the `PbrInput` combined with the `STANDARD_MATERIAL_FLAGS_UNLIT_BIT` flag. [Example.](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56) (If the unlit bit is set, the base_color is stored as RGB9E5 for extra precision) 2. A Custom Deferred Lighting pass can be created, either overriding the default, or running in addition. The a depth buffer is used to limit rendering to only the required fragments for each deferred lighting pass. Materials can set their respective depth id via the [deferred_lighting_pass_id](https://github.com/DGriffin91/bevy/blob/b79182d2a32cac28c4213c2457a53ac2cc885332/crates/bevy_pbr/src/prepass/prepass_io.wgsl#L95) attachment. The custom deferred lighting pass plugin can then set [its corresponding depth](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L37). Then with the lighting pass using [CompareFunction::Equal](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/mod.rs#L335), only the fragments with a depth that equal the corresponding depth written in the material will be rendered. Custom deferred lighting plugins can also be created to render the StandardMaterial. The default deferred lighting plugin can be bypassed with `DefaultPlugins.set(PBRDeferredLightingPlugin { bypass: true })` --------- Co-authored-by: nickrart <nickolas.g.russell@gmail.com>
2023-10-12 22:10:38 +00:00
aspect: TextureAspect::DepthOnly,
2021-07-01 23:48:55 +00:00
base_mip_level: 0,
mip_level_count: None,
2021-07-02 01:05:20 +00:00
base_array_layer: 0,
2021-07-01 23:48:55 +00:00
array_layer_count: None,
});
let directional_light_depth_texture_view = directional_light_depth_texture
.texture
.create_view(&TextureViewDescriptor {
label: Some("directional_light_shadow_map_array_texture_view"),
format: None,
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(any(
not(feature = "webgl"),
not(target_arch = "wasm32"),
feature = "webgpu"
))]
dimension: Some(TextureViewDimension::D2Array),
Update to wgpu 0.19 and raw-window-handle 0.6 (#11280) # Objective Keep core dependencies up to date. ## Solution Update the dependencies. wgpu 0.19 only supports raw-window-handle (rwh) 0.6, so bumping that was included in this. The rwh 0.6 version bump is just the simplest way of doing it. There might be a way we can take advantage of wgpu's new safe surface creation api, but I'm not familiar enough with bevy's window management to untangle it and my attempt ended up being a mess of lifetimes and rustc complaining about missing trait impls (that were implemented). Thanks to @MiniaczQ for the (much simpler) rwh 0.6 version bump code. Unblocks https://github.com/bevyengine/bevy/pull/9172 and https://github.com/bevyengine/bevy/pull/10812 ~~This might be blocked on cpal and oboe updating their ndk versions to 0.8, as they both currently target ndk 0.7 which uses rwh 0.5.2~~ Tested on android, and everything seems to work correctly (audio properly stops when minimized, and plays when re-focusing the app). --- ## Changelog - `wgpu` has been updated to 0.19! The long awaited arcanization has been merged (for more info, see https://gfx-rs.github.io/2023/11/24/arcanization.html), and Vulkan should now be working again on Intel GPUs. - Targeting WebGPU now requires that you add the new `webgpu` feature (setting the `RUSTFLAGS` environment variable to `--cfg=web_sys_unstable_apis` is still required). This feature currently overrides the `webgl2` feature if you have both enabled (the `webgl2` feature is enabled by default), so it is not recommended to add it as a default feature to libraries without putting it behind a flag that allows library users to opt out of it! In the future we plan on supporting wasm binaries that can target both webgl2 and webgpu now that wgpu added support for doing so (see https://github.com/bevyengine/bevy/issues/11505). - `raw-window-handle` has been updated to version 0.6. ## Migration Guide - `bevy_render::instance_index::get_instance_index()` has been removed as the webgl2 workaround is no longer required as it was fixed upstream in wgpu. The `BASE_INSTANCE_WORKAROUND` shaderdef has also been removed. - WebGPU now requires the new `webgpu` feature to be enabled. The `webgpu` feature currently overrides the `webgl2` feature so you no longer need to disable all default features and re-add them all when targeting `webgpu`, but binaries built with both the `webgpu` and `webgl2` features will only target the webgpu backend, and will only work on browsers that support WebGPU. - Places where you conditionally compiled things for webgl2 need to be updated because of this change, eg: - `#[cfg(any(not(feature = "webgl"), not(target_arch = "wasm32")))]` becomes `#[cfg(any(not(feature = "webgl") ,not(target_arch = "wasm32"), feature = "webgpu"))]` - `#[cfg(all(feature = "webgl", target_arch = "wasm32"))]` becomes `#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]` - `if cfg!(all(feature = "webgl", target_arch = "wasm32"))` becomes `if cfg!(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))` - `create_texture_with_data` now also takes a `TextureDataOrder`. You can probably just set this to `TextureDataOrder::default()` - `TextureFormat`'s `block_size` has been renamed to `block_copy_size` - See the `wgpu` changelog for anything I might've missed: https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md --------- Co-authored-by: François <mockersf@gmail.com>
2024-01-26 18:14:21 +00:00
#[cfg(all(feature = "webgl", target_arch = "wasm32", not(feature = "webgpu")))]
dimension: Some(TextureViewDimension::D2),
Deferred Renderer (#9258) # Objective - Add a [Deferred Renderer](https://en.wikipedia.org/wiki/Deferred_shading) to Bevy. - This allows subsequent passes to access per pixel material information before/during shading. - Accessing this per pixel material information is needed for some features, like GI. It also makes other features (ex. Decals) simpler to implement and/or improves their capability. There are multiple approaches to accomplishing this. The deferred shading approach works well given the limitations of WebGPU and WebGL2. Motivation: [I'm working on a GI solution for Bevy](https://youtu.be/eH1AkL-mwhI) # Solution - The deferred renderer is implemented with a prepass and a deferred lighting pass. - The prepass renders opaque objects into the Gbuffer attachment (`Rgba32Uint`). The PBR shader generates a `PbrInput` in mostly the same way as the forward implementation and then [packs it into the Gbuffer](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L168). - The deferred lighting pass unpacks the `PbrInput` and [feeds it into the pbr() function](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L65), then outputs the shaded color data. - There is now a resource [DefaultOpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L599) that can be used to set the default render method for opaque materials. If materials return `None` from [opaque_render_method()](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L131) the `DefaultOpaqueRendererMethod` will be used. Otherwise, custom materials can also explicitly choose to only support Deferred or Forward by returning the respective [OpaqueRendererMethod](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/material.rs#L603) - Deferred materials can be used seamlessly along with both opaque and transparent forward rendered materials in the same scene. The [deferred rendering example](https://github.com/DGriffin91/bevy/blob/deferred/examples/3d/deferred_rendering.rs) does this. - The deferred renderer does not support MSAA. If any deferred materials are used, MSAA must be disabled. Both TAA and FXAA are supported. - Deferred rendering supports WebGL2/WebGPU. ## Custom deferred materials - Custom materials can support both deferred and forward at the same time. The [StandardMaterial](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/render/pbr.wgsl#L166) does this. So does [this example](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56). - Custom deferred materials that require PBR lighting can create a `PbrInput`, write it to the deferred GBuffer and let it be rendered by the `PBRDeferredLightingPlugin`. - Custom deferred materials that require custom lighting have two options: 1. Use the base_color channel of the `PbrInput` combined with the `STANDARD_MATERIAL_FLAGS_UNLIT_BIT` flag. [Example.](https://github.com/DGriffin91/bevy_glowy_orb_tutorial/blob/deferred/assets/shaders/glowy.wgsl#L56) (If the unlit bit is set, the base_color is stored as RGB9E5 for extra precision) 2. A Custom Deferred Lighting pass can be created, either overriding the default, or running in addition. The a depth buffer is used to limit rendering to only the required fragments for each deferred lighting pass. Materials can set their respective depth id via the [deferred_lighting_pass_id](https://github.com/DGriffin91/bevy/blob/b79182d2a32cac28c4213c2457a53ac2cc885332/crates/bevy_pbr/src/prepass/prepass_io.wgsl#L95) attachment. The custom deferred lighting pass plugin can then set [its corresponding depth](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/deferred_lighting.wgsl#L37). Then with the lighting pass using [CompareFunction::Equal](https://github.com/DGriffin91/bevy/blob/ec1465559f2c82001830e908fc02ff1d7c2efe51/crates/bevy_pbr/src/deferred/mod.rs#L335), only the fragments with a depth that equal the corresponding depth written in the material will be rendered. Custom deferred lighting plugins can also be created to render the StandardMaterial. The default deferred lighting plugin can be bypassed with `DefaultPlugins.set(PBRDeferredLightingPlugin { bypass: true })` --------- Co-authored-by: nickrart <nickolas.g.russell@gmail.com>
2023-10-12 22:10:38 +00:00
aspect: TextureAspect::DepthOnly,
base_mip_level: 0,
mip_level_count: None,
base_array_layer: 0,
array_layer_count: None,
});
2021-07-01 23:48:55 +00:00
Accept Bundles for insert and remove. Deprecate insert/remove_bundle (#6039) # Objective Take advantage of the "impl Bundle for Component" changes in #2975 / add the follow up changes discussed there. ## Solution - Change `insert` and `remove` to accept a Bundle instead of a Component (for both Commands and World) - Deprecate `insert_bundle`, `remove_bundle`, and `remove_bundle_intersection` - Add `remove_intersection` --- ## Changelog - Change `insert` and `remove` now accept a Bundle instead of a Component (for both Commands and World) - `insert_bundle` and `remove_bundle` are deprecated ## Migration Guide Replace `insert_bundle` with `insert`: ```rust // Old (0.8) commands.spawn().insert_bundle(SomeBundle::default()); // New (0.9) commands.spawn().insert(SomeBundle::default()); ``` Replace `remove_bundle` with `remove`: ```rust // Old (0.8) commands.entity(some_entity).remove_bundle::<SomeBundle>(); // New (0.9) commands.entity(some_entity).remove::<SomeBundle>(); ``` Replace `remove_bundle_intersection` with `remove_intersection`: ```rust // Old (0.8) world.entity_mut(some_entity).remove_bundle_intersection::<SomeBundle>(); // New (0.9) world.entity_mut(some_entity).remove_intersection::<SomeBundle>(); ``` Consider consolidating as many operations as possible to improve ergonomics and cut down on archetype moves: ```rust // Old (0.8) commands.spawn() .insert_bundle(SomeBundle::default()) .insert(SomeComponent); // New (0.9) - Option 1 commands.spawn().insert(( SomeBundle::default(), SomeComponent, )) // New (0.9) - Option 2 commands.spawn_bundle(( SomeBundle::default(), SomeComponent, )) ``` ## Next Steps Consider changing `spawn` to accept a bundle and deprecate `spawn_bundle`.
2022-09-21 21:47:53 +00:00
commands.entity(entity).insert((
ViewShadowBindings {
point_light_depth_texture: point_light_depth_texture.texture,
point_light_depth_texture_view,
directional_light_depth_texture: directional_light_depth_texture.texture,
directional_light_depth_texture_view,
},
ViewLightEntities {
lights: view_lights,
},
ViewLightsUniformOffset {
offset: view_gpu_lights_writer.write(&gpu_lights),
},
));
2021-06-02 02:59:17 +00:00
}
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
}
// this must match CLUSTER_COUNT_SIZE in pbr.wgsl
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
// and must be large enough to contain MAX_UNIFORM_BUFFER_POINT_LIGHTS
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
const CLUSTER_COUNT_SIZE: u32 = 9;
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
const CLUSTER_OFFSET_MASK: u32 = (1 << (32 - (CLUSTER_COUNT_SIZE * 2))) - 1;
const CLUSTER_COUNT_MASK: u32 = (1 << CLUSTER_COUNT_SIZE) - 1;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// NOTE: With uniform buffer max binding size as 16384 bytes
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// that means we can fit 256 point lights in one uniform
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
// buffer, which means the count can be at most 256 so it
// needs 9 bits.
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// The array of indices can also use u8 and that means the
// offset in to the array of indices needs to be able to address
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
// 16384 values. log2(16384) = 14 bits.
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
// We use 32 bits to store the offset and counts so
// we pack the offset into the upper 14 bits of a u32,
// the point light count into bits 9-17, and the spot light count into bits 0-8.
// [ 31 .. 18 | 17 .. 9 | 8 .. 0 ]
// [ offset | point light count | spot light count ]
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
// NOTE: This assumes CPU and GPU endianness are the same which is true
// for all common and tested x86/ARM CPUs and AMD/NVIDIA/Intel/Apple/etc GPUs
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
fn pack_offset_and_counts(offset: usize, point_count: usize, spot_count: usize) -> u32 {
((offset as u32 & CLUSTER_OFFSET_MASK) << (CLUSTER_COUNT_SIZE * 2))
| (point_count as u32 & CLUSTER_COUNT_MASK) << CLUSTER_COUNT_SIZE
| (spot_count as u32 & CLUSTER_COUNT_MASK)
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
#[derive(ShaderType)]
struct GpuClusterLightIndexListsUniform {
data: Box<[UVec4; ViewClusterBindings::MAX_UNIFORM_ITEMS]>,
}
// NOTE: Assert at compile time that GpuClusterLightIndexListsUniform
// fits within the maximum uniform buffer binding size
const _: () = assert!(GpuClusterLightIndexListsUniform::SHADER_SIZE.get() <= 16384);
impl Default for GpuClusterLightIndexListsUniform {
fn default() -> Self {
Self {
data: Box::new([UVec4::ZERO; ViewClusterBindings::MAX_UNIFORM_ITEMS]),
}
}
}
#[derive(ShaderType)]
struct GpuClusterOffsetsAndCountsUniform {
data: Box<[UVec4; ViewClusterBindings::MAX_UNIFORM_ITEMS]>,
}
impl Default for GpuClusterOffsetsAndCountsUniform {
fn default() -> Self {
Self {
data: Box::new([UVec4::ZERO; ViewClusterBindings::MAX_UNIFORM_ITEMS]),
}
}
}
#[derive(ShaderType, Default)]
struct GpuClusterLightIndexListsStorage {
#[size(runtime)]
data: Vec<u32>,
}
#[derive(ShaderType, Default)]
struct GpuClusterOffsetsAndCountsStorage {
#[size(runtime)]
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
data: Vec<UVec4>,
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
enum ViewClusterBuffers {
Uniform {
// NOTE: UVec4 is because all arrays in Std140 layout have 16-byte alignment
cluster_light_index_lists: UniformBuffer<GpuClusterLightIndexListsUniform>,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
// NOTE: UVec4 is because all arrays in Std140 layout have 16-byte alignment
cluster_offsets_and_counts: UniformBuffer<GpuClusterOffsetsAndCountsUniform>,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
},
Storage {
cluster_light_index_lists: StorageBuffer<GpuClusterLightIndexListsStorage>,
cluster_offsets_and_counts: StorageBuffer<GpuClusterOffsetsAndCountsStorage>,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
},
}
impl ViewClusterBuffers {
fn new(buffer_binding_type: BufferBindingType) -> Self {
match buffer_binding_type {
BufferBindingType::Storage { .. } => Self::storage(),
BufferBindingType::Uniform => Self::uniform(),
}
}
fn uniform() -> Self {
ViewClusterBuffers::Uniform {
cluster_light_index_lists: UniformBuffer::default(),
cluster_offsets_and_counts: UniformBuffer::default(),
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
fn storage() -> Self {
ViewClusterBuffers::Storage {
cluster_light_index_lists: StorageBuffer::default(),
cluster_offsets_and_counts: StorageBuffer::default(),
}
}
}
#[derive(Component)]
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub struct ViewClusterBindings {
n_indices: usize,
n_offsets: usize,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
buffers: ViewClusterBuffers,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
impl ViewClusterBindings {
pub const MAX_OFFSETS: usize = 16384 / 4;
const MAX_UNIFORM_ITEMS: usize = Self::MAX_OFFSETS / 4;
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
pub const MAX_INDICES: usize = 16384;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub fn new(buffer_binding_type: BufferBindingType) -> Self {
Self {
n_indices: 0,
n_offsets: 0,
buffers: ViewClusterBuffers::new(buffer_binding_type),
}
}
pub fn clear(&mut self) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
cluster_offsets_and_counts,
} => {
*cluster_light_index_lists.get_mut().data = [UVec4::ZERO; Self::MAX_UNIFORM_ITEMS];
*cluster_offsets_and_counts.get_mut().data = [UVec4::ZERO; Self::MAX_UNIFORM_ITEMS];
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
ViewClusterBuffers::Storage {
cluster_light_index_lists,
cluster_offsets_and_counts,
..
} => {
cluster_light_index_lists.get_mut().data.clear();
cluster_offsets_and_counts.get_mut().data.clear();
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
pub fn push_offset_and_counts(&mut self, offset: usize, point_count: usize, spot_count: usize) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_offsets_and_counts,
..
} => {
let array_index = self.n_offsets >> 2; // >> 2 is equivalent to / 4
if array_index >= Self::MAX_UNIFORM_ITEMS {
warn!("cluster offset and count out of bounds!");
return;
}
let component = self.n_offsets & ((1 << 2) - 1);
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let packed = pack_offset_and_counts(offset, point_count, spot_count);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
cluster_offsets_and_counts.get_mut().data[array_index][component] = packed;
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
ViewClusterBuffers::Storage {
cluster_offsets_and_counts,
..
} => {
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
cluster_offsets_and_counts.get_mut().data.push(UVec4::new(
offset as u32,
point_count as u32,
spot_count as u32,
0,
));
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
self.n_offsets += 1;
}
pub fn n_indices(&self) -> usize {
self.n_indices
}
pub fn push_index(&mut self, index: usize) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
..
} => {
let array_index = self.n_indices >> 4; // >> 4 is equivalent to / 16
let component = (self.n_indices >> 2) & ((1 << 2) - 1);
let sub_index = self.n_indices & ((1 << 2) - 1);
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
let index = index as u32;
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
cluster_light_index_lists.get_mut().data[array_index][component] |=
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
index << (8 * sub_index);
}
ViewClusterBuffers::Storage {
cluster_light_index_lists,
..
} => {
cluster_light_index_lists.get_mut().data.push(index as u32);
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
self.n_indices += 1;
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub fn write_buffers(&mut self, render_device: &RenderDevice, render_queue: &RenderQueue) {
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
cluster_offsets_and_counts,
} => {
cluster_light_index_lists.write_buffer(render_device, render_queue);
cluster_offsets_and_counts.write_buffer(render_device, render_queue);
}
ViewClusterBuffers::Storage {
cluster_light_index_lists,
cluster_offsets_and_counts,
} => {
cluster_light_index_lists.write_buffer(render_device, render_queue);
cluster_offsets_and_counts.write_buffer(render_device, render_queue);
}
}
}
pub fn light_index_lists_binding(&self) -> Option<BindingResource> {
match &self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
..
} => cluster_light_index_lists.binding(),
ViewClusterBuffers::Storage {
cluster_light_index_lists,
..
} => cluster_light_index_lists.binding(),
}
}
pub fn offsets_and_counts_binding(&self) -> Option<BindingResource> {
match &self.buffers {
ViewClusterBuffers::Uniform {
cluster_offsets_and_counts,
..
} => cluster_offsets_and_counts.binding(),
ViewClusterBuffers::Storage {
cluster_offsets_and_counts,
..
} => cluster_offsets_and_counts.binding(),
}
}
pub fn min_size_cluster_light_index_lists(
buffer_binding_type: BufferBindingType,
) -> NonZeroU64 {
match buffer_binding_type {
BufferBindingType::Storage { .. } => GpuClusterLightIndexListsStorage::min_size(),
BufferBindingType::Uniform => GpuClusterLightIndexListsUniform::min_size(),
}
}
pub fn min_size_cluster_offsets_and_counts(
buffer_binding_type: BufferBindingType,
) -> NonZeroU64 {
match buffer_binding_type {
BufferBindingType::Storage { .. } => GpuClusterOffsetsAndCountsStorage::min_size(),
BufferBindingType::Uniform => GpuClusterOffsetsAndCountsUniform::min_size(),
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
pub fn prepare_clusters(
mut commands: Commands,
render_device: Res<RenderDevice>,
render_queue: Res<RenderQueue>,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
mesh_pipeline: Res<MeshPipeline>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta: Res<GlobalLightMeta>,
views: Query<(Entity, &ExtractedClustersPointLights), With<RenderPhase<Transparent3d>>>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let render_device = render_device.into_inner();
let supports_storage_buffers = matches!(
mesh_pipeline.clustered_forward_buffer_binding_type,
BufferBindingType::Storage { .. }
);
for (entity, extracted_clusters) in &views {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let mut view_clusters_bindings =
ViewClusterBindings::new(mesh_pipeline.clustered_forward_buffer_binding_type);
view_clusters_bindings.clear();
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
for record in &extracted_clusters.data {
match record {
ExtractedClustersPointLightsElement::ClusterHeader(
point_light_count,
spot_light_count,
) => {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
let offset = view_clusters_bindings.n_indices();
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
view_clusters_bindings.push_offset_and_counts(
offset,
*point_light_count as usize,
*spot_light_count as usize,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
);
}
ExtractedClustersPointLightsElement::LightEntity(entity) => {
if let Some(light_index) = global_light_meta.entity_to_index.get(entity) {
if view_clusters_bindings.n_indices() >= ViewClusterBindings::MAX_INDICES
&& !supports_storage_buffers
{
warn!("Cluster light index lists is full! The PointLights in the view are affecting too many clusters.");
break;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
view_clusters_bindings.push_index(*light_index);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
}
}
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
view_clusters_bindings.write_buffers(render_device, &render_queue);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
commands.get_or_spawn(entity).insert(view_clusters_bindings);
}
}
#[allow(clippy::too_many_arguments)]
pub fn queue_shadows<M: Material>(
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
shadow_draw_functions: Res<DrawFunctions<Shadow>>,
prepass_pipeline: Res<PrepassPipeline<M>>,
render_meshes: Res<RenderAssets<Mesh>>,
render_mesh_instances: Res<RenderMeshInstances>,
render_materials: Res<RenderMaterials<M>>,
Use EntityHashMap<Entity, T> for render world entity storage for better performance (#9903) # Objective - Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does. ## Solution - Define `EntityHasher` that calculates a hash from the `Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`. `0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`) with some tweaks for the case of hashing an `Entity`. `FxHasher` and `SeaHasher` were also tested but were significantly slower. - Define `EntityHashMap` type that uses the `EntityHashser` - Use `EntityHashMap<Entity, T>` for render world entity storage, including: - `RenderMaterialInstances` - contains the `AssetId<M>` of the material associated with the entity. Also for 2D. - `RenderMeshInstances` - contains mesh transforms, flags and properties about mesh entities. Also for 2D. - `SkinIndices` and `MorphIndices` - contains the skin and morph index for an entity, respectively - `ExtractedSprites` - `ExtractedUiNodes` ## Benchmarks All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none. ### 2D Meshes `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` #### `--ordered-z` This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1112" alt="Screenshot 2023-09-27 at 07 50 45" src="https://github.com/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb"> -39.1% median frame time. #### Random This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable. <img width="1108" alt="Screenshot 2023-09-27 at 07 51 28" src="https://github.com/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4"> -7.2% median frame time. ### 3D Meshes `many_cubes --benchmark` <img width="1112" alt="Screenshot 2023-09-27 at 07 51 57" src="https://github.com/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc"> -7.7% median frame time. ### Sprites **NOTE: On `main` sprites are using `SparseSet<Entity, T>`!** `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` #### `--ordered-z` This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1116" alt="Screenshot 2023-09-27 at 07 52 31" src="https://github.com/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67"> +13.0% median frame time. #### Random This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses. <img width="1109" alt="Screenshot 2023-09-27 at 07 53 01" src="https://github.com/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033"> +0.6% median frame time. ### UI **NOTE: On `main` UI is using `SparseSet<Entity, T>`!** `many_buttons` <img width="1111" alt="Screenshot 2023-09-27 at 07 53 26" src="https://github.com/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85"> +15.1% median frame time. ## Alternatives - Cart originally suggested trying out `SparseSet<Entity, T>` and indeed that is slightly faster under ideal conditions. However, `PassHashMap<Entity, T>` has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that `SparseSet`'s dense `Vec<usize>` that maps from the `Entity` index to sparse index into `Vec<T>`. This dense `Vec` has to be as large as the largest Entity index used with the `SparseSet`. - I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()` as the key, but this proved to sometimes be slower and mostly no different. - The only outstanding approach that has not been implemented and tested is to _not_ clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved. - Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via `&mut T` queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems. - The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense. --- ## Changelog - Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a `EntityHashMap<Entity, T>` in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are `RenderMeshInstances` and `RenderMaterialInstances`, and their 2D counterparts. ## Migration Guide Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially `EntityHashMap<Entity, T>` where `T` are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as `RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>` for their own data. Before: ```rust fn queue_custom( material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>, ) { ... for (entity, mesh_transforms, mesh_handle) in &material_meshes { ... } } ``` After: ```rust fn queue_custom( render_mesh_instances: Res<RenderMeshInstances>, instance_entities: Query<Entity, With<InstanceMaterialData>>, ) { ... for entity in &instance_entities { let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; }; // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now // be found in `mesh_instance` which is a `RenderMeshInstance` ... } } ``` --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
2023-09-27 08:28:28 +00:00
render_material_instances: Res<RenderMaterialInstances<M>>,
mut pipelines: ResMut<SpecializedMeshPipelines<PrepassPipeline<M>>>,
pipeline_cache: Res<PipelineCache>,
render_lightmaps: Res<RenderLightmaps>,
view_lights: Query<(Entity, &ViewLightEntities)>,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
mut view_light_shadow_phases: Query<(&LightEntity, &mut RenderPhase<Shadow>)>,
point_light_entities: Query<&CubemapVisibleEntities, With<ExtractedPointLight>>,
directional_light_entities: Query<&CascadesVisibleEntities, With<ExtractedDirectionalLight>>,
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
spot_light_entities: Query<&VisibleEntities, With<ExtractedPointLight>>,
) where
M::Data: PartialEq + Eq + Hash + Clone,
{
for (entity, view_lights) in &view_lights {
let draw_shadow_mesh = shadow_draw_functions.read().id::<DrawPrepass<M>>();
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
for view_light_entity in view_lights.lights.iter().copied() {
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
let (light_entity, mut shadow_phase) =
view_light_shadow_phases.get_mut(view_light_entity).unwrap();
let is_directional_light = matches!(light_entity, LightEntity::Directional { .. });
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
let visible_entities = match light_entity {
LightEntity::Directional {
light_entity,
cascade_index,
} => directional_light_entities
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
.get(*light_entity)
.expect("Failed to get directional light visible entities")
.entities
.get(&entity)
.expect("Failed to get directional light visible entities for view")
.get(*cascade_index)
.expect("Failed to get directional light visible entities for cascade"),
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
LightEntity::Point {
light_entity,
face_index,
} => point_light_entities
.get(*light_entity)
.expect("Failed to get point light visible entities")
.get(*face_index),
Spotlights (#4715) # Objective add spotlight support ## Solution / Changelog - add spotlight angles (inner, outer) to ``PointLight`` struct. emitted light is linearly attenuated from 100% to 0% as angle tends from inner to outer. Direction is taken from the existing transform rotation. - add spotlight direction (vec3) and angles (f32,f32) to ``GpuPointLight`` struct (60 bytes -> 80 bytes) in ``pbr/render/lights.rs`` and ``mesh_view_bind_group.wgsl`` - reduce no-buffer-support max point light count to 204 due to above - use spotlight data to attenuate light in ``pbr.wgsl`` - do additional cluster culling on spotlights to minimise cost in ``assign_lights_to_clusters`` - changed one of the lights in the lighting demo to a spotlight - also added a ``spotlight`` demo - probably not justified but so reviewers can see it more easily ## notes increasing the size of the GpuPointLight struct on my machine reduces the FPS of ``many_lights -- sphere`` from ~150fps to 140fps. i thought this was a reasonable tradeoff, and felt better than handling spotlights separately which is possible but would mean introducing a new bind group, refactoring light-assignment code and adding new spotlight-specific code in pbr.wgsl. the FPS impact for smaller numbers of lights should be very small. the cluster culling strategy reintroduces the cluster aabb code which was recently removed... sorry. the aabb is used to get a cluster bounding sphere, which can then be tested fairly efficiently using the strategy described at the end of https://bartwronski.com/2017/04/13/cull-that-cone/. this works well with roughly cubic clusters (where the cluster z size is close to the same as x/y size), less well for other cases like single Z slice / tiled forward rendering. In the worst case we will end up just keeping the culling of the equivalent point light. Co-authored-by: François <mockersf@gmail.com>
2022-07-08 19:57:43 +00:00
LightEntity::Spot { light_entity } => spot_light_entities
.get(*light_entity)
.expect("Failed to get spot light visible entities"),
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
};
// NOTE: Lights with shadow mapping disabled will have no visible entities
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// so no meshes will be queued
for entity in visible_entities.iter().copied() {
let Some(mesh_instance) = render_mesh_instances.get(&entity) else {
continue;
};
Use EntityHashMap<Entity, T> for render world entity storage for better performance (#9903) # Objective - Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does. ## Solution - Define `EntityHasher` that calculates a hash from the `Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`. `0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`) with some tweaks for the case of hashing an `Entity`. `FxHasher` and `SeaHasher` were also tested but were significantly slower. - Define `EntityHashMap` type that uses the `EntityHashser` - Use `EntityHashMap<Entity, T>` for render world entity storage, including: - `RenderMaterialInstances` - contains the `AssetId<M>` of the material associated with the entity. Also for 2D. - `RenderMeshInstances` - contains mesh transforms, flags and properties about mesh entities. Also for 2D. - `SkinIndices` and `MorphIndices` - contains the skin and morph index for an entity, respectively - `ExtractedSprites` - `ExtractedUiNodes` ## Benchmarks All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none. ### 2D Meshes `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` #### `--ordered-z` This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1112" alt="Screenshot 2023-09-27 at 07 50 45" src="https://github.com/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb"> -39.1% median frame time. #### Random This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable. <img width="1108" alt="Screenshot 2023-09-27 at 07 51 28" src="https://github.com/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4"> -7.2% median frame time. ### 3D Meshes `many_cubes --benchmark` <img width="1112" alt="Screenshot 2023-09-27 at 07 51 57" src="https://github.com/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc"> -7.7% median frame time. ### Sprites **NOTE: On `main` sprites are using `SparseSet<Entity, T>`!** `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` #### `--ordered-z` This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1116" alt="Screenshot 2023-09-27 at 07 52 31" src="https://github.com/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67"> +13.0% median frame time. #### Random This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses. <img width="1109" alt="Screenshot 2023-09-27 at 07 53 01" src="https://github.com/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033"> +0.6% median frame time. ### UI **NOTE: On `main` UI is using `SparseSet<Entity, T>`!** `many_buttons` <img width="1111" alt="Screenshot 2023-09-27 at 07 53 26" src="https://github.com/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85"> +15.1% median frame time. ## Alternatives - Cart originally suggested trying out `SparseSet<Entity, T>` and indeed that is slightly faster under ideal conditions. However, `PassHashMap<Entity, T>` has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that `SparseSet`'s dense `Vec<usize>` that maps from the `Entity` index to sparse index into `Vec<T>`. This dense `Vec` has to be as large as the largest Entity index used with the `SparseSet`. - I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()` as the key, but this proved to sometimes be slower and mostly no different. - The only outstanding approach that has not been implemented and tested is to _not_ clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved. - Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via `&mut T` queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems. - The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense. --- ## Changelog - Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a `EntityHashMap<Entity, T>` in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are `RenderMeshInstances` and `RenderMaterialInstances`, and their 2D counterparts. ## Migration Guide Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially `EntityHashMap<Entity, T>` where `T` are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as `RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>` for their own data. Before: ```rust fn queue_custom( material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>, ) { ... for (entity, mesh_transforms, mesh_handle) in &material_meshes { ... } } ``` After: ```rust fn queue_custom( render_mesh_instances: Res<RenderMeshInstances>, instance_entities: Query<Entity, With<InstanceMaterialData>>, ) { ... for entity in &instance_entities { let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; }; // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now // be found in `mesh_instance` which is a `RenderMeshInstance` ... } } ``` --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
2023-09-27 08:28:28 +00:00
if !mesh_instance.shadow_caster {
continue;
}
let Some(material_asset_id) = render_material_instances.get(&entity) else {
continue;
};
Use EntityHashMap<Entity, T> for render world entity storage for better performance (#9903) # Objective - Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does. ## Solution - Define `EntityHasher` that calculates a hash from the `Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`. `0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`) with some tweaks for the case of hashing an `Entity`. `FxHasher` and `SeaHasher` were also tested but were significantly slower. - Define `EntityHashMap` type that uses the `EntityHashser` - Use `EntityHashMap<Entity, T>` for render world entity storage, including: - `RenderMaterialInstances` - contains the `AssetId<M>` of the material associated with the entity. Also for 2D. - `RenderMeshInstances` - contains mesh transforms, flags and properties about mesh entities. Also for 2D. - `SkinIndices` and `MorphIndices` - contains the skin and morph index for an entity, respectively - `ExtractedSprites` - `ExtractedUiNodes` ## Benchmarks All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none. ### 2D Meshes `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` #### `--ordered-z` This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1112" alt="Screenshot 2023-09-27 at 07 50 45" src="https://github.com/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb"> -39.1% median frame time. #### Random This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable. <img width="1108" alt="Screenshot 2023-09-27 at 07 51 28" src="https://github.com/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4"> -7.2% median frame time. ### 3D Meshes `many_cubes --benchmark` <img width="1112" alt="Screenshot 2023-09-27 at 07 51 57" src="https://github.com/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc"> -7.7% median frame time. ### Sprites **NOTE: On `main` sprites are using `SparseSet<Entity, T>`!** `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` #### `--ordered-z` This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1116" alt="Screenshot 2023-09-27 at 07 52 31" src="https://github.com/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67"> +13.0% median frame time. #### Random This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses. <img width="1109" alt="Screenshot 2023-09-27 at 07 53 01" src="https://github.com/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033"> +0.6% median frame time. ### UI **NOTE: On `main` UI is using `SparseSet<Entity, T>`!** `many_buttons` <img width="1111" alt="Screenshot 2023-09-27 at 07 53 26" src="https://github.com/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85"> +15.1% median frame time. ## Alternatives - Cart originally suggested trying out `SparseSet<Entity, T>` and indeed that is slightly faster under ideal conditions. However, `PassHashMap<Entity, T>` has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that `SparseSet`'s dense `Vec<usize>` that maps from the `Entity` index to sparse index into `Vec<T>`. This dense `Vec` has to be as large as the largest Entity index used with the `SparseSet`. - I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()` as the key, but this proved to sometimes be slower and mostly no different. - The only outstanding approach that has not been implemented and tested is to _not_ clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved. - Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via `&mut T` queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems. - The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense. --- ## Changelog - Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a `EntityHashMap<Entity, T>` in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are `RenderMeshInstances` and `RenderMaterialInstances`, and their 2D counterparts. ## Migration Guide Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially `EntityHashMap<Entity, T>` where `T` are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as `RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>` for their own data. Before: ```rust fn queue_custom( material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>, ) { ... for (entity, mesh_transforms, mesh_handle) in &material_meshes { ... } } ``` After: ```rust fn queue_custom( render_mesh_instances: Res<RenderMeshInstances>, instance_entities: Query<Entity, With<InstanceMaterialData>>, ) { ... for entity in &instance_entities { let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; }; // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now // be found in `mesh_instance` which is a `RenderMeshInstance` ... } } ``` --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
2023-09-27 08:28:28 +00:00
let Some(material) = render_materials.get(material_asset_id) else {
continue;
};
Use EntityHashMap<Entity, T> for render world entity storage for better performance (#9903) # Objective - Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does. ## Solution - Define `EntityHasher` that calculates a hash from the `Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`. `0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`) with some tweaks for the case of hashing an `Entity`. `FxHasher` and `SeaHasher` were also tested but were significantly slower. - Define `EntityHashMap` type that uses the `EntityHashser` - Use `EntityHashMap<Entity, T>` for render world entity storage, including: - `RenderMaterialInstances` - contains the `AssetId<M>` of the material associated with the entity. Also for 2D. - `RenderMeshInstances` - contains mesh transforms, flags and properties about mesh entities. Also for 2D. - `SkinIndices` and `MorphIndices` - contains the skin and morph index for an entity, respectively - `ExtractedSprites` - `ExtractedUiNodes` ## Benchmarks All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none. ### 2D Meshes `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` #### `--ordered-z` This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1112" alt="Screenshot 2023-09-27 at 07 50 45" src="https://github.com/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb"> -39.1% median frame time. #### Random This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable. <img width="1108" alt="Screenshot 2023-09-27 at 07 51 28" src="https://github.com/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4"> -7.2% median frame time. ### 3D Meshes `many_cubes --benchmark` <img width="1112" alt="Screenshot 2023-09-27 at 07 51 57" src="https://github.com/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc"> -7.7% median frame time. ### Sprites **NOTE: On `main` sprites are using `SparseSet<Entity, T>`!** `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` #### `--ordered-z` This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1116" alt="Screenshot 2023-09-27 at 07 52 31" src="https://github.com/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67"> +13.0% median frame time. #### Random This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses. <img width="1109" alt="Screenshot 2023-09-27 at 07 53 01" src="https://github.com/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033"> +0.6% median frame time. ### UI **NOTE: On `main` UI is using `SparseSet<Entity, T>`!** `many_buttons` <img width="1111" alt="Screenshot 2023-09-27 at 07 53 26" src="https://github.com/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85"> +15.1% median frame time. ## Alternatives - Cart originally suggested trying out `SparseSet<Entity, T>` and indeed that is slightly faster under ideal conditions. However, `PassHashMap<Entity, T>` has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that `SparseSet`'s dense `Vec<usize>` that maps from the `Entity` index to sparse index into `Vec<T>`. This dense `Vec` has to be as large as the largest Entity index used with the `SparseSet`. - I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()` as the key, but this proved to sometimes be slower and mostly no different. - The only outstanding approach that has not been implemented and tested is to _not_ clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved. - Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via `&mut T` queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems. - The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense. --- ## Changelog - Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a `EntityHashMap<Entity, T>` in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are `RenderMeshInstances` and `RenderMaterialInstances`, and their 2D counterparts. ## Migration Guide Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially `EntityHashMap<Entity, T>` where `T` are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as `RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>` for their own data. Before: ```rust fn queue_custom( material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>, ) { ... for (entity, mesh_transforms, mesh_handle) in &material_meshes { ... } } ``` After: ```rust fn queue_custom( render_mesh_instances: Res<RenderMeshInstances>, instance_entities: Query<Entity, With<InstanceMaterialData>>, ) { ... for entity in &instance_entities { let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; }; // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now // be found in `mesh_instance` which is a `RenderMeshInstance` ... } } ``` --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
2023-09-27 08:28:28 +00:00
let Some(mesh) = render_meshes.get(mesh_instance.mesh_asset_id) else {
continue;
};
let mut mesh_key =
MeshPipelineKey::from_primitive_topology(mesh.primitive_topology)
| MeshPipelineKey::DEPTH_PREPASS;
if mesh.morph_targets.is_some() {
mesh_key |= MeshPipelineKey::MORPH_TARGETS;
}
if is_directional_light {
mesh_key |= MeshPipelineKey::DEPTH_CLAMP_ORTHO;
}
// Even though we don't use the lightmap in the shadow map, the
// `SetMeshBindGroup` render command will bind the data for it. So
// we need to include the appropriate flag in the mesh pipeline key
// to ensure that the necessary bind group layout entries are
// present.
if render_lightmaps.render_lightmaps.contains_key(&entity) {
mesh_key |= MeshPipelineKey::LIGHTMAPPED;
}
mesh_key |= match material.properties.alpha_mode {
AlphaMode::Mask(_)
| AlphaMode::Blend
| AlphaMode::Premultiplied
| AlphaMode::Add => MeshPipelineKey::MAY_DISCARD,
_ => MeshPipelineKey::NONE,
};
let pipeline_id = pipelines.specialize(
&pipeline_cache,
&prepass_pipeline,
MaterialPipelineKey {
mesh_key,
bind_group_data: material.key.clone(),
},
&mesh.layout,
);
let pipeline_id = match pipeline_id {
Ok(id) => id,
Err(err) => {
error!("{}", err);
continue;
}
};
mesh_instance
.material_bind_group_id
.set(material.get_bind_group_id());
shadow_phase.add(Shadow {
draw_function: draw_shadow_mesh,
pipeline: pipeline_id,
entity,
distance: 0.0, // TODO: sort front-to-back
Automatic batching/instancing of draw commands (#9685) # Objective - Implement the foundations of automatic batching/instancing of draw commands as the next step from #89 - NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data. ## Solution - The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on. - The following assumptions have been made: - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases - View bindings are constant across a phase for a given draw function as phases are per-view - `batch_and_prepare_render_phase` is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the `batch_and_prepare_render_phase` system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size. - Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago. - Change batch size for a range in `PhaseItem`, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed. - Add an optional mesh bind group dynamic offset to `PhaseItem`. This avoids having to do a massive table move just to insert `GpuArrayBufferIndex` components. ## Benchmarks All tests have been run on an M1 Max on AC power. `bevymark` and `many_cubes` were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with `--features bevy_ci_testing,trace_tracy` and `CI_TESTING_CONFIG=../benchmark.ron` with the contents of `../benchmark.ron`: ```rust ( exit_after: Some(1500) ) ``` ...in order to run each test for 1500 frames. The recent changes to `many_cubes` and `bevymark` added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch. The graphs compare main (yellow) to this PR (red). ### 3D Mesh `many_cubes --benchmark` <img width="1411" alt="Screenshot 2023-09-03 at 23 42 10" src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4"> The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time. The 1000th frame is identical using the flip tool: ![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4615 seconds ``` ### 3D Mesh `many_cubes --benchmark --material-texture-count 10` <img width="1404" alt="Screenshot 2023-09-03 at 23 45 18" src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf"> This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched. The 1000th frame is identical in flip: ![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4537 seconds ``` ### 3D Mesh `many_cubes --benchmark --vary-per-instance` <img width="1394" alt="Screenshot 2023-09-03 at 23 48 44" src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f"> This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities. The 1000th frame is identical according to flip: ![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4568 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` <img width="1412" alt="Screenshot 2023-09-03 at 23 59 56" src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4"> This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time. Capturing a screenshot of the 1000th frame main vs PR gives: ![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40) ``` Mean: 0.001222 Weighted median: 0.750432 1st weighted quartile: 0.453494 3rd weighted quartile: 0.969758 Min: 0.000000 Max: 0.990296 Evaluation time: 0.4255 seconds ``` So they seem to produce the same results. I also double-checked the number of draws. `main` does 160000 draws, and the PR does 160, as expected. ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10` <img width="1392" alt="Screenshot 2023-09-04 at 00 09 22" src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c"> This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay). ![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f) ``` Mean: 0.002877 Weighted median: 0.964980 1st weighted quartile: 0.668871 3rd weighted quartile: 0.982749 Min: 0.000000 Max: 0.992377 Evaluation time: 0.4301 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance` <img width="1396" alt="Screenshot 2023-09-04 at 00 13 53" src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb"> This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws. The 1000th frame is the same: ![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97) ``` Mean: 0.001214 Weighted median: 0.937499 1st weighted quartile: 0.635467 3rd weighted quartile: 0.979085 Min: 0.000000 Max: 0.988971 Evaluation time: 0.4462 seconds ``` ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` <img width="1396" alt="Screenshot 2023-09-04 at 12 21 12" src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43"> This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10` <img width="1389" alt="Screenshot 2023-09-04 at 12 36 08" src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413"> This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance` <img width="1401" alt="Screenshot 2023-09-04 at 12 29 52" src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6"> This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results. ### Summary | Benchmark | % change in median frame time | | ------------- | ------------- | | many_cubes | 🟩 -30% | | many_cubes 10 materials | 🟩 -5% | | many_cubes unique materials | 🟩 ~0% | | bevymark mesh2d | 🟩 -50% | | bevymark mesh2d 10 materials | 🟩 -50% | | bevymark mesh2d unique materials | 🟩 -7% | | bevymark sprite | 🟥 2% | | bevymark sprite 10 materials | 🟥 0.6% | | bevymark sprite unique materials | 🟥 4.1% | --- ## Changelog - Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Nicola Papale <nico@nicopap.ch>
2023-09-21 22:12:34 +00:00
batch_range: 0..1,
dynamic_offset: None,
});
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
}
}
}
}
pub struct Shadow {
pub distance: f32,
pub entity: Entity,
pub pipeline: CachedRenderPipelineId,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
pub draw_function: DrawFunctionId,
Automatic batching/instancing of draw commands (#9685) # Objective - Implement the foundations of automatic batching/instancing of draw commands as the next step from #89 - NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data. ## Solution - The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on. - The following assumptions have been made: - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases - View bindings are constant across a phase for a given draw function as phases are per-view - `batch_and_prepare_render_phase` is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the `batch_and_prepare_render_phase` system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size. - Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago. - Change batch size for a range in `PhaseItem`, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed. - Add an optional mesh bind group dynamic offset to `PhaseItem`. This avoids having to do a massive table move just to insert `GpuArrayBufferIndex` components. ## Benchmarks All tests have been run on an M1 Max on AC power. `bevymark` and `many_cubes` were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with `--features bevy_ci_testing,trace_tracy` and `CI_TESTING_CONFIG=../benchmark.ron` with the contents of `../benchmark.ron`: ```rust ( exit_after: Some(1500) ) ``` ...in order to run each test for 1500 frames. The recent changes to `many_cubes` and `bevymark` added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch. The graphs compare main (yellow) to this PR (red). ### 3D Mesh `many_cubes --benchmark` <img width="1411" alt="Screenshot 2023-09-03 at 23 42 10" src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4"> The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time. The 1000th frame is identical using the flip tool: ![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4615 seconds ``` ### 3D Mesh `many_cubes --benchmark --material-texture-count 10` <img width="1404" alt="Screenshot 2023-09-03 at 23 45 18" src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf"> This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched. The 1000th frame is identical in flip: ![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4537 seconds ``` ### 3D Mesh `many_cubes --benchmark --vary-per-instance` <img width="1394" alt="Screenshot 2023-09-03 at 23 48 44" src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f"> This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities. The 1000th frame is identical according to flip: ![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4568 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` <img width="1412" alt="Screenshot 2023-09-03 at 23 59 56" src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4"> This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time. Capturing a screenshot of the 1000th frame main vs PR gives: ![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40) ``` Mean: 0.001222 Weighted median: 0.750432 1st weighted quartile: 0.453494 3rd weighted quartile: 0.969758 Min: 0.000000 Max: 0.990296 Evaluation time: 0.4255 seconds ``` So they seem to produce the same results. I also double-checked the number of draws. `main` does 160000 draws, and the PR does 160, as expected. ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10` <img width="1392" alt="Screenshot 2023-09-04 at 00 09 22" src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c"> This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay). ![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f) ``` Mean: 0.002877 Weighted median: 0.964980 1st weighted quartile: 0.668871 3rd weighted quartile: 0.982749 Min: 0.000000 Max: 0.992377 Evaluation time: 0.4301 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance` <img width="1396" alt="Screenshot 2023-09-04 at 00 13 53" src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb"> This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws. The 1000th frame is the same: ![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97) ``` Mean: 0.001214 Weighted median: 0.937499 1st weighted quartile: 0.635467 3rd weighted quartile: 0.979085 Min: 0.000000 Max: 0.988971 Evaluation time: 0.4462 seconds ``` ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` <img width="1396" alt="Screenshot 2023-09-04 at 12 21 12" src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43"> This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10` <img width="1389" alt="Screenshot 2023-09-04 at 12 36 08" src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413"> This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance` <img width="1401" alt="Screenshot 2023-09-04 at 12 29 52" src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6"> This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results. ### Summary | Benchmark | % change in median frame time | | ------------- | ------------- | | many_cubes | 🟩 -30% | | many_cubes 10 materials | 🟩 -5% | | many_cubes unique materials | 🟩 ~0% | | bevymark mesh2d | 🟩 -50% | | bevymark mesh2d 10 materials | 🟩 -50% | | bevymark mesh2d unique materials | 🟩 -7% | | bevymark sprite | 🟥 2% | | bevymark sprite 10 materials | 🟥 0.6% | | bevymark sprite unique materials | 🟥 4.1% | --- ## Changelog - Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Nicola Papale <nico@nicopap.ch>
2023-09-21 22:12:34 +00:00
pub batch_range: Range<u32>,
pub dynamic_offset: Option<NonMaxU32>,
2021-06-02 02:59:17 +00:00
}
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
impl PhaseItem for Shadow {
Reorder render sets, refactor bevy_sprite to take advantage (#9236) This is a continuation of this PR: #8062 # Objective - Reorder render schedule sets to allow data preparation when phase item order is known to support improved batching - Part of the batching/instancing etc plan from here: https://github.com/bevyengine/bevy/issues/89#issuecomment-1379249074 - The original idea came from @inodentry and proved to be a good one. Thanks! - Refactor `bevy_sprite` and `bevy_ui` to take advantage of the new ordering ## Solution - Move `Prepare` and `PrepareFlush` after `PhaseSortFlush` - Add a `PrepareAssets` set that runs in parallel with other systems and sets in the render schedule. - Put prepare_assets systems in the `PrepareAssets` set - If explicit dependencies are needed on Mesh or Material RenderAssets then depend on the appropriate system. - Add `ManageViews` and `ManageViewsFlush` sets between `ExtractCommands` and Queue - Move `queue_mesh*_bind_group` to the Prepare stage - Rename them to `prepare_` - Put systems that prepare resources (buffers, textures, etc.) into a `PrepareResources` set inside `Prepare` - Put the `prepare_..._bind_group` systems into a `PrepareBindGroup` set after `PrepareResources` - Move `prepare_lights` to the `ManageViews` set - `prepare_lights` creates views and this must happen before `Queue` - This system needs refactoring to stop handling all responsibilities - Gather lights, sort, and create shadow map views. Store sorted light entities in a resource - Remove `BatchedPhaseItem` - Replace `batch_range` with `batch_size` representing how many items to skip after rendering the item or to skip the item entirely if `batch_size` is 0. - `queue_sprites` has been split into `queue_sprites` for queueing phase items and `prepare_sprites` for batching after the `PhaseSort` - `PhaseItem`s are still inserted in `queue_sprites` - After sorting adjacent compatible sprite phase items are accumulated into `SpriteBatch` components on the first entity of each batch, containing a range of vertex indices. The associated `PhaseItem`'s `batch_size` is updated appropriately. - `SpriteBatch` items are then drawn skipping over the other items in the batch based on the value in `batch_size` - A very similar refactor was performed on `bevy_ui` --- ## Changelog Changed: - Reordered and reworked render app schedule sets. The main change is that data is extracted, queued, sorted, and then prepared when the order of data is known. - Refactor `bevy_sprite` and `bevy_ui` to take advantage of the reordering. ## Migration Guide - Assets such as materials and meshes should now be created in `PrepareAssets` e.g. `prepare_assets<Mesh>` - Queueing entities to `RenderPhase`s continues to be done in `Queue` e.g. `queue_sprites` - Preparing resources (textures, buffers, etc.) should now be done in `PrepareResources`, e.g. `prepare_prepass_textures`, `prepare_mesh_uniforms` - Prepare bind groups should now be done in `PrepareBindGroups` e.g. `prepare_mesh_bind_group` - Any batching or instancing can now be done in `Prepare` where the order of the phase items is known e.g. `prepare_sprites` ## Next Steps - Introduce some generic mechanism to ensure items that can be batched are grouped in the phase item order, currently you could easily have `[sprite at z 0, mesh at z 0, sprite at z 0]` preventing batching. - Investigate improved orderings for building the MeshUniform buffer - Implementing batching across the rest of bevy --------- Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
2023-08-27 14:33:49 +00:00
type SortKey = usize;
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
#[inline]
fn entity(&self) -> Entity {
self.entity
}
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
#[inline]
fn sort_key(&self) -> Self::SortKey {
Reorder render sets, refactor bevy_sprite to take advantage (#9236) This is a continuation of this PR: #8062 # Objective - Reorder render schedule sets to allow data preparation when phase item order is known to support improved batching - Part of the batching/instancing etc plan from here: https://github.com/bevyengine/bevy/issues/89#issuecomment-1379249074 - The original idea came from @inodentry and proved to be a good one. Thanks! - Refactor `bevy_sprite` and `bevy_ui` to take advantage of the new ordering ## Solution - Move `Prepare` and `PrepareFlush` after `PhaseSortFlush` - Add a `PrepareAssets` set that runs in parallel with other systems and sets in the render schedule. - Put prepare_assets systems in the `PrepareAssets` set - If explicit dependencies are needed on Mesh or Material RenderAssets then depend on the appropriate system. - Add `ManageViews` and `ManageViewsFlush` sets between `ExtractCommands` and Queue - Move `queue_mesh*_bind_group` to the Prepare stage - Rename them to `prepare_` - Put systems that prepare resources (buffers, textures, etc.) into a `PrepareResources` set inside `Prepare` - Put the `prepare_..._bind_group` systems into a `PrepareBindGroup` set after `PrepareResources` - Move `prepare_lights` to the `ManageViews` set - `prepare_lights` creates views and this must happen before `Queue` - This system needs refactoring to stop handling all responsibilities - Gather lights, sort, and create shadow map views. Store sorted light entities in a resource - Remove `BatchedPhaseItem` - Replace `batch_range` with `batch_size` representing how many items to skip after rendering the item or to skip the item entirely if `batch_size` is 0. - `queue_sprites` has been split into `queue_sprites` for queueing phase items and `prepare_sprites` for batching after the `PhaseSort` - `PhaseItem`s are still inserted in `queue_sprites` - After sorting adjacent compatible sprite phase items are accumulated into `SpriteBatch` components on the first entity of each batch, containing a range of vertex indices. The associated `PhaseItem`'s `batch_size` is updated appropriately. - `SpriteBatch` items are then drawn skipping over the other items in the batch based on the value in `batch_size` - A very similar refactor was performed on `bevy_ui` --- ## Changelog Changed: - Reordered and reworked render app schedule sets. The main change is that data is extracted, queued, sorted, and then prepared when the order of data is known. - Refactor `bevy_sprite` and `bevy_ui` to take advantage of the reordering. ## Migration Guide - Assets such as materials and meshes should now be created in `PrepareAssets` e.g. `prepare_assets<Mesh>` - Queueing entities to `RenderPhase`s continues to be done in `Queue` e.g. `queue_sprites` - Preparing resources (textures, buffers, etc.) should now be done in `PrepareResources`, e.g. `prepare_prepass_textures`, `prepare_mesh_uniforms` - Prepare bind groups should now be done in `PrepareBindGroups` e.g. `prepare_mesh_bind_group` - Any batching or instancing can now be done in `Prepare` where the order of the phase items is known e.g. `prepare_sprites` ## Next Steps - Introduce some generic mechanism to ensure items that can be batched are grouped in the phase item order, currently you could easily have `[sprite at z 0, mesh at z 0, sprite at z 0]` preventing batching. - Investigate improved orderings for building the MeshUniform buffer - Implementing batching across the rest of bevy --------- Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
2023-08-27 14:33:49 +00:00
self.pipeline.id()
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
}
#[inline]
fn draw_function(&self) -> DrawFunctionId {
self.draw_function
}
Allow unbatched render phases to use unstable sorts (#5049) # Objective Partially addresses #4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>
2022-06-23 10:52:49 +00:00
#[inline]
fn sort(items: &mut [Self]) {
// The shadow phase is sorted by pipeline id for performance reasons.
// Grouping all draw commands using the same pipeline together performs
// better than rebinding everything at a high rate.
Use GpuArrayBuffer for MeshUniform (#9254) # Objective - Reduce the number of rebindings to enable batching of draw commands ## Solution - Use the new `GpuArrayBuffer` for `MeshUniform` data to store all `MeshUniform` data in arrays within fewer bindings - Sort opaque/alpha mask prepass, opaque/alpha mask main, and shadow phases also by the batch per-object data binding dynamic offset to improve performance on WebGL2. --- ## Changelog - Changed: Per-object `MeshUniform` data is now managed by `GpuArrayBuffer` as arrays in buffers that need to be indexed into. ## Migration Guide Accessing the `model` member of an individual mesh object's shader `Mesh` struct the old way where each `MeshUniform` was stored at its own dynamic offset: ```rust struct Vertex { @location(0) position: vec3<f32>, }; fn vertex(vertex: Vertex) -> VertexOutput { var out: VertexOutput; out.clip_position = mesh_position_local_to_clip( mesh.model, vec4<f32>(vertex.position, 1.0) ); return out; } ``` The new way where one needs to index into the array of `Mesh`es for the batch: ```rust struct Vertex { @builtin(instance_index) instance_index: u32, @location(0) position: vec3<f32>, }; fn vertex(vertex: Vertex) -> VertexOutput { var out: VertexOutput; out.clip_position = mesh_position_local_to_clip( mesh[vertex.instance_index].model, vec4<f32>(vertex.position, 1.0) ); return out; } ``` Note that using the instance_index is the default way to pass the per-object index into the shader, but if you wish to do custom rendering approaches you can pass it in however you like. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com>
2023-07-30 13:17:08 +00:00
radsort::sort_by_key(items, |item| item.sort_key());
Allow unbatched render phases to use unstable sorts (#5049) # Objective Partially addresses #4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>
2022-06-23 10:52:49 +00:00
}
Automatic batching/instancing of draw commands (#9685) # Objective - Implement the foundations of automatic batching/instancing of draw commands as the next step from #89 - NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data. ## Solution - The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on. - The following assumptions have been made: - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases - View bindings are constant across a phase for a given draw function as phases are per-view - `batch_and_prepare_render_phase` is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the `batch_and_prepare_render_phase` system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size. - Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago. - Change batch size for a range in `PhaseItem`, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed. - Add an optional mesh bind group dynamic offset to `PhaseItem`. This avoids having to do a massive table move just to insert `GpuArrayBufferIndex` components. ## Benchmarks All tests have been run on an M1 Max on AC power. `bevymark` and `many_cubes` were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with `--features bevy_ci_testing,trace_tracy` and `CI_TESTING_CONFIG=../benchmark.ron` with the contents of `../benchmark.ron`: ```rust ( exit_after: Some(1500) ) ``` ...in order to run each test for 1500 frames. The recent changes to `many_cubes` and `bevymark` added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch. The graphs compare main (yellow) to this PR (red). ### 3D Mesh `many_cubes --benchmark` <img width="1411" alt="Screenshot 2023-09-03 at 23 42 10" src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4"> The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time. The 1000th frame is identical using the flip tool: ![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4615 seconds ``` ### 3D Mesh `many_cubes --benchmark --material-texture-count 10` <img width="1404" alt="Screenshot 2023-09-03 at 23 45 18" src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf"> This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched. The 1000th frame is identical in flip: ![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4537 seconds ``` ### 3D Mesh `many_cubes --benchmark --vary-per-instance` <img width="1394" alt="Screenshot 2023-09-03 at 23 48 44" src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f"> This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities. The 1000th frame is identical according to flip: ![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4568 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` <img width="1412" alt="Screenshot 2023-09-03 at 23 59 56" src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4"> This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time. Capturing a screenshot of the 1000th frame main vs PR gives: ![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40) ``` Mean: 0.001222 Weighted median: 0.750432 1st weighted quartile: 0.453494 3rd weighted quartile: 0.969758 Min: 0.000000 Max: 0.990296 Evaluation time: 0.4255 seconds ``` So they seem to produce the same results. I also double-checked the number of draws. `main` does 160000 draws, and the PR does 160, as expected. ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10` <img width="1392" alt="Screenshot 2023-09-04 at 00 09 22" src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c"> This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay). ![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f) ``` Mean: 0.002877 Weighted median: 0.964980 1st weighted quartile: 0.668871 3rd weighted quartile: 0.982749 Min: 0.000000 Max: 0.992377 Evaluation time: 0.4301 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance` <img width="1396" alt="Screenshot 2023-09-04 at 00 13 53" src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb"> This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws. The 1000th frame is the same: ![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97) ``` Mean: 0.001214 Weighted median: 0.937499 1st weighted quartile: 0.635467 3rd weighted quartile: 0.979085 Min: 0.000000 Max: 0.988971 Evaluation time: 0.4462 seconds ``` ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` <img width="1396" alt="Screenshot 2023-09-04 at 12 21 12" src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43"> This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10` <img width="1389" alt="Screenshot 2023-09-04 at 12 36 08" src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413"> This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance` <img width="1401" alt="Screenshot 2023-09-04 at 12 29 52" src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6"> This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results. ### Summary | Benchmark | % change in median frame time | | ------------- | ------------- | | many_cubes | 🟩 -30% | | many_cubes 10 materials | 🟩 -5% | | many_cubes unique materials | 🟩 ~0% | | bevymark mesh2d | 🟩 -50% | | bevymark mesh2d 10 materials | 🟩 -50% | | bevymark mesh2d unique materials | 🟩 -7% | | bevymark sprite | 🟥 2% | | bevymark sprite 10 materials | 🟥 0.6% | | bevymark sprite unique materials | 🟥 4.1% | --- ## Changelog - Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Nicola Papale <nico@nicopap.ch>
2023-09-21 22:12:34 +00:00
#[inline]
fn batch_range(&self) -> &Range<u32> {
&self.batch_range
}
#[inline]
fn batch_range_mut(&mut self) -> &mut Range<u32> {
&mut self.batch_range
}
#[inline]
fn dynamic_offset(&self) -> Option<NonMaxU32> {
self.dynamic_offset
}
#[inline]
fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
&mut self.dynamic_offset
}
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
}
2021-06-02 02:59:17 +00:00
impl CachedRenderPipelinePhaseItem for Shadow {
#[inline]
fn cached_pipeline(&self) -> CachedRenderPipelineId {
self.pipeline
}
}
2021-06-02 02:59:17 +00:00
pub struct ShadowPassNode {
main_view_query: QueryState<&'static ViewLightEntities>,
view_light_query: QueryState<(&'static ShadowView, &'static RenderPhase<Shadow>)>,
2021-06-02 02:59:17 +00:00
}
impl ShadowPassNode {
pub fn new(world: &mut World) -> Self {
Self {
main_view_query: QueryState::new(world),
view_light_query: QueryState::new(world),
}
}
}
impl Node for ShadowPassNode {
fn update(&mut self, world: &mut World) {
self.main_view_query.update_archetypes(world);
self.view_light_query.update_archetypes(world);
}
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
fn run<'w>(
2021-06-02 02:59:17 +00:00
&self,
graph: &mut RenderGraphContext,
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
render_context: &mut RenderContext<'w>,
world: &'w World,
2021-06-02 02:59:17 +00:00
) -> Result<(), NodeRunError> {
Make render graph slots optional for most cases (#8109) # Objective - Currently, the render graph slots are only used to pass the view_entity around. This introduces significant boilerplate for very little value. Instead of using slots for this, make the view_entity part of the `RenderGraphContext`. This also means we won't need to have `IN_VIEW` on every node and and we'll be able to use the default impl of `Node::input()`. ## Solution - Add `view_entity: Option<Entity>` to the `RenderGraphContext` - Update all nodes to use this instead of entity slot input --- ## Changelog - Add optional `view_entity` to `RenderGraphContext` ## Migration Guide You can now get the view_entity directly from the `RenderGraphContext`. When implementing the Node: ```rust // 0.10 struct FooNode; impl FooNode { const IN_VIEW: &'static str = "view"; } impl Node for FooNode { fn input(&self) -> Vec<SlotInfo> { vec![SlotInfo::new(Self::IN_VIEW, SlotType::Entity)] } fn run( &self, graph: &mut RenderGraphContext, // ... ) -> Result<(), NodeRunError> { let view_entity = graph.get_input_entity(Self::IN_VIEW)?; // ... Ok(()) } } // 0.11 struct FooNode; impl Node for FooNode { fn run( &self, graph: &mut RenderGraphContext, // ... ) -> Result<(), NodeRunError> { let view_entity = graph.view_entity(); // ... Ok(()) } } ``` When adding the node to the graph, you don't need to specify a slot_edge for the view_entity. ```rust // 0.10 let mut graph = RenderGraph::default(); graph.add_node(FooNode::NAME, node); let input_node_id = draw_2d_graph.set_input(vec![SlotInfo::new( graph::input::VIEW_ENTITY, SlotType::Entity, )]); graph.add_slot_edge( input_node_id, graph::input::VIEW_ENTITY, FooNode::NAME, FooNode::IN_VIEW, ); // add_node_edge ... // 0.11 let mut graph = RenderGraph::default(); graph.add_node(FooNode::NAME, node); // add_node_edge ... ``` ## Notes This PR paired with #8007 will help reduce a lot of annoying boilerplate with the render nodes. Depending on which one gets merged first. It will require a bit of clean up work to make both compatible. I tagged this as a breaking change, because using the old system to get the view_entity will break things because it's not a node input slot anymore. ## Notes for reviewers A lot of the diffs are just removing the slots in every nodes and graph creation. The important part is mostly in the graph_runner/CameraDriverNode.
2023-03-21 20:11:13 +00:00
let view_entity = graph.view_entity();
2021-07-02 01:05:20 +00:00
if let Ok(view_lights) = self.main_view_query.get_manual(world, view_entity) {
for view_light_entity in view_lights.lights.iter().copied() {
let (view_light, shadow_phase) = self
.view_light_query
.get_manual(world, view_light_entity)
.unwrap();
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
let depth_stencil_attachment =
Some(view_light.depth_attachment.get_attachment(StoreOp::Store));
render_context.add_command_buffer_generation_task(move |render_device| {
#[cfg(feature = "trace")]
let _shadow_pass_span = info_span!("shadow_pass").entered();
let mut command_encoder =
render_device.create_command_encoder(&CommandEncoderDescriptor {
label: Some("shadow_pass_command_encoder"),
});
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
let render_pass = command_encoder.begin_render_pass(&RenderPassDescriptor {
label: Some(&view_light.pass_name),
color_attachments: &[],
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
depth_stencil_attachment,
Update to wgpu 0.18 (#10266) # Objective Keep up to date with wgpu. ## Solution Update the wgpu version. Currently blocked on naga_oil updating to naga 0.14 and releasing a new version. 3d scenes (or maybe any scene with lighting?) currently don't render anything due to ``` error: naga_oil bug, please file a report: composer failed to build a valid header: Type [2] '' is invalid = Capability Capabilities(CUBE_ARRAY_TEXTURES) is required ``` I'm not sure what should be passed in for `wgpu::InstanceFlags`, or if we want to make the gles3minorversion configurable (might be useful for debugging?) Currently blocked on https://github.com/bevyengine/naga_oil/pull/63, and https://github.com/gfx-rs/wgpu/issues/4569 to be fixed upstream in wgpu first. ## Known issues Amd+windows+vulkan has issues with texture_binding_arrays (see the image [here](https://github.com/bevyengine/bevy/pull/10266#issuecomment-1819946278)), but that'll be fixed in the next wgpu/naga version, and you can just use dx12 as a workaround for now (Amd+linux mesa+vulkan texture_binding_arrays are fixed though). --- ## Changelog Updated wgpu to 0.18, naga to 0.14.2, and naga_oil to 0.11. - Windows desktop GL should now be less painful as it no longer requires Angle. - You can now toggle shader validation and debug information for debug and release builds using `WgpuSettings.instance_flags` and [InstanceFlags](https://docs.rs/wgpu/0.18.0/wgpu/struct.InstanceFlags.html) ## Migration Guide - `RenderPassDescriptor` `color_attachments` (as well as `RenderPassColorAttachment`, and `RenderPassDepthStencilAttachment`) now use `StoreOp::Store` or `StoreOp::Discard` instead of a `boolean` to declare whether or not they should be stored. - `RenderPassDescriptor` now have `timestamp_writes` and `occlusion_query_set` fields. These can safely be set to `None`. - `ComputePassDescriptor` now have a `timestamp_writes` field. This can be set to `None` for now. - See the [wgpu changelog](https://github.com/gfx-rs/wgpu/blob/trunk/CHANGELOG.md#v0180-2023-10-25) for additional details
2023-12-14 02:45:47 +00:00
timestamp_writes: None,
occlusion_query_set: None,
});
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
let mut render_pass = TrackedRenderPass::new(&render_device, render_pass);
Multithreaded render command encoding (#9172) # Objective - Encoding many GPU commands (such as in a renderpass with many draws, such as the main opaque pass) onto a `wgpu::CommandEncoder` is very expensive, and takes a long time. - To improve performance, we want to perform the command encoding for these heavy passes in parallel. ## Solution - `RenderContext` can now queue up "command buffer generation tasks" which are closures that will generate a command buffer when called. - When finalizing the render context to produce the final list of command buffers, these tasks are run in parallel on the `ComputeTaskPool` to produce their corresponding command buffers. - The general idea is that the node graph will run in serial, but in a node, instead of doing rendering work, you can add tasks to do render work in parallel with other node's tasks that get ran at the end of the graph execution. ## Nodes Parallelized - `MainOpaquePass3dNode` - `PrepassNode` - `DeferredGBufferPrepassNode` - `ShadowPassNode` (One task per view) ## Future Work - For large number of draws calls, might be worth further subdividing passes into 2+ tasks. - Extend this to UI, 2d, transparent, and transmissive nodes? - Needs testing - small command buffers are inefficient - it may be worth reverting to the serial command encoder usage for render phases with few items. - All "serial" (traditional) rendering work must finish before parallel rendering tasks (the new stuff) can start to run. - There is still only one submission to the graphics queue at the end of the graph execution. There is still no ability to submit work earlier. ## Performance Improvement Thanks to @Elabajaba for testing on Bistro. ![image](https://github.com/bevyengine/bevy/assets/47158642/be50dafa-85eb-4da5-a5cd-c0a044f1e76f) TLDR: Without shadow mapping, this PR has no impact. _With_ shadow mapping, this PR gives **~40 more fps** than main. --- ## Changelog - `MainOpaquePass3dNode`, `PrepassNode`, `DeferredGBufferPrepassNode`, and each shadow map within `ShadowPassNode` are now encoded in parallel, giving _greatly_ increased CPU performance, mainly when shadow mapping is enabled. - Does not work on WASM or AMD+Windows+Vulkan. - Added `RenderContext::add_command_buffer_generation_task()`. - `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. ## Migration Guide `RenderContext::new()` now takes adapter info - Some render graph and Node related types and methods now have additional lifetime constraints. --------- Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com> Co-authored-by: François <mockersf@gmail.com>
2024-02-09 07:35:35 +00:00
shadow_phase.render(&mut render_pass, world, view_light_entity);
drop(render_pass);
command_encoder.finish()
});
}
2021-06-02 02:59:17 +00:00
}
Ok(())
}
}