bevy/crates/bevy_pbr/src/render/light.rs

1508 lines
56 KiB
Rust
Raw Normal View History

use crate::{
point_light_order, AmbientLight, Clusters, CubemapVisibleEntities, DirectionalLight,
DirectionalLightShadowMap, DrawMesh, GlobalVisiblePointLights, MeshPipeline, NotShadowCaster,
PointLight, PointLightShadowMap, SetMeshBindGroup, VisiblePointLights, SHADOW_SHADER_HANDLE,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
};
use bevy_asset::Handle;
use bevy_core_pipeline::Transparent3d;
use bevy_ecs::{
prelude::*,
system::{lifetimeless::*, SystemParamItem},
};
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
use bevy_math::{const_vec3, Mat4, UVec2, UVec3, UVec4, Vec2, Vec3, Vec4, Vec4Swizzles};
use bevy_render::{
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
camera::{Camera, CameraProjection},
2021-06-02 02:59:17 +00:00
color::Color,
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
mesh::{Mesh, MeshVertexBufferLayout},
render_asset::RenderAssets,
2021-06-02 02:59:17 +00:00
render_graph::{Node, NodeRunError, RenderGraphContext, SlotInfo, SlotType},
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
render_phase::{
CachedRenderPipelinePhaseItem, DrawFunctionId, DrawFunctions, EntityPhaseItem,
Shader Imports. Decouple Mesh logic from PBR (#3137) ## Shader Imports This adds "whole file" shader imports. These come in two flavors: ### Asset Path Imports ```rust // /assets/shaders/custom.wgsl #import "shaders/custom_material.wgsl" [[stage(fragment)]] fn fragment() -> [[location(0)]] vec4<f32> { return get_color(); } ``` ```rust // /assets/shaders/custom_material.wgsl [[block]] struct CustomMaterial { color: vec4<f32>; }; [[group(1), binding(0)]] var<uniform> material: CustomMaterial; ``` ### Custom Path Imports Enables defining custom import paths. These are intended to be used by crates to export shader functionality: ```rust // bevy_pbr2/src/render/pbr.wgsl #import bevy_pbr::mesh_view_bind_group #import bevy_pbr::mesh_bind_group [[block]] struct StandardMaterial { base_color: vec4<f32>; emissive: vec4<f32>; perceptual_roughness: f32; metallic: f32; reflectance: f32; flags: u32; }; /* rest of PBR fragment shader here */ ``` ```rust impl Plugin for MeshRenderPlugin { fn build(&self, app: &mut bevy_app::App) { let mut shaders = app.world.get_resource_mut::<Assets<Shader>>().unwrap(); shaders.set_untracked( MESH_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_bind_group"), ); shaders.set_untracked( MESH_VIEW_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_view_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_view_bind_group"), ); ``` By convention these should use rust-style module paths that start with the crate name. Ultimately we might enforce this convention. Note that this feature implements _run time_ import resolution. Ultimately we should move the import logic into an asset preprocessor once Bevy gets support for that. ## Decouple Mesh Logic from PBR Logic via MeshRenderPlugin This breaks out mesh rendering code from PBR material code, which improves the legibility of the code, decouples mesh logic from PBR logic, and opens the door for a future `MaterialPlugin<T: Material>` that handles all of the pipeline setup for arbitrary shader materials. ## Removed `RenderAsset<Shader>` in favor of extracting shaders into RenderPipelineCache This simplifies the shader import implementation and removes the need to pass around `RenderAssets<Shader>`. ## RenderCommands are now fallible This allows us to cleanly handle pipelines+shaders not being ready yet. We can abort a render command early in these cases, preventing bevy from trying to bind group / do draw calls for pipelines that couldn't be bound. This could also be used in the future for things like "components not existing on entities yet". # Next Steps * Investigate using Naga for "partial typed imports" (ex: `#import bevy_pbr::material::StandardMaterial`, which would import only the StandardMaterial struct) * Implement `MaterialPlugin<T: Material>` for low-boilerplate custom material shaders * Move shader import logic into the asset preprocessor once bevy gets support for that. Fixes #3132
2021-11-18 03:45:02 +00:00
EntityRenderCommand, PhaseItem, RenderCommandResult, RenderPhase, SetItemPipeline,
TrackedRenderPass,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
},
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
render_resource::{std140::AsStd140, std430::AsStd430, *},
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
renderer::{RenderContext, RenderDevice, RenderQueue},
2021-06-02 02:59:17 +00:00
texture::*,
view::{
ExtractedView, ViewUniform, ViewUniformOffset, ViewUniforms, Visibility, VisibleEntities,
},
2021-06-02 02:59:17 +00:00
};
use bevy_transform::components::GlobalTransform;
use bevy_utils::FloatOrd;
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
use bevy_utils::{
tracing::{error, warn},
HashMap,
};
2021-06-02 02:59:17 +00:00
use std::num::NonZeroU32;
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
#[derive(Debug, Hash, PartialEq, Eq, Clone, SystemLabel)]
pub enum RenderLightSystems {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
ExtractClusters,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
ExtractLights,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
PrepareClusters,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
PrepareLights,
QueueShadows,
}
bevy_pbr2: Add support for most of the StandardMaterial textures (#4) * bevy_pbr2: Add support for most of the StandardMaterial textures Normal maps are not included here as they require tangents in a vertex attribute. * bevy_pbr2: Ensure RenderCommandQueue is ready for PbrShaders init * texture_pipelined: Add a light to the scene so we can see stuff * WIP bevy_pbr2: back to front sorting hack * bevy_pbr2: Uniform control flow for texture sampling in pbr.frag From 'fintelia' on the Bevy Render Rework Round 2 discussion: "My understanding is that GPUs these days never use the "execute both branches and select the result" strategy. Rather, what they do is evaluate the branch condition on all threads of a warp, and jump over it if all of them evaluate to false. If even a single thread needs to execute the if statement body, however, then the remaining threads are paused until that is completed." * bevy_pbr2: Simplify texture and sampler names The StandardMaterial_ prefix is no longer needed * bevy_pbr2: Match default 'AmbientColor' of current bevy_pbr for now * bevy_pbr2: Convert from non-linear to linear sRGB for the color uniform * bevy_pbr2: Add pbr_pipelined example * Fix view vector in pbr frag to work in ortho * bevy_pbr2: Use a 90 degree y fov and light range projection for lights * bevy_pbr2: Add AmbientLight resource * bevy_pbr2: Convert PointLight color to linear sRGB for use in fragment shader * bevy_pbr2: pbr.frag: Rename PointLight.projection to view_projection The uniform contains the view_projection matrix so this was incorrect. * bevy_pbr2: PointLight is an OmniLight as it has a radius * bevy_pbr2: Factoring out duplicated code * bevy_pbr2: Implement RenderAsset for StandardMaterial * Remove unnecessary texture and sampler clones * fix comment formatting * remove redundant Buffer:from * Don't extract meshes when their material textures aren't ready * make missing textures in the queue step an error Co-authored-by: Aevyrie <aevyrie@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-06-27 23:10:23 +00:00
pub struct ExtractedAmbientLight {
color: Color,
brightness: f32,
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
2021-06-02 02:59:17 +00:00
pub struct ExtractedPointLight {
color: Color,
bevy_pbr2: Improve lighting units and documentation (#2704) # Objective A question was raised on Discord about the units of the `PointLight` `intensity` member. After digging around in the bevy_pbr2 source code and [Google Filament documentation](https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower) I discovered that the intention by Filament was that the 'intensity' value for point lights would be in lumens. This makes a lot of sense as these are quite relatable units given basically all light bulbs I've seen sold over the past years are rated in lumens as people move away from thinking about how bright a bulb is relative to a non-halogen incandescent bulb. However, it seems that the derivation of the conversion between luminous power (lumens, denoted `Φ` in the Filament formulae) and luminous intensity (lumens per steradian, `I` in the Filament formulae) was missed and I can see why as it is tucked right under equation 58 at the link above. As such, while the formula states that for a point light, `I = Φ / 4 π` we have been using `intensity` as if it were luminous intensity `I`. Before this PR, the intensity field is luminous intensity in lumens per steradian. After this PR, the intensity field is luminous power in lumens, [as suggested by Filament](https://google.github.io/filament/Filament.html#table_lighttypesunits) (unfortunately the link jumps to the table's caption so scroll up to see the actual table). I appreciate that it may be confusing to call this an intensity, but I think this is intended as more of a non-scientific, human-relatable general term with a bit of hand waving so that most light types can just have an intensity field and for most of them it works in the same way or at least with some relatable value. I'm inclined to think this is reasonable rather than throwing terms like luminous power, luminous intensity, blah at users. ## Solution - Documented the `PointLight` `intensity` member as 'luminous power' in units of lumens. - Added a table of examples relating from various types of household lighting to lumen values. - Added in the mapping from luminous power to luminous intensity when premultiplying the intensity into the colour before it is made into a graphics uniform. - Updated the documentation in `pbr.wgsl` to clarify the earlier confusion about the missing `/ 4 π`. - Bumped the intensity of the point lights in `3d_scene_pipelined` to 1600 lumens. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-08-23 23:48:11 +00:00
/// luminous intensity in lumens per steradian
2021-06-02 02:59:17 +00:00
intensity: f32,
range: f32,
radius: f32,
transform: GlobalTransform,
shadows_enabled: bool,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
}
pub type ExtractedPointLightShadowMap = PointLightShadowMap;
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ExtractedDirectionalLight {
color: Color,
illuminance: f32,
direction: Vec3,
projection: Mat4,
shadows_enabled: bool,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
near: f32,
far: f32,
2021-06-02 02:59:17 +00:00
}
pub type ExtractedDirectionalLightShadowMap = DirectionalLightShadowMap;
2021-06-02 02:59:17 +00:00
#[repr(C)]
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
#[derive(Copy, Clone, AsStd140, AsStd430, Default, Debug)]
pub struct GpuPointLight {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// The lower-right 2x2 values of the projection matrix 22 23 32 33
projection_lr: Vec4,
color_inverse_square_range: Vec4,
position_radius: Vec4,
flags: u32,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub enum GpuPointLights {
Uniform {
buffer: UniformVec<[GpuPointLight; MAX_UNIFORM_BUFFER_POINT_LIGHTS]>,
},
Storage {
buffer: StorageBuffer<GpuPointLight>,
},
}
impl GpuPointLights {
fn new(buffer_binding_type: BufferBindingType) -> Self {
match buffer_binding_type {
BufferBindingType::Storage { .. } => Self::storage(),
BufferBindingType::Uniform => Self::uniform(),
}
}
fn uniform() -> Self {
Self::Uniform {
buffer: UniformVec::default(),
}
}
fn storage() -> Self {
Self::Storage {
buffer: StorageBuffer::default(),
}
}
fn clear(&mut self) {
match self {
GpuPointLights::Uniform { buffer } => buffer.clear(),
GpuPointLights::Storage { buffer } => buffer.clear(),
}
}
fn push(&mut self, mut lights: Vec<GpuPointLight>) {
match self {
GpuPointLights::Uniform { buffer } => {
// NOTE: This iterator construction allows moving and padding with default
// values and is like this to avoid unnecessary cloning.
let gpu_point_lights = lights
.drain(..)
.chain(std::iter::repeat_with(GpuPointLight::default))
.take(MAX_UNIFORM_BUFFER_POINT_LIGHTS)
.collect::<Vec<_>>();
buffer.push(gpu_point_lights.try_into().unwrap());
}
GpuPointLights::Storage { buffer } => {
buffer.append(&mut lights);
}
}
}
fn write_buffer(&mut self, render_device: &RenderDevice, render_queue: &RenderQueue) {
match self {
GpuPointLights::Uniform { buffer } => buffer.write_buffer(render_device, render_queue),
GpuPointLights::Storage { buffer } => buffer.write_buffer(render_device, render_queue),
}
}
pub fn binding(&self) -> Option<BindingResource> {
match self {
GpuPointLights::Uniform { buffer } => buffer.binding(),
GpuPointLights::Storage { buffer } => buffer.binding(),
}
}
pub fn len(&self) -> usize {
match self {
GpuPointLights::Uniform { buffer } => buffer.len(),
GpuPointLights::Storage { buffer } => buffer.values().len(),
}
}
pub fn is_empty(&self) -> bool {
self.len() == 0
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
// NOTE: These must match the bit flags in bevy_pbr2/src/render/pbr.frag!
bitflags::bitflags! {
#[repr(transparent)]
struct PointLightFlags: u32 {
const SHADOWS_ENABLED = (1 << 0);
const NONE = 0;
const UNINITIALIZED = 0xFFFF;
}
}
#[repr(C)]
#[derive(Copy, Clone, AsStd140, Default, Debug)]
pub struct GpuDirectionalLight {
view_projection: Mat4,
color: Vec4,
dir_to_light: Vec3,
flags: u32,
shadow_depth_bias: f32,
shadow_normal_bias: f32,
2021-06-02 02:59:17 +00:00
}
// NOTE: These must match the bit flags in bevy_pbr2/src/render/pbr.frag!
bitflags::bitflags! {
#[repr(transparent)]
struct DirectionalLightFlags: u32 {
const SHADOWS_ENABLED = (1 << 0);
const NONE = 0;
const UNINITIALIZED = 0xFFFF;
}
}
2021-06-02 02:59:17 +00:00
#[repr(C)]
#[derive(Copy, Clone, Debug, AsStd140)]
pub struct GpuLights {
// TODO: this comes first to work around a WGSL alignment issue. We need to solve this issue before releasing the renderer rework
directional_lights: [GpuDirectionalLight; MAX_DIRECTIONAL_LIGHTS],
bevy_pbr2: Add support for most of the StandardMaterial textures (#4) * bevy_pbr2: Add support for most of the StandardMaterial textures Normal maps are not included here as they require tangents in a vertex attribute. * bevy_pbr2: Ensure RenderCommandQueue is ready for PbrShaders init * texture_pipelined: Add a light to the scene so we can see stuff * WIP bevy_pbr2: back to front sorting hack * bevy_pbr2: Uniform control flow for texture sampling in pbr.frag From 'fintelia' on the Bevy Render Rework Round 2 discussion: "My understanding is that GPUs these days never use the "execute both branches and select the result" strategy. Rather, what they do is evaluate the branch condition on all threads of a warp, and jump over it if all of them evaluate to false. If even a single thread needs to execute the if statement body, however, then the remaining threads are paused until that is completed." * bevy_pbr2: Simplify texture and sampler names The StandardMaterial_ prefix is no longer needed * bevy_pbr2: Match default 'AmbientColor' of current bevy_pbr for now * bevy_pbr2: Convert from non-linear to linear sRGB for the color uniform * bevy_pbr2: Add pbr_pipelined example * Fix view vector in pbr frag to work in ortho * bevy_pbr2: Use a 90 degree y fov and light range projection for lights * bevy_pbr2: Add AmbientLight resource * bevy_pbr2: Convert PointLight color to linear sRGB for use in fragment shader * bevy_pbr2: pbr.frag: Rename PointLight.projection to view_projection The uniform contains the view_projection matrix so this was incorrect. * bevy_pbr2: PointLight is an OmniLight as it has a radius * bevy_pbr2: Factoring out duplicated code * bevy_pbr2: Implement RenderAsset for StandardMaterial * Remove unnecessary texture and sampler clones * fix comment formatting * remove redundant Buffer:from * Don't extract meshes when their material textures aren't ready * make missing textures in the queue step an error Co-authored-by: Aevyrie <aevyrie@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-06-27 23:10:23 +00:00
ambient_color: Vec4,
// xyz are x/y/z cluster dimensions and w is the number of clusters
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
cluster_dimensions: UVec4,
// xy are vec2<f32>(cluster_dimensions.xy) / vec2<f32>(view.width, view.height)
// z is cluster_dimensions.z / log(far / near)
// w is cluster_dimensions.z * log(near) / log(far / near)
cluster_factors: Vec4,
n_directional_lights: u32,
2021-06-02 02:59:17 +00:00
}
// NOTE: this must be kept in sync with the same constants in pbr.frag
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub const MAX_UNIFORM_BUFFER_POINT_LIGHTS: usize = 256;
pub const MAX_DIRECTIONAL_LIGHTS: usize = 1;
pub const DIRECTIONAL_SHADOW_LAYERS: u32 = MAX_DIRECTIONAL_LIGHTS as u32;
2021-06-02 02:59:17 +00:00
pub const SHADOW_FORMAT: TextureFormat = TextureFormat::Depth32Float;
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
pub struct ShadowPipeline {
2021-06-21 23:28:52 +00:00
pub view_layout: BindGroupLayout,
pub mesh_layout: BindGroupLayout,
pub skinned_mesh_layout: BindGroupLayout,
pub point_light_sampler: Sampler,
pub directional_light_sampler: Sampler,
2021-06-02 02:59:17 +00:00
}
// TODO: this pattern for initializing the shaders / pipeline isn't ideal. this should be handled by the asset system
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
impl FromWorld for ShadowPipeline {
2021-06-02 02:59:17 +00:00
fn from_world(world: &mut World) -> Self {
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
let world = world.cell();
let render_device = world.resource::<RenderDevice>();
2021-06-21 23:28:52 +00:00
let view_layout = render_device.create_bind_group_layout(&BindGroupLayoutDescriptor {
entries: &[
// View
BindGroupLayoutEntry {
binding: 0,
visibility: ShaderStages::VERTEX | ShaderStages::FRAGMENT,
2021-06-21 23:28:52 +00:00
ty: BindingType::Buffer {
ty: BufferBindingType::Uniform,
has_dynamic_offset: true,
min_binding_size: BufferSize::new(ViewUniform::std140_size_static() as u64),
2021-06-21 23:28:52 +00:00
},
count: None,
2021-06-02 02:59:17 +00:00
},
],
label: Some("shadow_view_layout"),
2021-06-21 23:28:52 +00:00
});
2021-06-02 02:59:17 +00:00
let mesh_pipeline = world.resource::<MeshPipeline>();
let skinned_mesh_layout = mesh_pipeline.skinned_mesh_layout.clone();
ShadowPipeline {
view_layout,
Shader Imports. Decouple Mesh logic from PBR (#3137) ## Shader Imports This adds "whole file" shader imports. These come in two flavors: ### Asset Path Imports ```rust // /assets/shaders/custom.wgsl #import "shaders/custom_material.wgsl" [[stage(fragment)]] fn fragment() -> [[location(0)]] vec4<f32> { return get_color(); } ``` ```rust // /assets/shaders/custom_material.wgsl [[block]] struct CustomMaterial { color: vec4<f32>; }; [[group(1), binding(0)]] var<uniform> material: CustomMaterial; ``` ### Custom Path Imports Enables defining custom import paths. These are intended to be used by crates to export shader functionality: ```rust // bevy_pbr2/src/render/pbr.wgsl #import bevy_pbr::mesh_view_bind_group #import bevy_pbr::mesh_bind_group [[block]] struct StandardMaterial { base_color: vec4<f32>; emissive: vec4<f32>; perceptual_roughness: f32; metallic: f32; reflectance: f32; flags: u32; }; /* rest of PBR fragment shader here */ ``` ```rust impl Plugin for MeshRenderPlugin { fn build(&self, app: &mut bevy_app::App) { let mut shaders = app.world.get_resource_mut::<Assets<Shader>>().unwrap(); shaders.set_untracked( MESH_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_bind_group"), ); shaders.set_untracked( MESH_VIEW_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_view_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_view_bind_group"), ); ``` By convention these should use rust-style module paths that start with the crate name. Ultimately we might enforce this convention. Note that this feature implements _run time_ import resolution. Ultimately we should move the import logic into an asset preprocessor once Bevy gets support for that. ## Decouple Mesh Logic from PBR Logic via MeshRenderPlugin This breaks out mesh rendering code from PBR material code, which improves the legibility of the code, decouples mesh logic from PBR logic, and opens the door for a future `MaterialPlugin<T: Material>` that handles all of the pipeline setup for arbitrary shader materials. ## Removed `RenderAsset<Shader>` in favor of extracting shaders into RenderPipelineCache This simplifies the shader import implementation and removes the need to pass around `RenderAssets<Shader>`. ## RenderCommands are now fallible This allows us to cleanly handle pipelines+shaders not being ready yet. We can abort a render command early in these cases, preventing bevy from trying to bind group / do draw calls for pipelines that couldn't be bound. This could also be used in the future for things like "components not existing on entities yet". # Next Steps * Investigate using Naga for "partial typed imports" (ex: `#import bevy_pbr::material::StandardMaterial`, which would import only the StandardMaterial struct) * Implement `MaterialPlugin<T: Material>` for low-boilerplate custom material shaders * Move shader import logic into the asset preprocessor once bevy gets support for that. Fixes #3132
2021-11-18 03:45:02 +00:00
mesh_layout: mesh_pipeline.mesh_layout.clone(),
skinned_mesh_layout,
point_light_sampler: render_device.create_sampler(&SamplerDescriptor {
address_mode_u: AddressMode::ClampToEdge,
address_mode_v: AddressMode::ClampToEdge,
address_mode_w: AddressMode::ClampToEdge,
mag_filter: FilterMode::Linear,
min_filter: FilterMode::Linear,
mipmap_filter: FilterMode::Nearest,
compare: Some(CompareFunction::GreaterEqual),
..Default::default()
}),
directional_light_sampler: render_device.create_sampler(&SamplerDescriptor {
address_mode_u: AddressMode::ClampToEdge,
address_mode_v: AddressMode::ClampToEdge,
address_mode_w: AddressMode::ClampToEdge,
mag_filter: FilterMode::Linear,
min_filter: FilterMode::Linear,
mipmap_filter: FilterMode::Nearest,
compare: Some(CompareFunction::GreaterEqual),
..Default::default()
}),
}
}
}
bitflags::bitflags! {
#[repr(transparent)]
pub struct ShadowPipelineKey: u32 {
const NONE = 0;
const PRIMITIVE_TOPOLOGY_RESERVED_BITS = ShadowPipelineKey::PRIMITIVE_TOPOLOGY_MASK_BITS << ShadowPipelineKey::PRIMITIVE_TOPOLOGY_SHIFT_BITS;
}
}
impl ShadowPipelineKey {
const PRIMITIVE_TOPOLOGY_MASK_BITS: u32 = 0b111;
const PRIMITIVE_TOPOLOGY_SHIFT_BITS: u32 = 32 - 3;
pub fn from_primitive_topology(primitive_topology: PrimitiveTopology) -> Self {
let primitive_topology_bits = ((primitive_topology as u32)
& Self::PRIMITIVE_TOPOLOGY_MASK_BITS)
<< Self::PRIMITIVE_TOPOLOGY_SHIFT_BITS;
Self::from_bits(primitive_topology_bits).unwrap()
}
pub fn primitive_topology(&self) -> PrimitiveTopology {
let primitive_topology_bits =
(self.bits >> Self::PRIMITIVE_TOPOLOGY_SHIFT_BITS) & Self::PRIMITIVE_TOPOLOGY_MASK_BITS;
match primitive_topology_bits {
x if x == PrimitiveTopology::PointList as u32 => PrimitiveTopology::PointList,
x if x == PrimitiveTopology::LineList as u32 => PrimitiveTopology::LineList,
x if x == PrimitiveTopology::LineStrip as u32 => PrimitiveTopology::LineStrip,
x if x == PrimitiveTopology::TriangleList as u32 => PrimitiveTopology::TriangleList,
x if x == PrimitiveTopology::TriangleStrip as u32 => PrimitiveTopology::TriangleStrip,
_ => PrimitiveTopology::default(),
}
}
}
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
impl SpecializedMeshPipeline for ShadowPipeline {
type Key = ShadowPipelineKey;
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
fn specialize(
&self,
key: Self::Key,
layout: &MeshVertexBufferLayout,
) -> Result<RenderPipelineDescriptor, SpecializedMeshPipelineError> {
let mut vertex_attributes = vec![Mesh::ATTRIBUTE_POSITION.at_shader_location(0)];
let mut bind_group_layout = vec![self.view_layout.clone()];
let mut shader_defs = Vec::new();
if layout.contains(Mesh::ATTRIBUTE_JOINT_INDEX)
&& layout.contains(Mesh::ATTRIBUTE_JOINT_WEIGHT)
{
shader_defs.push(String::from("SKINNED"));
vertex_attributes.push(Mesh::ATTRIBUTE_JOINT_INDEX.at_shader_location(4));
vertex_attributes.push(Mesh::ATTRIBUTE_JOINT_WEIGHT.at_shader_location(5));
bind_group_layout.push(self.skinned_mesh_layout.clone());
} else {
bind_group_layout.push(self.mesh_layout.clone());
}
let vertex_buffer_layout = layout.get_layout(&vertex_attributes)?;
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
Ok(RenderPipelineDescriptor {
vertex: VertexState {
shader: SHADOW_SHADER_HANDLE.typed::<Shader>(),
entry_point: "vertex".into(),
shader_defs,
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
buffers: vec![vertex_buffer_layout],
2021-06-21 23:28:52 +00:00
},
fragment: None,
layout: Some(bind_group_layout),
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
primitive: PrimitiveState {
topology: key.primitive_topology(),
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
strip_index_format: None,
front_face: FrontFace::Ccw,
cull_mode: None,
unclipped_depth: false,
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
polygon_mode: PolygonMode::Fill,
conservative: false,
},
2021-06-02 02:59:17 +00:00
depth_stencil: Some(DepthStencilState {
format: SHADOW_FORMAT,
depth_write_enabled: true,
Use the infinite reverse right-handed perspective projection (#2543) # Objective Forward perspective projections have poor floating point precision distribution over the depth range. Reverse projections fair much better, and instead of having to have a far plane, with the reverse projection, using an infinite far plane is not a problem. The infinite reverse perspective projection has become the industry standard. The renderer rework is a great time to migrate to it. ## Solution All perspective projections, including point lights, have been moved to using `glam::Mat4::perspective_infinite_reverse_rh()` and so have no far plane. As various depth textures are shared between orthographic and perspective projections, a quirk of this PR is that the near and far planes of the orthographic projection are swapped when the Mat4 is computed. This has no impact on 2D/3D orthographic projection usage, and provides consistency in shaders, texture clear values, etc. throughout the codebase. ## Known issues For some reason, when looking along -Z, all geometry is black. The camera can be translated up/down / strafed left/right and geometry will still be black. Moving forward/backward or rotating the camera away from looking exactly along -Z causes everything to work as expected. I have tried to debug this issue but both in macOS and Windows I get crashes when doing pixel debugging. If anyone could reproduce this and debug it I would be very grateful. Otherwise I will have to try to debug it further without pixel debugging, though the projections and such all looked fine to me.
2021-08-27 20:15:09 +00:00
depth_compare: CompareFunction::GreaterEqual,
2021-06-02 02:59:17 +00:00
stencil: StencilState {
front: StencilFaceState::IGNORE,
back: StencilFaceState::IGNORE,
read_mask: 0,
write_mask: 0,
},
bias: DepthBiasState {
Scale normal bias by texel size (#26) * 3d_scene_pipelined: Use a shallower directional light angle to provoke acne * cornell_box_pipelined: Remove bias tweaks * bevy_pbr2: Simplify shadow biases by moving them to linear depth * bevy_pbr2: Do not use DepthBiasState * bevy_pbr2: Do not use bilinear filtering for sampling depth textures * pbr.wgsl: Remove unnecessary comment * bevy_pbr2: Do manual shadow map depth comparisons for more flexibility * examples: Add shadow_biases_pipelined example This is useful for stress testing biases. * bevy_pbr2: Scale the point light normal bias by the shadow map texel size This allows the normal bias to be small close to the light source where the shadow map texel to screen texel ratio is high, but is appropriately large further away from the light source where the shadow map texel can easily cover multiple screen texels. * shadow_biases_pipelined: Add support for toggling directional / point light * shadow_biases_pipelined: Cleanup * bevy_pbr2: Scale the directional light normal bias by the shadow map texel size * shadow_biases_pipelined: Fit the orthographic projection around the scene * bevy_pbr2: Directional lights should have no shadows outside their projection Before this change, sampling a fragment position from outside the ndc volume would result in the return sample being clamped to the edge in x,y or possibly always casting a shadow for fragment positions past the orthographic projection's far plane. * bevy_pbr2: Fix the default directional light normal bias * Revert "bevy_pbr2: Do manual shadow map depth comparisons for more flexibility" This reverts commit 7df1bab38a42d8a33bc50ca583d4be37bd9c9f0d. * shadow_biases_pipelined: Adjust directional light normal bias in 0.1 increments * pbr.wgsl: Add a couple of clarifying comments * Revert "bevy_pbr2: Do not use bilinear filtering for sampling depth textures" This reverts commit f53baab0232ce218866a45cad6902b470f4cf2c4. * shadow_biases_pipelined: Print usage to terminal
2021-07-19 19:20:59 +00:00
constant: 0,
slope_scale: 0.0,
2021-06-02 02:59:17 +00:00
clamp: 0.0,
},
}),
2021-06-21 23:28:52 +00:00
multisample: MultisampleState::default(),
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
label: Some("shadow_pipeline".into()),
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
})
2021-06-02 02:59:17 +00:00
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
#[derive(Component)]
pub struct ExtractedClusterConfig {
/// Special near value for cluster calculations
near: f32,
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
far: f32,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
/// Number of clusters in x / y / z in the view frustum
dimensions: UVec3,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
#[derive(Component)]
pub struct ExtractedClustersPointLights {
data: Vec<VisiblePointLights>,
}
pub fn extract_clusters(mut commands: Commands, views: Query<(Entity, &Clusters), With<Camera>>) {
for (entity, clusters) in views.iter() {
commands.get_or_spawn(entity).insert_bundle((
ExtractedClustersPointLights {
data: clusters.lights.clone(),
},
ExtractedClusterConfig {
near: clusters.near,
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
far: clusters.far,
dimensions: clusters.dimensions,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
},
));
}
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
#[allow(clippy::too_many_arguments)]
2021-06-02 02:59:17 +00:00
pub fn extract_lights(
mut commands: Commands,
bevy_pbr2: Add support for most of the StandardMaterial textures (#4) * bevy_pbr2: Add support for most of the StandardMaterial textures Normal maps are not included here as they require tangents in a vertex attribute. * bevy_pbr2: Ensure RenderCommandQueue is ready for PbrShaders init * texture_pipelined: Add a light to the scene so we can see stuff * WIP bevy_pbr2: back to front sorting hack * bevy_pbr2: Uniform control flow for texture sampling in pbr.frag From 'fintelia' on the Bevy Render Rework Round 2 discussion: "My understanding is that GPUs these days never use the "execute both branches and select the result" strategy. Rather, what they do is evaluate the branch condition on all threads of a warp, and jump over it if all of them evaluate to false. If even a single thread needs to execute the if statement body, however, then the remaining threads are paused until that is completed." * bevy_pbr2: Simplify texture and sampler names The StandardMaterial_ prefix is no longer needed * bevy_pbr2: Match default 'AmbientColor' of current bevy_pbr for now * bevy_pbr2: Convert from non-linear to linear sRGB for the color uniform * bevy_pbr2: Add pbr_pipelined example * Fix view vector in pbr frag to work in ortho * bevy_pbr2: Use a 90 degree y fov and light range projection for lights * bevy_pbr2: Add AmbientLight resource * bevy_pbr2: Convert PointLight color to linear sRGB for use in fragment shader * bevy_pbr2: pbr.frag: Rename PointLight.projection to view_projection The uniform contains the view_projection matrix so this was incorrect. * bevy_pbr2: PointLight is an OmniLight as it has a radius * bevy_pbr2: Factoring out duplicated code * bevy_pbr2: Implement RenderAsset for StandardMaterial * Remove unnecessary texture and sampler clones * fix comment formatting * remove redundant Buffer:from * Don't extract meshes when their material textures aren't ready * make missing textures in the queue step an error Co-authored-by: Aevyrie <aevyrie@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-06-27 23:10:23 +00:00
ambient_light: Res<AmbientLight>,
point_light_shadow_map: Res<PointLightShadowMap>,
directional_light_shadow_map: Res<DirectionalLightShadowMap>,
global_point_lights: Res<GlobalVisiblePointLights>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
mut point_lights: Query<(&PointLight, &mut CubemapVisibleEntities, &GlobalTransform)>,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
mut directional_lights: Query<(
Entity,
&DirectionalLight,
&mut VisibleEntities,
&GlobalTransform,
&Visibility,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
)>,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
mut previous_point_lights_len: Local<usize>,
2021-06-02 02:59:17 +00:00
) {
bevy_pbr2: Add support for most of the StandardMaterial textures (#4) * bevy_pbr2: Add support for most of the StandardMaterial textures Normal maps are not included here as they require tangents in a vertex attribute. * bevy_pbr2: Ensure RenderCommandQueue is ready for PbrShaders init * texture_pipelined: Add a light to the scene so we can see stuff * WIP bevy_pbr2: back to front sorting hack * bevy_pbr2: Uniform control flow for texture sampling in pbr.frag From 'fintelia' on the Bevy Render Rework Round 2 discussion: "My understanding is that GPUs these days never use the "execute both branches and select the result" strategy. Rather, what they do is evaluate the branch condition on all threads of a warp, and jump over it if all of them evaluate to false. If even a single thread needs to execute the if statement body, however, then the remaining threads are paused until that is completed." * bevy_pbr2: Simplify texture and sampler names The StandardMaterial_ prefix is no longer needed * bevy_pbr2: Match default 'AmbientColor' of current bevy_pbr for now * bevy_pbr2: Convert from non-linear to linear sRGB for the color uniform * bevy_pbr2: Add pbr_pipelined example * Fix view vector in pbr frag to work in ortho * bevy_pbr2: Use a 90 degree y fov and light range projection for lights * bevy_pbr2: Add AmbientLight resource * bevy_pbr2: Convert PointLight color to linear sRGB for use in fragment shader * bevy_pbr2: pbr.frag: Rename PointLight.projection to view_projection The uniform contains the view_projection matrix so this was incorrect. * bevy_pbr2: PointLight is an OmniLight as it has a radius * bevy_pbr2: Factoring out duplicated code * bevy_pbr2: Implement RenderAsset for StandardMaterial * Remove unnecessary texture and sampler clones * fix comment formatting * remove redundant Buffer:from * Don't extract meshes when their material textures aren't ready * make missing textures in the queue step an error Co-authored-by: Aevyrie <aevyrie@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-06-27 23:10:23 +00:00
commands.insert_resource(ExtractedAmbientLight {
color: ambient_light.color,
brightness: ambient_light.brightness,
});
commands.insert_resource::<ExtractedPointLightShadowMap>(point_light_shadow_map.clone());
commands.insert_resource::<ExtractedDirectionalLightShadowMap>(
directional_light_shadow_map.clone(),
);
Scale normal bias by texel size (#26) * 3d_scene_pipelined: Use a shallower directional light angle to provoke acne * cornell_box_pipelined: Remove bias tweaks * bevy_pbr2: Simplify shadow biases by moving them to linear depth * bevy_pbr2: Do not use DepthBiasState * bevy_pbr2: Do not use bilinear filtering for sampling depth textures * pbr.wgsl: Remove unnecessary comment * bevy_pbr2: Do manual shadow map depth comparisons for more flexibility * examples: Add shadow_biases_pipelined example This is useful for stress testing biases. * bevy_pbr2: Scale the point light normal bias by the shadow map texel size This allows the normal bias to be small close to the light source where the shadow map texel to screen texel ratio is high, but is appropriately large further away from the light source where the shadow map texel can easily cover multiple screen texels. * shadow_biases_pipelined: Add support for toggling directional / point light * shadow_biases_pipelined: Cleanup * bevy_pbr2: Scale the directional light normal bias by the shadow map texel size * shadow_biases_pipelined: Fit the orthographic projection around the scene * bevy_pbr2: Directional lights should have no shadows outside their projection Before this change, sampling a fragment position from outside the ndc volume would result in the return sample being clamped to the edge in x,y or possibly always casting a shadow for fragment positions past the orthographic projection's far plane. * bevy_pbr2: Fix the default directional light normal bias * Revert "bevy_pbr2: Do manual shadow map depth comparisons for more flexibility" This reverts commit 7df1bab38a42d8a33bc50ca583d4be37bd9c9f0d. * shadow_biases_pipelined: Adjust directional light normal bias in 0.1 increments * pbr.wgsl: Add a couple of clarifying comments * Revert "bevy_pbr2: Do not use bilinear filtering for sampling depth textures" This reverts commit f53baab0232ce218866a45cad6902b470f4cf2c4. * shadow_biases_pipelined: Print usage to terminal
2021-07-19 19:20:59 +00:00
// This is the point light shadow map texel size for one face of the cube as a distance of 1.0
// world unit from the light.
// point_light_texel_size = 2.0 * 1.0 * tan(PI / 4.0) / cube face width in texels
// PI / 4.0 is half the cube face fov, tan(PI / 4.0) = 1.0, so this simplifies to:
// point_light_texel_size = 2.0 / cube face width in texels
// NOTE: When using various PCF kernel sizes, this will need to be adjusted, according to:
// https://catlikecoding.com/unity/tutorials/custom-srp/point-and-spot-shadows/
let point_light_texel_size = 2.0 / point_light_shadow_map.size as f32;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let mut point_lights_values = Vec::with_capacity(*previous_point_lights_len);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
for entity in global_point_lights.iter().copied() {
if let Ok((point_light, cubemap_visible_entities, transform)) = point_lights.get_mut(entity)
{
let render_cubemap_visible_entities =
std::mem::take(cubemap_visible_entities.into_inner());
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
point_lights_values.push((
entity,
(
ExtractedPointLight {
color: point_light.color,
// NOTE: Map from luminous power in lumens to luminous intensity in lumens per steradian
// for a point light. See https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower
// for details.
intensity: point_light.intensity / (4.0 * std::f32::consts::PI),
range: point_light.range,
radius: point_light.radius,
transform: *transform,
shadows_enabled: point_light.shadows_enabled,
shadow_depth_bias: point_light.shadow_depth_bias,
// The factor of SQRT_2 is for the worst-case diagonal offset
shadow_normal_bias: point_light.shadow_normal_bias
* point_light_texel_size
* std::f32::consts::SQRT_2,
},
render_cubemap_visible_entities,
),
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
));
}
2021-06-02 02:59:17 +00:00
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
*previous_point_lights_len = point_lights_values.len();
commands.insert_or_spawn_batch(point_lights_values);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
for (entity, directional_light, visible_entities, transform, visibility) in
directional_lights.iter_mut()
{
if !visibility.is_visible {
continue;
}
Scale normal bias by texel size (#26) * 3d_scene_pipelined: Use a shallower directional light angle to provoke acne * cornell_box_pipelined: Remove bias tweaks * bevy_pbr2: Simplify shadow biases by moving them to linear depth * bevy_pbr2: Do not use DepthBiasState * bevy_pbr2: Do not use bilinear filtering for sampling depth textures * pbr.wgsl: Remove unnecessary comment * bevy_pbr2: Do manual shadow map depth comparisons for more flexibility * examples: Add shadow_biases_pipelined example This is useful for stress testing biases. * bevy_pbr2: Scale the point light normal bias by the shadow map texel size This allows the normal bias to be small close to the light source where the shadow map texel to screen texel ratio is high, but is appropriately large further away from the light source where the shadow map texel can easily cover multiple screen texels. * shadow_biases_pipelined: Add support for toggling directional / point light * shadow_biases_pipelined: Cleanup * bevy_pbr2: Scale the directional light normal bias by the shadow map texel size * shadow_biases_pipelined: Fit the orthographic projection around the scene * bevy_pbr2: Directional lights should have no shadows outside their projection Before this change, sampling a fragment position from outside the ndc volume would result in the return sample being clamped to the edge in x,y or possibly always casting a shadow for fragment positions past the orthographic projection's far plane. * bevy_pbr2: Fix the default directional light normal bias * Revert "bevy_pbr2: Do manual shadow map depth comparisons for more flexibility" This reverts commit 7df1bab38a42d8a33bc50ca583d4be37bd9c9f0d. * shadow_biases_pipelined: Adjust directional light normal bias in 0.1 increments * pbr.wgsl: Add a couple of clarifying comments * Revert "bevy_pbr2: Do not use bilinear filtering for sampling depth textures" This reverts commit f53baab0232ce218866a45cad6902b470f4cf2c4. * shadow_biases_pipelined: Print usage to terminal
2021-07-19 19:20:59 +00:00
// Calulate the directional light shadow map texel size using the largest x,y dimension of
// the orthographic projection divided by the shadow map resolution
// NOTE: When using various PCF kernel sizes, this will need to be adjusted, according to:
// https://catlikecoding.com/unity/tutorials/custom-srp/directional-shadows/
let largest_dimension = (directional_light.shadow_projection.right
- directional_light.shadow_projection.left)
.max(
directional_light.shadow_projection.top
- directional_light.shadow_projection.bottom,
);
let directional_light_texel_size =
largest_dimension / directional_light_shadow_map.size as f32;
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
let render_visible_entities = std::mem::take(visible_entities.into_inner());
commands.get_or_spawn(entity).insert_bundle((
ExtractedDirectionalLight {
color: directional_light.color,
illuminance: directional_light.illuminance,
direction: transform.forward(),
projection: directional_light.shadow_projection.get_projection_matrix(),
shadows_enabled: directional_light.shadows_enabled,
shadow_depth_bias: directional_light.shadow_depth_bias,
Scale normal bias by texel size (#26) * 3d_scene_pipelined: Use a shallower directional light angle to provoke acne * cornell_box_pipelined: Remove bias tweaks * bevy_pbr2: Simplify shadow biases by moving them to linear depth * bevy_pbr2: Do not use DepthBiasState * bevy_pbr2: Do not use bilinear filtering for sampling depth textures * pbr.wgsl: Remove unnecessary comment * bevy_pbr2: Do manual shadow map depth comparisons for more flexibility * examples: Add shadow_biases_pipelined example This is useful for stress testing biases. * bevy_pbr2: Scale the point light normal bias by the shadow map texel size This allows the normal bias to be small close to the light source where the shadow map texel to screen texel ratio is high, but is appropriately large further away from the light source where the shadow map texel can easily cover multiple screen texels. * shadow_biases_pipelined: Add support for toggling directional / point light * shadow_biases_pipelined: Cleanup * bevy_pbr2: Scale the directional light normal bias by the shadow map texel size * shadow_biases_pipelined: Fit the orthographic projection around the scene * bevy_pbr2: Directional lights should have no shadows outside their projection Before this change, sampling a fragment position from outside the ndc volume would result in the return sample being clamped to the edge in x,y or possibly always casting a shadow for fragment positions past the orthographic projection's far plane. * bevy_pbr2: Fix the default directional light normal bias * Revert "bevy_pbr2: Do manual shadow map depth comparisons for more flexibility" This reverts commit 7df1bab38a42d8a33bc50ca583d4be37bd9c9f0d. * shadow_biases_pipelined: Adjust directional light normal bias in 0.1 increments * pbr.wgsl: Add a couple of clarifying comments * Revert "bevy_pbr2: Do not use bilinear filtering for sampling depth textures" This reverts commit f53baab0232ce218866a45cad6902b470f4cf2c4. * shadow_biases_pipelined: Print usage to terminal
2021-07-19 19:20:59 +00:00
// The factor of SQRT_2 is for the worst-case diagonal offset
shadow_normal_bias: directional_light.shadow_normal_bias
* directional_light_texel_size
* std::f32::consts::SQRT_2,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
near: directional_light.shadow_projection.near,
far: directional_light.shadow_projection.far,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
},
render_visible_entities,
));
}
2021-06-02 02:59:17 +00:00
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub(crate) const POINT_LIGHT_NEAR_Z: f32 = 0.1f32;
2021-07-01 23:48:55 +00:00
// Can't do `Vec3::Y * -1.0` because mul isn't const
const NEGATIVE_X: Vec3 = const_vec3!([-1.0, 0.0, 0.0]);
const NEGATIVE_Y: Vec3 = const_vec3!([0.0, -1.0, 0.0]);
const NEGATIVE_Z: Vec3 = const_vec3!([0.0, 0.0, -1.0]);
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
pub(crate) struct CubeMapFace {
pub(crate) target: Vec3,
pub(crate) up: Vec3,
2021-07-01 23:48:55 +00:00
}
// see https://www.khronos.org/opengl/wiki/Cubemap_Texture
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
pub(crate) const CUBE_MAP_FACES: [CubeMapFace; 6] = [
2021-07-01 23:48:55 +00:00
// 0 GL_TEXTURE_CUBE_MAP_POSITIVE_X
CubeMapFace {
target: NEGATIVE_X,
up: NEGATIVE_Y,
},
// 1 GL_TEXTURE_CUBE_MAP_NEGATIVE_X
CubeMapFace {
target: Vec3::X,
up: NEGATIVE_Y,
},
// 2 GL_TEXTURE_CUBE_MAP_POSITIVE_Y
CubeMapFace {
target: NEGATIVE_Y,
up: Vec3::Z,
},
// 3 GL_TEXTURE_CUBE_MAP_NEGATIVE_Y
CubeMapFace {
target: Vec3::Y,
up: NEGATIVE_Z,
},
// 4 GL_TEXTURE_CUBE_MAP_POSITIVE_Z
CubeMapFace {
target: NEGATIVE_Z,
up: NEGATIVE_Y,
},
// 5 GL_TEXTURE_CUBE_MAP_NEGATIVE_Z
CubeMapFace {
target: Vec3::Z,
up: NEGATIVE_Y,
},
];
fn face_index_to_name(face_index: usize) -> &'static str {
match face_index {
0 => "+x",
1 => "-x",
2 => "+y",
3 => "-y",
4 => "+z",
5 => "-z",
_ => "invalid",
}
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ShadowView {
2021-07-01 23:48:55 +00:00
pub depth_texture_view: TextureView,
pub pass_name: String,
2021-06-02 02:59:17 +00:00
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ViewShadowBindings {
pub point_light_depth_texture: Texture,
pub point_light_depth_texture_view: TextureView,
pub directional_light_depth_texture: Texture,
pub directional_light_depth_texture_view: TextureView,
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ViewLightEntities {
2021-06-02 02:59:17 +00:00
pub lights: Vec<Entity>,
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
pub struct ViewLightsUniformOffset {
pub offset: u32,
2021-06-02 02:59:17 +00:00
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
// NOTE: Clustered-forward rendering requires 3 storage buffer bindings so check that
// at least that many are supported using this constant and SupportedBindingType::from_device()
pub const CLUSTERED_FORWARD_STORAGE_BUFFER_COUNT: u32 = 3;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub struct GlobalLightMeta {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub gpu_point_lights: GpuPointLights,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub entity_to_index: HashMap<Entity, usize>,
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
impl FromWorld for GlobalLightMeta {
fn from_world(world: &mut World) -> Self {
Self::new(
world
.resource::<RenderDevice>()
.get_supported_read_only_binding_type(CLUSTERED_FORWARD_STORAGE_BUFFER_COUNT),
)
}
}
impl GlobalLightMeta {
pub fn new(buffer_binding_type: BufferBindingType) -> Self {
Self {
gpu_point_lights: GpuPointLights::new(buffer_binding_type),
entity_to_index: HashMap::default(),
}
}
}
2021-06-02 02:59:17 +00:00
#[derive(Default)]
pub struct LightMeta {
pub view_gpu_lights: DynamicUniformVec<GpuLights>,
2021-06-21 23:28:52 +00:00
pub shadow_view_bind_group: Option<BindGroup>,
2021-06-02 02:59:17 +00:00
}
2021-11-22 23:16:36 +00:00
#[derive(Component)]
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
pub enum LightEntity {
Directional {
light_entity: Entity,
},
Point {
light_entity: Entity,
face_index: usize,
},
}
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
pub fn calculate_cluster_factors(
near: f32,
far: f32,
z_slices: f32,
is_orthographic: bool,
) -> Vec2 {
if is_orthographic {
Vec2::new(-near, z_slices / (-far - -near))
} else {
let z_slices_of_ln_zfar_over_znear = (z_slices - 1.0) / (far / near).ln();
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
Vec2::new(
z_slices_of_ln_zfar_over_znear,
near.ln() * z_slices_of_ln_zfar_over_znear,
)
}
}
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
2021-07-08 03:01:41 +00:00
#[allow(clippy::too_many_arguments)]
2021-06-02 02:59:17 +00:00
pub fn prepare_lights(
mut commands: Commands,
mut texture_cache: ResMut<TextureCache>,
2021-06-21 23:28:52 +00:00
render_device: Res<RenderDevice>,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
render_queue: Res<RenderQueue>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
mut global_light_meta: ResMut<GlobalLightMeta>,
2021-06-02 02:59:17 +00:00
mut light_meta: ResMut<LightMeta>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
views: Query<
(Entity, &ExtractedView, &ExtractedClusterConfig),
With<RenderPhase<Transparent3d>>,
>,
bevy_pbr2: Add support for most of the StandardMaterial textures (#4) * bevy_pbr2: Add support for most of the StandardMaterial textures Normal maps are not included here as they require tangents in a vertex attribute. * bevy_pbr2: Ensure RenderCommandQueue is ready for PbrShaders init * texture_pipelined: Add a light to the scene so we can see stuff * WIP bevy_pbr2: back to front sorting hack * bevy_pbr2: Uniform control flow for texture sampling in pbr.frag From 'fintelia' on the Bevy Render Rework Round 2 discussion: "My understanding is that GPUs these days never use the "execute both branches and select the result" strategy. Rather, what they do is evaluate the branch condition on all threads of a warp, and jump over it if all of them evaluate to false. If even a single thread needs to execute the if statement body, however, then the remaining threads are paused until that is completed." * bevy_pbr2: Simplify texture and sampler names The StandardMaterial_ prefix is no longer needed * bevy_pbr2: Match default 'AmbientColor' of current bevy_pbr for now * bevy_pbr2: Convert from non-linear to linear sRGB for the color uniform * bevy_pbr2: Add pbr_pipelined example * Fix view vector in pbr frag to work in ortho * bevy_pbr2: Use a 90 degree y fov and light range projection for lights * bevy_pbr2: Add AmbientLight resource * bevy_pbr2: Convert PointLight color to linear sRGB for use in fragment shader * bevy_pbr2: pbr.frag: Rename PointLight.projection to view_projection The uniform contains the view_projection matrix so this was incorrect. * bevy_pbr2: PointLight is an OmniLight as it has a radius * bevy_pbr2: Factoring out duplicated code * bevy_pbr2: Implement RenderAsset for StandardMaterial * Remove unnecessary texture and sampler clones * fix comment formatting * remove redundant Buffer:from * Don't extract meshes when their material textures aren't ready * make missing textures in the queue step an error Co-authored-by: Aevyrie <aevyrie@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-06-27 23:10:23 +00:00
ambient_light: Res<ExtractedAmbientLight>,
point_light_shadow_map: Res<ExtractedPointLightShadowMap>,
directional_light_shadow_map: Res<ExtractedDirectionalLightShadowMap>,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
point_lights: Query<(Entity, &ExtractedPointLight)>,
directional_lights: Query<(Entity, &ExtractedDirectionalLight)>,
2021-06-02 02:59:17 +00:00
) {
Sprite Batching (#3060) This implements the following: * **Sprite Batching**: Collects sprites in a vertex buffer to draw many sprites with a single draw call. Sprites are batched by their `Handle<Image>` within a specific z-level. When possible, sprites are opportunistically batched _across_ z-levels (when no sprites with a different texture exist between two sprites with the same texture on different z levels). With these changes, I can now get ~130,000 sprites at 60fps on the `bevymark_pipelined` example. * **Sprite Color Tints**: The `Sprite` type now has a `color` field. Non-white color tints result in a specialized render pipeline that passes the color in as a vertex attribute. I chose to specialize this because passing vertex colors has a measurable price (without colors I get ~130,000 sprites on bevymark, with colors I get ~100,000 sprites). "Colored" sprites cannot be batched with "uncolored" sprites, but I think this is fine because the chance of a "colored" sprite needing to batch with other "colored" sprites is generally probably way higher than an "uncolored" sprite needing to batch with a "colored" sprite. * **Sprite Flipping**: Sprites can be flipped on their x or y axis using `Sprite::flip_x` and `Sprite::flip_y`. This is also true for `TextureAtlasSprite`. * **Simpler BufferVec/UniformVec/DynamicUniformVec Clearing**: improved the clearing interface by removing the need to know the size of the final buffer at the initial clear. ![image](https://user-images.githubusercontent.com/2694663/140001821-99be0d96-025d-489e-9bfa-ba19c1dc9548.png) Note that this moves sprites away from entity-driven rendering and back to extracted lists. We _could_ use entities here, but it necessitates that an intermediate list is allocated / populated to collect and sort extracted sprites. This redundant copy, combined with the normal overhead of spawning extracted sprite entities, brings bevymark down to ~80,000 sprites at 60fps. I think making sprites a bit more fixed (by default) is worth it. I view this as acceptable because batching makes normal entity-driven rendering pretty useless anyway (and we would want to batch most custom materials too). We can still support custom shaders with custom bindings, we'll just need to define a specific interface for it.
2021-11-04 20:28:53 +00:00
light_meta.view_gpu_lights.clear();
2021-06-02 02:59:17 +00:00
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
// Pre-calculate for PointLights
let cube_face_projection =
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Mat4::perspective_infinite_reverse_rh(std::f32::consts::FRAC_PI_2, 1.0, POINT_LIGHT_NEAR_Z);
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
let cube_face_rotations = CUBE_MAP_FACES
.iter()
.map(|CubeMapFace { target, up }| GlobalTransform::identity().looking_at(*target, *up))
.collect::<Vec<_>>();
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta.gpu_point_lights.clear();
global_light_meta.entity_to_index.clear();
let mut point_lights: Vec<_> = point_lights.iter().collect::<Vec<_>>();
#[cfg(not(feature = "webgl"))]
let max_point_light_shadow_maps = point_lights
.iter()
.filter(|light| light.1.shadows_enabled)
.count()
.min((render_device.limits().max_texture_array_layers / 6) as usize);
#[cfg(feature = "webgl")]
let max_point_light_shadow_maps = 1;
// Sort point lights with shadows enabled first, then by a stable key so that the index can be used
// to render at most `max_point_light_shadow_maps` point light shadows.
point_lights.sort_by(|(entity_1, light_1), (entity_2, light_2)| {
point_light_order(
(entity_1, &light_1.shadows_enabled),
(entity_2, &light_2.shadows_enabled),
)
});
if global_light_meta.entity_to_index.capacity() < point_lights.len() {
global_light_meta
.entity_to_index
.reserve(point_lights.len());
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let mut gpu_point_lights = Vec::new();
for (index, &(entity, light)) in point_lights.iter().enumerate() {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
let mut flags = PointLightFlags::NONE;
// Lights are sorted, shadow enabled lights are first
if light.shadows_enabled && index < max_point_light_shadow_maps {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
flags |= PointLightFlags::SHADOWS_ENABLED;
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
gpu_point_lights.push(GpuPointLight {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
projection_lr: Vec4::new(
cube_face_projection.z_axis.z,
cube_face_projection.z_axis.w,
cube_face_projection.w_axis.z,
cube_face_projection.w_axis.w,
),
// premultiply color by intensity
// we don't use the alpha at all, so no reason to multiply only [0..3]
color_inverse_square_range: (Vec4::from_slice(&light.color.as_linear_rgba_f32())
* light.intensity)
.xyz()
.extend(1.0 / (light.range * light.range)),
position_radius: light.transform.translation.extend(light.radius),
flags: flags.bits,
shadow_depth_bias: light.shadow_depth_bias,
shadow_normal_bias: light.shadow_normal_bias,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
});
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta.entity_to_index.insert(entity, index);
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
global_light_meta.gpu_point_lights.push(gpu_point_lights);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta
.gpu_point_lights
.write_buffer(&render_device, &render_queue);
2021-06-02 02:59:17 +00:00
// set up light data for each view
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
for (entity, extracted_view, clusters) in views.iter() {
let point_light_depth_texture = texture_cache.get(
&render_device,
TextureDescriptor {
size: Extent3d {
width: point_light_shadow_map.size as u32,
height: point_light_shadow_map.size as u32,
depth_or_array_layers: max_point_light_shadow_maps.max(1) as u32 * 6,
},
mip_level_count: 1,
sample_count: 1,
dimension: TextureDimension::D2,
format: SHADOW_FORMAT,
label: Some("point_light_shadow_map_texture"),
usage: TextureUsages::RENDER_ATTACHMENT | TextureUsages::TEXTURE_BINDING,
},
);
let directional_light_depth_texture = texture_cache.get(
2021-06-21 23:28:52 +00:00
&render_device,
2021-06-02 02:59:17 +00:00
TextureDescriptor {
size: Extent3d {
width: (directional_light_shadow_map.size as u32)
.min(render_device.limits().max_texture_dimension_2d),
height: (directional_light_shadow_map.size as u32)
.min(render_device.limits().max_texture_dimension_2d),
depth_or_array_layers: DIRECTIONAL_SHADOW_LAYERS,
},
2021-06-02 02:59:17 +00:00
mip_level_count: 1,
sample_count: 1,
dimension: TextureDimension::D2,
format: SHADOW_FORMAT,
label: Some("directional_light_shadow_map_texture"),
usage: TextureUsages::RENDER_ATTACHMENT | TextureUsages::TEXTURE_BINDING,
2021-06-02 02:59:17 +00:00
},
);
let mut view_lights = Vec::new();
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
let is_orthographic = extracted_view.projection.w_axis.w == 1.0;
let cluster_factors_zw = calculate_cluster_factors(
clusters.near,
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
clusters.far,
clusters.dimensions.z as f32,
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
is_orthographic,
);
let n_clusters = clusters.dimensions.x * clusters.dimensions.y * clusters.dimensions.z;
2021-06-02 02:59:17 +00:00
let mut gpu_lights = GpuLights {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
directional_lights: [GpuDirectionalLight::default(); MAX_DIRECTIONAL_LIGHTS],
ambient_color: Vec4::from_slice(&ambient_light.color.as_linear_rgba_f32())
* ambient_light.brightness,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
cluster_factors: Vec4::new(
clusters.dimensions.x as f32 / extracted_view.width as f32,
clusters.dimensions.y as f32 / extracted_view.height as f32,
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
cluster_factors_zw.x,
cluster_factors_zw.y,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
),
cluster_dimensions: clusters.dimensions.extend(n_clusters),
n_directional_lights: directional_lights.iter().len() as u32,
2021-06-02 02:59:17 +00:00
};
// TODO: this should select lights based on relevance to the view instead of the first ones that show up in a query
for &(light_entity, light) in point_lights
.iter()
// Lights are sorted, shadow enabled lights are first
.take(max_point_light_shadow_maps)
.filter(|(_, light)| light.shadows_enabled)
{
let light_index = *global_light_meta
.entity_to_index
.get(&light_entity)
.unwrap();
// ignore scale because we don't want to effectively scale light radius and range
// by applying those as a view transform to shadow map rendering of objects
// and ignore rotation because we want the shadow map projections to align with the axes
let view_translation = GlobalTransform::from_translation(light.transform.translation);
for (face_index, view_rotation) in cube_face_rotations.iter().enumerate() {
let depth_texture_view =
point_light_depth_texture
.texture
.create_view(&TextureViewDescriptor {
label: Some("point_light_shadow_map_texture_view"),
format: None,
dimension: Some(TextureViewDimension::D2),
aspect: TextureAspect::All,
base_mip_level: 0,
mip_level_count: None,
base_array_layer: (light_index * 6 + face_index) as u32,
array_layer_count: NonZeroU32::new(1),
});
let view_light_entity = commands
.spawn()
.insert_bundle((
ShadowView {
depth_texture_view,
pass_name: format!(
"shadow pass point light {} {}",
light_index,
face_index_to_name(face_index)
),
},
ExtractedView {
width: point_light_shadow_map.size as u32,
height: point_light_shadow_map.size as u32,
transform: view_translation * *view_rotation,
projection: cube_face_projection,
near: POINT_LIGHT_NEAR_Z,
far: light.range,
},
RenderPhase::<Shadow>::default(),
LightEntity::Point {
light_entity,
face_index,
},
))
.id();
view_lights.push(view_light_entity);
}
2021-06-02 02:59:17 +00:00
}
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
for (i, (light_entity, light)) in directional_lights
.iter()
.enumerate()
.take(MAX_DIRECTIONAL_LIGHTS)
{
// direction is negated to be ready for N.L
let dir_to_light = -light.direction;
// convert from illuminance (lux) to candelas
//
// exposure is hard coded at the moment but should be replaced
// by values coming from the camera
// see: https://google.github.io/filament/Filament.html#imagingpipeline/physicallybasedcamera/exposuresettings
const APERTURE: f32 = 4.0;
const SHUTTER_SPEED: f32 = 1.0 / 250.0;
const SENSITIVITY: f32 = 100.0;
let ev100 =
f32::log2(APERTURE * APERTURE / SHUTTER_SPEED) - f32::log2(SENSITIVITY / 100.0);
let exposure = 1.0 / (f32::powf(2.0, ev100) * 1.2);
let intensity = light.illuminance * exposure;
// NOTE: A directional light seems to have to have an eye position on the line along the direction of the light
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// through the world origin. I (Rob Swain) do not yet understand why it cannot be translated away from this.
let view = Mat4::look_at_rh(Vec3::ZERO, light.direction, Vec3::Y);
// NOTE: This orthographic projection defines the volume within which shadows from a directional light can be cast
let projection = light.projection;
let mut flags = DirectionalLightFlags::NONE;
if light.shadows_enabled {
flags |= DirectionalLightFlags::SHADOWS_ENABLED;
}
gpu_lights.directional_lights[i] = GpuDirectionalLight {
// premultiply color by intensity
// we don't use the alpha at all, so no reason to multiply only [0..3]
color: Vec4::from_slice(&light.color.as_linear_rgba_f32()) * intensity,
dir_to_light,
// NOTE: * view is correct, it should not be view.inverse() here
view_projection: projection * view,
flags: flags.bits,
shadow_depth_bias: light.shadow_depth_bias,
shadow_normal_bias: light.shadow_normal_bias,
};
if light.shadows_enabled {
let depth_texture_view =
directional_light_depth_texture
.texture
.create_view(&TextureViewDescriptor {
label: Some("directional_light_shadow_map_texture_view"),
format: None,
dimension: Some(TextureViewDimension::D2),
aspect: TextureAspect::All,
base_mip_level: 0,
mip_level_count: None,
base_array_layer: i as u32,
array_layer_count: NonZeroU32::new(1),
});
let view_light_entity = commands
.spawn()
.insert_bundle((
ShadowView {
depth_texture_view,
pass_name: format!("shadow pass directional light {}", i),
},
ExtractedView {
width: directional_light_shadow_map.size as u32,
height: directional_light_shadow_map.size as u32,
transform: GlobalTransform::from_matrix(view.inverse()),
projection,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
near: light.near,
far: light.far,
},
RenderPhase::<Shadow>::default(),
LightEntity::Directional { light_entity },
))
.id();
view_lights.push(view_light_entity);
}
}
let point_light_depth_texture_view =
point_light_depth_texture
2021-07-01 23:48:55 +00:00
.texture
.create_view(&TextureViewDescriptor {
label: Some("point_light_shadow_map_array_texture_view"),
2021-07-01 23:48:55 +00:00
format: None,
#[cfg(not(feature = "webgl"))]
2021-07-01 23:48:55 +00:00
dimension: Some(TextureViewDimension::CubeArray),
#[cfg(feature = "webgl")]
dimension: Some(TextureViewDimension::Cube),
2021-07-01 23:48:55 +00:00
aspect: TextureAspect::All,
base_mip_level: 0,
mip_level_count: None,
2021-07-02 01:05:20 +00:00
base_array_layer: 0,
2021-07-01 23:48:55 +00:00
array_layer_count: None,
});
let directional_light_depth_texture_view = directional_light_depth_texture
.texture
.create_view(&TextureViewDescriptor {
label: Some("directional_light_shadow_map_array_texture_view"),
format: None,
#[cfg(not(feature = "webgl"))]
dimension: Some(TextureViewDimension::D2Array),
#[cfg(feature = "webgl")]
dimension: Some(TextureViewDimension::D2),
aspect: TextureAspect::All,
base_mip_level: 0,
mip_level_count: None,
base_array_layer: 0,
array_layer_count: None,
});
2021-07-01 23:48:55 +00:00
commands.entity(entity).insert_bundle((
ViewShadowBindings {
point_light_depth_texture: point_light_depth_texture.texture,
point_light_depth_texture_view,
directional_light_depth_texture: directional_light_depth_texture.texture,
directional_light_depth_texture_view,
},
ViewLightEntities {
lights: view_lights,
},
ViewLightsUniformOffset {
offset: light_meta.view_gpu_lights.push(gpu_lights),
},
));
2021-06-02 02:59:17 +00:00
}
Sprite Batching (#3060) This implements the following: * **Sprite Batching**: Collects sprites in a vertex buffer to draw many sprites with a single draw call. Sprites are batched by their `Handle<Image>` within a specific z-level. When possible, sprites are opportunistically batched _across_ z-levels (when no sprites with a different texture exist between two sprites with the same texture on different z levels). With these changes, I can now get ~130,000 sprites at 60fps on the `bevymark_pipelined` example. * **Sprite Color Tints**: The `Sprite` type now has a `color` field. Non-white color tints result in a specialized render pipeline that passes the color in as a vertex attribute. I chose to specialize this because passing vertex colors has a measurable price (without colors I get ~130,000 sprites on bevymark, with colors I get ~100,000 sprites). "Colored" sprites cannot be batched with "uncolored" sprites, but I think this is fine because the chance of a "colored" sprite needing to batch with other "colored" sprites is generally probably way higher than an "uncolored" sprite needing to batch with a "colored" sprite. * **Sprite Flipping**: Sprites can be flipped on their x or y axis using `Sprite::flip_x` and `Sprite::flip_y`. This is also true for `TextureAtlasSprite`. * **Simpler BufferVec/UniformVec/DynamicUniformVec Clearing**: improved the clearing interface by removing the need to know the size of the final buffer at the initial clear. ![image](https://user-images.githubusercontent.com/2694663/140001821-99be0d96-025d-489e-9bfa-ba19c1dc9548.png) Note that this moves sprites away from entity-driven rendering and back to extracted lists. We _could_ use entities here, but it necessitates that an intermediate list is allocated / populated to collect and sort extracted sprites. This redundant copy, combined with the normal overhead of spawning extracted sprite entities, brings bevymark down to ~80,000 sprites at 60fps. I think making sprites a bit more fixed (by default) is worth it. I view this as acceptable because batching makes normal entity-driven rendering pretty useless anyway (and we would want to batch most custom materials too). We can still support custom shaders with custom bindings, we'll just need to define a specific interface for it.
2021-11-04 20:28:53 +00:00
light_meta
.view_gpu_lights
.write_buffer(&render_device, &render_queue);
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
}
// this must match CLUSTER_COUNT_SIZE in pbr.wgsl
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
// and must be large enough to contain MAX_UNIFORM_BUFFER_POINT_LIGHTS
const CLUSTER_COUNT_SIZE: u32 = 13;
const CLUSTER_OFFSET_MASK: u32 = (1 << (32 - CLUSTER_COUNT_SIZE)) - 1;
const CLUSTER_COUNT_MASK: u32 = (1 << CLUSTER_COUNT_SIZE) - 1;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
const POINT_LIGHT_INDEX_MASK: u32 = (1 << 8) - 1;
// NOTE: With uniform buffer max binding size as 16384 bytes
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
// that means we can fit say 256 point lights in one uniform
// buffer, which means the count can be at most 256 so it
// needs 9 bits.
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// The array of indices can also use u8 and that means the
// offset in to the array of indices needs to be able to address
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
// 16384 values. log2(16384) = 14 bits.
// We use 32 bits to store the pair, so we choose to divide the
// remaining 9 bits proportionally to give some future room.
// This means we can pack the offset into the upper 19 bits of a u32
// and the count into the lower 13 bits.
bevy_pbr2: Fix clustering for orthographic projections (#3316) # Objective PBR lighting was broken in the new renderer when using orthographic projections due to the way the depth slicing works for the clusters. Fix it. ## Solution - The default orthographic projection near plane is 0.0. The perspective projection depth slicing does a division by the near plane which gives a floating point NaN and the clustering all breaks down. - Orthographic projections have a linear depth mapping, so it made intuitive sense to me to do depth slicing with a linear mapping too. The alternative I saw was to try to handle the near plane being at 0.0 and using the exponential depth slicing, but that felt like a hack that didn't make sense. - As such, I have added code that detects whether the projection is orthographic based on `projection[3][3] == 1.0` and then implemented the orthographic mapping case throughout (when computing cluster AABBs, and when mapping a view space position (or light) to a cluster id in both the rust and shader code). ## Screenshots Before: ![before](https://user-images.githubusercontent.com/302146/145847278-5b1bca74-fbad-4cc5-8b49-384f6a377fdc.png) After: <img width="1392" alt="Screenshot 2021-12-13 at 16 36 53" src="https://user-images.githubusercontent.com/302146/145847314-6f3a2035-5d87-4896-8032-0c3e35e15b7d.png"> Old renderer (slightly lighter due to slight difference in configured intensity): <img width="1392" alt="Screenshot 2021-12-13 at 16 42 23" src="https://user-images.githubusercontent.com/302146/145847391-6a5e6fe0-22da-4fc1-a6c7-440543689a63.png">
2021-12-14 23:42:35 +00:00
// NOTE: This assumes CPU and GPU endianness are the same which is true
// for all common and tested x86/ARM CPUs and AMD/NVIDIA/Intel/Apple/etc GPUs
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
fn pack_offset_and_count(offset: usize, count: usize) -> u32 {
((offset as u32 & CLUSTER_OFFSET_MASK) << CLUSTER_COUNT_SIZE)
| (count as u32 & CLUSTER_COUNT_MASK)
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
enum ViewClusterBuffers {
Uniform {
// NOTE: UVec4 is because all arrays in Std140 layout have 16-byte alignment
cluster_light_index_lists: UniformVec<[UVec4; ViewClusterBindings::MAX_UNIFORM_ITEMS]>,
// NOTE: UVec4 is because all arrays in Std140 layout have 16-byte alignment
cluster_offsets_and_counts: UniformVec<[UVec4; ViewClusterBindings::MAX_UNIFORM_ITEMS]>,
},
Storage {
cluster_light_index_lists: StorageBuffer<u32>,
cluster_offsets_and_counts: StorageBuffer<UVec2>,
},
}
impl ViewClusterBuffers {
fn new(buffer_binding_type: BufferBindingType) -> Self {
match buffer_binding_type {
BufferBindingType::Storage { .. } => Self::storage(),
BufferBindingType::Uniform => Self::uniform(),
}
}
fn uniform() -> Self {
ViewClusterBuffers::Uniform {
cluster_light_index_lists: UniformVec::default(),
cluster_offsets_and_counts: UniformVec::default(),
}
}
fn storage() -> Self {
ViewClusterBuffers::Storage {
cluster_light_index_lists: StorageBuffer::default(),
cluster_offsets_and_counts: StorageBuffer::default(),
}
}
}
#[derive(Component)]
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub struct ViewClusterBindings {
n_indices: usize,
n_offsets: usize,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
buffers: ViewClusterBuffers,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
impl ViewClusterBindings {
pub const MAX_OFFSETS: usize = 16384 / 4;
const MAX_UNIFORM_ITEMS: usize = Self::MAX_OFFSETS / 4;
Dynamic light clusters (#3968) # Objective provide some customisation for default cluster setup avoid "cluster index lists is full" in all cases (using a strategy outlined by @superdump) ## Solution Add ClusterConfig enum (which can be inserted into a view at any time) to allow specifying cluster setup with variants: - None (do not do any light assignment - for views which do not require light info, e.g. minimaps etc) - Single (one cluster) - XYZ (explicit cluster counts in each dimension) - FixedZ (most similar to current - specify Z-slices and total, then x and y counts are dynamically determined to give approximately square clusters based on current aspect ratio) Defaults to FixedZ { total: 4096, z: 24 } which is similar to the current setup. Per frame, estimate the number of indices that would be required for the current config and decrease the cluster counts / increase the cluster sizes in the x and y dimensions if the index list would be too small. notes: - I didn't put ClusterConfig in the camera bundles to avoid introducing a dependency from bevy_render to bevy_pbr. the ClusterConfig enum comes with a pbr-centric impl block so i didn't want to move that into bevy_render either. - ~Might want to add None variant to cluster config for views that don't care about lights?~ - Not well tested for orthographic - ~there's a cluster_muck branch on my repo which includes some diagnostics / a modified lighting example which may be useful for tyre-kicking~ (outdated, i will bring it up to date if required) anecdotal timings: FPS on the lighting demo is negligibly better (~5%), maybe due to a small optimisation constraining the light aabb to be in front of the camera FPS on the lighting demo with 100 extra lights added is ~33% faster, and also renders correctly as the cluster index count is no longer exceeded
2022-03-08 04:56:42 +00:00
pub const MAX_INDICES: usize = 16384;
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub fn new(buffer_binding_type: BufferBindingType) -> Self {
Self {
n_indices: 0,
n_offsets: 0,
buffers: ViewClusterBuffers::new(buffer_binding_type),
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
pub fn reserve_and_clear(&mut self) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
cluster_offsets_and_counts,
} => {
cluster_light_index_lists.clear();
cluster_light_index_lists.push([UVec4::ZERO; Self::MAX_UNIFORM_ITEMS]);
cluster_offsets_and_counts.clear();
cluster_offsets_and_counts.push([UVec4::ZERO; Self::MAX_UNIFORM_ITEMS]);
}
ViewClusterBuffers::Storage {
cluster_light_index_lists,
cluster_offsets_and_counts,
..
} => {
cluster_light_index_lists.clear();
cluster_offsets_and_counts.clear();
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
pub fn push_offset_and_count(&mut self, offset: usize, count: usize) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_offsets_and_counts,
..
} => {
let array_index = self.n_offsets >> 2; // >> 2 is equivalent to / 4
if array_index >= Self::MAX_UNIFORM_ITEMS {
warn!("cluster offset and count out of bounds!");
return;
}
let component = self.n_offsets & ((1 << 2) - 1);
let packed = pack_offset_and_count(offset, count);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
cluster_offsets_and_counts.get_mut(0)[array_index][component] = packed;
}
ViewClusterBuffers::Storage {
cluster_offsets_and_counts,
..
} => {
cluster_offsets_and_counts.push(UVec2::new(offset as u32, count as u32));
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
self.n_offsets += 1;
}
pub fn n_indices(&self) -> usize {
self.n_indices
}
pub fn push_index(&mut self, index: usize) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
..
} => {
let array_index = self.n_indices >> 4; // >> 4 is equivalent to / 16
let component = (self.n_indices >> 2) & ((1 << 2) - 1);
let sub_index = self.n_indices & ((1 << 2) - 1);
let index = index as u32 & POINT_LIGHT_INDEX_MASK;
cluster_light_index_lists.get_mut(0)[array_index][component] |=
index << (8 * sub_index);
}
ViewClusterBuffers::Storage {
cluster_light_index_lists,
..
} => {
cluster_light_index_lists.push(index as u32);
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
self.n_indices += 1;
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
pub fn write_buffers(&mut self, render_device: &RenderDevice, render_queue: &RenderQueue) {
match &mut self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
cluster_offsets_and_counts,
} => {
cluster_light_index_lists.write_buffer(render_device, render_queue);
cluster_offsets_and_counts.write_buffer(render_device, render_queue);
}
ViewClusterBuffers::Storage {
cluster_light_index_lists,
cluster_offsets_and_counts,
} => {
cluster_light_index_lists.write_buffer(render_device, render_queue);
cluster_offsets_and_counts.write_buffer(render_device, render_queue);
}
}
}
pub fn light_index_lists_binding(&self) -> Option<BindingResource> {
match &self.buffers {
ViewClusterBuffers::Uniform {
cluster_light_index_lists,
..
} => cluster_light_index_lists.binding(),
ViewClusterBuffers::Storage {
cluster_light_index_lists,
..
} => cluster_light_index_lists.binding(),
}
}
pub fn offsets_and_counts_binding(&self) -> Option<BindingResource> {
match &self.buffers {
ViewClusterBuffers::Uniform {
cluster_offsets_and_counts,
..
} => cluster_offsets_and_counts.binding(),
ViewClusterBuffers::Storage {
cluster_offsets_and_counts,
..
} => cluster_offsets_and_counts.binding(),
}
}
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
}
pub fn prepare_clusters(
mut commands: Commands,
render_device: Res<RenderDevice>,
render_queue: Res<RenderQueue>,
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
mesh_pipeline: Res<MeshPipeline>,
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
global_light_meta: Res<GlobalLightMeta>,
views: Query<
(
Entity,
&ExtractedClusterConfig,
&ExtractedClustersPointLights,
),
With<RenderPhase<Transparent3d>>,
>,
) {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let render_device = render_device.into_inner();
let supports_storage_buffers = matches!(
mesh_pipeline.clustered_forward_buffer_binding_type,
BufferBindingType::Storage { .. }
);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
for (entity, cluster_config, extracted_clusters) in views.iter() {
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
let mut view_clusters_bindings =
ViewClusterBindings::new(mesh_pipeline.clustered_forward_buffer_binding_type);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
view_clusters_bindings.reserve_and_clear();
let mut indices_full = false;
let mut cluster_index = 0;
for _y in 0..cluster_config.dimensions.y {
for _x in 0..cluster_config.dimensions.x {
for _z in 0..cluster_config.dimensions.z {
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
let offset = view_clusters_bindings.n_indices();
let cluster_lights = &extracted_clusters.data[cluster_index];
let count = cluster_lights.len();
view_clusters_bindings.push_offset_and_count(offset, count);
if !indices_full {
for entity in cluster_lights.iter() {
if let Some(light_index) = global_light_meta.entity_to_index.get(entity)
{
if view_clusters_bindings.n_indices()
>= ViewClusterBindings::MAX_INDICES
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
&& !supports_storage_buffers
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
{
warn!("Cluster light index lists is full! The PointLights in the view are affecting too many clusters.");
indices_full = true;
break;
}
view_clusters_bindings.push_index(*light_index);
}
}
}
cluster_index += 1;
}
}
}
Use storage buffers for clustered forward point lights (#3989) # Objective - Make use of storage buffers, where they are available, for clustered forward bindings to support far more point lights in a scene - Fixes #3605 - Based on top of #4079 This branch on an M1 Max can keep 60fps with about 2150 point lights of radius 1m in the Sponza scene where I've been testing. The bottleneck is mostly assigning lights to clusters which grows faster than linearly (I think 1000 lights was about 1.5ms and 5000 was 7.5ms). I have seen papers and presentations leveraging compute shaders that can get this up to over 1 million. That said, I think any further optimisations should probably be done in a separate PR. ## Solution - Add `RenderDevice` to the `Material` and `SpecializedMaterial` trait `::key()` functions to allow setting flags on the keys depending on feature/limit availability - Make `GpuPointLights` and `ViewClusterBuffers` into enums containing `UniformVec` and `StorageBuffer` variants. Implement the necessary API on them to make usage the same for both cases, and the only difference is at initialisation time. - Appropriate shader defs in the shader code to handle the two cases ## Context on some decisions / open questions - I'm using `max_storage_buffers_per_shader_stage >= 3` as a check to see if storage buffers are supported. I was thinking about diving into 'binding resource management' but it feels like we don't have enough use cases to understand the problem yet, and it is mostly a separate concern to this PR, so I think it should be handled separately. - Should `ViewClusterBuffers` and `ViewClusterBindings` be merged, duplicating the count variables into the enum variants? Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-04-07 16:16:35 +00:00
view_clusters_bindings.write_buffers(render_device, &render_queue);
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
commands.get_or_spawn(entity).insert(view_clusters_bindings);
}
}
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
pub fn queue_shadow_view_bind_group(
render_device: Res<RenderDevice>,
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
shadow_pipeline: Res<ShadowPipeline>,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
mut light_meta: ResMut<LightMeta>,
view_uniforms: Res<ViewUniforms>,
) {
if let Some(view_binding) = view_uniforms.uniforms.binding() {
light_meta.shadow_view_bind_group =
Some(render_device.create_bind_group(&BindGroupDescriptor {
entries: &[BindGroupEntry {
binding: 0,
resource: view_binding,
}],
label: Some("shadow_view_bind_group"),
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
layout: &shadow_pipeline.view_layout,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
}));
}
}
#[allow(clippy::too_many_arguments)]
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
pub fn queue_shadows(
shadow_draw_functions: Res<DrawFunctions<Shadow>>,
Pipeline Specialization, Shader Assets, and Shader Preprocessing (#3031) ## New Features This adds the following to the new renderer: * **Shader Assets** * Shaders are assets again! Users no longer need to call `include_str!` for their shaders * Shader hot-reloading * **Shader Defs / Shader Preprocessing** * Shaders now support `# ifdef NAME`, `# ifndef NAME`, and `# endif` preprocessor directives * **Bevy RenderPipelineDescriptor and RenderPipelineCache** * Bevy now provides its own `RenderPipelineDescriptor` and the wgpu version is now exported as `RawRenderPipelineDescriptor`. This allows users to define pipelines with `Handle<Shader>` instead of needing to manually compile and reference `ShaderModules`, enables passing in shader defs to configure the shader preprocessor, makes hot reloading possible (because the descriptor can be owned and used to create new pipelines when a shader changes), and opens the doors to pipeline specialization. * The `RenderPipelineCache` now handles compiling and re-compiling Bevy RenderPipelineDescriptors. It has internal PipelineLayout and ShaderModule caches. Users receive a `CachedPipelineId`, which can be used to look up the actual `&RenderPipeline` during rendering. * **Pipeline Specialization** * This enables defining per-entity-configurable pipelines that specialize on arbitrary custom keys. In practice this will involve specializing based on things like MSAA values, Shader Defs, Bind Group existence, and Vertex Layouts. * Adds a `SpecializedPipeline` trait and `SpecializedPipelines<MyPipeline>` resource. This is a simple layer that generates Bevy RenderPipelineDescriptors based on a custom key defined for the pipeline. * Specialized pipelines are also hot-reloadable. * This was the result of experimentation with two different approaches: 1. **"generic immediate mode multi-key hash pipeline specialization"** * breaks up the pipeline into multiple "identities" (the core pipeline definition, shader defs, mesh layout, bind group layout). each of these identities has its own key. looking up / compiling a specific version of a pipeline requires composing all of these keys together * the benefit of this approach is that it works for all pipelines / the pipeline is fully identified by the keys. the multiple keys allow pre-hashing parts of the pipeline identity where possible (ex: pre compute the mesh identity for all meshes) * the downside is that any per-entity data that informs the values of these keys could require expensive re-hashes. computing each key for each sprite tanked bevymark performance (sprites don't actually need this level of specialization yet ... but things like pbr and future sprite scenarios might). * this is the approach rafx used last time i checked 2. **"custom key specialization"** * Pipelines by default are not specialized * Pipelines that need specialization implement a SpecializedPipeline trait with a custom key associated type * This allows specialization keys to encode exactly the amount of information required (instead of needing to be a combined hash of the entire pipeline). Generally this should fit in a small number of bytes. Per-entity specialization barely registers anymore on things like bevymark. It also makes things like "shader defs" way cheaper to hash because we can use context specific bitflags instead of strings. * Despite the extra trait, it actually generally makes pipeline definitions + lookups simpler: managing multiple keys (and making the appropriate calls to manage these keys) was way more complicated. * I opted for custom key specialization. It performs better generally and in my opinion is better UX. Fortunately the way this is implemented also allows for custom caches as this all builds on a common abstraction: the RenderPipelineCache. The built in custom key trait is just a simple / pre-defined way to interact with the cache ## Callouts * The SpecializedPipeline trait makes it easy to inherit pipeline configuration in custom pipelines. The changes to `custom_shader_pipelined` and the new `shader_defs_pipelined` example illustrate how much simpler it is to define custom pipelines based on the PbrPipeline. * The shader preprocessor is currently pretty naive (it just uses regexes to process each line). Ultimately we might want to build a more custom parser for more performance + better error handling, but for now I'm happy to optimize for "easy to implement and understand". ## Next Steps * Port compute pipelines to the new system * Add more preprocessor directives (else, elif, import) * More flexible vertex attribute specialization / enable cheaply specializing on specific mesh vertex layouts
2021-10-28 19:07:47 +00:00
shadow_pipeline: Res<ShadowPipeline>,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
casting_meshes: Query<&Handle<Mesh>, Without<NotShadowCaster>>,
render_meshes: Res<RenderAssets<Mesh>>,
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
mut pipelines: ResMut<SpecializedMeshPipelines<ShadowPipeline>>,
mut pipeline_cache: ResMut<PipelineCache>,
view_lights: Query<&ViewLightEntities>,
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
mut view_light_shadow_phases: Query<(&LightEntity, &mut RenderPhase<Shadow>)>,
point_light_entities: Query<&CubemapVisibleEntities, With<ExtractedPointLight>>,
directional_light_entities: Query<&VisibleEntities, With<ExtractedDirectionalLight>>,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
) {
for view_lights in view_lights.iter() {
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
let draw_shadow_mesh = shadow_draw_functions
.read()
.get_id::<DrawShadowMesh>()
.unwrap();
for view_light_entity in view_lights.lights.iter().copied() {
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
let (light_entity, mut shadow_phase) =
view_light_shadow_phases.get_mut(view_light_entity).unwrap();
let visible_entities = match light_entity {
LightEntity::Directional { light_entity } => directional_light_entities
.get(*light_entity)
.expect("Failed to get directional light visible entities"),
LightEntity::Point {
light_entity,
face_index,
} => point_light_entities
.get(*light_entity)
.expect("Failed to get point light visible entities")
.get(*face_index),
};
// NOTE: Lights with shadow mapping disabled will have no visible entities
Clustered forward rendering (#3153) # Objective Implement clustered-forward rendering. ## Solution ~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works. * The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+). * WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers. * Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct * The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space. * Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers. * Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR. * I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times. Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on. * Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is. * Improve the z-slicing to use a larger first slice. * More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …) * Switch to using a texture instead of uniform buffer * Figure out the / a better story for shadows I will also probably add an example that demonstrates some of the issues: * What situations exhaust the space available in the uniform buffers * Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters * Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light * Perhaps some performance issues * How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
// so no meshes will be queued
for entity in visible_entities.iter().copied() {
if let Ok(mesh_handle) = casting_meshes.get(entity) {
Frustum culling (#2861) # Objective Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M. macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes. ## Solution - Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc. - I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps. - I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations. - When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free. - For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB. - The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice. - `RenderLayers` were copied over from the current renderer. - `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras. - `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false. - The `ComputedVisibility` is used to decide which meshes to extract. - I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win. I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
if let Some(mesh) = render_meshes.get(mesh_handle) {
Mesh vertex buffer layouts (#3959) This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes: * Reworks the `Mesh` type to use the newly added `VertexAttribute` internally * `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat` * `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting" * Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently. * Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key). * Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR. * Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool! To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now: ```rust impl SpecializedMeshPipeline for MeshPipeline { type Key = MeshPipelineKey; fn specialize( &self, key: Self::Key, layout: &MeshVertexBufferLayout, ) -> RenderPipelineDescriptor { let mut vertex_attributes = vec![ Mesh::ATTRIBUTE_POSITION.at_shader_location(0), Mesh::ATTRIBUTE_NORMAL.at_shader_location(1), Mesh::ATTRIBUTE_UV_0.at_shader_location(2), ]; let mut shader_defs = Vec::new(); if layout.contains(Mesh::ATTRIBUTE_TANGENT) { shader_defs.push(String::from("VERTEX_TANGENTS")); vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3)); } let vertex_buffer_layout = layout .get_layout(&vertex_attributes) .expect("Mesh is missing a vertex attribute"); ``` Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`. This is still a draft because I still need to: * Add more docs * Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it. * Consider breaking out the PreHash / hashbrown changes into a separate PR. * Add an example illustrating this change * Verify that the "mesh-specialized pipeline de-duplication code" works properly Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap. Alternative to #3120 Fixes #3030 Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
let key =
ShadowPipelineKey::from_primitive_topology(mesh.primitive_topology);
let pipeline_id = pipelines.specialize(
&mut pipeline_cache,
&shadow_pipeline,
key,
&mesh.layout,
);
let pipeline_id = match pipeline_id {
Ok(id) => id,
Err(err) => {
error!("{}", err);
continue;
}
};
shadow_phase.add(Shadow {
draw_function: draw_shadow_mesh,
pipeline: pipeline_id,
entity,
distance: 0.0, // TODO: sort back-to-front
});
}
}
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
}
}
}
}
pub struct Shadow {
pub distance: f32,
pub entity: Entity,
pub pipeline: CachedRenderPipelineId,
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
pub draw_function: DrawFunctionId,
2021-06-02 02:59:17 +00:00
}
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
impl PhaseItem for Shadow {
type SortKey = FloatOrd;
#[inline]
fn sort_key(&self) -> Self::SortKey {
FloatOrd(self.distance)
}
#[inline]
fn draw_function(&self) -> DrawFunctionId {
self.draw_function
}
}
2021-06-02 02:59:17 +00:00
impl EntityPhaseItem for Shadow {
fn entity(&self) -> Entity {
self.entity
}
}
impl CachedRenderPipelinePhaseItem for Shadow {
#[inline]
fn cached_pipeline(&self) -> CachedRenderPipelineId {
self.pipeline
}
}
2021-06-02 02:59:17 +00:00
pub struct ShadowPassNode {
main_view_query: QueryState<&'static ViewLightEntities>,
view_light_query: QueryState<(&'static ShadowView, &'static RenderPhase<Shadow>)>,
2021-06-02 02:59:17 +00:00
}
impl ShadowPassNode {
pub const IN_VIEW: &'static str = "view";
pub fn new(world: &mut World) -> Self {
Self {
main_view_query: QueryState::new(world),
view_light_query: QueryState::new(world),
}
}
}
impl Node for ShadowPassNode {
fn input(&self) -> Vec<SlotInfo> {
vec![SlotInfo::new(ShadowPassNode::IN_VIEW, SlotType::Entity)]
}
fn update(&mut self, world: &mut World) {
self.main_view_query.update_archetypes(world);
self.view_light_query.update_archetypes(world);
}
fn run(
&self,
graph: &mut RenderGraphContext,
2021-06-21 23:28:52 +00:00
render_context: &mut RenderContext,
2021-06-02 02:59:17 +00:00
world: &World,
) -> Result<(), NodeRunError> {
let view_entity = graph.get_input_entity(Self::IN_VIEW)?;
2021-07-02 01:05:20 +00:00
if let Ok(view_lights) = self.main_view_query.get_manual(world, view_entity) {
for view_light_entity in view_lights.lights.iter().copied() {
let (view_light, shadow_phase) = self
.view_light_query
.get_manual(world, view_light_entity)
.unwrap();
if shadow_phase.items.is_empty() {
continue;
}
2021-06-21 23:28:52 +00:00
let pass_descriptor = RenderPassDescriptor {
label: Some(&view_light.pass_name),
2021-06-21 23:28:52 +00:00
color_attachments: &[],
depth_stencil_attachment: Some(RenderPassDepthStencilAttachment {
2021-07-01 23:48:55 +00:00
view: &view_light.depth_texture_view,
depth_ops: Some(Operations {
Use the infinite reverse right-handed perspective projection (#2543) # Objective Forward perspective projections have poor floating point precision distribution over the depth range. Reverse projections fair much better, and instead of having to have a far plane, with the reverse projection, using an infinite far plane is not a problem. The infinite reverse perspective projection has become the industry standard. The renderer rework is a great time to migrate to it. ## Solution All perspective projections, including point lights, have been moved to using `glam::Mat4::perspective_infinite_reverse_rh()` and so have no far plane. As various depth textures are shared between orthographic and perspective projections, a quirk of this PR is that the near and far planes of the orthographic projection are swapped when the Mat4 is computed. This has no impact on 2D/3D orthographic projection usage, and provides consistency in shaders, texture clear values, etc. throughout the codebase. ## Known issues For some reason, when looking along -Z, all geometry is black. The camera can be translated up/down / strafed left/right and geometry will still be black. Moving forward/backward or rotating the camera away from looking exactly along -Z causes everything to work as expected. I have tried to debug this issue but both in macOS and Windows I get crashes when doing pixel debugging. If anyone could reproduce this and debug it I would be very grateful. Otherwise I will have to try to debug it further without pixel debugging, though the projections and such all looked fine to me.
2021-08-27 20:15:09 +00:00
load: LoadOp::Clear(0.0),
store: true,
}),
stencil_ops: None,
2021-06-02 02:59:17 +00:00
}),
};
let draw_functions = world.resource::<DrawFunctions<Shadow>>();
2021-06-21 23:28:52 +00:00
let render_pass = render_context
.command_encoder
.begin_render_pass(&pass_descriptor);
let mut draw_functions = draw_functions.write();
let mut tracked_pass = TrackedRenderPass::new(render_pass);
for item in &shadow_phase.items {
Modular Rendering (#2831) This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts. To make rendering more modular, I made a number of changes: * Entities now drive rendering: * We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering" * To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815 * Reworked the `Draw` abstraction: * Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key" * Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems) * `Draw` trait and and `DrawFunctions` are now generic on PhaseItem * Modular / Ergonomic `DrawFunctions` via `RenderCommands` * RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations: ```rust pub type DrawPbr = ( SetPbrPipeline, SetMeshViewBindGroup<0>, SetStandardMaterialBindGroup<1>, SetTransformBindGroup<2>, DrawMesh, ); ``` * The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function: ```rust type DrawCustom = ( SetCustomMaterialPipeline, SetMeshViewBindGroup<0>, SetTransformBindGroup<2>, DrawMesh, ); ``` * ExtractComponentPlugin and UniformComponentPlugin: * Simple, standardized ways to easily extract individual components and write them to GPU buffers * Ported PBR and Sprite rendering to the new primitives above. * Removed staging buffer from UniformVec in favor of direct Queue usage * Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems). * Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark! * Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know! * RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers). * RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes. * Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
let draw_function = draw_functions.get_mut(item.draw_function).unwrap();
draw_function.draw(world, &mut tracked_pass, view_light_entity, item);
2021-06-21 23:28:52 +00:00
}
}
2021-06-02 02:59:17 +00:00
}
Ok(())
}
}
pub type DrawShadowMesh = (
SetItemPipeline,
SetShadowViewBindGroup<0>,
Shader Imports. Decouple Mesh logic from PBR (#3137) ## Shader Imports This adds "whole file" shader imports. These come in two flavors: ### Asset Path Imports ```rust // /assets/shaders/custom.wgsl #import "shaders/custom_material.wgsl" [[stage(fragment)]] fn fragment() -> [[location(0)]] vec4<f32> { return get_color(); } ``` ```rust // /assets/shaders/custom_material.wgsl [[block]] struct CustomMaterial { color: vec4<f32>; }; [[group(1), binding(0)]] var<uniform> material: CustomMaterial; ``` ### Custom Path Imports Enables defining custom import paths. These are intended to be used by crates to export shader functionality: ```rust // bevy_pbr2/src/render/pbr.wgsl #import bevy_pbr::mesh_view_bind_group #import bevy_pbr::mesh_bind_group [[block]] struct StandardMaterial { base_color: vec4<f32>; emissive: vec4<f32>; perceptual_roughness: f32; metallic: f32; reflectance: f32; flags: u32; }; /* rest of PBR fragment shader here */ ``` ```rust impl Plugin for MeshRenderPlugin { fn build(&self, app: &mut bevy_app::App) { let mut shaders = app.world.get_resource_mut::<Assets<Shader>>().unwrap(); shaders.set_untracked( MESH_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_bind_group"), ); shaders.set_untracked( MESH_VIEW_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_view_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_view_bind_group"), ); ``` By convention these should use rust-style module paths that start with the crate name. Ultimately we might enforce this convention. Note that this feature implements _run time_ import resolution. Ultimately we should move the import logic into an asset preprocessor once Bevy gets support for that. ## Decouple Mesh Logic from PBR Logic via MeshRenderPlugin This breaks out mesh rendering code from PBR material code, which improves the legibility of the code, decouples mesh logic from PBR logic, and opens the door for a future `MaterialPlugin<T: Material>` that handles all of the pipeline setup for arbitrary shader materials. ## Removed `RenderAsset<Shader>` in favor of extracting shaders into RenderPipelineCache This simplifies the shader import implementation and removes the need to pass around `RenderAssets<Shader>`. ## RenderCommands are now fallible This allows us to cleanly handle pipelines+shaders not being ready yet. We can abort a render command early in these cases, preventing bevy from trying to bind group / do draw calls for pipelines that couldn't be bound. This could also be used in the future for things like "components not existing on entities yet". # Next Steps * Investigate using Naga for "partial typed imports" (ex: `#import bevy_pbr::material::StandardMaterial`, which would import only the StandardMaterial struct) * Implement `MaterialPlugin<T: Material>` for low-boilerplate custom material shaders * Move shader import logic into the asset preprocessor once bevy gets support for that. Fixes #3132
2021-11-18 03:45:02 +00:00
SetMeshBindGroup<1>,
DrawMesh,
);
2021-06-02 02:59:17 +00:00
pub struct SetShadowViewBindGroup<const I: usize>;
impl<const I: usize> EntityRenderCommand for SetShadowViewBindGroup<I> {
type Param = (SRes<LightMeta>, SQuery<Read<ViewUniformOffset>>);
#[inline]
fn render<'w>(
2021-06-02 02:59:17 +00:00
view: Entity,
_item: Entity,
(light_meta, view_query): SystemParamItem<'w, '_, Self::Param>,
pass: &mut TrackedRenderPass<'w>,
Shader Imports. Decouple Mesh logic from PBR (#3137) ## Shader Imports This adds "whole file" shader imports. These come in two flavors: ### Asset Path Imports ```rust // /assets/shaders/custom.wgsl #import "shaders/custom_material.wgsl" [[stage(fragment)]] fn fragment() -> [[location(0)]] vec4<f32> { return get_color(); } ``` ```rust // /assets/shaders/custom_material.wgsl [[block]] struct CustomMaterial { color: vec4<f32>; }; [[group(1), binding(0)]] var<uniform> material: CustomMaterial; ``` ### Custom Path Imports Enables defining custom import paths. These are intended to be used by crates to export shader functionality: ```rust // bevy_pbr2/src/render/pbr.wgsl #import bevy_pbr::mesh_view_bind_group #import bevy_pbr::mesh_bind_group [[block]] struct StandardMaterial { base_color: vec4<f32>; emissive: vec4<f32>; perceptual_roughness: f32; metallic: f32; reflectance: f32; flags: u32; }; /* rest of PBR fragment shader here */ ``` ```rust impl Plugin for MeshRenderPlugin { fn build(&self, app: &mut bevy_app::App) { let mut shaders = app.world.get_resource_mut::<Assets<Shader>>().unwrap(); shaders.set_untracked( MESH_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_bind_group"), ); shaders.set_untracked( MESH_VIEW_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_view_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_view_bind_group"), ); ``` By convention these should use rust-style module paths that start with the crate name. Ultimately we might enforce this convention. Note that this feature implements _run time_ import resolution. Ultimately we should move the import logic into an asset preprocessor once Bevy gets support for that. ## Decouple Mesh Logic from PBR Logic via MeshRenderPlugin This breaks out mesh rendering code from PBR material code, which improves the legibility of the code, decouples mesh logic from PBR logic, and opens the door for a future `MaterialPlugin<T: Material>` that handles all of the pipeline setup for arbitrary shader materials. ## Removed `RenderAsset<Shader>` in favor of extracting shaders into RenderPipelineCache This simplifies the shader import implementation and removes the need to pass around `RenderAssets<Shader>`. ## RenderCommands are now fallible This allows us to cleanly handle pipelines+shaders not being ready yet. We can abort a render command early in these cases, preventing bevy from trying to bind group / do draw calls for pipelines that couldn't be bound. This could also be used in the future for things like "components not existing on entities yet". # Next Steps * Investigate using Naga for "partial typed imports" (ex: `#import bevy_pbr::material::StandardMaterial`, which would import only the StandardMaterial struct) * Implement `MaterialPlugin<T: Material>` for low-boilerplate custom material shaders * Move shader import logic into the asset preprocessor once bevy gets support for that. Fixes #3132
2021-11-18 03:45:02 +00:00
) -> RenderCommandResult {
let view_uniform_offset = view_query.get(view).unwrap();
pass.set_bind_group(
I,
light_meta
.into_inner()
.shadow_view_bind_group
.as_ref()
.unwrap(),
&[view_uniform_offset.offset],
);
Shader Imports. Decouple Mesh logic from PBR (#3137) ## Shader Imports This adds "whole file" shader imports. These come in two flavors: ### Asset Path Imports ```rust // /assets/shaders/custom.wgsl #import "shaders/custom_material.wgsl" [[stage(fragment)]] fn fragment() -> [[location(0)]] vec4<f32> { return get_color(); } ``` ```rust // /assets/shaders/custom_material.wgsl [[block]] struct CustomMaterial { color: vec4<f32>; }; [[group(1), binding(0)]] var<uniform> material: CustomMaterial; ``` ### Custom Path Imports Enables defining custom import paths. These are intended to be used by crates to export shader functionality: ```rust // bevy_pbr2/src/render/pbr.wgsl #import bevy_pbr::mesh_view_bind_group #import bevy_pbr::mesh_bind_group [[block]] struct StandardMaterial { base_color: vec4<f32>; emissive: vec4<f32>; perceptual_roughness: f32; metallic: f32; reflectance: f32; flags: u32; }; /* rest of PBR fragment shader here */ ``` ```rust impl Plugin for MeshRenderPlugin { fn build(&self, app: &mut bevy_app::App) { let mut shaders = app.world.get_resource_mut::<Assets<Shader>>().unwrap(); shaders.set_untracked( MESH_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_bind_group"), ); shaders.set_untracked( MESH_VIEW_BIND_GROUP_HANDLE, Shader::from_wgsl(include_str!("mesh_view_bind_group.wgsl")) .with_import_path("bevy_pbr::mesh_view_bind_group"), ); ``` By convention these should use rust-style module paths that start with the crate name. Ultimately we might enforce this convention. Note that this feature implements _run time_ import resolution. Ultimately we should move the import logic into an asset preprocessor once Bevy gets support for that. ## Decouple Mesh Logic from PBR Logic via MeshRenderPlugin This breaks out mesh rendering code from PBR material code, which improves the legibility of the code, decouples mesh logic from PBR logic, and opens the door for a future `MaterialPlugin<T: Material>` that handles all of the pipeline setup for arbitrary shader materials. ## Removed `RenderAsset<Shader>` in favor of extracting shaders into RenderPipelineCache This simplifies the shader import implementation and removes the need to pass around `RenderAssets<Shader>`. ## RenderCommands are now fallible This allows us to cleanly handle pipelines+shaders not being ready yet. We can abort a render command early in these cases, preventing bevy from trying to bind group / do draw calls for pipelines that couldn't be bound. This could also be used in the future for things like "components not existing on entities yet". # Next Steps * Investigate using Naga for "partial typed imports" (ex: `#import bevy_pbr::material::StandardMaterial`, which would import only the StandardMaterial struct) * Implement `MaterialPlugin<T: Material>` for low-boilerplate custom material shaders * Move shader import logic into the asset preprocessor once bevy gets support for that. Fixes #3132
2021-11-18 03:45:02 +00:00
RenderCommandResult::Success
2021-06-02 02:59:17 +00:00
}
}