2021-08-25 05:57:57 +00:00
|
|
|
use crate::{
|
2023-03-02 08:21:21 +00:00
|
|
|
directional_light_order, point_light_order, AlphaMode, AmbientLight, Cascade,
|
|
|
|
CascadeShadowConfig, Cascades, CascadesVisibleEntities, Clusters, CubemapVisibleEntities,
|
|
|
|
DirectionalLight, DirectionalLightShadowMap, DrawPrepass, EnvironmentMapLight,
|
|
|
|
GlobalVisiblePointLights, Material, MaterialPipelineKey, MeshPipeline, MeshPipelineKey,
|
|
|
|
NotShadowCaster, PointLight, PointLightShadowMap, PrepassPipeline, RenderMaterials, SpotLight,
|
|
|
|
VisiblePointLights,
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
};
|
|
|
|
use bevy_asset::Handle;
|
Camera Driven Rendering (#4745)
This adds "high level camera driven rendering" to Bevy. The goal is to give users more control over what gets rendered (and where) without needing to deal with render logic. This will make scenarios like "render to texture", "multiple windows", "split screen", "2d on 3d", "3d on 2d", "pass layering", and more significantly easier.
Here is an [example of a 2d render sandwiched between two 3d renders (each from a different perspective)](https://gist.github.com/cart/4fe56874b2e53bc5594a182fc76f4915):
![image](https://user-images.githubusercontent.com/2694663/168411086-af13dec8-0093-4a84-bdd4-d4362d850ffa.png)
Users can now spawn a camera, point it at a RenderTarget (a texture or a window), and it will "just work".
Rendering to a second window is as simple as spawning a second camera and assigning it to a specific window id:
```rust
// main camera (main window)
commands.spawn_bundle(Camera2dBundle::default());
// second camera (other window)
commands.spawn_bundle(Camera2dBundle {
camera: Camera {
target: RenderTarget::Window(window_id),
..default()
},
..default()
});
```
Rendering to a texture is as simple as pointing the camera at a texture:
```rust
commands.spawn_bundle(Camera2dBundle {
camera: Camera {
target: RenderTarget::Texture(image_handle),
..default()
},
..default()
});
```
Cameras now have a "render priority", which controls the order they are drawn in. If you want to use a camera's output texture as a texture in the main pass, just set the priority to a number lower than the main pass camera (which defaults to `0`).
```rust
// main pass camera with a default priority of 0
commands.spawn_bundle(Camera2dBundle::default());
commands.spawn_bundle(Camera2dBundle {
camera: Camera {
target: RenderTarget::Texture(image_handle.clone()),
priority: -1,
..default()
},
..default()
});
commands.spawn_bundle(SpriteBundle {
texture: image_handle,
..default()
})
```
Priority can also be used to layer to cameras on top of each other for the same RenderTarget. This is what "2d on top of 3d" looks like in the new system:
```rust
commands.spawn_bundle(Camera3dBundle::default());
commands.spawn_bundle(Camera2dBundle {
camera: Camera {
// this will render 2d entities "on top" of the default 3d camera's render
priority: 1,
..default()
},
..default()
});
```
There is no longer the concept of a global "active camera". Resources like `ActiveCamera<Camera2d>` and `ActiveCamera<Camera3d>` have been replaced with the camera-specific `Camera::is_active` field. This does put the onus on users to manage which cameras should be active.
Cameras are now assigned a single render graph as an "entry point", which is configured on each camera entity using the new `CameraRenderGraph` component. The old `PerspectiveCameraBundle` and `OrthographicCameraBundle` (generic on camera marker components like Camera2d and Camera3d) have been replaced by `Camera3dBundle` and `Camera2dBundle`, which set 3d and 2d default values for the `CameraRenderGraph` and projections.
```rust
// old 3d perspective camera
commands.spawn_bundle(PerspectiveCameraBundle::default())
// new 3d perspective camera
commands.spawn_bundle(Camera3dBundle::default())
```
```rust
// old 2d orthographic camera
commands.spawn_bundle(OrthographicCameraBundle::new_2d())
// new 2d orthographic camera
commands.spawn_bundle(Camera2dBundle::default())
```
```rust
// old 3d orthographic camera
commands.spawn_bundle(OrthographicCameraBundle::new_3d())
// new 3d orthographic camera
commands.spawn_bundle(Camera3dBundle {
projection: OrthographicProjection {
scale: 3.0,
scaling_mode: ScalingMode::FixedVertical,
..default()
}.into(),
..default()
})
```
Note that `Camera3dBundle` now uses a new `Projection` enum instead of hard coding the projection into the type. There are a number of motivators for this change: the render graph is now a part of the bundle, the way "generic bundles" work in the rust type system prevents nice `..default()` syntax, and changing projections at runtime is much easier with an enum (ex for editor scenarios). I'm open to discussing this choice, but I'm relatively certain we will all come to the same conclusion here. Camera2dBundle and Camera3dBundle are much clearer than being generic on marker components / using non-default constructors.
If you want to run a custom render graph on a camera, just set the `CameraRenderGraph` component:
```rust
commands.spawn_bundle(Camera3dBundle {
camera_render_graph: CameraRenderGraph::new(some_render_graph_name),
..default()
})
```
Just note that if the graph requires data from specific components to work (such as `Camera3d` config, which is provided in the `Camera3dBundle`), make sure the relevant components have been added.
Speaking of using components to configure graphs / passes, there are a number of new configuration options:
```rust
commands.spawn_bundle(Camera3dBundle {
camera_3d: Camera3d {
// overrides the default global clear color
clear_color: ClearColorConfig::Custom(Color::RED),
..default()
},
..default()
})
commands.spawn_bundle(Camera3dBundle {
camera_3d: Camera3d {
// disables clearing
clear_color: ClearColorConfig::None,
..default()
},
..default()
})
```
Expect to see more of the "graph configuration Components on Cameras" pattern in the future.
By popular demand, UI no longer requires a dedicated camera. `UiCameraBundle` has been removed. `Camera2dBundle` and `Camera3dBundle` now both default to rendering UI as part of their own render graphs. To disable UI rendering for a camera, disable it using the CameraUi component:
```rust
commands
.spawn_bundle(Camera3dBundle::default())
.insert(CameraUi {
is_enabled: false,
..default()
})
```
## Other Changes
* The separate clear pass has been removed. We should revisit this for things like sky rendering, but I think this PR should "keep it simple" until we're ready to properly support that (for code complexity and performance reasons). We can come up with the right design for a modular clear pass in a followup pr.
* I reorganized bevy_core_pipeline into Core2dPlugin and Core3dPlugin (and core_2d / core_3d modules). Everything is pretty much the same as before, just logically separate. I've moved relevant types (like Camera2d, Camera3d, Camera3dBundle, Camera2dBundle) into their relevant modules, which is what motivated this reorganization.
* I adapted the `scene_viewer` example (which relied on the ActiveCameras behavior) to the new system. I also refactored bits and pieces to be a bit simpler.
* All of the examples have been ported to the new camera approach. `render_to_texture` and `multiple_windows` are now _much_ simpler. I removed `two_passes` because it is less relevant with the new approach. If someone wants to add a new "layered custom pass with CameraRenderGraph" example, that might fill a similar niche. But I don't feel much pressure to add that in this pr.
* Cameras now have `target_logical_size` and `target_physical_size` fields, which makes finding the size of a camera's render target _much_ simpler. As a result, the `Assets<Image>` and `Windows` parameters were removed from `Camera::world_to_screen`, making that operation much more ergonomic.
* Render order ambiguities between cameras with the same target and the same priority now produce a warning. This accomplishes two goals:
1. Now that there is no "global" active camera, by default spawning two cameras will result in two renders (one covering the other). This would be a silent performance killer that would be hard to detect after the fact. By detecting ambiguities, we can provide a helpful warning when this occurs.
2. Render order ambiguities could result in unexpected / unpredictable render results. Resolving them makes sense.
## Follow Up Work
* Per-Camera viewports, which will make it possible to render to a smaller area inside of a RenderTarget (great for something like splitscreen)
* Camera-specific MSAA config (should use the same "overriding" pattern used for ClearColor)
* Graph Based Camera Ordering: priorities are simple, but they make complicated ordering constraints harder to express. We should consider adopting a "graph based" camera ordering model with "before" and "after" relationships to other cameras (or build it "on top" of the priority system).
* Consider allowing graphs to run subgraphs from any nest level (aka a global namespace for graphs). Right now the 2d and 3d graphs each need their own UI subgraph, which feels "fine" in the short term. But being able to share subgraphs between other subgraphs seems valuable.
* Consider splitting `bevy_core_pipeline` into `bevy_core_2d` and `bevy_core_3d` packages. Theres a shared "clear color" dependency here, which would need a new home.
2022-06-02 00:12:17 +00:00
|
|
|
use bevy_core_pipeline::core_3d::Transparent3d;
|
2023-03-02 22:44:10 +00:00
|
|
|
use bevy_ecs::prelude::*;
|
2023-01-25 12:35:39 +00:00
|
|
|
use bevy_math::{Mat4, UVec3, UVec4, Vec2, Vec3, Vec3Swizzles, Vec4, Vec4Swizzles};
|
2021-12-14 03:58:23 +00:00
|
|
|
use bevy_render::{
|
2023-01-25 12:35:39 +00:00
|
|
|
camera::Camera,
|
2021-06-02 02:59:17 +00:00
|
|
|
color::Color,
|
2023-03-02 08:21:21 +00:00
|
|
|
mesh::Mesh,
|
2021-06-26 22:35:07 +00:00
|
|
|
render_asset::RenderAssets,
|
Make render graph slots optional for most cases (#8109)
# Objective
- Currently, the render graph slots are only used to pass the
view_entity around. This introduces significant boilerplate for very
little value. Instead of using slots for this, make the view_entity part
of the `RenderGraphContext`. This also means we won't need to have
`IN_VIEW` on every node and and we'll be able to use the default impl of
`Node::input()`.
## Solution
- Add `view_entity: Option<Entity>` to the `RenderGraphContext`
- Update all nodes to use this instead of entity slot input
---
## Changelog
- Add optional `view_entity` to `RenderGraphContext`
## Migration Guide
You can now get the view_entity directly from the `RenderGraphContext`.
When implementing the Node:
```rust
// 0.10
struct FooNode;
impl FooNode {
const IN_VIEW: &'static str = "view";
}
impl Node for FooNode {
fn input(&self) -> Vec<SlotInfo> {
vec![SlotInfo::new(Self::IN_VIEW, SlotType::Entity)]
}
fn run(
&self,
graph: &mut RenderGraphContext,
// ...
) -> Result<(), NodeRunError> {
let view_entity = graph.get_input_entity(Self::IN_VIEW)?;
// ...
Ok(())
}
}
// 0.11
struct FooNode;
impl Node for FooNode {
fn run(
&self,
graph: &mut RenderGraphContext,
// ...
) -> Result<(), NodeRunError> {
let view_entity = graph.view_entity();
// ...
Ok(())
}
}
```
When adding the node to the graph, you don't need to specify a slot_edge
for the view_entity.
```rust
// 0.10
let mut graph = RenderGraph::default();
graph.add_node(FooNode::NAME, node);
let input_node_id = draw_2d_graph.set_input(vec![SlotInfo::new(
graph::input::VIEW_ENTITY,
SlotType::Entity,
)]);
graph.add_slot_edge(
input_node_id,
graph::input::VIEW_ENTITY,
FooNode::NAME,
FooNode::IN_VIEW,
);
// add_node_edge ...
// 0.11
let mut graph = RenderGraph::default();
graph.add_node(FooNode::NAME, node);
// add_node_edge ...
```
## Notes
This PR paired with #8007 will help reduce a lot of annoying boilerplate
with the render nodes. Depending on which one gets merged first. It will
require a bit of clean up work to make both compatible.
I tagged this as a breaking change, because using the old system to get
the view_entity will break things because it's not a node input slot
anymore.
## Notes for reviewers
A lot of the diffs are just removing the slots in every nodes and graph
creation. The important part is mostly in the
graph_runner/CameraDriverNode.
2023-03-21 20:11:13 +00:00
|
|
|
render_graph::{Node, NodeRunError, RenderGraphContext},
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
render_phase::{
|
2023-03-02 22:44:10 +00:00
|
|
|
CachedRenderPipelinePhaseItem, DrawFunctionId, DrawFunctions, PhaseItem, RenderPhase,
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
},
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
render_resource::*,
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
renderer::{RenderContext, RenderDevice, RenderQueue},
|
2021-06-02 02:59:17 +00:00
|
|
|
texture::*,
|
2023-03-02 22:44:10 +00:00
|
|
|
view::{ComputedVisibility, ExtractedView, VisibleEntities},
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
Extract,
|
2021-06-02 02:59:17 +00:00
|
|
|
};
|
2022-07-16 00:51:12 +00:00
|
|
|
use bevy_transform::{components::GlobalTransform, prelude::Transform};
|
Mesh vertex buffer layouts (#3959)
This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes:
* Reworks the `Mesh` type to use the newly added `VertexAttribute` internally
* `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat`
* `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting"
* Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently.
* Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key).
* Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR.
* Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool!
To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now:
```rust
impl SpecializedMeshPipeline for MeshPipeline {
type Key = MeshPipelineKey;
fn specialize(
&self,
key: Self::Key,
layout: &MeshVertexBufferLayout,
) -> RenderPipelineDescriptor {
let mut vertex_attributes = vec![
Mesh::ATTRIBUTE_POSITION.at_shader_location(0),
Mesh::ATTRIBUTE_NORMAL.at_shader_location(1),
Mesh::ATTRIBUTE_UV_0.at_shader_location(2),
];
let mut shader_defs = Vec::new();
if layout.contains(Mesh::ATTRIBUTE_TANGENT) {
shader_defs.push(String::from("VERTEX_TANGENTS"));
vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3));
}
let vertex_buffer_layout = layout
.get_layout(&vertex_attributes)
.expect("Mesh is missing a vertex attribute");
```
Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`.
This is still a draft because I still need to:
* Add more docs
* Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it.
* Consider breaking out the PreHash / hashbrown changes into a separate PR.
* Add an example illustrating this change
* Verify that the "mesh-specialized pipeline de-duplication code" works properly
Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap.
Alternative to #3120
Fixes #3030
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
|
|
|
use bevy_utils::{
|
|
|
|
tracing::{error, warn},
|
|
|
|
HashMap,
|
|
|
|
};
|
2023-04-26 15:34:23 +00:00
|
|
|
use std::{hash::Hash, num::NonZeroU64};
|
2021-06-02 02:59:17 +00:00
|
|
|
|
Migrate engine to Schedule v3 (#7267)
Huge thanks to @maniwani, @devil-ira, @hymm, @cart, @superdump and @jakobhellermann for the help with this PR.
# Objective
- Followup #6587.
- Minimal integration for the Stageless Scheduling RFC: https://github.com/bevyengine/rfcs/pull/45
## Solution
- [x] Remove old scheduling module
- [x] Migrate new methods to no longer use extension methods
- [x] Fix compiler errors
- [x] Fix benchmarks
- [x] Fix examples
- [x] Fix docs
- [x] Fix tests
## Changelog
### Added
- a large number of methods on `App` to work with schedules ergonomically
- the `CoreSchedule` enum
- `App::add_extract_system` via the `RenderingAppExtension` trait extension method
- the private `prepare_view_uniforms` system now has a public system set for scheduling purposes, called `ViewSet::PrepareUniforms`
### Removed
- stages, and all code that mentions stages
- states have been dramatically simplified, and no longer use a stack
- `RunCriteriaLabel`
- `AsSystemLabel` trait
- `on_hierarchy_reports_enabled` run criteria (now just uses an ad hoc resource checking run condition)
- systems in `RenderSet/Stage::Extract` no longer warn when they do not read data from the main world
- `RunCriteriaLabel`
- `transform_propagate_system_set`: this was a nonstandard pattern that didn't actually provide enough control. The systems are already `pub`: the docs have been updated to ensure that the third-party usage is clear.
### Changed
- `System::default_labels` is now `System::default_system_sets`.
- `App::add_default_labels` is now `App::add_default_sets`
- `CoreStage` and `StartupStage` enums are now `CoreSet` and `StartupSet`
- `App::add_system_set` was renamed to `App::add_systems`
- The `StartupSchedule` label is now defined as part of the `CoreSchedules` enum
- `.label(SystemLabel)` is now referred to as `.in_set(SystemSet)`
- `SystemLabel` trait was replaced by `SystemSet`
- `SystemTypeIdLabel<T>` was replaced by `SystemSetType<T>`
- The `ReportHierarchyIssue` resource now has a public constructor (`new`), and implements `PartialEq`
- Fixed time steps now use a schedule (`CoreSchedule::FixedTimeStep`) rather than a run criteria.
- Adding rendering extraction systems now panics rather than silently failing if no subapp with the `RenderApp` label is found.
- the `calculate_bounds` system, with the `CalculateBounds` label, is now in `CoreSet::Update`, rather than in `CoreSet::PostUpdate` before commands are applied.
- `SceneSpawnerSystem` now runs under `CoreSet::Update`, rather than `CoreStage::PreUpdate.at_end()`.
- `bevy_pbr::add_clusters` is no longer an exclusive system
- the top level `bevy_ecs::schedule` module was replaced with `bevy_ecs::scheduling`
- `tick_global_task_pools_on_main_thread` is no longer run as an exclusive system. Instead, it has been replaced by `tick_global_task_pools`, which uses a `NonSend` resource to force running on the main thread.
## Migration Guide
- Calls to `.label(MyLabel)` should be replaced with `.in_set(MySet)`
- Stages have been removed. Replace these with system sets, and then add command flushes using the `apply_system_buffers` exclusive system where needed.
- The `CoreStage`, `StartupStage, `RenderStage` and `AssetStage` enums have been replaced with `CoreSet`, `StartupSet, `RenderSet` and `AssetSet`. The same scheduling guarantees have been preserved.
- Systems are no longer added to `CoreSet::Update` by default. Add systems manually if this behavior is needed, although you should consider adding your game logic systems to `CoreSchedule::FixedTimestep` instead for more reliable framerate-independent behavior.
- Similarly, startup systems are no longer part of `StartupSet::Startup` by default. In most cases, this won't matter to you.
- For example, `add_system_to_stage(CoreStage::PostUpdate, my_system)` should be replaced with
- `add_system(my_system.in_set(CoreSet::PostUpdate)`
- When testing systems or otherwise running them in a headless fashion, simply construct and run a schedule using `Schedule::new()` and `World::run_schedule` rather than constructing stages
- Run criteria have been renamed to run conditions. These can now be combined with each other and with states.
- Looping run criteria and state stacks have been removed. Use an exclusive system that runs a schedule if you need this level of control over system control flow.
- For app-level control flow over which schedules get run when (such as for rollback networking), create your own schedule and insert it under the `CoreSchedule::Outer` label.
- Fixed timesteps are now evaluated in a schedule, rather than controlled via run criteria. The `run_fixed_timestep` system runs this schedule between `CoreSet::First` and `CoreSet::PreUpdate` by default.
- Command flush points introduced by `AssetStage` have been removed. If you were relying on these, add them back manually.
- Adding extract systems is now typically done directly on the main app. Make sure the `RenderingAppExtension` trait is in scope, then call `app.add_extract_system(my_system)`.
- the `calculate_bounds` system, with the `CalculateBounds` label, is now in `CoreSet::Update`, rather than in `CoreSet::PostUpdate` before commands are applied. You may need to order your movement systems to occur before this system in order to avoid system order ambiguities in culling behavior.
- the `RenderLabel` `AppLabel` was renamed to `RenderApp` for clarity
- `App::add_state` now takes 0 arguments: the starting state is set based on the `Default` impl.
- Instead of creating `SystemSet` containers for systems that run in stages, simply use `.on_enter::<State::Variant>()` or its `on_exit` or `on_update` siblings.
- `SystemLabel` derives should be replaced with `SystemSet`. You will also need to add the `Debug`, `PartialEq`, `Eq`, and `Hash` traits to satisfy the new trait bounds.
- `with_run_criteria` has been renamed to `run_if`. Run criteria have been renamed to run conditions for clarity, and should now simply return a bool.
- States have been dramatically simplified: there is no longer a "state stack". To queue a transition to the next state, call `NextState::set`
## TODO
- [x] remove dead methods on App and World
- [x] add `App::add_system_to_schedule` and `App::add_systems_to_schedule`
- [x] avoid adding the default system set at inappropriate times
- [x] remove any accidental cycles in the default plugins schedule
- [x] migrate benchmarks
- [x] expose explicit labels for the built-in command flush points
- [x] migrate engine code
- [x] remove all mentions of stages from the docs
- [x] verify docs for States
- [x] fix uses of exclusive systems that use .end / .at_start / .before_commands
- [x] migrate RenderStage and AssetStage
- [x] migrate examples
- [x] ensure that transform propagation is exported in a sufficiently public way (the systems are already pub)
- [x] ensure that on_enter schedules are run at least once before the main app
- [x] re-enable opt-in to execution order ambiguities
- [x] revert change to `update_bounds` to ensure it runs in `PostUpdate`
- [x] test all examples
- [x] unbreak directional lights
- [x] unbreak shadows (see 3d_scene, 3d_shape, lighting, transparaency_3d examples)
- [x] game menu example shows loading screen and menu simultaneously
- [x] display settings menu is a blank screen
- [x] `without_winit` example panics
- [x] ensure all tests pass
- [x] SubApp doc test fails
- [x] runs_spawn_local tasks fails
- [x] [Fix panic_when_hierachy_cycle test hanging](https://github.com/alice-i-cecile/bevy/pull/120)
## Points of Difficulty and Controversy
**Reviewers, please give feedback on these and look closely**
1. Default sets, from the RFC, have been removed. These added a tremendous amount of implicit complexity and result in hard to debug scheduling errors. They're going to be tackled in the form of "base sets" by @cart in a followup.
2. The outer schedule controls which schedule is run when `App::update` is called.
3. I implemented `Label for `Box<dyn Label>` for our label types. This enables us to store schedule labels in concrete form, and then later run them. I ran into the same set of problems when working with one-shot systems. We've previously investigated this pattern in depth, and it does not appear to lead to extra indirection with nested boxes.
4. `SubApp::update` simply runs the default schedule once. This sucks, but this whole API is incomplete and this was the minimal changeset.
5. `time_system` and `tick_global_task_pools_on_main_thread` no longer use exclusive systems to attempt to force scheduling order
6. Implemetnation strategy for fixed timesteps
7. `AssetStage` was migrated to `AssetSet` without reintroducing command flush points. These did not appear to be used, and it's nice to remove these bottlenecks.
8. Migration of `bevy_render/lib.rs` and pipelined rendering. The logic here is unusually tricky, as we have complex scheduling requirements.
## Future Work (ideally before 0.10)
- Rename schedule_v3 module to schedule or scheduling
- Add a derive macro to states, and likely a `EnumIter` trait of some form
- Figure out what exactly to do with the "systems added should basically work by default" problem
- Improve ergonomics for working with fixed timesteps and states
- Polish FixedTime API to match Time
- Rebase and merge #7415
- Resolve all internal ambiguities (blocked on better tools, especially #7442)
- Add "base sets" to replace the removed default sets.
2023-02-06 02:04:50 +00:00
|
|
|
#[derive(Debug, Hash, PartialEq, Eq, Clone, SystemSet)]
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
pub enum RenderLightSystems {
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
ExtractClusters,
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
ExtractLights,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
PrepareClusters,
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
PrepareLights,
|
|
|
|
QueueShadows,
|
|
|
|
}
|
|
|
|
|
2021-11-22 23:16:36 +00:00
|
|
|
#[derive(Component)]
|
2021-06-02 02:59:17 +00:00
|
|
|
pub struct ExtractedPointLight {
|
|
|
|
color: Color,
|
bevy_pbr2: Improve lighting units and documentation (#2704)
# Objective
A question was raised on Discord about the units of the `PointLight` `intensity` member.
After digging around in the bevy_pbr2 source code and [Google Filament documentation](https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower) I discovered that the intention by Filament was that the 'intensity' value for point lights would be in lumens. This makes a lot of sense as these are quite relatable units given basically all light bulbs I've seen sold over the past years are rated in lumens as people move away from thinking about how bright a bulb is relative to a non-halogen incandescent bulb.
However, it seems that the derivation of the conversion between luminous power (lumens, denoted `Φ` in the Filament formulae) and luminous intensity (lumens per steradian, `I` in the Filament formulae) was missed and I can see why as it is tucked right under equation 58 at the link above. As such, while the formula states that for a point light, `I = Φ / 4 π` we have been using `intensity` as if it were luminous intensity `I`.
Before this PR, the intensity field is luminous intensity in lumens per steradian. After this PR, the intensity field is luminous power in lumens, [as suggested by Filament](https://google.github.io/filament/Filament.html#table_lighttypesunits) (unfortunately the link jumps to the table's caption so scroll up to see the actual table).
I appreciate that it may be confusing to call this an intensity, but I think this is intended as more of a non-scientific, human-relatable general term with a bit of hand waving so that most light types can just have an intensity field and for most of them it works in the same way or at least with some relatable value. I'm inclined to think this is reasonable rather than throwing terms like luminous power, luminous intensity, blah at users.
## Solution
- Documented the `PointLight` `intensity` member as 'luminous power' in units of lumens.
- Added a table of examples relating from various types of household lighting to lumen values.
- Added in the mapping from luminous power to luminous intensity when premultiplying the intensity into the colour before it is made into a graphics uniform.
- Updated the documentation in `pbr.wgsl` to clarify the earlier confusion about the missing `/ 4 π`.
- Bumped the intensity of the point lights in `3d_scene_pipelined` to 1600 lumens.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2021-08-23 23:48:11 +00:00
|
|
|
/// luminous intensity in lumens per steradian
|
2021-06-02 02:59:17 +00:00
|
|
|
intensity: f32,
|
|
|
|
range: f32,
|
|
|
|
radius: f32,
|
|
|
|
transform: GlobalTransform,
|
2021-11-19 21:16:58 +00:00
|
|
|
shadows_enabled: bool,
|
2021-07-16 22:41:56 +00:00
|
|
|
shadow_depth_bias: f32,
|
|
|
|
shadow_normal_bias: f32,
|
2022-07-08 19:57:43 +00:00
|
|
|
spot_light_angles: Option<(f32, f32)>,
|
2021-07-08 02:49:33 +00:00
|
|
|
}
|
|
|
|
|
2023-01-25 12:35:39 +00:00
|
|
|
#[derive(Component, Debug)]
|
2021-07-08 02:49:33 +00:00
|
|
|
pub struct ExtractedDirectionalLight {
|
|
|
|
color: Color,
|
|
|
|
illuminance: f32,
|
Take DirectionalLight's GlobalTransform into account when calculating shadow map volume (not just direction) (#6384)
# Objective
This PR fixes #5789, by enabling movable (and scalable) directional light shadow volumes.
## Solution
This PR changes `ExtractedDirectionalLight` to hold a copy of the `DirectionalLight` entity's `GlobalTransform`, instead of just a `direction` vector. This allows the shadow map volume (as defined by the light's `shadow_projection` field) to be transformed honoring translation _and_ scale transforms, and not just rotation.
It also augments the texel size calculation (used to determine the `shadow_normal_bias`) so that it now takes into account the upper bound of the x/y/z scale of the `GlobalTransform`.
This change makes the directional light extraction code more consistent with point and spot lights (that already use `transform`), and allows easily moving and scaling the shadow volume along with a player entity based on camera distance/angle, immediately enabling more real world use cases until we have a more sophisticated adaptive implementation, such as the one described in #3629.
**Note:** While it was previously possible to update the projection achieving a similar effect, depending on the light direction and distance to the origin, the fact that the shadow map camera was always positioned at the origin with a hardcoded `Vec3::Y` up value meant you would get sub-optimal or inconsistent/incorrect results.
---
## Changelog
### Changed
- `DirectionalLight` shadow volumes now honor translation and scale transforms
## Migration Guide
- If your directional lights were positioned at the origin and not scaled (the default, most common scenario) no changes are needed on your part; it just works as before;
- If you previously had a system for dynamically updating directional light shadow projections, you might now be able to simplify your code by updating the directional light entity's transform instead;
- In the unlikely scenario that a scene with directional lights that previously rendered shadows correctly has missing shadows, make sure your directional lights are positioned at (0, 0, 0) and are not scaled to a size that's too large or too small.
2022-11-04 20:12:26 +00:00
|
|
|
transform: GlobalTransform,
|
2021-11-19 21:16:58 +00:00
|
|
|
shadows_enabled: bool,
|
2021-07-16 22:41:56 +00:00
|
|
|
shadow_depth_bias: f32,
|
|
|
|
shadow_normal_bias: f32,
|
2023-01-25 12:35:39 +00:00
|
|
|
cascade_shadow_config: CascadeShadowConfig,
|
|
|
|
cascades: HashMap<Entity, Vec<Cascade>>,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
#[derive(Copy, Clone, ShaderType, Default, Debug)]
|
2021-07-08 02:49:33 +00:00
|
|
|
pub struct GpuPointLight {
|
2022-07-08 19:57:43 +00:00
|
|
|
// For point lights: the lower-right 2x2 values of the projection matrix [2][2] [2][3] [3][2] [3][3]
|
|
|
|
// For spot lights: 2 components of the direction (x,z), spot_scale and spot_offset
|
|
|
|
light_custom_data: Vec4,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
color_inverse_square_range: Vec4,
|
|
|
|
position_radius: Vec4,
|
2021-11-19 21:16:58 +00:00
|
|
|
flags: u32,
|
2021-07-16 22:41:56 +00:00
|
|
|
shadow_depth_bias: f32,
|
|
|
|
shadow_normal_bias: f32,
|
2022-07-08 19:57:43 +00:00
|
|
|
spot_light_tan_angle: f32,
|
2021-07-08 02:49:33 +00:00
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
#[derive(ShaderType)]
|
|
|
|
pub struct GpuPointLightsUniform {
|
|
|
|
data: Box<[GpuPointLight; MAX_UNIFORM_BUFFER_POINT_LIGHTS]>,
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Default for GpuPointLightsUniform {
|
|
|
|
fn default() -> Self {
|
|
|
|
Self {
|
|
|
|
data: Box::new([GpuPointLight::default(); MAX_UNIFORM_BUFFER_POINT_LIGHTS]),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#[derive(ShaderType, Default)]
|
|
|
|
pub struct GpuPointLightsStorage {
|
|
|
|
#[size(runtime)]
|
|
|
|
data: Vec<GpuPointLight>,
|
|
|
|
}
|
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
pub enum GpuPointLights {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
Uniform(UniformBuffer<GpuPointLightsUniform>),
|
|
|
|
Storage(StorageBuffer<GpuPointLightsStorage>),
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
impl GpuPointLights {
|
|
|
|
fn new(buffer_binding_type: BufferBindingType) -> Self {
|
|
|
|
match buffer_binding_type {
|
|
|
|
BufferBindingType::Storage { .. } => Self::storage(),
|
|
|
|
BufferBindingType::Uniform => Self::uniform(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fn uniform() -> Self {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
Self::Uniform(UniformBuffer::default())
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
fn storage() -> Self {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
Self::Storage(StorageBuffer::default())
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
fn set(&mut self, mut lights: Vec<GpuPointLight>) {
|
2022-04-07 16:16:35 +00:00
|
|
|
match self {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
GpuPointLights::Uniform(buffer) => {
|
|
|
|
let len = lights.len().min(MAX_UNIFORM_BUFFER_POINT_LIGHTS);
|
|
|
|
let src = &lights[..len];
|
|
|
|
let dst = &mut buffer.get_mut().data[..len];
|
|
|
|
dst.copy_from_slice(src);
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
GpuPointLights::Storage(buffer) => {
|
|
|
|
buffer.get_mut().data.clear();
|
|
|
|
buffer.get_mut().data.append(&mut lights);
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fn write_buffer(&mut self, render_device: &RenderDevice, render_queue: &RenderQueue) {
|
|
|
|
match self {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
GpuPointLights::Uniform(buffer) => buffer.write_buffer(render_device, render_queue),
|
|
|
|
GpuPointLights::Storage(buffer) => buffer.write_buffer(render_device, render_queue),
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn binding(&self) -> Option<BindingResource> {
|
|
|
|
match self {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
GpuPointLights::Uniform(buffer) => buffer.binding(),
|
|
|
|
GpuPointLights::Storage(buffer) => buffer.binding(),
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
pub fn min_size(buffer_binding_type: BufferBindingType) -> NonZeroU64 {
|
|
|
|
match buffer_binding_type {
|
|
|
|
BufferBindingType::Storage { .. } => GpuPointLightsStorage::min_size(),
|
|
|
|
BufferBindingType::Uniform => GpuPointLightsUniform::min_size(),
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
}
|
|
|
|
|
2022-09-27 17:51:12 +00:00
|
|
|
// NOTE: These must match the bit flags in bevy_pbr/src/render/mesh_view_types.wgsl!
|
2021-11-19 21:16:58 +00:00
|
|
|
bitflags::bitflags! {
|
|
|
|
#[repr(transparent)]
|
|
|
|
struct PointLightFlags: u32 {
|
|
|
|
const SHADOWS_ENABLED = (1 << 0);
|
2022-07-08 19:57:43 +00:00
|
|
|
const SPOT_LIGHT_Y_NEGATIVE = (1 << 1);
|
2021-11-19 21:16:58 +00:00
|
|
|
const NONE = 0;
|
|
|
|
const UNINITIALIZED = 0xFFFF;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
#[derive(Copy, Clone, ShaderType, Default, Debug)]
|
2023-01-25 12:35:39 +00:00
|
|
|
pub struct GpuDirectionalCascade {
|
2021-07-08 02:49:33 +00:00
|
|
|
view_projection: Mat4,
|
2023-01-25 12:35:39 +00:00
|
|
|
texel_size: f32,
|
|
|
|
far_bound: f32,
|
|
|
|
}
|
|
|
|
|
|
|
|
#[derive(Copy, Clone, ShaderType, Default, Debug)]
|
|
|
|
pub struct GpuDirectionalLight {
|
|
|
|
cascades: [GpuDirectionalCascade; MAX_CASCADES_PER_LIGHT],
|
2021-07-08 02:49:33 +00:00
|
|
|
color: Vec4,
|
|
|
|
dir_to_light: Vec3,
|
2021-11-19 21:16:58 +00:00
|
|
|
flags: u32,
|
2021-07-16 22:41:56 +00:00
|
|
|
shadow_depth_bias: f32,
|
|
|
|
shadow_normal_bias: f32,
|
2023-01-25 12:35:39 +00:00
|
|
|
num_cascades: u32,
|
|
|
|
cascades_overlap_proportion: f32,
|
|
|
|
depth_texture_base_index: u32,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
2022-09-27 17:51:12 +00:00
|
|
|
// NOTE: These must match the bit flags in bevy_pbr/src/render/mesh_view_types.wgsl!
|
2021-11-19 21:16:58 +00:00
|
|
|
bitflags::bitflags! {
|
|
|
|
#[repr(transparent)]
|
|
|
|
struct DirectionalLightFlags: u32 {
|
|
|
|
const SHADOWS_ENABLED = (1 << 0);
|
|
|
|
const NONE = 0;
|
|
|
|
const UNINITIALIZED = 0xFFFF;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
#[derive(Copy, Clone, Debug, ShaderType)]
|
2021-06-02 02:59:17 +00:00
|
|
|
pub struct GpuLights {
|
2021-07-08 02:49:33 +00:00
|
|
|
directional_lights: [GpuDirectionalLight; MAX_DIRECTIONAL_LIGHTS],
|
2021-06-27 23:10:23 +00:00
|
|
|
ambient_color: Vec4,
|
2022-01-07 21:25:59 +00:00
|
|
|
// xyz are x/y/z cluster dimensions and w is the number of clusters
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
cluster_dimensions: UVec4,
|
|
|
|
// xy are vec2<f32>(cluster_dimensions.xy) / vec2<f32>(view.width, view.height)
|
|
|
|
// z is cluster_dimensions.z / log(far / near)
|
|
|
|
// w is cluster_dimensions.z * log(near) / log(far / near)
|
|
|
|
cluster_factors: Vec4,
|
2021-07-08 02:49:33 +00:00
|
|
|
n_directional_lights: u32,
|
2022-07-08 19:57:43 +00:00
|
|
|
// offset from spot light's light index to spot light's shadow map index
|
|
|
|
spot_light_shadowmap_offset: i32,
|
2023-02-20 00:02:40 +00:00
|
|
|
environment_map_smallest_specular_mip_level: u32,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
2021-07-08 02:49:33 +00:00
|
|
|
// NOTE: this must be kept in sync with the same constants in pbr.frag
|
2022-04-07 16:16:35 +00:00
|
|
|
pub const MAX_UNIFORM_BUFFER_POINT_LIGHTS: usize = 256;
|
2022-11-03 07:09:51 +00:00
|
|
|
pub const MAX_DIRECTIONAL_LIGHTS: usize = 10;
|
2023-01-25 12:35:39 +00:00
|
|
|
#[cfg(not(feature = "webgl"))]
|
|
|
|
pub const MAX_CASCADES_PER_LIGHT: usize = 4;
|
|
|
|
#[cfg(feature = "webgl")]
|
|
|
|
pub const MAX_CASCADES_PER_LIGHT: usize = 1;
|
2021-06-02 02:59:17 +00:00
|
|
|
pub const SHADOW_FORMAT: TextureFormat = TextureFormat::Depth32Float;
|
|
|
|
|
2023-01-14 18:33:38 +00:00
|
|
|
#[derive(Resource, Clone)]
|
2023-03-02 08:21:21 +00:00
|
|
|
pub struct ShadowSamplers {
|
2021-07-08 02:49:33 +00:00
|
|
|
pub point_light_sampler: Sampler,
|
|
|
|
pub directional_light_sampler: Sampler,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// TODO: this pattern for initializing the shaders / pipeline isn't ideal. this should be handled by the asset system
|
2023-03-02 08:21:21 +00:00
|
|
|
impl FromWorld for ShadowSamplers {
|
2021-06-02 02:59:17 +00:00
|
|
|
fn from_world(world: &mut World) -> Self {
|
2022-04-25 23:19:13 +00:00
|
|
|
let render_device = world.resource::<RenderDevice>();
|
2021-06-21 23:28:52 +00:00
|
|
|
|
2023-03-02 08:21:21 +00:00
|
|
|
ShadowSamplers {
|
2021-11-04 21:47:57 +00:00
|
|
|
point_light_sampler: render_device.create_sampler(&SamplerDescriptor {
|
|
|
|
address_mode_u: AddressMode::ClampToEdge,
|
|
|
|
address_mode_v: AddressMode::ClampToEdge,
|
|
|
|
address_mode_w: AddressMode::ClampToEdge,
|
|
|
|
mag_filter: FilterMode::Linear,
|
|
|
|
min_filter: FilterMode::Linear,
|
|
|
|
mipmap_filter: FilterMode::Nearest,
|
|
|
|
compare: Some(CompareFunction::GreaterEqual),
|
|
|
|
..Default::default()
|
|
|
|
}),
|
|
|
|
directional_light_sampler: render_device.create_sampler(&SamplerDescriptor {
|
|
|
|
address_mode_u: AddressMode::ClampToEdge,
|
|
|
|
address_mode_v: AddressMode::ClampToEdge,
|
|
|
|
address_mode_w: AddressMode::ClampToEdge,
|
|
|
|
mag_filter: FilterMode::Linear,
|
|
|
|
min_filter: FilterMode::Linear,
|
|
|
|
mipmap_filter: FilterMode::Nearest,
|
|
|
|
compare: Some(CompareFunction::GreaterEqual),
|
|
|
|
..Default::default()
|
|
|
|
}),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
#[derive(Component)]
|
|
|
|
pub struct ExtractedClusterConfig {
|
2022-01-07 21:25:59 +00:00
|
|
|
/// Special near value for cluster calculations
|
|
|
|
near: f32,
|
2022-03-08 04:56:42 +00:00
|
|
|
far: f32,
|
2022-07-12 13:06:16 +00:00
|
|
|
/// Number of clusters in `X` / `Y` / `Z` in the view frustum
|
2022-03-24 00:20:27 +00:00
|
|
|
dimensions: UVec3,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#[derive(Component)]
|
|
|
|
pub struct ExtractedClustersPointLights {
|
|
|
|
data: Vec<VisiblePointLights>,
|
|
|
|
}
|
|
|
|
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
pub fn extract_clusters(
|
|
|
|
mut commands: Commands,
|
|
|
|
views: Extract<Query<(Entity, &Clusters), With<Camera>>>,
|
|
|
|
) {
|
2022-09-20 00:29:10 +00:00
|
|
|
for (entity, clusters) in &views {
|
2022-09-21 21:47:53 +00:00
|
|
|
commands.get_or_spawn(entity).insert((
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
ExtractedClustersPointLights {
|
|
|
|
data: clusters.lights.clone(),
|
|
|
|
},
|
|
|
|
ExtractedClusterConfig {
|
2022-01-07 21:25:59 +00:00
|
|
|
near: clusters.near,
|
2022-03-08 04:56:42 +00:00
|
|
|
far: clusters.far,
|
2022-03-24 00:20:27 +00:00
|
|
|
dimensions: clusters.dimensions,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
},
|
|
|
|
));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
#[allow(clippy::too_many_arguments)]
|
2021-06-02 02:59:17 +00:00
|
|
|
pub fn extract_lights(
|
|
|
|
mut commands: Commands,
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
point_light_shadow_map: Extract<Res<PointLightShadowMap>>,
|
|
|
|
directional_light_shadow_map: Extract<Res<DirectionalLightShadowMap>>,
|
|
|
|
global_point_lights: Extract<Res<GlobalVisiblePointLights>>,
|
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310)
# Objective
Fixes #4907. Fixes #838. Fixes #5089.
Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114
Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities.
Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information.
## Solution
Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function.
Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate.
This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views.
Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice.
Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite.
![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png)
Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale.
## Follow up work
* Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?.
* Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset.
* Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc).
---
## Changelog
* ComputedVisibility now takes hierarchy visibility into account.
* 2D, UI and Light entities now use the ComputedVisibility component.
## Migration Guide
If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead:
```rust
// before (0.7)
fn system(query: Query<&Visibility>) {
for visibility in query.iter() {
if visibility.is_visible {
log!("found visible entity");
}
}
}
// after (0.8)
fn system(query: Query<&ComputedVisibility>) {
for visibility in query.iter() {
if visibility.is_visible() {
log!("found visible entity");
}
}
}
```
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
|
|
|
point_lights: Extract<
|
|
|
|
Query<(
|
|
|
|
&PointLight,
|
|
|
|
&CubemapVisibleEntities,
|
|
|
|
&GlobalTransform,
|
|
|
|
&ComputedVisibility,
|
|
|
|
)>,
|
|
|
|
>,
|
|
|
|
spot_lights: Extract<
|
|
|
|
Query<(
|
|
|
|
&SpotLight,
|
|
|
|
&VisibleEntities,
|
|
|
|
&GlobalTransform,
|
|
|
|
&ComputedVisibility,
|
|
|
|
)>,
|
|
|
|
>,
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
directional_lights: Extract<
|
|
|
|
Query<
|
|
|
|
(
|
|
|
|
Entity,
|
|
|
|
&DirectionalLight,
|
2023-01-25 12:35:39 +00:00
|
|
|
&CascadesVisibleEntities,
|
|
|
|
&Cascades,
|
|
|
|
&CascadeShadowConfig,
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
&GlobalTransform,
|
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310)
# Objective
Fixes #4907. Fixes #838. Fixes #5089.
Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114
Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities.
Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information.
## Solution
Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function.
Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate.
This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views.
Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice.
Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite.
![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png)
Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale.
## Follow up work
* Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?.
* Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset.
* Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc).
---
## Changelog
* ComputedVisibility now takes hierarchy visibility into account.
* 2D, UI and Light entities now use the ComputedVisibility component.
## Migration Guide
If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead:
```rust
// before (0.7)
fn system(query: Query<&Visibility>) {
for visibility in query.iter() {
if visibility.is_visible {
log!("found visible entity");
}
}
}
// after (0.8)
fn system(query: Query<&ComputedVisibility>) {
for visibility in query.iter() {
if visibility.is_visible() {
log!("found visible entity");
}
}
}
```
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
|
|
|
&ComputedVisibility,
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
),
|
|
|
|
Without<SpotLight>,
|
|
|
|
>,
|
2022-07-08 19:57:43 +00:00
|
|
|
>,
|
2022-04-07 16:16:35 +00:00
|
|
|
mut previous_point_lights_len: Local<usize>,
|
2022-07-08 19:57:43 +00:00
|
|
|
mut previous_spot_lights_len: Local<usize>,
|
2021-06-02 02:59:17 +00:00
|
|
|
) {
|
2022-05-30 18:36:03 +00:00
|
|
|
// NOTE: These shadow map resources are extracted here as they are used here too so this avoids
|
|
|
|
// races between scheduling of ExtractResourceSystems and this system.
|
|
|
|
if point_light_shadow_map.is_changed() {
|
|
|
|
commands.insert_resource(point_light_shadow_map.clone());
|
|
|
|
}
|
|
|
|
if directional_light_shadow_map.is_changed() {
|
|
|
|
commands.insert_resource(directional_light_shadow_map.clone());
|
|
|
|
}
|
2021-07-19 19:20:59 +00:00
|
|
|
// This is the point light shadow map texel size for one face of the cube as a distance of 1.0
|
|
|
|
// world unit from the light.
|
|
|
|
// point_light_texel_size = 2.0 * 1.0 * tan(PI / 4.0) / cube face width in texels
|
|
|
|
// PI / 4.0 is half the cube face fov, tan(PI / 4.0) = 1.0, so this simplifies to:
|
|
|
|
// point_light_texel_size = 2.0 / cube face width in texels
|
|
|
|
// NOTE: When using various PCF kernel sizes, this will need to be adjusted, according to:
|
|
|
|
// https://catlikecoding.com/unity/tutorials/custom-srp/point-and-spot-shadows/
|
2021-08-25 05:57:57 +00:00
|
|
|
let point_light_texel_size = 2.0 / point_light_shadow_map.size as f32;
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
let mut point_lights_values = Vec::with_capacity(*previous_point_lights_len);
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
for entity in global_point_lights.iter().copied() {
|
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310)
# Objective
Fixes #4907. Fixes #838. Fixes #5089.
Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114
Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities.
Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information.
## Solution
Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function.
Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate.
This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views.
Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice.
Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite.
![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png)
Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale.
## Follow up work
* Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?.
* Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset.
* Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc).
---
## Changelog
* ComputedVisibility now takes hierarchy visibility into account.
* 2D, UI and Light entities now use the ComputedVisibility component.
## Migration Guide
If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead:
```rust
// before (0.7)
fn system(query: Query<&Visibility>) {
for visibility in query.iter() {
if visibility.is_visible {
log!("found visible entity");
}
}
}
// after (0.8)
fn system(query: Query<&ComputedVisibility>) {
for visibility in query.iter() {
if visibility.is_visible() {
log!("found visible entity");
}
}
}
```
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
|
|
|
if let Ok((point_light, cubemap_visible_entities, transform, visibility)) =
|
|
|
|
point_lights.get(entity)
|
|
|
|
{
|
|
|
|
if !visibility.is_visible() {
|
|
|
|
continue;
|
|
|
|
}
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
// TODO: This is very much not ideal. We should be able to re-use the vector memory.
|
|
|
|
// However, since exclusive access to the main world in extract is ill-advised, we just clone here.
|
|
|
|
let render_cubemap_visible_entities = cubemap_visible_entities.clone();
|
2022-04-07 16:16:35 +00:00
|
|
|
point_lights_values.push((
|
|
|
|
entity,
|
|
|
|
(
|
|
|
|
ExtractedPointLight {
|
|
|
|
color: point_light.color,
|
|
|
|
// NOTE: Map from luminous power in lumens to luminous intensity in lumens per steradian
|
|
|
|
// for a point light. See https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower
|
|
|
|
// for details.
|
|
|
|
intensity: point_light.intensity / (4.0 * std::f32::consts::PI),
|
|
|
|
range: point_light.range,
|
|
|
|
radius: point_light.radius,
|
|
|
|
transform: *transform,
|
|
|
|
shadows_enabled: point_light.shadows_enabled,
|
|
|
|
shadow_depth_bias: point_light.shadow_depth_bias,
|
|
|
|
// The factor of SQRT_2 is for the worst-case diagonal offset
|
|
|
|
shadow_normal_bias: point_light.shadow_normal_bias
|
|
|
|
* point_light_texel_size
|
|
|
|
* std::f32::consts::SQRT_2,
|
2022-07-08 19:57:43 +00:00
|
|
|
spot_light_angles: None,
|
2022-04-07 16:16:35 +00:00
|
|
|
},
|
|
|
|
render_cubemap_visible_entities,
|
|
|
|
),
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
));
|
|
|
|
}
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
2022-04-07 16:16:35 +00:00
|
|
|
*previous_point_lights_len = point_lights_values.len();
|
|
|
|
commands.insert_or_spawn_batch(point_lights_values);
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
2022-07-08 19:57:43 +00:00
|
|
|
let mut spot_lights_values = Vec::with_capacity(*previous_spot_lights_len);
|
|
|
|
for entity in global_point_lights.iter().copied() {
|
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310)
# Objective
Fixes #4907. Fixes #838. Fixes #5089.
Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114
Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities.
Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information.
## Solution
Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function.
Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate.
This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views.
Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice.
Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite.
![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png)
Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale.
## Follow up work
* Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?.
* Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset.
* Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc).
---
## Changelog
* ComputedVisibility now takes hierarchy visibility into account.
* 2D, UI and Light entities now use the ComputedVisibility component.
## Migration Guide
If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead:
```rust
// before (0.7)
fn system(query: Query<&Visibility>) {
for visibility in query.iter() {
if visibility.is_visible {
log!("found visible entity");
}
}
}
// after (0.8)
fn system(query: Query<&ComputedVisibility>) {
for visibility in query.iter() {
if visibility.is_visible() {
log!("found visible entity");
}
}
}
```
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
|
|
|
if let Ok((spot_light, visible_entities, transform, visibility)) = spot_lights.get(entity) {
|
|
|
|
if !visibility.is_visible() {
|
|
|
|
continue;
|
|
|
|
}
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
// TODO: This is very much not ideal. We should be able to re-use the vector memory.
|
|
|
|
// However, since exclusive access to the main world in extract is ill-advised, we just clone here.
|
|
|
|
let render_visible_entities = visible_entities.clone();
|
2022-07-08 19:57:43 +00:00
|
|
|
let texel_size =
|
|
|
|
2.0 * spot_light.outer_angle.tan() / directional_light_shadow_map.size as f32;
|
|
|
|
|
|
|
|
spot_lights_values.push((
|
|
|
|
entity,
|
|
|
|
(
|
|
|
|
ExtractedPointLight {
|
|
|
|
color: spot_light.color,
|
|
|
|
// NOTE: Map from luminous power in lumens to luminous intensity in lumens per steradian
|
|
|
|
// for a point light. See https://google.github.io/filament/Filament.html#mjx-eqn-pointLightLuminousPower
|
|
|
|
// for details.
|
|
|
|
// Note: Filament uses a divisor of PI for spot lights. We choose to use the same 4*PI divisor
|
|
|
|
// in both cases so that toggling between point light and spot light keeps lit areas lit equally,
|
|
|
|
// which seems least surprising for users
|
|
|
|
intensity: spot_light.intensity / (4.0 * std::f32::consts::PI),
|
|
|
|
range: spot_light.range,
|
|
|
|
radius: spot_light.radius,
|
|
|
|
transform: *transform,
|
|
|
|
shadows_enabled: spot_light.shadows_enabled,
|
|
|
|
shadow_depth_bias: spot_light.shadow_depth_bias,
|
|
|
|
// The factor of SQRT_2 is for the worst-case diagonal offset
|
|
|
|
shadow_normal_bias: spot_light.shadow_normal_bias
|
|
|
|
* texel_size
|
|
|
|
* std::f32::consts::SQRT_2,
|
|
|
|
spot_light_angles: Some((spot_light.inner_angle, spot_light.outer_angle)),
|
|
|
|
},
|
|
|
|
render_visible_entities,
|
|
|
|
),
|
|
|
|
));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
*previous_spot_lights_len = spot_lights_values.len();
|
|
|
|
commands.insert_or_spawn_batch(spot_lights_values);
|
|
|
|
|
2023-01-25 12:35:39 +00:00
|
|
|
for (
|
|
|
|
entity,
|
|
|
|
directional_light,
|
|
|
|
visible_entities,
|
|
|
|
cascades,
|
|
|
|
cascade_config,
|
|
|
|
transform,
|
|
|
|
visibility,
|
|
|
|
) in directional_lights.iter()
|
2022-03-05 03:23:01 +00:00
|
|
|
{
|
Visibilty Inheritance, universal ComputedVisibility and RenderLayers support (#5310)
# Objective
Fixes #4907. Fixes #838. Fixes #5089.
Supersedes #5146. Supersedes #2087. Supersedes #865. Supersedes #5114
Visibility is currently entirely local. Set a parent entity to be invisible, and the children are still visible. This makes it hard for users to hide entire hierarchies of entities.
Additionally, the semantics of `Visibility` vs `ComputedVisibility` are inconsistent across entity types. 3D meshes use `ComputedVisibility` as the "definitive" visibility component, with `Visibility` being just one data source. Sprites just use `Visibility`, which means they can't feed off of `ComputedVisibility` data, such as culling information, RenderLayers, and (added in this pr) visibility inheritance information.
## Solution
Splits `ComputedVisibilty::is_visible` into `ComputedVisibilty::is_visible_in_view` and `ComputedVisibilty::is_visible_in_hierarchy`. For each visible entity, `is_visible_in_hierarchy` is computed by propagating visibility down the hierarchy. The `ComputedVisibility::is_visible()` function combines these two booleans for the canonical "is this entity visible" function.
Additionally, all entities that have `Visibility` now also have `ComputedVisibility`. Sprites, Lights, and UI entities now use `ComputedVisibility` when appropriate.
This means that in addition to visibility inheritance, everything using Visibility now also supports RenderLayers. Notably, Sprites (and other 2d objects) now support `RenderLayers` and work properly across multiple views.
Also note that this does increase the amount of work done per sprite. Bevymark with 100,000 sprites on `main` runs in `0.017612` seconds and this runs in `0.01902`. That is certainly a gap, but I believe the api consistency and extra functionality this buys us is worth it. See [this thread](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for more info. Note that #5146 in combination with #5114 _are_ a viable alternative to this PR and _would_ perform better, but that comes at the cost of api inconsistencies and doing visibility calculations in the "wrong" place. The current visibility system does have potential for performance improvements. I would prefer to evolve that one system as a whole rather than doing custom hacks / different behaviors for each feature slice.
Here is a "split screen" example where the left camera uses RenderLayers to filter out the blue sprite.
![image](https://user-images.githubusercontent.com/2694663/178814868-2e9a2173-bf8c-4c79-8815-633899d492c3.png)
Note that this builds directly on #5146 and that @james7132 deserves the credit for the baseline visibility inheritance work. This pr moves the inherited visibility field into `ComputedVisibility`, then does the additional work of porting everything to `ComputedVisibility`. See my [comments here](https://github.com/bevyengine/bevy/pull/5146#issuecomment-1182783452) for rationale.
## Follow up work
* Now that lights use ComputedVisibility, VisibleEntities now includes "visible lights" in the entity list. Functionally not a problem as we use queries to filter the list down in the desired context. But we should consider splitting this out into a separate`VisibleLights` collection for both clarity and performance reasons. And _maybe_ even consider scoping `VisibleEntities` down to `VisibleMeshes`?.
* Investigate alternative sprite rendering impls (in combination with visibility system tweaks) that avoid re-generating a per-view fixedbitset of visible entities every frame, then checking each ExtractedEntity. This is where most of the performance overhead lives. Ex: we could generate ExtractedEntities per-view using the VisibleEntities list, avoiding the need for the bitset.
* Should ComputedVisibility use bitflags under the hood? This would cut down on the size of the component, potentially speed up the `is_visible()` function, and allow us to cheaply expand ComputedVisibility with more data (ex: split out local visibility and parent visibility, add more culling classes, etc).
---
## Changelog
* ComputedVisibility now takes hierarchy visibility into account.
* 2D, UI and Light entities now use the ComputedVisibility component.
## Migration Guide
If you were previously reading `Visibility::is_visible` as the "actual visibility" for sprites or lights, use `ComputedVisibilty::is_visible()` instead:
```rust
// before (0.7)
fn system(query: Query<&Visibility>) {
for visibility in query.iter() {
if visibility.is_visible {
log!("found visible entity");
}
}
}
// after (0.8)
fn system(query: Query<&ComputedVisibility>) {
for visibility in query.iter() {
if visibility.is_visible() {
log!("found visible entity");
}
}
}
```
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-15 23:24:42 +00:00
|
|
|
if !visibility.is_visible() {
|
2022-03-05 03:23:01 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
Make `RenderStage::Extract` run on the render world (#4402)
# Objective
- Currently, the `Extract` `RenderStage` is executed on the main world, with the render world available as a resource.
- However, when needing access to resources in the render world (e.g. to mutate them), the only way to do so was to get exclusive access to the whole `RenderWorld` resource.
- This meant that effectively only one extract which wrote to resources could run at a time.
- We didn't previously make `Extract`ing writing to the world a non-happy path, even though we want to discourage that.
## Solution
- Move the extract stage to run on the render world.
- Add the main world as a `MainWorld` resource.
- Add an `Extract` `SystemParam` as a convenience to access a (read only) `SystemParam` in the main world during `Extract`.
## Future work
It should be possible to avoid needing to use `get_or_spawn` for the render commands, since now the `Commands`' `Entities` matches up with the world being executed on.
We need to determine how this interacts with https://github.com/bevyengine/bevy/pull/3519
It's theoretically possible to remove the need for the `value` method on `Extract`. However, that requires slightly changing the `SystemParam` interface, which would make it more complicated. That would probably mess up the `SystemState` api too.
## Todo
I still need to add doc comments to `Extract`.
---
## Changelog
### Changed
- The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase.
Resources on the render world can now be accessed using `ResMut` during extract.
### Removed
- `Commands::spawn_and_forget`. Use `Commands::get_or_spawn(e).insert_bundle(bundle)` instead
## Migration Guide
The `Extract` `RenderStage` now runs on the render world (instead of the main world as before).
You must use the `Extract` `SystemParam` to access the main world during the extract phase. `Extract` takes a single type parameter, which is any system parameter (such as `Res`, `Query` etc.). It will extract this from the main world, and returns the result of this extraction when `value` is called on it.
For example, if previously your extract system looked like:
```rust
fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
for cloud in clouds.iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
the new version would be:
```rust
fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
The diff is:
```diff
--- a/src/clouds.rs
+++ b/src/clouds.rs
@@ -1,5 +1,5 @@
-fn extract_clouds(mut commands: Commands, clouds: Query<Entity, With<Cloud>>) {
- for cloud in clouds.iter() {
+fn extract_clouds(mut commands: Commands, mut clouds: Extract<Query<Entity, With<Cloud>>>) {
+ for cloud in clouds.value().iter() {
commands.get_or_spawn(cloud).insert(Cloud);
}
}
```
You can now also access resources from the render world using the normal system parameters during `Extract`:
```rust
fn extract_assets(mut render_assets: ResMut<MyAssets>, source_assets: Extract<Res<MyAssets>>) {
*render_assets = source_assets.clone();
}
```
Please note that all existing extract systems need to be updated to match this new style; even if they currently compile they will not run as expected. A warning will be emitted on a best-effort basis if this is not met.
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-07-08 23:56:33 +00:00
|
|
|
// TODO: As above
|
|
|
|
let render_visible_entities = visible_entities.clone();
|
2022-09-21 21:47:53 +00:00
|
|
|
commands.get_or_spawn(entity).insert((
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
ExtractedDirectionalLight {
|
2021-07-08 02:49:33 +00:00
|
|
|
color: directional_light.color,
|
|
|
|
illuminance: directional_light.illuminance,
|
Take DirectionalLight's GlobalTransform into account when calculating shadow map volume (not just direction) (#6384)
# Objective
This PR fixes #5789, by enabling movable (and scalable) directional light shadow volumes.
## Solution
This PR changes `ExtractedDirectionalLight` to hold a copy of the `DirectionalLight` entity's `GlobalTransform`, instead of just a `direction` vector. This allows the shadow map volume (as defined by the light's `shadow_projection` field) to be transformed honoring translation _and_ scale transforms, and not just rotation.
It also augments the texel size calculation (used to determine the `shadow_normal_bias`) so that it now takes into account the upper bound of the x/y/z scale of the `GlobalTransform`.
This change makes the directional light extraction code more consistent with point and spot lights (that already use `transform`), and allows easily moving and scaling the shadow volume along with a player entity based on camera distance/angle, immediately enabling more real world use cases until we have a more sophisticated adaptive implementation, such as the one described in #3629.
**Note:** While it was previously possible to update the projection achieving a similar effect, depending on the light direction and distance to the origin, the fact that the shadow map camera was always positioned at the origin with a hardcoded `Vec3::Y` up value meant you would get sub-optimal or inconsistent/incorrect results.
---
## Changelog
### Changed
- `DirectionalLight` shadow volumes now honor translation and scale transforms
## Migration Guide
- If your directional lights were positioned at the origin and not scaled (the default, most common scenario) no changes are needed on your part; it just works as before;
- If you previously had a system for dynamically updating directional light shadow projections, you might now be able to simplify your code by updating the directional light entity's transform instead;
- In the unlikely scenario that a scene with directional lights that previously rendered shadows correctly has missing shadows, make sure your directional lights are positioned at (0, 0, 0) and are not scaled to a size that's too large or too small.
2022-11-04 20:12:26 +00:00
|
|
|
transform: *transform,
|
2021-11-19 21:16:58 +00:00
|
|
|
shadows_enabled: directional_light.shadows_enabled,
|
2021-07-16 22:41:56 +00:00
|
|
|
shadow_depth_bias: directional_light.shadow_depth_bias,
|
2023-01-25 12:35:39 +00:00
|
|
|
// The factor of SQRT_2 is for the worst-case diagonal offset
|
|
|
|
shadow_normal_bias: directional_light.shadow_normal_bias * std::f32::consts::SQRT_2,
|
|
|
|
cascade_shadow_config: cascade_config.clone(),
|
|
|
|
cascades: cascades.cascades.clone(),
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
},
|
|
|
|
render_visible_entities,
|
|
|
|
));
|
2021-07-08 02:49:33 +00:00
|
|
|
}
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
pub(crate) const POINT_LIGHT_NEAR_Z: f32 = 0.1f32;
|
|
|
|
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
pub(crate) struct CubeMapFace {
|
|
|
|
pub(crate) target: Vec3,
|
|
|
|
pub(crate) up: Vec3,
|
2021-07-01 23:48:55 +00:00
|
|
|
}
|
|
|
|
|
2023-03-18 23:06:53 +00:00
|
|
|
// Cubemap faces are [+X, -X, +Y, -Y, +Z, -Z], per https://www.w3.org/TR/webgpu/#texture-view-creation
|
|
|
|
// Note: Cubemap coordinates are left-handed y-up, unlike the rest of Bevy.
|
|
|
|
// See https://registry.khronos.org/vulkan/specs/1.2/html/chap16.html#_cube_map_face_selection
|
|
|
|
//
|
|
|
|
// For each cubemap face, we take care to specify the appropriate target/up axis such that the rendered
|
|
|
|
// texture using Bevy's right-handed y-up coordinate space matches the expected cubemap face in
|
|
|
|
// left-handed y-up cubemap coordinates.
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
pub(crate) const CUBE_MAP_FACES: [CubeMapFace; 6] = [
|
2023-03-18 23:06:53 +00:00
|
|
|
// +X
|
2021-07-01 23:48:55 +00:00
|
|
|
CubeMapFace {
|
2023-03-18 23:06:53 +00:00
|
|
|
target: Vec3::X,
|
|
|
|
up: Vec3::Y,
|
2021-07-01 23:48:55 +00:00
|
|
|
},
|
2023-03-18 23:06:53 +00:00
|
|
|
// -X
|
2021-07-01 23:48:55 +00:00
|
|
|
CubeMapFace {
|
2023-03-18 23:06:53 +00:00
|
|
|
target: Vec3::NEG_X,
|
|
|
|
up: Vec3::Y,
|
2021-07-01 23:48:55 +00:00
|
|
|
},
|
2023-03-18 23:06:53 +00:00
|
|
|
// +Y
|
2021-07-01 23:48:55 +00:00
|
|
|
CubeMapFace {
|
2023-03-18 23:06:53 +00:00
|
|
|
target: Vec3::Y,
|
2021-07-01 23:48:55 +00:00
|
|
|
up: Vec3::Z,
|
|
|
|
},
|
2023-03-18 23:06:53 +00:00
|
|
|
// -Y
|
2021-07-01 23:48:55 +00:00
|
|
|
CubeMapFace {
|
2023-03-18 23:06:53 +00:00
|
|
|
target: Vec3::NEG_Y,
|
2022-07-03 19:55:33 +00:00
|
|
|
up: Vec3::NEG_Z,
|
2021-07-01 23:48:55 +00:00
|
|
|
},
|
2023-03-18 23:06:53 +00:00
|
|
|
// +Z (with left-handed conventions, pointing forwards)
|
2021-07-01 23:48:55 +00:00
|
|
|
CubeMapFace {
|
2022-07-03 19:55:33 +00:00
|
|
|
target: Vec3::NEG_Z,
|
2023-03-18 23:06:53 +00:00
|
|
|
up: Vec3::Y,
|
2021-07-01 23:48:55 +00:00
|
|
|
},
|
2023-03-18 23:06:53 +00:00
|
|
|
// -Z (with left-handed conventions, pointing backwards)
|
2021-07-01 23:48:55 +00:00
|
|
|
CubeMapFace {
|
|
|
|
target: Vec3::Z,
|
2023-03-18 23:06:53 +00:00
|
|
|
up: Vec3::Y,
|
2021-07-01 23:48:55 +00:00
|
|
|
},
|
|
|
|
];
|
|
|
|
|
2021-07-08 02:49:33 +00:00
|
|
|
fn face_index_to_name(face_index: usize) -> &'static str {
|
|
|
|
match face_index {
|
|
|
|
0 => "+x",
|
|
|
|
1 => "-x",
|
|
|
|
2 => "+y",
|
|
|
|
3 => "-y",
|
|
|
|
4 => "+z",
|
|
|
|
5 => "-z",
|
|
|
|
_ => "invalid",
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-11-22 23:16:36 +00:00
|
|
|
#[derive(Component)]
|
2021-11-19 21:16:58 +00:00
|
|
|
pub struct ShadowView {
|
2021-07-01 23:48:55 +00:00
|
|
|
pub depth_texture_view: TextureView,
|
2021-07-08 02:49:33 +00:00
|
|
|
pub pass_name: String,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
2021-11-22 23:16:36 +00:00
|
|
|
#[derive(Component)]
|
2021-11-19 21:16:58 +00:00
|
|
|
pub struct ViewShadowBindings {
|
2021-07-08 02:49:33 +00:00
|
|
|
pub point_light_depth_texture: Texture,
|
|
|
|
pub point_light_depth_texture_view: TextureView,
|
|
|
|
pub directional_light_depth_texture: Texture,
|
|
|
|
pub directional_light_depth_texture_view: TextureView,
|
2021-11-19 21:16:58 +00:00
|
|
|
}
|
|
|
|
|
2021-11-22 23:16:36 +00:00
|
|
|
#[derive(Component)]
|
2021-11-19 21:16:58 +00:00
|
|
|
pub struct ViewLightEntities {
|
2021-06-02 02:59:17 +00:00
|
|
|
pub lights: Vec<Entity>,
|
2021-11-19 21:16:58 +00:00
|
|
|
}
|
|
|
|
|
2021-11-22 23:16:36 +00:00
|
|
|
#[derive(Component)]
|
2021-11-19 21:16:58 +00:00
|
|
|
pub struct ViewLightsUniformOffset {
|
|
|
|
pub offset: u32,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
// NOTE: Clustered-forward rendering requires 3 storage buffer bindings so check that
|
|
|
|
// at least that many are supported using this constant and SupportedBindingType::from_device()
|
|
|
|
pub const CLUSTERED_FORWARD_STORAGE_BUFFER_COUNT: u32 = 3;
|
|
|
|
|
Make `Resource` trait opt-in, requiring `#[derive(Resource)]` V2 (#5577)
*This PR description is an edited copy of #5007, written by @alice-i-cecile.*
# Objective
Follow-up to https://github.com/bevyengine/bevy/pull/2254. The `Resource` trait currently has a blanket implementation for all types that meet its bounds.
While ergonomic, this results in several drawbacks:
* it is possible to make confusing, silent mistakes such as inserting a function pointer (Foo) rather than a value (Foo::Bar) as a resource
* it is challenging to discover if a type is intended to be used as a resource
* we cannot later add customization options (see the [RFC](https://github.com/bevyengine/rfcs/blob/main/rfcs/27-derive-component.md) for the equivalent choice for Component).
* dependencies can use the same Rust type as a resource in invisibly conflicting ways
* raw Rust types used as resources cannot preserve privacy appropriately, as anyone able to access that type can read and write to internal values
* we cannot capture a definitive list of possible resources to display to users in an editor
## Notes to reviewers
* Review this commit-by-commit; there's effectively no back-tracking and there's a lot of churn in some of these commits.
*ira: My commits are not as well organized :')*
* I've relaxed the bound on Local to Send + Sync + 'static: I don't think these concerns apply there, so this can keep things simple. Storing e.g. a u32 in a Local is fine, because there's a variable name attached explaining what it does.
* I think this is a bad place for the Resource trait to live, but I've left it in place to make reviewing easier. IMO that's best tackled with https://github.com/bevyengine/bevy/issues/4981.
## Changelog
`Resource` is no longer automatically implemented for all matching types. Instead, use the new `#[derive(Resource)]` macro.
## Migration Guide
Add `#[derive(Resource)]` to all types you are using as a resource.
If you are using a third party type as a resource, wrap it in a tuple struct to bypass orphan rules. Consider deriving `Deref` and `DerefMut` to improve ergonomics.
`ClearColor` no longer implements `Component`. Using `ClearColor` as a component in 0.8 did nothing.
Use the `ClearColorConfig` in the `Camera3d` and `Camera2d` components instead.
Co-authored-by: Alice <alice.i.cecile@gmail.com>
Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
Co-authored-by: devil-ira <justthecooldude@gmail.com>
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-08-08 21:36:35 +00:00
|
|
|
#[derive(Resource)]
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
pub struct GlobalLightMeta {
|
2022-04-07 16:16:35 +00:00
|
|
|
pub gpu_point_lights: GpuPointLights,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
pub entity_to_index: HashMap<Entity, usize>,
|
|
|
|
}
|
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
impl FromWorld for GlobalLightMeta {
|
|
|
|
fn from_world(world: &mut World) -> Self {
|
|
|
|
Self::new(
|
|
|
|
world
|
|
|
|
.resource::<RenderDevice>()
|
|
|
|
.get_supported_read_only_binding_type(CLUSTERED_FORWARD_STORAGE_BUFFER_COUNT),
|
|
|
|
)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl GlobalLightMeta {
|
|
|
|
pub fn new(buffer_binding_type: BufferBindingType) -> Self {
|
|
|
|
Self {
|
|
|
|
gpu_point_lights: GpuPointLights::new(buffer_binding_type),
|
|
|
|
entity_to_index: HashMap::default(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Make `Resource` trait opt-in, requiring `#[derive(Resource)]` V2 (#5577)
*This PR description is an edited copy of #5007, written by @alice-i-cecile.*
# Objective
Follow-up to https://github.com/bevyengine/bevy/pull/2254. The `Resource` trait currently has a blanket implementation for all types that meet its bounds.
While ergonomic, this results in several drawbacks:
* it is possible to make confusing, silent mistakes such as inserting a function pointer (Foo) rather than a value (Foo::Bar) as a resource
* it is challenging to discover if a type is intended to be used as a resource
* we cannot later add customization options (see the [RFC](https://github.com/bevyengine/rfcs/blob/main/rfcs/27-derive-component.md) for the equivalent choice for Component).
* dependencies can use the same Rust type as a resource in invisibly conflicting ways
* raw Rust types used as resources cannot preserve privacy appropriately, as anyone able to access that type can read and write to internal values
* we cannot capture a definitive list of possible resources to display to users in an editor
## Notes to reviewers
* Review this commit-by-commit; there's effectively no back-tracking and there's a lot of churn in some of these commits.
*ira: My commits are not as well organized :')*
* I've relaxed the bound on Local to Send + Sync + 'static: I don't think these concerns apply there, so this can keep things simple. Storing e.g. a u32 in a Local is fine, because there's a variable name attached explaining what it does.
* I think this is a bad place for the Resource trait to live, but I've left it in place to make reviewing easier. IMO that's best tackled with https://github.com/bevyengine/bevy/issues/4981.
## Changelog
`Resource` is no longer automatically implemented for all matching types. Instead, use the new `#[derive(Resource)]` macro.
## Migration Guide
Add `#[derive(Resource)]` to all types you are using as a resource.
If you are using a third party type as a resource, wrap it in a tuple struct to bypass orphan rules. Consider deriving `Deref` and `DerefMut` to improve ergonomics.
`ClearColor` no longer implements `Component`. Using `ClearColor` as a component in 0.8 did nothing.
Use the `ClearColorConfig` in the `Camera3d` and `Camera2d` components instead.
Co-authored-by: Alice <alice.i.cecile@gmail.com>
Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
Co-authored-by: devil-ira <justthecooldude@gmail.com>
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-08-08 21:36:35 +00:00
|
|
|
#[derive(Resource, Default)]
|
2021-06-02 02:59:17 +00:00
|
|
|
pub struct LightMeta {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
pub view_gpu_lights: DynamicUniformBuffer<GpuLights>,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
2021-11-22 23:16:36 +00:00
|
|
|
#[derive(Component)]
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
pub enum LightEntity {
|
|
|
|
Directional {
|
|
|
|
light_entity: Entity,
|
2023-01-25 12:35:39 +00:00
|
|
|
cascade_index: usize,
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
},
|
|
|
|
Point {
|
|
|
|
light_entity: Entity,
|
|
|
|
face_index: usize,
|
|
|
|
},
|
2022-07-08 19:57:43 +00:00
|
|
|
Spot {
|
|
|
|
light_entity: Entity,
|
|
|
|
},
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
}
|
2021-12-14 23:42:35 +00:00
|
|
|
pub fn calculate_cluster_factors(
|
|
|
|
near: f32,
|
|
|
|
far: f32,
|
|
|
|
z_slices: f32,
|
|
|
|
is_orthographic: bool,
|
|
|
|
) -> Vec2 {
|
|
|
|
if is_orthographic {
|
|
|
|
Vec2::new(-near, z_slices / (-far - -near))
|
|
|
|
} else {
|
2022-01-07 21:25:59 +00:00
|
|
|
let z_slices_of_ln_zfar_over_znear = (z_slices - 1.0) / (far / near).ln();
|
2021-12-14 23:42:35 +00:00
|
|
|
Vec2::new(
|
|
|
|
z_slices_of_ln_zfar_over_znear,
|
|
|
|
near.ln() * z_slices_of_ln_zfar_over_znear,
|
|
|
|
)
|
|
|
|
}
|
|
|
|
}
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
|
2022-07-08 19:57:43 +00:00
|
|
|
// this method of constructing a basis from a vec3 is used by glam::Vec3::any_orthonormal_pair
|
|
|
|
// we will also construct it in the fragment shader and need our implementations to match,
|
|
|
|
// so we reproduce it here to avoid a mismatch if glam changes. we also switch the handedness
|
|
|
|
// could move this onto transform but it's pretty niche
|
|
|
|
pub(crate) fn spot_light_view_matrix(transform: &GlobalTransform) -> Mat4 {
|
|
|
|
// the matrix z_local (opposite of transform.forward())
|
2022-07-16 00:51:12 +00:00
|
|
|
let fwd_dir = transform.back().extend(0.0);
|
2022-07-08 19:57:43 +00:00
|
|
|
|
|
|
|
let sign = 1f32.copysign(fwd_dir.z);
|
|
|
|
let a = -1.0 / (fwd_dir.z + sign);
|
|
|
|
let b = fwd_dir.x * fwd_dir.y * a;
|
|
|
|
let up_dir = Vec4::new(
|
|
|
|
1.0 + sign * fwd_dir.x * fwd_dir.x * a,
|
|
|
|
sign * b,
|
|
|
|
-sign * fwd_dir.x,
|
|
|
|
0.0,
|
|
|
|
);
|
|
|
|
let right_dir = Vec4::new(-b, -sign - fwd_dir.y * fwd_dir.y * a, fwd_dir.y, 0.0);
|
|
|
|
|
|
|
|
Mat4::from_cols(
|
|
|
|
right_dir,
|
|
|
|
up_dir,
|
|
|
|
fwd_dir,
|
2022-07-16 00:51:12 +00:00
|
|
|
transform.translation().extend(1.0),
|
2022-07-08 19:57:43 +00:00
|
|
|
)
|
|
|
|
}
|
|
|
|
|
|
|
|
pub(crate) fn spot_light_projection_matrix(angle: f32) -> Mat4 {
|
2023-04-08 16:22:46 +00:00
|
|
|
// spot light projection FOV is 2x the angle from spot light center to outer edge
|
2022-07-08 19:57:43 +00:00
|
|
|
Mat4::perspective_infinite_reverse_rh(angle * 2.0, 1.0, POINT_LIGHT_NEAR_Z)
|
|
|
|
}
|
|
|
|
|
2021-07-08 03:01:41 +00:00
|
|
|
#[allow(clippy::too_many_arguments)]
|
2021-06-02 02:59:17 +00:00
|
|
|
pub fn prepare_lights(
|
|
|
|
mut commands: Commands,
|
|
|
|
mut texture_cache: ResMut<TextureCache>,
|
2023-02-20 00:02:40 +00:00
|
|
|
images: Res<RenderAssets<Image>>,
|
2021-06-21 23:28:52 +00:00
|
|
|
render_device: Res<RenderDevice>,
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
render_queue: Res<RenderQueue>,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
mut global_light_meta: ResMut<GlobalLightMeta>,
|
2021-06-02 02:59:17 +00:00
|
|
|
mut light_meta: ResMut<LightMeta>,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
views: Query<
|
2023-02-20 00:02:40 +00:00
|
|
|
(
|
|
|
|
Entity,
|
|
|
|
&ExtractedView,
|
|
|
|
&ExtractedClusterConfig,
|
|
|
|
Option<&EnvironmentMapLight>,
|
|
|
|
),
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
With<RenderPhase<Transparent3d>>,
|
|
|
|
>,
|
2022-05-30 18:36:03 +00:00
|
|
|
ambient_light: Res<AmbientLight>,
|
|
|
|
point_light_shadow_map: Res<PointLightShadowMap>,
|
|
|
|
directional_light_shadow_map: Res<DirectionalLightShadowMap>,
|
2022-11-03 07:09:51 +00:00
|
|
|
mut max_directional_lights_warning_emitted: Local<bool>,
|
2023-01-25 12:35:39 +00:00
|
|
|
mut max_cascades_per_light_warning_emitted: Local<bool>,
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
point_lights: Query<(Entity, &ExtractedPointLight)>,
|
|
|
|
directional_lights: Query<(Entity, &ExtractedDirectionalLight)>,
|
2021-06-02 02:59:17 +00:00
|
|
|
) {
|
Sprite Batching (#3060)
This implements the following:
* **Sprite Batching**: Collects sprites in a vertex buffer to draw many sprites with a single draw call. Sprites are batched by their `Handle<Image>` within a specific z-level. When possible, sprites are opportunistically batched _across_ z-levels (when no sprites with a different texture exist between two sprites with the same texture on different z levels). With these changes, I can now get ~130,000 sprites at 60fps on the `bevymark_pipelined` example.
* **Sprite Color Tints**: The `Sprite` type now has a `color` field. Non-white color tints result in a specialized render pipeline that passes the color in as a vertex attribute. I chose to specialize this because passing vertex colors has a measurable price (without colors I get ~130,000 sprites on bevymark, with colors I get ~100,000 sprites). "Colored" sprites cannot be batched with "uncolored" sprites, but I think this is fine because the chance of a "colored" sprite needing to batch with other "colored" sprites is generally probably way higher than an "uncolored" sprite needing to batch with a "colored" sprite.
* **Sprite Flipping**: Sprites can be flipped on their x or y axis using `Sprite::flip_x` and `Sprite::flip_y`. This is also true for `TextureAtlasSprite`.
* **Simpler BufferVec/UniformVec/DynamicUniformVec Clearing**: improved the clearing interface by removing the need to know the size of the final buffer at the initial clear.
![image](https://user-images.githubusercontent.com/2694663/140001821-99be0d96-025d-489e-9bfa-ba19c1dc9548.png)
Note that this moves sprites away from entity-driven rendering and back to extracted lists. We _could_ use entities here, but it necessitates that an intermediate list is allocated / populated to collect and sort extracted sprites. This redundant copy, combined with the normal overhead of spawning extracted sprite entities, brings bevymark down to ~80,000 sprites at 60fps. I think making sprites a bit more fixed (by default) is worth it. I view this as acceptable because batching makes normal entity-driven rendering pretty useless anyway (and we would want to batch most custom materials too). We can still support custom shaders with custom bindings, we'll just need to define a specific interface for it.
2021-11-04 20:28:53 +00:00
|
|
|
light_meta.view_gpu_lights.clear();
|
2021-06-02 02:59:17 +00:00
|
|
|
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
// Pre-calculate for PointLights
|
|
|
|
let cube_face_projection =
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
Mat4::perspective_infinite_reverse_rh(std::f32::consts::FRAC_PI_2, 1.0, POINT_LIGHT_NEAR_Z);
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
let cube_face_rotations = CUBE_MAP_FACES
|
|
|
|
.iter()
|
2022-08-30 22:10:24 +00:00
|
|
|
.map(|CubeMapFace { target, up }| Transform::IDENTITY.looking_at(*target, *up))
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
.collect::<Vec<_>>();
|
|
|
|
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
global_light_meta.entity_to_index.clear();
|
2021-12-22 20:59:48 +00:00
|
|
|
|
|
|
|
let mut point_lights: Vec<_> = point_lights.iter().collect::<Vec<_>>();
|
2022-11-03 07:09:51 +00:00
|
|
|
let mut directional_lights: Vec<_> = directional_lights.iter().collect::<Vec<_>>();
|
2021-12-22 20:59:48 +00:00
|
|
|
|
2022-04-07 21:55:31 +00:00
|
|
|
#[cfg(not(feature = "webgl"))]
|
2022-07-08 19:57:43 +00:00
|
|
|
let max_texture_array_layers = render_device.limits().max_texture_array_layers as usize;
|
|
|
|
#[cfg(not(feature = "webgl"))]
|
|
|
|
let max_texture_cubes = max_texture_array_layers / 6;
|
|
|
|
#[cfg(feature = "webgl")]
|
|
|
|
let max_texture_array_layers = 1;
|
|
|
|
#[cfg(feature = "webgl")]
|
|
|
|
let max_texture_cubes = 1;
|
|
|
|
|
2022-11-03 07:09:51 +00:00
|
|
|
if !*max_directional_lights_warning_emitted && directional_lights.len() > MAX_DIRECTIONAL_LIGHTS
|
|
|
|
{
|
|
|
|
warn!(
|
|
|
|
"The amount of directional lights of {} is exceeding the supported limit of {}.",
|
|
|
|
directional_lights.len(),
|
|
|
|
MAX_DIRECTIONAL_LIGHTS
|
|
|
|
);
|
|
|
|
*max_directional_lights_warning_emitted = true;
|
|
|
|
}
|
|
|
|
|
2023-01-25 12:35:39 +00:00
|
|
|
if !*max_cascades_per_light_warning_emitted
|
|
|
|
&& directional_lights
|
|
|
|
.iter()
|
|
|
|
.any(|(_, light)| light.cascade_shadow_config.bounds.len() > MAX_CASCADES_PER_LIGHT)
|
|
|
|
{
|
|
|
|
warn!(
|
|
|
|
"The number of cascades configured for a directional light exceeds the supported limit of {}.",
|
|
|
|
MAX_CASCADES_PER_LIGHT
|
|
|
|
);
|
|
|
|
*max_cascades_per_light_warning_emitted = true;
|
|
|
|
}
|
|
|
|
|
2022-07-08 19:57:43 +00:00
|
|
|
let point_light_count = point_lights
|
2022-04-07 21:55:31 +00:00
|
|
|
.iter()
|
2022-07-25 16:24:54 +00:00
|
|
|
.filter(|light| light.1.spot_light_angles.is_none())
|
2022-07-08 19:57:43 +00:00
|
|
|
.count();
|
|
|
|
|
2022-07-25 16:24:54 +00:00
|
|
|
let point_light_shadow_maps_count = point_lights
|
|
|
|
.iter()
|
|
|
|
.filter(|light| light.1.shadows_enabled && light.1.spot_light_angles.is_none())
|
|
|
|
.count()
|
|
|
|
.min(max_texture_cubes);
|
2022-07-08 19:57:43 +00:00
|
|
|
|
2023-01-25 12:35:39 +00:00
|
|
|
let directional_shadow_enabled_count = directional_lights
|
2022-07-08 19:57:43 +00:00
|
|
|
.iter()
|
2022-11-03 07:09:51 +00:00
|
|
|
.take(MAX_DIRECTIONAL_LIGHTS)
|
2022-07-08 19:57:43 +00:00
|
|
|
.filter(|(_, light)| light.shadows_enabled)
|
2022-04-07 21:55:31 +00:00
|
|
|
.count()
|
2023-01-25 12:35:39 +00:00
|
|
|
.min(max_texture_array_layers / MAX_CASCADES_PER_LIGHT);
|
2022-04-07 21:55:31 +00:00
|
|
|
|
2022-07-08 19:57:43 +00:00
|
|
|
let spot_light_shadow_maps_count = point_lights
|
|
|
|
.iter()
|
|
|
|
.filter(|(_, light)| light.shadows_enabled && light.spot_light_angles.is_some())
|
|
|
|
.count()
|
2023-01-25 12:35:39 +00:00
|
|
|
.min(max_texture_array_layers - directional_shadow_enabled_count * MAX_CASCADES_PER_LIGHT);
|
2022-07-08 19:57:43 +00:00
|
|
|
|
|
|
|
// Sort lights by
|
|
|
|
// - point-light vs spot-light, so that we can iterate point lights and spot lights in contiguous blocks in the fragment shader,
|
|
|
|
// - then those with shadows enabled first, so that the index can be used to render at most `point_light_shadow_maps_count`
|
|
|
|
// point light shadows and `spot_light_shadow_maps_count` spot light shadow maps,
|
|
|
|
// - then by entity as a stable key to ensure that a consistent set of lights are chosen if the light count limit is exceeded.
|
2021-12-22 20:59:48 +00:00
|
|
|
point_lights.sort_by(|(entity_1, light_1), (entity_2, light_2)| {
|
2022-03-01 10:17:41 +00:00
|
|
|
point_light_order(
|
2022-07-08 19:57:43 +00:00
|
|
|
(
|
|
|
|
entity_1,
|
|
|
|
&light_1.shadows_enabled,
|
|
|
|
&light_1.spot_light_angles.is_some(),
|
|
|
|
),
|
|
|
|
(
|
|
|
|
entity_2,
|
|
|
|
&light_2.shadows_enabled,
|
|
|
|
&light_2.spot_light_angles.is_some(),
|
|
|
|
),
|
2022-03-01 10:17:41 +00:00
|
|
|
)
|
2021-12-22 20:59:48 +00:00
|
|
|
});
|
|
|
|
|
2022-11-03 07:09:51 +00:00
|
|
|
// Sort lights by
|
|
|
|
// - those with shadows enabled first, so that the index can be used to render at most `directional_light_shadow_maps_count`
|
|
|
|
// directional light shadows
|
|
|
|
// - then by entity as a stable key to ensure that a consistent set of lights are chosen if the light count limit is exceeded.
|
|
|
|
directional_lights.sort_by(|(entity_1, light_1), (entity_2, light_2)| {
|
|
|
|
directional_light_order(
|
|
|
|
(entity_1, &light_1.shadows_enabled),
|
|
|
|
(entity_2, &light_2.shadows_enabled),
|
|
|
|
)
|
|
|
|
});
|
|
|
|
|
2021-12-22 20:59:48 +00:00
|
|
|
if global_light_meta.entity_to_index.capacity() < point_lights.len() {
|
|
|
|
global_light_meta
|
|
|
|
.entity_to_index
|
|
|
|
.reserve(point_lights.len());
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
}
|
2021-12-22 20:59:48 +00:00
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
let mut gpu_point_lights = Vec::new();
|
2022-03-01 10:17:41 +00:00
|
|
|
for (index, &(entity, light)) in point_lights.iter().enumerate() {
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
let mut flags = PointLightFlags::NONE;
|
2022-07-08 19:57:43 +00:00
|
|
|
|
2021-12-22 20:59:48 +00:00
|
|
|
// Lights are sorted, shadow enabled lights are first
|
2022-07-08 19:57:43 +00:00
|
|
|
if light.shadows_enabled
|
|
|
|
&& (index < point_light_shadow_maps_count
|
|
|
|
|| (light.spot_light_angles.is_some()
|
|
|
|
&& index - point_light_count < spot_light_shadow_maps_count))
|
|
|
|
{
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
flags |= PointLightFlags::SHADOWS_ENABLED;
|
|
|
|
}
|
2022-07-08 19:57:43 +00:00
|
|
|
|
|
|
|
let (light_custom_data, spot_light_tan_angle) = match light.spot_light_angles {
|
|
|
|
Some((inner, outer)) => {
|
|
|
|
let light_direction = light.transform.forward();
|
|
|
|
if light_direction.y.is_sign_negative() {
|
|
|
|
flags |= PointLightFlags::SPOT_LIGHT_Y_NEGATIVE;
|
|
|
|
}
|
|
|
|
|
|
|
|
let cos_outer = outer.cos();
|
|
|
|
let spot_scale = 1.0 / f32::max(inner.cos() - cos_outer, 1e-4);
|
|
|
|
let spot_offset = -cos_outer * spot_scale;
|
|
|
|
|
|
|
|
(
|
|
|
|
// For spot lights: the direction (x,z), spot_scale and spot_offset
|
|
|
|
light_direction.xz().extend(spot_scale).extend(spot_offset),
|
|
|
|
outer.tan(),
|
|
|
|
)
|
|
|
|
}
|
|
|
|
None => {
|
|
|
|
(
|
|
|
|
// For point lights: the lower-right 2x2 values of the projection matrix [2][2] [2][3] [3][2] [3][3]
|
|
|
|
Vec4::new(
|
|
|
|
cube_face_projection.z_axis.z,
|
|
|
|
cube_face_projection.z_axis.w,
|
|
|
|
cube_face_projection.w_axis.z,
|
|
|
|
cube_face_projection.w_axis.w,
|
|
|
|
),
|
|
|
|
// unused
|
|
|
|
0.0,
|
|
|
|
)
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
gpu_point_lights.push(GpuPointLight {
|
2022-07-08 19:57:43 +00:00
|
|
|
light_custom_data,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
// premultiply color by intensity
|
|
|
|
// we don't use the alpha at all, so no reason to multiply only [0..3]
|
|
|
|
color_inverse_square_range: (Vec4::from_slice(&light.color.as_linear_rgba_f32())
|
|
|
|
* light.intensity)
|
|
|
|
.xyz()
|
|
|
|
.extend(1.0 / (light.range * light.range)),
|
2022-07-16 00:51:12 +00:00
|
|
|
position_radius: light.transform.translation().extend(light.radius),
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
flags: flags.bits,
|
|
|
|
shadow_depth_bias: light.shadow_depth_bias,
|
|
|
|
shadow_normal_bias: light.shadow_normal_bias,
|
2022-07-08 19:57:43 +00:00
|
|
|
spot_light_tan_angle,
|
2022-04-07 16:16:35 +00:00
|
|
|
});
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
global_light_meta.entity_to_index.insert(entity, index);
|
|
|
|
}
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
|
2022-11-03 07:09:51 +00:00
|
|
|
let mut gpu_directional_lights = [GpuDirectionalLight::default(); MAX_DIRECTIONAL_LIGHTS];
|
2023-01-25 12:35:39 +00:00
|
|
|
let mut num_directional_cascades_enabled = 0usize;
|
2022-11-03 07:09:51 +00:00
|
|
|
for (index, (_light_entity, light)) in directional_lights
|
|
|
|
.iter()
|
|
|
|
.enumerate()
|
|
|
|
.take(MAX_DIRECTIONAL_LIGHTS)
|
|
|
|
{
|
|
|
|
let mut flags = DirectionalLightFlags::NONE;
|
|
|
|
|
|
|
|
// Lights are sorted, shadow enabled lights are first
|
2023-01-25 12:35:39 +00:00
|
|
|
if light.shadows_enabled && (index < directional_shadow_enabled_count) {
|
2022-11-03 07:09:51 +00:00
|
|
|
flags |= DirectionalLightFlags::SHADOWS_ENABLED;
|
|
|
|
}
|
|
|
|
|
|
|
|
// convert from illuminance (lux) to candelas
|
|
|
|
//
|
|
|
|
// exposure is hard coded at the moment but should be replaced
|
|
|
|
// by values coming from the camera
|
|
|
|
// see: https://google.github.io/filament/Filament.html#imagingpipeline/physicallybasedcamera/exposuresettings
|
|
|
|
const APERTURE: f32 = 4.0;
|
|
|
|
const SHUTTER_SPEED: f32 = 1.0 / 250.0;
|
|
|
|
const SENSITIVITY: f32 = 100.0;
|
|
|
|
let ev100 = f32::log2(APERTURE * APERTURE / SHUTTER_SPEED) - f32::log2(SENSITIVITY / 100.0);
|
|
|
|
let exposure = 1.0 / (f32::powf(2.0, ev100) * 1.2);
|
|
|
|
let intensity = light.illuminance * exposure;
|
|
|
|
|
2023-01-25 12:35:39 +00:00
|
|
|
let num_cascades = light
|
|
|
|
.cascade_shadow_config
|
|
|
|
.bounds
|
|
|
|
.len()
|
|
|
|
.min(MAX_CASCADES_PER_LIGHT);
|
2022-11-03 07:09:51 +00:00
|
|
|
gpu_directional_lights[index] = GpuDirectionalLight {
|
2023-01-25 12:35:39 +00:00
|
|
|
// Filled in later.
|
|
|
|
cascades: [GpuDirectionalCascade::default(); MAX_CASCADES_PER_LIGHT],
|
2022-11-03 07:09:51 +00:00
|
|
|
// premultiply color by intensity
|
|
|
|
// we don't use the alpha at all, so no reason to multiply only [0..3]
|
|
|
|
color: Vec4::from_slice(&light.color.as_linear_rgba_f32()) * intensity,
|
2023-01-25 12:35:39 +00:00
|
|
|
// direction is negated to be ready for N.L
|
|
|
|
dir_to_light: light.transform.back(),
|
2022-11-03 07:09:51 +00:00
|
|
|
flags: flags.bits,
|
|
|
|
shadow_depth_bias: light.shadow_depth_bias,
|
|
|
|
shadow_normal_bias: light.shadow_normal_bias,
|
2023-01-25 12:35:39 +00:00
|
|
|
num_cascades: num_cascades as u32,
|
|
|
|
cascades_overlap_proportion: light.cascade_shadow_config.overlap_proportion,
|
|
|
|
depth_texture_base_index: num_directional_cascades_enabled as u32,
|
2022-11-03 07:09:51 +00:00
|
|
|
};
|
2023-01-25 12:35:39 +00:00
|
|
|
if index < directional_shadow_enabled_count {
|
|
|
|
num_directional_cascades_enabled += num_cascades;
|
|
|
|
}
|
2022-11-03 07:09:51 +00:00
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
global_light_meta.gpu_point_lights.set(gpu_point_lights);
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
global_light_meta
|
|
|
|
.gpu_point_lights
|
|
|
|
.write_buffer(&render_device, &render_queue);
|
|
|
|
|
2021-06-02 02:59:17 +00:00
|
|
|
// set up light data for each view
|
2023-02-20 00:02:40 +00:00
|
|
|
for (entity, extracted_view, clusters, environment_map) in &views {
|
2021-07-08 02:49:33 +00:00
|
|
|
let point_light_depth_texture = texture_cache.get(
|
|
|
|
&render_device,
|
|
|
|
TextureDescriptor {
|
2021-08-25 05:57:57 +00:00
|
|
|
size: Extent3d {
|
|
|
|
width: point_light_shadow_map.size as u32,
|
|
|
|
height: point_light_shadow_map.size as u32,
|
2022-07-08 19:57:43 +00:00
|
|
|
depth_or_array_layers: point_light_shadow_maps_count.max(1) as u32 * 6,
|
2021-08-25 05:57:57 +00:00
|
|
|
},
|
2021-07-08 02:49:33 +00:00
|
|
|
mip_level_count: 1,
|
|
|
|
sample_count: 1,
|
|
|
|
dimension: TextureDimension::D2,
|
|
|
|
format: SHADOW_FORMAT,
|
2021-10-03 19:04:37 +00:00
|
|
|
label: Some("point_light_shadow_map_texture"),
|
2021-10-08 19:55:24 +00:00
|
|
|
usage: TextureUsages::RENDER_ATTACHMENT | TextureUsages::TEXTURE_BINDING,
|
Wgpu 0.15 (#7356)
# Objective
Update Bevy to wgpu 0.15.
## Changelog
- Update to wgpu 0.15, wgpu-hal 0.15.1, and naga 0.11
- Users can now use the [DirectX Shader Compiler](https://github.com/microsoft/DirectXShaderCompiler) (DXC) on Windows with DX12 for faster shader compilation and ShaderModel 6.0+ support (requires `dxcompiler.dll` and `dxil.dll`, which are included in DXC downloads from [here](https://github.com/microsoft/DirectXShaderCompiler/releases/latest))
## Migration Guide
### WGSL Top-Level `let` is now `const`
All top level constants are now declared with `const`, catching up with the wgsl spec.
`let` is no longer allowed at the global scope, only within functions.
```diff
-let SOME_CONSTANT = 12.0;
+const SOME_CONSTANT = 12.0;
```
#### `TextureDescriptor` and `SurfaceConfiguration` now requires a `view_formats` field
The new `view_formats` field in the `TextureDescriptor` is used to specify a list of formats the texture can be re-interpreted to in a texture view. Currently only changing srgb-ness is allowed (ex. `Rgba8Unorm` <=> `Rgba8UnormSrgb`). You should set `view_formats` to `&[]` (empty) unless you have a specific reason not to.
#### The DirectX Shader Compiler (DXC) is now supported on DX12
DXC is now the default shader compiler when using the DX12 backend. DXC is Microsoft's replacement for their legacy FXC compiler, and is faster, less buggy, and allows for modern shader features to be used (ShaderModel 6.0+). DXC requires `dxcompiler.dll` and `dxil.dll` to be available, otherwise it will log a warning and fall back to FXC.
You can get `dxcompiler.dll` and `dxil.dll` by downloading the latest release from [Microsoft's DirectXShaderCompiler github repo](https://github.com/microsoft/DirectXShaderCompiler/releases/latest) and copying them into your project's root directory. These must be included when you distribute your Bevy game/app/etc if you plan on supporting the DX12 backend and are using DXC.
`WgpuSettings` now has a `dx12_shader_compiler` field which can be used to choose between either FXC or DXC (if you pass None for the paths for DXC, it will check for the .dlls in the working directory).
2023-01-29 20:27:30 +00:00
|
|
|
view_formats: &[],
|
2021-07-08 02:49:33 +00:00
|
|
|
},
|
|
|
|
);
|
|
|
|
let directional_light_depth_texture = texture_cache.get(
|
2021-06-21 23:28:52 +00:00
|
|
|
&render_device,
|
2021-06-02 02:59:17 +00:00
|
|
|
TextureDescriptor {
|
2021-08-25 05:57:57 +00:00
|
|
|
size: Extent3d {
|
2022-01-04 20:08:12 +00:00
|
|
|
width: (directional_light_shadow_map.size as u32)
|
2022-02-16 21:17:37 +00:00
|
|
|
.min(render_device.limits().max_texture_dimension_2d),
|
2022-01-04 20:08:12 +00:00
|
|
|
height: (directional_light_shadow_map.size as u32)
|
2022-02-16 21:17:37 +00:00
|
|
|
.min(render_device.limits().max_texture_dimension_2d),
|
2023-01-25 12:35:39 +00:00
|
|
|
depth_or_array_layers: (num_directional_cascades_enabled
|
2022-07-08 19:57:43 +00:00
|
|
|
+ spot_light_shadow_maps_count)
|
|
|
|
.max(1) as u32,
|
2021-08-25 05:57:57 +00:00
|
|
|
},
|
2021-06-02 02:59:17 +00:00
|
|
|
mip_level_count: 1,
|
|
|
|
sample_count: 1,
|
|
|
|
dimension: TextureDimension::D2,
|
|
|
|
format: SHADOW_FORMAT,
|
2021-10-03 19:04:37 +00:00
|
|
|
label: Some("directional_light_shadow_map_texture"),
|
2021-10-08 19:55:24 +00:00
|
|
|
usage: TextureUsages::RENDER_ATTACHMENT | TextureUsages::TEXTURE_BINDING,
|
Wgpu 0.15 (#7356)
# Objective
Update Bevy to wgpu 0.15.
## Changelog
- Update to wgpu 0.15, wgpu-hal 0.15.1, and naga 0.11
- Users can now use the [DirectX Shader Compiler](https://github.com/microsoft/DirectXShaderCompiler) (DXC) on Windows with DX12 for faster shader compilation and ShaderModel 6.0+ support (requires `dxcompiler.dll` and `dxil.dll`, which are included in DXC downloads from [here](https://github.com/microsoft/DirectXShaderCompiler/releases/latest))
## Migration Guide
### WGSL Top-Level `let` is now `const`
All top level constants are now declared with `const`, catching up with the wgsl spec.
`let` is no longer allowed at the global scope, only within functions.
```diff
-let SOME_CONSTANT = 12.0;
+const SOME_CONSTANT = 12.0;
```
#### `TextureDescriptor` and `SurfaceConfiguration` now requires a `view_formats` field
The new `view_formats` field in the `TextureDescriptor` is used to specify a list of formats the texture can be re-interpreted to in a texture view. Currently only changing srgb-ness is allowed (ex. `Rgba8Unorm` <=> `Rgba8UnormSrgb`). You should set `view_formats` to `&[]` (empty) unless you have a specific reason not to.
#### The DirectX Shader Compiler (DXC) is now supported on DX12
DXC is now the default shader compiler when using the DX12 backend. DXC is Microsoft's replacement for their legacy FXC compiler, and is faster, less buggy, and allows for modern shader features to be used (ShaderModel 6.0+). DXC requires `dxcompiler.dll` and `dxil.dll` to be available, otherwise it will log a warning and fall back to FXC.
You can get `dxcompiler.dll` and `dxil.dll` by downloading the latest release from [Microsoft's DirectXShaderCompiler github repo](https://github.com/microsoft/DirectXShaderCompiler/releases/latest) and copying them into your project's root directory. These must be included when you distribute your Bevy game/app/etc if you plan on supporting the DX12 backend and are using DXC.
`WgpuSettings` now has a `dx12_shader_compiler` field which can be used to choose between either FXC or DXC (if you pass None for the paths for DXC, it will check for the .dlls in the working directory).
2023-01-29 20:27:30 +00:00
|
|
|
view_formats: &[],
|
2021-06-02 02:59:17 +00:00
|
|
|
},
|
|
|
|
);
|
|
|
|
let mut view_lights = Vec::new();
|
|
|
|
|
2021-12-14 23:42:35 +00:00
|
|
|
let is_orthographic = extracted_view.projection.w_axis.w == 1.0;
|
|
|
|
let cluster_factors_zw = calculate_cluster_factors(
|
2022-01-07 21:25:59 +00:00
|
|
|
clusters.near,
|
2022-03-08 04:56:42 +00:00
|
|
|
clusters.far,
|
2022-03-24 00:20:27 +00:00
|
|
|
clusters.dimensions.z as f32,
|
2021-12-14 23:42:35 +00:00
|
|
|
is_orthographic,
|
|
|
|
);
|
|
|
|
|
2022-03-24 00:20:27 +00:00
|
|
|
let n_clusters = clusters.dimensions.x * clusters.dimensions.y * clusters.dimensions.z;
|
2023-01-25 12:35:39 +00:00
|
|
|
let mut gpu_lights = GpuLights {
|
2022-11-03 07:09:51 +00:00
|
|
|
directional_lights: gpu_directional_lights,
|
2021-11-28 10:40:42 +00:00
|
|
|
ambient_color: Vec4::from_slice(&ambient_light.color.as_linear_rgba_f32())
|
|
|
|
* ambient_light.brightness,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
cluster_factors: Vec4::new(
|
2022-09-15 21:58:14 +00:00
|
|
|
clusters.dimensions.x as f32 / extracted_view.viewport.z as f32,
|
|
|
|
clusters.dimensions.y as f32 / extracted_view.viewport.w as f32,
|
2021-12-14 23:42:35 +00:00
|
|
|
cluster_factors_zw.x,
|
|
|
|
cluster_factors_zw.y,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
),
|
2022-03-24 00:20:27 +00:00
|
|
|
cluster_dimensions: clusters.dimensions.extend(n_clusters),
|
2021-07-08 02:49:33 +00:00
|
|
|
n_directional_lights: directional_lights.iter().len() as u32,
|
2023-01-25 12:35:39 +00:00
|
|
|
// spotlight shadow maps are stored in the directional light array, starting at num_directional_cascades_enabled.
|
2022-07-08 19:57:43 +00:00
|
|
|
// the spot lights themselves start in the light array at point_light_count. so to go from light
|
2022-07-25 16:24:54 +00:00
|
|
|
// index to shadow map index, we need to subtract point light count and add directional shadowmap count.
|
2023-01-25 12:35:39 +00:00
|
|
|
spot_light_shadowmap_offset: num_directional_cascades_enabled as i32
|
2022-07-08 19:57:43 +00:00
|
|
|
- point_light_count as i32,
|
2023-02-20 00:02:40 +00:00
|
|
|
environment_map_smallest_specular_mip_level: environment_map
|
|
|
|
.and_then(|env_map| images.get(&env_map.specular_map))
|
|
|
|
.map(|specular_map| specular_map.mip_level_count - 1)
|
|
|
|
.unwrap_or(0),
|
2021-06-02 02:59:17 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
// TODO: this should select lights based on relevance to the view instead of the first ones that show up in a query
|
2021-12-22 20:59:48 +00:00
|
|
|
for &(light_entity, light) in point_lights
|
|
|
|
.iter()
|
|
|
|
// Lights are sorted, shadow enabled lights are first
|
2022-07-08 19:57:43 +00:00
|
|
|
.take(point_light_shadow_maps_count)
|
2021-12-22 20:59:48 +00:00
|
|
|
.filter(|(_, light)| light.shadows_enabled)
|
|
|
|
{
|
|
|
|
let light_index = *global_light_meta
|
|
|
|
.entity_to_index
|
|
|
|
.get(&light_entity)
|
|
|
|
.unwrap();
|
|
|
|
// ignore scale because we don't want to effectively scale light radius and range
|
|
|
|
// by applying those as a view transform to shadow map rendering of objects
|
|
|
|
// and ignore rotation because we want the shadow map projections to align with the axes
|
2022-07-16 00:51:12 +00:00
|
|
|
let view_translation = GlobalTransform::from_translation(light.transform.translation());
|
2021-12-22 20:59:48 +00:00
|
|
|
|
|
|
|
for (face_index, view_rotation) in cube_face_rotations.iter().enumerate() {
|
|
|
|
let depth_texture_view =
|
|
|
|
point_light_depth_texture
|
|
|
|
.texture
|
|
|
|
.create_view(&TextureViewDescriptor {
|
|
|
|
label: Some("point_light_shadow_map_texture_view"),
|
|
|
|
format: None,
|
|
|
|
dimension: Some(TextureViewDimension::D2),
|
|
|
|
aspect: TextureAspect::All,
|
|
|
|
base_mip_level: 0,
|
|
|
|
mip_level_count: None,
|
|
|
|
base_array_layer: (light_index * 6 + face_index) as u32,
|
2023-04-26 15:34:23 +00:00
|
|
|
array_layer_count: Some(1u32),
|
2021-12-22 20:59:48 +00:00
|
|
|
});
|
|
|
|
|
|
|
|
let view_light_entity = commands
|
Spawn now takes a Bundle (#6054)
# Objective
Now that we can consolidate Bundles and Components under a single insert (thanks to #2975 and #6039), almost 100% of world spawns now look like `world.spawn().insert((Some, Tuple, Here))`. Spawning an entity without any components is an extremely uncommon pattern, so it makes sense to give spawn the "first class" ergonomic api. This consolidated api should be made consistent across all spawn apis (such as World and Commands).
## Solution
All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input:
```rust
// before:
commands
.spawn()
.insert((A, B, C));
world
.spawn()
.insert((A, B, C);
// after
commands.spawn((A, B, C));
world.spawn((A, B, C));
```
All existing instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api. A new `spawn_empty` has been added, replacing the old `spawn` api.
By allowing `world.spawn(some_bundle)` to replace `world.spawn().insert(some_bundle)`, this opened the door to removing the initial entity allocation in the "empty" archetype / table done in `spawn()` (and subsequent move to the actual archetype in `.insert(some_bundle)`).
This improves spawn performance by over 10%:
![image](https://user-images.githubusercontent.com/2694663/191627587-4ab2f949-4ccd-4231-80eb-80dd4d9ad6b9.png)
To take this measurement, I added a new `world_spawn` benchmark.
Unfortunately, optimizing `Commands::spawn` is slightly less trivial, as Commands expose the Entity id of spawned entities prior to actually spawning. Doing the optimization would (naively) require assurances that the `spawn(some_bundle)` command is applied before all other commands involving the entity (which would not necessarily be true, if memory serves). Optimizing `Commands::spawn` this way does feel possible, but it will require careful thought (and maybe some additional checks), which deserves its own PR. For now, it has the same performance characteristics of the current `Commands::spawn_bundle` on main.
**Note that 99% of this PR is simple renames and refactors. The only code that needs careful scrutiny is the new `World::spawn()` impl, which is relatively straightforward, but it has some new unsafe code (which re-uses battle tested BundlerSpawner code path).**
---
## Changelog
- All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input
- All instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api
- World and Commands now have `spawn_empty()`, which is equivalent to the old `spawn()` behavior.
## Migration Guide
```rust
// Old (0.8):
commands
.spawn()
.insert_bundle((A, B, C));
// New (0.9)
commands.spawn((A, B, C));
// Old (0.8):
commands.spawn_bundle((A, B, C));
// New (0.9)
commands.spawn((A, B, C));
// Old (0.8):
let entity = commands.spawn().id();
// New (0.9)
let entity = commands.spawn_empty().id();
// Old (0.8)
let entity = world.spawn().id();
// New (0.9)
let entity = world.spawn_empty();
```
2022-09-23 19:55:54 +00:00
|
|
|
.spawn((
|
2021-12-22 20:59:48 +00:00
|
|
|
ShadowView {
|
|
|
|
depth_texture_view,
|
|
|
|
pass_name: format!(
|
|
|
|
"shadow pass point light {} {}",
|
|
|
|
light_index,
|
|
|
|
face_index_to_name(face_index)
|
|
|
|
),
|
|
|
|
},
|
|
|
|
ExtractedView {
|
2022-09-15 21:58:14 +00:00
|
|
|
viewport: UVec4::new(
|
|
|
|
0,
|
|
|
|
0,
|
|
|
|
point_light_shadow_map.size as u32,
|
|
|
|
point_light_shadow_map.size as u32,
|
|
|
|
),
|
2021-12-22 20:59:48 +00:00
|
|
|
transform: view_translation * *view_rotation,
|
2023-01-25 12:35:39 +00:00
|
|
|
view_projection: None,
|
2021-12-22 20:59:48 +00:00
|
|
|
projection: cube_face_projection,
|
2022-10-26 20:13:59 +00:00
|
|
|
hdr: false,
|
2023-02-19 20:38:13 +00:00
|
|
|
color_grading: Default::default(),
|
2021-12-22 20:59:48 +00:00
|
|
|
},
|
|
|
|
RenderPhase::<Shadow>::default(),
|
|
|
|
LightEntity::Point {
|
|
|
|
light_entity,
|
|
|
|
face_index,
|
|
|
|
},
|
|
|
|
))
|
|
|
|
.id();
|
|
|
|
view_lights.push(view_light_entity);
|
2021-11-19 21:16:58 +00:00
|
|
|
}
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
2022-07-08 19:57:43 +00:00
|
|
|
// spot lights
|
|
|
|
for (light_index, &(light_entity, light)) in point_lights
|
|
|
|
.iter()
|
|
|
|
.skip(point_light_count)
|
|
|
|
.take(spot_light_shadow_maps_count)
|
|
|
|
.enumerate()
|
|
|
|
{
|
|
|
|
let spot_view_matrix = spot_light_view_matrix(&light.transform);
|
2022-07-16 00:51:12 +00:00
|
|
|
let spot_view_transform = spot_view_matrix.into();
|
2022-07-08 19:57:43 +00:00
|
|
|
|
|
|
|
let angle = light.spot_light_angles.expect("lights should be sorted so that \
|
2022-07-25 16:24:54 +00:00
|
|
|
[point_light_count..point_light_count + spot_light_shadow_maps_count] are spot lights").1;
|
2022-07-08 19:57:43 +00:00
|
|
|
let spot_projection = spot_light_projection_matrix(angle);
|
|
|
|
|
|
|
|
let depth_texture_view =
|
|
|
|
directional_light_depth_texture
|
|
|
|
.texture
|
|
|
|
.create_view(&TextureViewDescriptor {
|
|
|
|
label: Some("spot_light_shadow_map_texture_view"),
|
|
|
|
format: None,
|
|
|
|
dimension: Some(TextureViewDimension::D2),
|
|
|
|
aspect: TextureAspect::All,
|
|
|
|
base_mip_level: 0,
|
|
|
|
mip_level_count: None,
|
2023-01-25 12:35:39 +00:00
|
|
|
base_array_layer: (num_directional_cascades_enabled + light_index) as u32,
|
2023-04-26 15:34:23 +00:00
|
|
|
array_layer_count: Some(1u32),
|
2022-07-08 19:57:43 +00:00
|
|
|
});
|
|
|
|
|
|
|
|
let view_light_entity = commands
|
Spawn now takes a Bundle (#6054)
# Objective
Now that we can consolidate Bundles and Components under a single insert (thanks to #2975 and #6039), almost 100% of world spawns now look like `world.spawn().insert((Some, Tuple, Here))`. Spawning an entity without any components is an extremely uncommon pattern, so it makes sense to give spawn the "first class" ergonomic api. This consolidated api should be made consistent across all spawn apis (such as World and Commands).
## Solution
All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input:
```rust
// before:
commands
.spawn()
.insert((A, B, C));
world
.spawn()
.insert((A, B, C);
// after
commands.spawn((A, B, C));
world.spawn((A, B, C));
```
All existing instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api. A new `spawn_empty` has been added, replacing the old `spawn` api.
By allowing `world.spawn(some_bundle)` to replace `world.spawn().insert(some_bundle)`, this opened the door to removing the initial entity allocation in the "empty" archetype / table done in `spawn()` (and subsequent move to the actual archetype in `.insert(some_bundle)`).
This improves spawn performance by over 10%:
![image](https://user-images.githubusercontent.com/2694663/191627587-4ab2f949-4ccd-4231-80eb-80dd4d9ad6b9.png)
To take this measurement, I added a new `world_spawn` benchmark.
Unfortunately, optimizing `Commands::spawn` is slightly less trivial, as Commands expose the Entity id of spawned entities prior to actually spawning. Doing the optimization would (naively) require assurances that the `spawn(some_bundle)` command is applied before all other commands involving the entity (which would not necessarily be true, if memory serves). Optimizing `Commands::spawn` this way does feel possible, but it will require careful thought (and maybe some additional checks), which deserves its own PR. For now, it has the same performance characteristics of the current `Commands::spawn_bundle` on main.
**Note that 99% of this PR is simple renames and refactors. The only code that needs careful scrutiny is the new `World::spawn()` impl, which is relatively straightforward, but it has some new unsafe code (which re-uses battle tested BundlerSpawner code path).**
---
## Changelog
- All `spawn` apis (`World::spawn`, `Commands:;spawn`, `ChildBuilder::spawn`, and `WorldChildBuilder::spawn`) now accept a bundle as input
- All instances of `spawn_bundle` have been deprecated in favor of the new `spawn` api
- World and Commands now have `spawn_empty()`, which is equivalent to the old `spawn()` behavior.
## Migration Guide
```rust
// Old (0.8):
commands
.spawn()
.insert_bundle((A, B, C));
// New (0.9)
commands.spawn((A, B, C));
// Old (0.8):
commands.spawn_bundle((A, B, C));
// New (0.9)
commands.spawn((A, B, C));
// Old (0.8):
let entity = commands.spawn().id();
// New (0.9)
let entity = commands.spawn_empty().id();
// Old (0.8)
let entity = world.spawn().id();
// New (0.9)
let entity = world.spawn_empty();
```
2022-09-23 19:55:54 +00:00
|
|
|
.spawn((
|
2022-07-08 19:57:43 +00:00
|
|
|
ShadowView {
|
|
|
|
depth_texture_view,
|
2022-10-28 21:03:01 +00:00
|
|
|
pass_name: format!("shadow pass spot light {light_index}",),
|
2022-07-08 19:57:43 +00:00
|
|
|
},
|
|
|
|
ExtractedView {
|
2022-09-15 21:58:14 +00:00
|
|
|
viewport: UVec4::new(
|
|
|
|
0,
|
|
|
|
0,
|
|
|
|
directional_light_shadow_map.size as u32,
|
|
|
|
directional_light_shadow_map.size as u32,
|
|
|
|
),
|
2022-07-08 19:57:43 +00:00
|
|
|
transform: spot_view_transform,
|
|
|
|
projection: spot_projection,
|
2023-01-25 12:35:39 +00:00
|
|
|
view_projection: None,
|
2022-10-26 20:13:59 +00:00
|
|
|
hdr: false,
|
2023-02-19 20:38:13 +00:00
|
|
|
color_grading: Default::default(),
|
2022-07-08 19:57:43 +00:00
|
|
|
},
|
|
|
|
RenderPhase::<Shadow>::default(),
|
|
|
|
LightEntity::Spot { light_entity },
|
|
|
|
))
|
|
|
|
.id();
|
|
|
|
|
|
|
|
view_lights.push(view_light_entity);
|
|
|
|
}
|
|
|
|
|
2022-11-03 07:09:51 +00:00
|
|
|
// directional lights
|
2023-01-25 12:35:39 +00:00
|
|
|
let mut directional_depth_texture_array_index = 0u32;
|
2022-11-03 07:09:51 +00:00
|
|
|
for (light_index, &(light_entity, light)) in directional_lights
|
2021-07-08 02:49:33 +00:00
|
|
|
.iter()
|
|
|
|
.enumerate()
|
2023-01-25 12:35:39 +00:00
|
|
|
.take(directional_shadow_enabled_count)
|
2021-07-08 02:49:33 +00:00
|
|
|
{
|
2023-01-25 12:35:39 +00:00
|
|
|
for (cascade_index, (cascade, bound)) in light
|
|
|
|
.cascades
|
|
|
|
.get(&entity)
|
|
|
|
.unwrap()
|
|
|
|
.iter()
|
|
|
|
.take(MAX_CASCADES_PER_LIGHT)
|
|
|
|
.zip(&light.cascade_shadow_config.bounds)
|
|
|
|
.enumerate()
|
|
|
|
{
|
|
|
|
gpu_lights.directional_lights[light_index].cascades[cascade_index] =
|
|
|
|
GpuDirectionalCascade {
|
|
|
|
view_projection: cascade.view_projection,
|
|
|
|
texel_size: cascade.texel_size,
|
|
|
|
far_bound: *bound,
|
|
|
|
};
|
2021-07-08 02:49:33 +00:00
|
|
|
|
2023-01-25 12:35:39 +00:00
|
|
|
let depth_texture_view =
|
|
|
|
directional_light_depth_texture
|
|
|
|
.texture
|
|
|
|
.create_view(&TextureViewDescriptor {
|
|
|
|
label: Some("directional_light_shadow_map_array_texture_view"),
|
|
|
|
format: None,
|
|
|
|
dimension: Some(TextureViewDimension::D2),
|
|
|
|
aspect: TextureAspect::All,
|
|
|
|
base_mip_level: 0,
|
|
|
|
mip_level_count: None,
|
|
|
|
base_array_layer: directional_depth_texture_array_index,
|
2023-04-26 15:34:23 +00:00
|
|
|
array_layer_count: Some(1u32),
|
2023-01-25 12:35:39 +00:00
|
|
|
});
|
|
|
|
directional_depth_texture_array_index += 1;
|
|
|
|
|
|
|
|
let view_light_entity = commands
|
|
|
|
.spawn((
|
|
|
|
ShadowView {
|
|
|
|
depth_texture_view,
|
|
|
|
pass_name: format!(
|
|
|
|
"shadow pass directional light {light_index} cascade {cascade_index}"),
|
|
|
|
},
|
|
|
|
ExtractedView {
|
|
|
|
viewport: UVec4::new(
|
|
|
|
0,
|
|
|
|
0,
|
|
|
|
directional_light_shadow_map.size as u32,
|
|
|
|
directional_light_shadow_map.size as u32,
|
|
|
|
),
|
|
|
|
transform: GlobalTransform::from(cascade.view_transform),
|
|
|
|
projection: cascade.projection,
|
|
|
|
view_projection: Some(cascade.view_projection),
|
|
|
|
hdr: false,
|
2023-02-19 20:38:13 +00:00
|
|
|
color_grading: Default::default(),
|
2023-01-25 12:35:39 +00:00
|
|
|
},
|
|
|
|
RenderPhase::<Shadow>::default(),
|
|
|
|
LightEntity::Directional {
|
|
|
|
light_entity,
|
|
|
|
cascade_index,
|
|
|
|
},
|
|
|
|
))
|
|
|
|
.id();
|
|
|
|
view_lights.push(view_light_entity);
|
|
|
|
}
|
2021-07-08 02:49:33 +00:00
|
|
|
}
|
2022-11-03 07:09:51 +00:00
|
|
|
|
2021-07-08 02:49:33 +00:00
|
|
|
let point_light_depth_texture_view =
|
|
|
|
point_light_depth_texture
|
2021-07-01 23:48:55 +00:00
|
|
|
.texture
|
|
|
|
.create_view(&TextureViewDescriptor {
|
2021-10-03 19:04:37 +00:00
|
|
|
label: Some("point_light_shadow_map_array_texture_view"),
|
2021-07-01 23:48:55 +00:00
|
|
|
format: None,
|
2021-12-22 20:59:48 +00:00
|
|
|
#[cfg(not(feature = "webgl"))]
|
2021-07-01 23:48:55 +00:00
|
|
|
dimension: Some(TextureViewDimension::CubeArray),
|
2021-12-22 20:59:48 +00:00
|
|
|
#[cfg(feature = "webgl")]
|
|
|
|
dimension: Some(TextureViewDimension::Cube),
|
2021-07-01 23:48:55 +00:00
|
|
|
aspect: TextureAspect::All,
|
|
|
|
base_mip_level: 0,
|
|
|
|
mip_level_count: None,
|
2021-07-02 01:05:20 +00:00
|
|
|
base_array_layer: 0,
|
2021-07-01 23:48:55 +00:00
|
|
|
array_layer_count: None,
|
|
|
|
});
|
2021-07-08 02:49:33 +00:00
|
|
|
let directional_light_depth_texture_view = directional_light_depth_texture
|
|
|
|
.texture
|
|
|
|
.create_view(&TextureViewDescriptor {
|
2021-10-03 19:04:37 +00:00
|
|
|
label: Some("directional_light_shadow_map_array_texture_view"),
|
2021-07-08 02:49:33 +00:00
|
|
|
format: None,
|
2021-12-22 20:59:48 +00:00
|
|
|
#[cfg(not(feature = "webgl"))]
|
2021-07-08 02:49:33 +00:00
|
|
|
dimension: Some(TextureViewDimension::D2Array),
|
2021-12-22 20:59:48 +00:00
|
|
|
#[cfg(feature = "webgl")]
|
|
|
|
dimension: Some(TextureViewDimension::D2),
|
2021-07-08 02:49:33 +00:00
|
|
|
aspect: TextureAspect::All,
|
|
|
|
base_mip_level: 0,
|
|
|
|
mip_level_count: None,
|
|
|
|
base_array_layer: 0,
|
|
|
|
array_layer_count: None,
|
|
|
|
});
|
2021-07-01 23:48:55 +00:00
|
|
|
|
2022-09-21 21:47:53 +00:00
|
|
|
commands.entity(entity).insert((
|
2021-11-19 21:16:58 +00:00
|
|
|
ViewShadowBindings {
|
|
|
|
point_light_depth_texture: point_light_depth_texture.texture,
|
|
|
|
point_light_depth_texture_view,
|
|
|
|
directional_light_depth_texture: directional_light_depth_texture.texture,
|
|
|
|
directional_light_depth_texture_view,
|
|
|
|
},
|
|
|
|
ViewLightEntities {
|
|
|
|
lights: view_lights,
|
|
|
|
},
|
|
|
|
ViewLightsUniformOffset {
|
|
|
|
offset: light_meta.view_gpu_lights.push(gpu_lights),
|
|
|
|
},
|
|
|
|
));
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
Sprite Batching (#3060)
This implements the following:
* **Sprite Batching**: Collects sprites in a vertex buffer to draw many sprites with a single draw call. Sprites are batched by their `Handle<Image>` within a specific z-level. When possible, sprites are opportunistically batched _across_ z-levels (when no sprites with a different texture exist between two sprites with the same texture on different z levels). With these changes, I can now get ~130,000 sprites at 60fps on the `bevymark_pipelined` example.
* **Sprite Color Tints**: The `Sprite` type now has a `color` field. Non-white color tints result in a specialized render pipeline that passes the color in as a vertex attribute. I chose to specialize this because passing vertex colors has a measurable price (without colors I get ~130,000 sprites on bevymark, with colors I get ~100,000 sprites). "Colored" sprites cannot be batched with "uncolored" sprites, but I think this is fine because the chance of a "colored" sprite needing to batch with other "colored" sprites is generally probably way higher than an "uncolored" sprite needing to batch with a "colored" sprite.
* **Sprite Flipping**: Sprites can be flipped on their x or y axis using `Sprite::flip_x` and `Sprite::flip_y`. This is also true for `TextureAtlasSprite`.
* **Simpler BufferVec/UniformVec/DynamicUniformVec Clearing**: improved the clearing interface by removing the need to know the size of the final buffer at the initial clear.
![image](https://user-images.githubusercontent.com/2694663/140001821-99be0d96-025d-489e-9bfa-ba19c1dc9548.png)
Note that this moves sprites away from entity-driven rendering and back to extracted lists. We _could_ use entities here, but it necessitates that an intermediate list is allocated / populated to collect and sort extracted sprites. This redundant copy, combined with the normal overhead of spawning extracted sprite entities, brings bevymark down to ~80,000 sprites at 60fps. I think making sprites a bit more fixed (by default) is worth it. I view this as acceptable because batching makes normal entity-driven rendering pretty useless anyway (and we would want to batch most custom materials too). We can still support custom shaders with custom bindings, we'll just need to define a specific interface for it.
2021-11-04 20:28:53 +00:00
|
|
|
light_meta
|
|
|
|
.view_gpu_lights
|
|
|
|
.write_buffer(&render_device, &render_queue);
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
}
|
|
|
|
|
2022-03-01 10:17:41 +00:00
|
|
|
// this must match CLUSTER_COUNT_SIZE in pbr.wgsl
|
2022-04-07 16:16:35 +00:00
|
|
|
// and must be large enough to contain MAX_UNIFORM_BUFFER_POINT_LIGHTS
|
2022-07-08 19:57:43 +00:00
|
|
|
const CLUSTER_COUNT_SIZE: u32 = 9;
|
2022-03-01 10:17:41 +00:00
|
|
|
|
2022-07-08 19:57:43 +00:00
|
|
|
const CLUSTER_OFFSET_MASK: u32 = (1 << (32 - (CLUSTER_COUNT_SIZE * 2))) - 1;
|
2022-03-01 10:17:41 +00:00
|
|
|
const CLUSTER_COUNT_MASK: u32 = (1 << CLUSTER_COUNT_SIZE) - 1;
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
|
|
|
// NOTE: With uniform buffer max binding size as 16384 bytes
|
2022-07-08 19:57:43 +00:00
|
|
|
// that means we can fit 256 point lights in one uniform
|
2021-12-14 23:42:35 +00:00
|
|
|
// buffer, which means the count can be at most 256 so it
|
2022-03-01 10:17:41 +00:00
|
|
|
// needs 9 bits.
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
// The array of indices can also use u8 and that means the
|
|
|
|
// offset in to the array of indices needs to be able to address
|
2021-12-14 23:42:35 +00:00
|
|
|
// 16384 values. log2(16384) = 14 bits.
|
2022-07-08 19:57:43 +00:00
|
|
|
// We use 32 bits to store the offset and counts so
|
|
|
|
// we pack the offset into the upper 14 bits of a u32,
|
|
|
|
// the point light count into bits 9-17, and the spot light count into bits 0-8.
|
|
|
|
// [ 31 .. 18 | 17 .. 9 | 8 .. 0 ]
|
|
|
|
// [ offset | point light count | spot light count ]
|
2021-12-14 23:42:35 +00:00
|
|
|
// NOTE: This assumes CPU and GPU endianness are the same which is true
|
|
|
|
// for all common and tested x86/ARM CPUs and AMD/NVIDIA/Intel/Apple/etc GPUs
|
2022-07-08 19:57:43 +00:00
|
|
|
fn pack_offset_and_counts(offset: usize, point_count: usize, spot_count: usize) -> u32 {
|
|
|
|
((offset as u32 & CLUSTER_OFFSET_MASK) << (CLUSTER_COUNT_SIZE * 2))
|
|
|
|
| (point_count as u32 & CLUSTER_COUNT_MASK) << CLUSTER_COUNT_SIZE
|
|
|
|
| (spot_count as u32 & CLUSTER_COUNT_MASK)
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
#[derive(ShaderType)]
|
|
|
|
struct GpuClusterLightIndexListsUniform {
|
|
|
|
data: Box<[UVec4; ViewClusterBindings::MAX_UNIFORM_ITEMS]>,
|
|
|
|
}
|
|
|
|
|
|
|
|
// NOTE: Assert at compile time that GpuClusterLightIndexListsUniform
|
|
|
|
// fits within the maximum uniform buffer binding size
|
2022-07-03 19:55:33 +00:00
|
|
|
const _: () = assert!(GpuClusterLightIndexListsUniform::SHADER_SIZE.get() <= 16384);
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
|
|
|
|
impl Default for GpuClusterLightIndexListsUniform {
|
|
|
|
fn default() -> Self {
|
|
|
|
Self {
|
|
|
|
data: Box::new([UVec4::ZERO; ViewClusterBindings::MAX_UNIFORM_ITEMS]),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#[derive(ShaderType)]
|
|
|
|
struct GpuClusterOffsetsAndCountsUniform {
|
|
|
|
data: Box<[UVec4; ViewClusterBindings::MAX_UNIFORM_ITEMS]>,
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Default for GpuClusterOffsetsAndCountsUniform {
|
|
|
|
fn default() -> Self {
|
|
|
|
Self {
|
|
|
|
data: Box::new([UVec4::ZERO; ViewClusterBindings::MAX_UNIFORM_ITEMS]),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#[derive(ShaderType, Default)]
|
|
|
|
struct GpuClusterLightIndexListsStorage {
|
|
|
|
#[size(runtime)]
|
|
|
|
data: Vec<u32>,
|
|
|
|
}
|
|
|
|
|
|
|
|
#[derive(ShaderType, Default)]
|
|
|
|
struct GpuClusterOffsetsAndCountsStorage {
|
|
|
|
#[size(runtime)]
|
2022-07-08 19:57:43 +00:00
|
|
|
data: Vec<UVec4>,
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
}
|
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
enum ViewClusterBuffers {
|
|
|
|
Uniform {
|
|
|
|
// NOTE: UVec4 is because all arrays in Std140 layout have 16-byte alignment
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_light_index_lists: UniformBuffer<GpuClusterLightIndexListsUniform>,
|
2022-04-07 16:16:35 +00:00
|
|
|
// NOTE: UVec4 is because all arrays in Std140 layout have 16-byte alignment
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_offsets_and_counts: UniformBuffer<GpuClusterOffsetsAndCountsUniform>,
|
2022-04-07 16:16:35 +00:00
|
|
|
},
|
|
|
|
Storage {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_light_index_lists: StorageBuffer<GpuClusterLightIndexListsStorage>,
|
|
|
|
cluster_offsets_and_counts: StorageBuffer<GpuClusterOffsetsAndCountsStorage>,
|
2022-04-07 16:16:35 +00:00
|
|
|
},
|
|
|
|
}
|
|
|
|
|
|
|
|
impl ViewClusterBuffers {
|
|
|
|
fn new(buffer_binding_type: BufferBindingType) -> Self {
|
|
|
|
match buffer_binding_type {
|
|
|
|
BufferBindingType::Storage { .. } => Self::storage(),
|
|
|
|
BufferBindingType::Uniform => Self::uniform(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fn uniform() -> Self {
|
|
|
|
ViewClusterBuffers::Uniform {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_light_index_lists: UniformBuffer::default(),
|
|
|
|
cluster_offsets_and_counts: UniformBuffer::default(),
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fn storage() -> Self {
|
|
|
|
ViewClusterBuffers::Storage {
|
|
|
|
cluster_light_index_lists: StorageBuffer::default(),
|
|
|
|
cluster_offsets_and_counts: StorageBuffer::default(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#[derive(Component)]
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
pub struct ViewClusterBindings {
|
|
|
|
n_indices: usize,
|
|
|
|
n_offsets: usize,
|
2022-04-07 16:16:35 +00:00
|
|
|
buffers: ViewClusterBuffers,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
impl ViewClusterBindings {
|
|
|
|
pub const MAX_OFFSETS: usize = 16384 / 4;
|
|
|
|
const MAX_UNIFORM_ITEMS: usize = Self::MAX_OFFSETS / 4;
|
2022-03-08 04:56:42 +00:00
|
|
|
pub const MAX_INDICES: usize = 16384;
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
pub fn new(buffer_binding_type: BufferBindingType) -> Self {
|
|
|
|
Self {
|
|
|
|
n_indices: 0,
|
|
|
|
n_offsets: 0,
|
|
|
|
buffers: ViewClusterBuffers::new(buffer_binding_type),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
pub fn clear(&mut self) {
|
2022-04-07 16:16:35 +00:00
|
|
|
match &mut self.buffers {
|
|
|
|
ViewClusterBuffers::Uniform {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
} => {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
*cluster_light_index_lists.get_mut().data = [UVec4::ZERO; Self::MAX_UNIFORM_ITEMS];
|
|
|
|
*cluster_offsets_and_counts.get_mut().data = [UVec4::ZERO; Self::MAX_UNIFORM_ITEMS];
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
ViewClusterBuffers::Storage {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
..
|
|
|
|
} => {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_light_index_lists.get_mut().data.clear();
|
|
|
|
cluster_offsets_and_counts.get_mut().data.clear();
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
}
|
|
|
|
|
2022-07-08 19:57:43 +00:00
|
|
|
pub fn push_offset_and_counts(&mut self, offset: usize, point_count: usize, spot_count: usize) {
|
2022-04-07 16:16:35 +00:00
|
|
|
match &mut self.buffers {
|
|
|
|
ViewClusterBuffers::Uniform {
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
..
|
|
|
|
} => {
|
|
|
|
let array_index = self.n_offsets >> 2; // >> 2 is equivalent to / 4
|
|
|
|
if array_index >= Self::MAX_UNIFORM_ITEMS {
|
|
|
|
warn!("cluster offset and count out of bounds!");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
let component = self.n_offsets & ((1 << 2) - 1);
|
2022-07-08 19:57:43 +00:00
|
|
|
let packed = pack_offset_and_counts(offset, point_count, spot_count);
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_offsets_and_counts.get_mut().data[array_index][component] = packed;
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
ViewClusterBuffers::Storage {
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
..
|
|
|
|
} => {
|
2022-07-08 19:57:43 +00:00
|
|
|
cluster_offsets_and_counts.get_mut().data.push(UVec4::new(
|
|
|
|
offset as u32,
|
|
|
|
point_count as u32,
|
|
|
|
spot_count as u32,
|
|
|
|
0,
|
|
|
|
));
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
|
|
|
self.n_offsets += 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn n_indices(&self) -> usize {
|
|
|
|
self.n_indices
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn push_index(&mut self, index: usize) {
|
2022-04-07 16:16:35 +00:00
|
|
|
match &mut self.buffers {
|
|
|
|
ViewClusterBuffers::Uniform {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
..
|
|
|
|
} => {
|
|
|
|
let array_index = self.n_indices >> 4; // >> 4 is equivalent to / 16
|
|
|
|
let component = (self.n_indices >> 2) & ((1 << 2) - 1);
|
|
|
|
let sub_index = self.n_indices & ((1 << 2) - 1);
|
2022-07-08 19:57:43 +00:00
|
|
|
let index = index as u32;
|
2022-04-07 16:16:35 +00:00
|
|
|
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_light_index_lists.get_mut().data[array_index][component] |=
|
2022-04-07 16:16:35 +00:00
|
|
|
index << (8 * sub_index);
|
|
|
|
}
|
|
|
|
ViewClusterBuffers::Storage {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
..
|
|
|
|
} => {
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
cluster_light_index_lists.get_mut().data.push(index as u32);
|
2022-04-07 16:16:35 +00:00
|
|
|
}
|
|
|
|
}
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
|
|
|
self.n_indices += 1;
|
|
|
|
}
|
2022-04-07 16:16:35 +00:00
|
|
|
|
|
|
|
pub fn write_buffers(&mut self, render_device: &RenderDevice, render_queue: &RenderQueue) {
|
|
|
|
match &mut self.buffers {
|
|
|
|
ViewClusterBuffers::Uniform {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
} => {
|
|
|
|
cluster_light_index_lists.write_buffer(render_device, render_queue);
|
|
|
|
cluster_offsets_and_counts.write_buffer(render_device, render_queue);
|
|
|
|
}
|
|
|
|
ViewClusterBuffers::Storage {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
} => {
|
|
|
|
cluster_light_index_lists.write_buffer(render_device, render_queue);
|
|
|
|
cluster_offsets_and_counts.write_buffer(render_device, render_queue);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn light_index_lists_binding(&self) -> Option<BindingResource> {
|
|
|
|
match &self.buffers {
|
|
|
|
ViewClusterBuffers::Uniform {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
..
|
|
|
|
} => cluster_light_index_lists.binding(),
|
|
|
|
ViewClusterBuffers::Storage {
|
|
|
|
cluster_light_index_lists,
|
|
|
|
..
|
|
|
|
} => cluster_light_index_lists.binding(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn offsets_and_counts_binding(&self) -> Option<BindingResource> {
|
|
|
|
match &self.buffers {
|
|
|
|
ViewClusterBuffers::Uniform {
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
..
|
|
|
|
} => cluster_offsets_and_counts.binding(),
|
|
|
|
ViewClusterBuffers::Storage {
|
|
|
|
cluster_offsets_and_counts,
|
|
|
|
..
|
|
|
|
} => cluster_offsets_and_counts.binding(),
|
|
|
|
}
|
|
|
|
}
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
|
|
|
|
pub fn min_size_cluster_light_index_lists(
|
|
|
|
buffer_binding_type: BufferBindingType,
|
|
|
|
) -> NonZeroU64 {
|
|
|
|
match buffer_binding_type {
|
|
|
|
BufferBindingType::Storage { .. } => GpuClusterLightIndexListsStorage::min_size(),
|
|
|
|
BufferBindingType::Uniform => GpuClusterLightIndexListsUniform::min_size(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn min_size_cluster_offsets_and_counts(
|
|
|
|
buffer_binding_type: BufferBindingType,
|
|
|
|
) -> NonZeroU64 {
|
|
|
|
match buffer_binding_type {
|
|
|
|
BufferBindingType::Storage { .. } => GpuClusterOffsetsAndCountsStorage::min_size(),
|
|
|
|
BufferBindingType::Uniform => GpuClusterOffsetsAndCountsUniform::min_size(),
|
|
|
|
}
|
|
|
|
}
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
pub fn prepare_clusters(
|
|
|
|
mut commands: Commands,
|
|
|
|
render_device: Res<RenderDevice>,
|
|
|
|
render_queue: Res<RenderQueue>,
|
2022-04-07 16:16:35 +00:00
|
|
|
mesh_pipeline: Res<MeshPipeline>,
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
global_light_meta: Res<GlobalLightMeta>,
|
|
|
|
views: Query<
|
|
|
|
(
|
|
|
|
Entity,
|
|
|
|
&ExtractedClusterConfig,
|
|
|
|
&ExtractedClustersPointLights,
|
|
|
|
),
|
|
|
|
With<RenderPhase<Transparent3d>>,
|
|
|
|
>,
|
|
|
|
) {
|
2022-04-07 16:16:35 +00:00
|
|
|
let render_device = render_device.into_inner();
|
|
|
|
let supports_storage_buffers = matches!(
|
|
|
|
mesh_pipeline.clustered_forward_buffer_binding_type,
|
|
|
|
BufferBindingType::Storage { .. }
|
|
|
|
);
|
2022-07-11 15:28:50 +00:00
|
|
|
for (entity, cluster_config, extracted_clusters) in &views {
|
2022-04-07 16:16:35 +00:00
|
|
|
let mut view_clusters_bindings =
|
|
|
|
ViewClusterBindings::new(mesh_pipeline.clustered_forward_buffer_binding_type);
|
Migrate to encase from crevice (#4339)
# Objective
- Unify buffer APIs
- Also see #4272
## Solution
- Replace vendored `crevice` with `encase`
---
## Changelog
Changed `StorageBuffer`
Added `DynamicStorageBuffer`
Replaced `UniformVec` with `UniformBuffer`
Replaced `DynamicUniformVec` with `DynamicUniformBuffer`
## Migration Guide
### `StorageBuffer`
removed `set_body()`, `values()`, `values_mut()`, `clear()`, `push()`, `append()`
added `set()`, `get()`, `get_mut()`
### `UniformVec` -> `UniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `len()`, `is_empty()`, `capacity()`, `push()`, `reserve()`, `clear()`, `values()`
added `set()`, `get()`
### `DynamicUniformVec` -> `DynamicUniformBuffer`
renamed `uniform_buffer()` to `buffer()`
removed `capacity()`, `reserve()`
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-05-18 21:09:21 +00:00
|
|
|
view_clusters_bindings.clear();
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
|
|
|
let mut indices_full = false;
|
|
|
|
|
|
|
|
let mut cluster_index = 0;
|
2022-03-24 00:20:27 +00:00
|
|
|
for _y in 0..cluster_config.dimensions.y {
|
|
|
|
for _x in 0..cluster_config.dimensions.x {
|
|
|
|
for _z in 0..cluster_config.dimensions.z {
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
let offset = view_clusters_bindings.n_indices();
|
|
|
|
let cluster_lights = &extracted_clusters.data[cluster_index];
|
2022-07-08 19:57:43 +00:00
|
|
|
view_clusters_bindings.push_offset_and_counts(
|
|
|
|
offset,
|
|
|
|
cluster_lights.point_light_count,
|
|
|
|
cluster_lights.spot_light_count,
|
|
|
|
);
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
|
|
|
if !indices_full {
|
|
|
|
for entity in cluster_lights.iter() {
|
|
|
|
if let Some(light_index) = global_light_meta.entity_to_index.get(entity)
|
|
|
|
{
|
|
|
|
if view_clusters_bindings.n_indices()
|
|
|
|
>= ViewClusterBindings::MAX_INDICES
|
2022-04-07 16:16:35 +00:00
|
|
|
&& !supports_storage_buffers
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
{
|
|
|
|
warn!("Cluster light index lists is full! The PointLights in the view are affecting too many clusters.");
|
|
|
|
indices_full = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
view_clusters_bindings.push_index(*light_index);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
cluster_index += 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-04-07 16:16:35 +00:00
|
|
|
view_clusters_bindings.write_buffers(render_device, &render_queue);
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
|
|
|
|
commands.get_or_spawn(entity).insert(view_clusters_bindings);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-11-04 21:47:57 +00:00
|
|
|
#[allow(clippy::too_many_arguments)]
|
2023-03-02 08:21:21 +00:00
|
|
|
pub fn queue_shadows<M: Material>(
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
shadow_draw_functions: Res<DrawFunctions<Shadow>>,
|
2023-03-02 08:21:21 +00:00
|
|
|
prepass_pipeline: Res<PrepassPipeline<M>>,
|
|
|
|
casting_meshes: Query<(&Handle<Mesh>, &Handle<M>), Without<NotShadowCaster>>,
|
2021-11-04 21:47:57 +00:00
|
|
|
render_meshes: Res<RenderAssets<Mesh>>,
|
2023-03-02 08:21:21 +00:00
|
|
|
render_materials: Res<RenderMaterials<M>>,
|
|
|
|
mut pipelines: ResMut<SpecializedMeshPipelines<PrepassPipeline<M>>>,
|
2023-01-16 15:41:14 +00:00
|
|
|
pipeline_cache: Res<PipelineCache>,
|
2023-01-25 12:35:39 +00:00
|
|
|
view_lights: Query<(Entity, &ViewLightEntities)>,
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
mut view_light_shadow_phases: Query<(&LightEntity, &mut RenderPhase<Shadow>)>,
|
|
|
|
point_light_entities: Query<&CubemapVisibleEntities, With<ExtractedPointLight>>,
|
2023-01-25 12:35:39 +00:00
|
|
|
directional_light_entities: Query<&CascadesVisibleEntities, With<ExtractedDirectionalLight>>,
|
2022-07-08 19:57:43 +00:00
|
|
|
spot_light_entities: Query<&VisibleEntities, With<ExtractedPointLight>>,
|
2023-03-02 08:21:21 +00:00
|
|
|
) where
|
|
|
|
M::Data: PartialEq + Eq + Hash + Clone,
|
|
|
|
{
|
2023-01-25 12:35:39 +00:00
|
|
|
for (entity, view_lights) in &view_lights {
|
2023-03-02 08:21:21 +00:00
|
|
|
let draw_shadow_mesh = shadow_draw_functions.read().id::<DrawPrepass<M>>();
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
for view_light_entity in view_lights.lights.iter().copied() {
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
let (light_entity, mut shadow_phase) =
|
|
|
|
view_light_shadow_phases.get_mut(view_light_entity).unwrap();
|
2023-01-25 12:35:39 +00:00
|
|
|
let is_directional_light = matches!(light_entity, LightEntity::Directional { .. });
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
let visible_entities = match light_entity {
|
2023-01-25 12:35:39 +00:00
|
|
|
LightEntity::Directional {
|
|
|
|
light_entity,
|
|
|
|
cascade_index,
|
|
|
|
} => directional_light_entities
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
.get(*light_entity)
|
2023-01-25 12:35:39 +00:00
|
|
|
.expect("Failed to get directional light visible entities")
|
|
|
|
.entities
|
|
|
|
.get(&entity)
|
|
|
|
.expect("Failed to get directional light visible entities for view")
|
|
|
|
.get(*cascade_index)
|
|
|
|
.expect("Failed to get directional light visible entities for cascade"),
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
LightEntity::Point {
|
|
|
|
light_entity,
|
|
|
|
face_index,
|
|
|
|
} => point_light_entities
|
|
|
|
.get(*light_entity)
|
|
|
|
.expect("Failed to get point light visible entities")
|
|
|
|
.get(*face_index),
|
2022-07-08 19:57:43 +00:00
|
|
|
LightEntity::Spot { light_entity } => spot_light_entities
|
|
|
|
.get(*light_entity)
|
|
|
|
.expect("Failed to get spot light visible entities"),
|
Frustum culling (#2861)
# Objective
Implement frustum culling for much better performance on more complex scenes. With the Amazon Lumberyard Bistro scene, I was getting roughly 15fps without frustum culling and 60+fps with frustum culling on a MacBook Pro 16 with i9 9980HK 8c/16t CPU and Radeon Pro 5500M.
macOS does weird things with vsync so even though vsync was off, it really looked like sometimes other applications or the desktop window compositor were interfering, but the difference could be even more as I even saw up to 90+fps sometimes.
## Solution
- Until the https://github.com/bevyengine/rfcs/pull/12 RFC is completed, I wanted to implement at least some of the bounding volume functionality we needed to be able to unblock a bunch of rendering features and optimisations such as frustum culling, fitting the directional light orthographic projection to the relevant meshes in the view, clustered forward rendering, etc.
- I have added `Aabb`, `Frustum`, and `Sphere` types with only the necessary intersection tests for the algorithms used. I also added `CubemapFrusta` which contains a `[Frustum; 6]` and can be used by cube maps such as environment maps, and point light shadow maps.
- I did do a bit of benchmarking and optimisation on the intersection tests. I compared the [rafx parallel-comparison bitmask approach](https://github.com/aclysma/rafx/blob/c91bd5fcfdfa3f4d1b43507c32d84b94ffdf1b2e/rafx-visibility/src/geometry/frustum.rs#L64-L92) with a naïve loop that has an early-out in case of a bounding volume being outside of any one of the `Frustum` planes and found them to be very similar, so I chose the simpler and more readable option. I also compared using Vec3 and Vec3A and it turned out that promoting Vec3s to Vec3A improved performance of the culling significantly due to Vec3A operations using SIMD optimisations where Vec3 uses plain scalar operations.
- When loading glTF models, the vertex attribute accessors generally store the minimum and maximum values, which allows for adding AABBs to meshes loaded from glTF for free.
- For meshes without an AABB (`PbrBundle` deliberately does not have an AABB by default), a system is executed that scans over the vertex positions to find the minimum and maximum values along each axis. This is used to construct the AABB.
- The `Frustum::intersects_obb` and `Sphere::insersects_obb` algorithm is from Foundations of Game Engine Development 2: Rendering by Eric Lengyel. There is no OBB type, yet, rather an AABB and the model matrix are passed in as arguments. This calculates a 'relative radius' of the AABB with respect to the plane normal (the plane normal in the Sphere case being something I came up with as the direction pointing from the centre of the sphere to the centre of the AABB) such that it can then do a sphere-sphere intersection test in practice.
- `RenderLayers` were copied over from the current renderer.
- `VisibleEntities` was copied over from the current renderer and a `CubemapVisibleEntities` was added to support `PointLight`s for now. `VisibleEntities` are added to views (cameras and lights) and contain a `Vec<Entity>` that is populated by culling/visibility systems that run in PostUpdate of the app world, and are iterated over in the render world for, for example, queuing up meshes to be drawn by lights for shadow maps and the main pass for cameras.
- `Visibility` and `ComputedVisibility` components were added. The `Visibility` component is user-facing so that, for example, the entity can be marked as not visible in an editor. `ComputedVisibility` on the other hand is the result of the culling/visibility systems and takes `Visibility` into account. So if an entity is marked as not being visible in its `Visibility` component, that will skip culling/visibility intersection tests and just mark the `ComputedVisibility` as false.
- The `ComputedVisibility` is used to decide which meshes to extract.
- I had to add a way to get the far plane from the `CameraProjection` in order to define an explicit far frustum plane for culling. This should perhaps be optional as it is not always desired and in that case, testing 5 planes instead of 6 is a performance win.
I think that's about all. I discussed some of the design with @cart on Discord already so hopefully it's not too far from being mergeable. It works well at least. 😄
2021-11-07 21:45:52 +00:00
|
|
|
};
|
2021-11-19 21:16:58 +00:00
|
|
|
// NOTE: Lights with shadow mapping disabled will have no visible entities
|
Clustered forward rendering (#3153)
# Objective
Implement clustered-forward rendering.
## Solution
~~FIXME - in the interest of keeping the merge train moving, I'm submitting this PR now before the description is ready. I want to add in some comments into the code with references for the various bits and pieces and I want to describe some of the key decisions I made here. I'll do that as soon as I can.~~ Anyone reviewing is welcome to add review comments where you want to know more about how something or other works.
* The summary of the technique is that the view frustum is divided into a grid of sub-volumes called clusters, point lights are tested against each of the clusters to see if they would affect that volume within the scene and if so, added to a list of lights affecting that cluster. Then when shading a fragment which is a point on the surface of a mesh within the scene, the point is mapped to a cluster and only the lights affecting that clusters are used in lighting calculations. This brings huge performance and scalability benefits as most of the time lights are placed so that there are not that many that overlap each other in terms of their sphere of influence, but there may be many distinct point lights visible in the scene. Doing all the lighting calculations for all visible lights in the scene for every pixel on the screen quickly becomes a performance limitation. Clustered forward rendering allows us to make an approximate list of lights that affect each pixel, indeed each surface in the scene (as it works along the view z axis too, unlike tiled/forward+).
* WebGL2 is a platform we want to support and it does not support storage buffers. Uniform buffer bindings are limited to a maximum of 16384 bytes per binding. I used bit shifting and masking to pack the cluster light lists and various indices into a uniform buffer and the 16kB limit is very likely the first bottleneck in scaling the number of lights in a scene at the moment if the lights can affect many clusters due to their range or proximity to the camera (there are a lot of clusters close to the camera, which is an area for improvement). We could store the information in textures instead of uniform buffers to remove this bottleneck though I don’t know if there are performance implications to reading from textures instead if uniform buffers.
* Because of the uniform buffer binding size limitations we can support a maximum of 256 lights with the current size of the PointLight struct
* The z-slicing method (i.e. the mapping from view space z to a depth slice which defines the near and far planes of a cluster) is using the Doom 2016 method. I need to add comments with references to this. It’s an exponential function that simplifies well for the purposes of optimising the fragment shader. xy grid divisions are regular in screen space.
* Some optimisation work was done on the allocation of lights to clusters, which involves intersection tests, and for this number of clusters and lights the system has insignificant cost using a fairly naïve algorithm. I think for more lights / finer-grained clusters we could use a BVH, but at some point it would be just much better to use compute shaders and storage buffers.
* Something else to note is that it is absolutely infeasible to use plain cube map point light shadow mapping for many lights. It does not scale in terms of performance nor memory usage. There are some interesting methods I saw discussed in reference material that I will add a link to which render and update shadow maps piece-wise, but they also need compute shaders to work well. Basically for now you need to sacrifice point light shadows for all but a handful of point lights if you don’t want to kill performance. I set the limit to 10 but that’s just what we had from before where 10 was the maximum number of point lights before this PR.
* I added a couple of debug visualisations behind a shader def that were useful for seeing performance impact of light distribution - I should make the debug mode configurable without modifying the shader code. One mode shows the number of lights affecting each cluster by tinting toward red for few lights or green for many lights (maxes out at 16, but not sure that’s a reasonable max). The other shows which cluster the surface at a fragment belongs to by tinting it with a randomish colour. This can help to understand deeper performance issues due to screen space tiles spanning multiple clusters in depth with divergent shader execution times.
Also, there are more things that could be done as improvements, and I will document those somewhere (I'm not sure where will be the best place... in a todo alongside the code, a GitHub issue, somewhere else?) but I think it works well enough and brings significant performance and scalability benefits that it's worth integrating already now and then iterating on.
* Calculate the light’s effective range based on its intensity and physical falloff and either just use this, or take the minimum of the user-supplied range and this. This would avoid unnecessary lighting calculations for clusters that cannot be affected. This would need to take into account HDR tone mapping as in my not-fully-understanding-the-details understanding, the threshold is relative to how bright the scene is.
* Improve the z-slicing to use a larger first slice.
* More gracefully handle the cluster light list uniform buffer binding size limitations by prioritising which lights are included (some heuristic for most significant like closest to the camera, brightest, affecting the most pixels, …)
* Switch to using a texture instead of uniform buffer
* Figure out the / a better story for shadows
I will also probably add an example that demonstrates some of the issues:
* What situations exhaust the space available in the uniform buffers
* Light range too large making lights affect many clusters and so exhausting the space for the lists of lights that affect clusters
* Light range set to be too small producing visible artifacts where clusters the light would physically affect are not affected by the light
* Perhaps some performance issues
* How many lights can be closely packed or affect large portions of the view before performance drops?
2021-12-09 03:08:54 +00:00
|
|
|
// so no meshes will be queued
|
|
|
|
for entity in visible_entities.iter().copied() {
|
2023-03-02 08:21:21 +00:00
|
|
|
if let Ok((mesh_handle, material_handle)) = casting_meshes.get(entity) {
|
|
|
|
if let (Some(mesh), Some(material)) = (
|
|
|
|
render_meshes.get(mesh_handle),
|
|
|
|
render_materials.get(material_handle),
|
|
|
|
) {
|
|
|
|
let mut mesh_key =
|
|
|
|
MeshPipelineKey::from_primitive_topology(mesh.primitive_topology)
|
|
|
|
| MeshPipelineKey::DEPTH_PREPASS;
|
2023-01-25 12:35:39 +00:00
|
|
|
if is_directional_light {
|
2023-03-02 08:21:21 +00:00
|
|
|
mesh_key |= MeshPipelineKey::DEPTH_CLAMP_ORTHO;
|
|
|
|
}
|
|
|
|
let alpha_mode = material.properties.alpha_mode;
|
|
|
|
match alpha_mode {
|
|
|
|
AlphaMode::Mask(_) => {
|
|
|
|
mesh_key |= MeshPipelineKey::ALPHA_MASK;
|
|
|
|
}
|
|
|
|
AlphaMode::Blend | AlphaMode::Premultiplied | AlphaMode::Add => {
|
|
|
|
mesh_key |= MeshPipelineKey::BLEND_PREMULTIPLIED_ALPHA;
|
|
|
|
}
|
|
|
|
_ => {}
|
2023-01-25 12:35:39 +00:00
|
|
|
}
|
Mesh vertex buffer layouts (#3959)
This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes:
* Reworks the `Mesh` type to use the newly added `VertexAttribute` internally
* `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat`
* `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting"
* Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently.
* Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key).
* Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR.
* Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool!
To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now:
```rust
impl SpecializedMeshPipeline for MeshPipeline {
type Key = MeshPipelineKey;
fn specialize(
&self,
key: Self::Key,
layout: &MeshVertexBufferLayout,
) -> RenderPipelineDescriptor {
let mut vertex_attributes = vec![
Mesh::ATTRIBUTE_POSITION.at_shader_location(0),
Mesh::ATTRIBUTE_NORMAL.at_shader_location(1),
Mesh::ATTRIBUTE_UV_0.at_shader_location(2),
];
let mut shader_defs = Vec::new();
if layout.contains(Mesh::ATTRIBUTE_TANGENT) {
shader_defs.push(String::from("VERTEX_TANGENTS"));
vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3));
}
let vertex_buffer_layout = layout
.get_layout(&vertex_attributes)
.expect("Mesh is missing a vertex attribute");
```
Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`.
This is still a draft because I still need to:
* Add more docs
* Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it.
* Consider breaking out the PreHash / hashbrown changes into a separate PR.
* Add an example illustrating this change
* Verify that the "mesh-specialized pipeline de-duplication code" works properly
Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap.
Alternative to #3120
Fixes #3030
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
|
|
|
let pipeline_id = pipelines.specialize(
|
2023-01-16 15:41:14 +00:00
|
|
|
&pipeline_cache,
|
2023-03-02 08:21:21 +00:00
|
|
|
&prepass_pipeline,
|
|
|
|
MaterialPipelineKey {
|
|
|
|
mesh_key,
|
|
|
|
bind_group_data: material.key.clone(),
|
|
|
|
},
|
Mesh vertex buffer layouts (#3959)
This PR makes a number of changes to how meshes and vertex attributes are handled, which the goal of enabling easy and flexible custom vertex attributes:
* Reworks the `Mesh` type to use the newly added `VertexAttribute` internally
* `VertexAttribute` defines the name, a unique `VertexAttributeId`, and a `VertexFormat`
* `VertexAttributeId` is used to produce consistent sort orders for vertex buffer generation, replacing the more expensive and often surprising "name based sorting"
* Meshes can be used to generate a `MeshVertexBufferLayout`, which defines the layout of the gpu buffer produced by the mesh. `MeshVertexBufferLayouts` can then be used to generate actual `VertexBufferLayouts` according to the requirements of a specific pipeline. This decoupling of "mesh layout" vs "pipeline vertex buffer layout" is what enables custom attributes. We don't need to standardize _mesh layouts_ or contort meshes to meet the needs of a specific pipeline. As long as the mesh has what the pipeline needs, it will work transparently.
* Mesh-based pipelines now specialize on `&MeshVertexBufferLayout` via the new `SpecializedMeshPipeline` trait (which behaves like `SpecializedPipeline`, but adds `&MeshVertexBufferLayout`). The integrity of the pipeline cache is maintained because the `MeshVertexBufferLayout` is treated as part of the key (which is fully abstracted from implementers of the trait ... no need to add any additional info to the specialization key).
* Hashing `MeshVertexBufferLayout` is too expensive to do for every entity, every frame. To make this scalable, I added a generalized "pre-hashing" solution to `bevy_utils`: `Hashed<T>` keys and `PreHashMap<K, V>` (which uses `Hashed<T>` internally) . Why didn't I just do the quick and dirty in-place "pre-compute hash and use that u64 as a key in a hashmap" that we've done in the past? Because its wrong! Hashes by themselves aren't enough because two different values can produce the same hash. Re-hashing a hash is even worse! I decided to build a generalized solution because this pattern has come up in the past and we've chosen to do the wrong thing. Now we can do the right thing! This did unfortunately require pulling in `hashbrown` and using that in `bevy_utils`, because avoiding re-hashes requires the `raw_entry_mut` api, which isn't stabilized yet (and may never be ... `entry_ref` has favor now, but also isn't available yet). If std's HashMap ever provides the tools we need, we can move back to that. Note that adding `hashbrown` doesn't increase our dependency count because it was already in our tree. I will probably break these changes out into their own PR.
* Specializing on `MeshVertexBufferLayout` has one non-obvious behavior: it can produce identical pipelines for two different MeshVertexBufferLayouts. To optimize the number of active pipelines / reduce re-binds while drawing, I de-duplicate pipelines post-specialization using the final `VertexBufferLayout` as the key. For example, consider a pipeline that needs the layout `(position, normal)` and is specialized using two meshes: `(position, normal, uv)` and `(position, normal, other_vec2)`. If both of these meshes result in `(position, normal)` specializations, we can use the same pipeline! Now we do. Cool!
To briefly illustrate, this is what the relevant section of `MeshPipeline`'s specialization code looks like now:
```rust
impl SpecializedMeshPipeline for MeshPipeline {
type Key = MeshPipelineKey;
fn specialize(
&self,
key: Self::Key,
layout: &MeshVertexBufferLayout,
) -> RenderPipelineDescriptor {
let mut vertex_attributes = vec![
Mesh::ATTRIBUTE_POSITION.at_shader_location(0),
Mesh::ATTRIBUTE_NORMAL.at_shader_location(1),
Mesh::ATTRIBUTE_UV_0.at_shader_location(2),
];
let mut shader_defs = Vec::new();
if layout.contains(Mesh::ATTRIBUTE_TANGENT) {
shader_defs.push(String::from("VERTEX_TANGENTS"));
vertex_attributes.push(Mesh::ATTRIBUTE_TANGENT.at_shader_location(3));
}
let vertex_buffer_layout = layout
.get_layout(&vertex_attributes)
.expect("Mesh is missing a vertex attribute");
```
Notice that this is _much_ simpler than it was before. And now any mesh with any layout can be used with this pipeline, provided it has vertex postions, normals, and uvs. We even got to remove `HAS_TANGENTS` from MeshPipelineKey and `has_tangents` from `GpuMesh`, because that information is redundant with `MeshVertexBufferLayout`.
This is still a draft because I still need to:
* Add more docs
* Experiment with adding error handling to mesh pipeline specialization (which would print errors at runtime when a mesh is missing a vertex attribute required by a pipeline). If it doesn't tank perf, we'll keep it.
* Consider breaking out the PreHash / hashbrown changes into a separate PR.
* Add an example illustrating this change
* Verify that the "mesh-specialized pipeline de-duplication code" works properly
Please dont yell at me for not doing these things yet :) Just trying to get this in peoples' hands asap.
Alternative to #3120
Fixes #3030
Co-authored-by: Carter Anderson <mcanders1@gmail.com>
2022-02-23 23:21:13 +00:00
|
|
|
&mesh.layout,
|
|
|
|
);
|
|
|
|
|
|
|
|
let pipeline_id = match pipeline_id {
|
|
|
|
Ok(id) => id,
|
|
|
|
Err(err) => {
|
|
|
|
error!("{}", err);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
shadow_phase.add(Shadow {
|
|
|
|
draw_function: draw_shadow_mesh,
|
|
|
|
pipeline: pipeline_id,
|
|
|
|
entity,
|
|
|
|
distance: 0.0, // TODO: sort back-to-front
|
|
|
|
});
|
2021-11-04 21:47:57 +00:00
|
|
|
}
|
|
|
|
}
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub struct Shadow {
|
|
|
|
pub distance: f32,
|
|
|
|
pub entity: Entity,
|
2022-03-23 00:27:26 +00:00
|
|
|
pub pipeline: CachedRenderPipelineId,
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
pub draw_function: DrawFunctionId,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
impl PhaseItem for Shadow {
|
2023-03-06 00:00:40 +00:00
|
|
|
type SortKey = usize;
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
|
2023-01-04 01:13:30 +00:00
|
|
|
#[inline]
|
|
|
|
fn entity(&self) -> Entity {
|
|
|
|
self.entity
|
|
|
|
}
|
|
|
|
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
#[inline]
|
|
|
|
fn sort_key(&self) -> Self::SortKey {
|
2023-03-06 00:00:40 +00:00
|
|
|
self.pipeline.id()
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#[inline]
|
|
|
|
fn draw_function(&self) -> DrawFunctionId {
|
|
|
|
self.draw_function
|
|
|
|
}
|
Allow unbatched render phases to use unstable sorts (#5049)
# Objective
Partially addresses #4291.
Speed up the sort phase for unbatched render phases.
## Solution
Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass.
## Performance
This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change.
On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction.
![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png)
## Future Work
There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems.
Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`.
Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while.
---
## Changelog
Added: `PhaseItem::sort`
## Migration Guide
RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`).
Co-authored-by: Federico Rinaldi <gisquerin@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: colepoirier <colepoirier@gmail.com>
2022-06-23 10:52:49 +00:00
|
|
|
|
|
|
|
#[inline]
|
|
|
|
fn sort(items: &mut [Self]) {
|
2023-03-06 00:00:40 +00:00
|
|
|
// The shadow phase is sorted by pipeline id for performance reasons.
|
|
|
|
// Grouping all draw commands using the same pipeline together performs
|
|
|
|
// better than rebinding everything at a high rate.
|
|
|
|
radsort::sort_by_key(items, |item| item.pipeline.id());
|
Allow unbatched render phases to use unstable sorts (#5049)
# Objective
Partially addresses #4291.
Speed up the sort phase for unbatched render phases.
## Solution
Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass.
## Performance
This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change.
On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction.
![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png)
## Future Work
There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems.
Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`.
Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while.
---
## Changelog
Added: `PhaseItem::sort`
## Migration Guide
RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`).
Co-authored-by: Federico Rinaldi <gisquerin@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: colepoirier <colepoirier@gmail.com>
2022-06-23 10:52:49 +00:00
|
|
|
}
|
Modular Rendering (#2831)
This changes how render logic is composed to make it much more modular. Previously, all extraction logic was centralized for a given "type" of rendered thing. For example, we extracted meshes into a vector of ExtractedMesh, which contained the mesh and material asset handles, the transform, etc. We looked up bindings for "drawn things" using their index in the `Vec<ExtractedMesh>`. This worked fine for built in rendering, but made it hard to reuse logic for "custom" rendering. It also prevented us from reusing things like "extracted transforms" across contexts.
To make rendering more modular, I made a number of changes:
* Entities now drive rendering:
* We extract "render components" from "app components" and store them _on_ entities. No more centralized uber lists! We now have true "ECS-driven rendering"
* To make this perform well, I implemented #2673 in upstream Bevy for fast batch insertions into specific entities. This was merged into the `pipelined-rendering` branch here: #2815
* Reworked the `Draw` abstraction:
* Generic `PhaseItems`: each draw phase can define its own type of "rendered thing", which can define its own "sort key"
* Ported the 2d, 3d, and shadow phases to the new PhaseItem impl (currently Transparent2d, Transparent3d, and Shadow PhaseItems)
* `Draw` trait and and `DrawFunctions` are now generic on PhaseItem
* Modular / Ergonomic `DrawFunctions` via `RenderCommands`
* RenderCommand is a trait that runs an ECS query and produces one or more RenderPass calls. Types implementing this trait can be composed to create a final DrawFunction. For example the DrawPbr DrawFunction is created from the following DrawCommand tuple. Const generics are used to set specific bind group locations:
```rust
pub type DrawPbr = (
SetPbrPipeline,
SetMeshViewBindGroup<0>,
SetStandardMaterialBindGroup<1>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* The new `custom_shader_pipelined` example illustrates how the commands above can be reused to create a custom draw function:
```rust
type DrawCustom = (
SetCustomMaterialPipeline,
SetMeshViewBindGroup<0>,
SetTransformBindGroup<2>,
DrawMesh,
);
```
* ExtractComponentPlugin and UniformComponentPlugin:
* Simple, standardized ways to easily extract individual components and write them to GPU buffers
* Ported PBR and Sprite rendering to the new primitives above.
* Removed staging buffer from UniformVec in favor of direct Queue usage
* Makes UniformVec much easier to use and more ergonomic. Completely removes the need for custom render graph nodes in these contexts (see the PbrNode and view Node removals and the much simpler call patterns in the relevant Prepare systems).
* Added a many_cubes_pipelined example to benchmark baseline 3d rendering performance and ensure there were no major regressions during this port. Avoiding regressions was challenging given that the old approach of extracting into centralized vectors is basically the "optimal" approach. However thanks to a various ECS optimizations and render logic rephrasing, we pretty much break even on this benchmark!
* Lifetimeless SystemParams: this will be a bit divisive, but as we continue to embrace "trait driven systems" (ex: ExtractComponentPlugin, UniformComponentPlugin, DrawCommand), the ergonomics of `(Query<'static, 'static, (&'static A, &'static B, &'static)>, Res<'static, C>)` were getting very hard to bear. As a compromise, I added "static type aliases" for the relevant SystemParams. The previous example can now be expressed like this: `(SQuery<(Read<A>, Read<B>)>, SRes<C>)`. If anyone has better ideas / conflicting opinions, please let me know!
* RunSystem trait: a way to define Systems via a trait with a SystemParam associated type. This is used to implement the various plugins mentioned above. I also added SystemParamItem and QueryItem type aliases to make "trait stye" ecs interactions nicer on the eyes (and fingers).
* RenderAsset retrying: ensures that render assets are only created when they are "ready" and allows us to create bind groups directly inside render assets (which significantly simplified the StandardMaterial code). I think ultimately we should swap this out on "asset dependency" events to wait for dependencies to load, but this will require significant asset system changes.
* Updated some built in shaders to account for missing MeshUniform fields
2021-09-23 06:16:11 +00:00
|
|
|
}
|
2021-06-02 02:59:17 +00:00
|
|
|
|
2022-03-23 00:27:26 +00:00
|
|
|
impl CachedRenderPipelinePhaseItem for Shadow {
|
2021-11-12 22:27:17 +00:00
|
|
|
#[inline]
|
2022-03-23 00:27:26 +00:00
|
|
|
fn cached_pipeline(&self) -> CachedRenderPipelineId {
|
2021-11-12 22:27:17 +00:00
|
|
|
self.pipeline
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-06-02 02:59:17 +00:00
|
|
|
pub struct ShadowPassNode {
|
2021-11-19 21:16:58 +00:00
|
|
|
main_view_query: QueryState<&'static ViewLightEntities>,
|
|
|
|
view_light_query: QueryState<(&'static ShadowView, &'static RenderPhase<Shadow>)>,
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
impl ShadowPassNode {
|
|
|
|
pub fn new(world: &mut World) -> Self {
|
|
|
|
Self {
|
|
|
|
main_view_query: QueryState::new(world),
|
|
|
|
view_light_query: QueryState::new(world),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Node for ShadowPassNode {
|
|
|
|
fn update(&mut self, world: &mut World) {
|
|
|
|
self.main_view_query.update_archetypes(world);
|
|
|
|
self.view_light_query.update_archetypes(world);
|
|
|
|
}
|
|
|
|
|
|
|
|
fn run(
|
|
|
|
&self,
|
|
|
|
graph: &mut RenderGraphContext,
|
2021-06-21 23:28:52 +00:00
|
|
|
render_context: &mut RenderContext,
|
2021-06-02 02:59:17 +00:00
|
|
|
world: &World,
|
|
|
|
) -> Result<(), NodeRunError> {
|
Make render graph slots optional for most cases (#8109)
# Objective
- Currently, the render graph slots are only used to pass the
view_entity around. This introduces significant boilerplate for very
little value. Instead of using slots for this, make the view_entity part
of the `RenderGraphContext`. This also means we won't need to have
`IN_VIEW` on every node and and we'll be able to use the default impl of
`Node::input()`.
## Solution
- Add `view_entity: Option<Entity>` to the `RenderGraphContext`
- Update all nodes to use this instead of entity slot input
---
## Changelog
- Add optional `view_entity` to `RenderGraphContext`
## Migration Guide
You can now get the view_entity directly from the `RenderGraphContext`.
When implementing the Node:
```rust
// 0.10
struct FooNode;
impl FooNode {
const IN_VIEW: &'static str = "view";
}
impl Node for FooNode {
fn input(&self) -> Vec<SlotInfo> {
vec![SlotInfo::new(Self::IN_VIEW, SlotType::Entity)]
}
fn run(
&self,
graph: &mut RenderGraphContext,
// ...
) -> Result<(), NodeRunError> {
let view_entity = graph.get_input_entity(Self::IN_VIEW)?;
// ...
Ok(())
}
}
// 0.11
struct FooNode;
impl Node for FooNode {
fn run(
&self,
graph: &mut RenderGraphContext,
// ...
) -> Result<(), NodeRunError> {
let view_entity = graph.view_entity();
// ...
Ok(())
}
}
```
When adding the node to the graph, you don't need to specify a slot_edge
for the view_entity.
```rust
// 0.10
let mut graph = RenderGraph::default();
graph.add_node(FooNode::NAME, node);
let input_node_id = draw_2d_graph.set_input(vec![SlotInfo::new(
graph::input::VIEW_ENTITY,
SlotType::Entity,
)]);
graph.add_slot_edge(
input_node_id,
graph::input::VIEW_ENTITY,
FooNode::NAME,
FooNode::IN_VIEW,
);
// add_node_edge ...
// 0.11
let mut graph = RenderGraph::default();
graph.add_node(FooNode::NAME, node);
// add_node_edge ...
```
## Notes
This PR paired with #8007 will help reduce a lot of annoying boilerplate
with the render nodes. Depending on which one gets merged first. It will
require a bit of clean up work to make both compatible.
I tagged this as a breaking change, because using the old system to get
the view_entity will break things because it's not a node input slot
anymore.
## Notes for reviewers
A lot of the diffs are just removing the slots in every nodes and graph
creation. The important part is mostly in the
graph_runner/CameraDriverNode.
2023-03-21 20:11:13 +00:00
|
|
|
let view_entity = graph.view_entity();
|
2021-07-02 01:05:20 +00:00
|
|
|
if let Ok(view_lights) = self.main_view_query.get_manual(world, view_entity) {
|
2021-06-18 18:21:18 +00:00
|
|
|
for view_light_entity in view_lights.lights.iter().copied() {
|
|
|
|
let (view_light, shadow_phase) = self
|
|
|
|
.view_light_query
|
|
|
|
.get_manual(world, view_light_entity)
|
|
|
|
.unwrap();
|
2022-05-02 20:22:30 +00:00
|
|
|
|
|
|
|
if shadow_phase.items.is_empty() {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2023-01-09 19:24:56 +00:00
|
|
|
let mut render_pass =
|
|
|
|
render_context.begin_tracked_render_pass(RenderPassDescriptor {
|
|
|
|
label: Some(&view_light.pass_name),
|
|
|
|
color_attachments: &[],
|
|
|
|
depth_stencil_attachment: Some(RenderPassDepthStencilAttachment {
|
|
|
|
view: &view_light.depth_texture_view,
|
|
|
|
depth_ops: Some(Operations {
|
|
|
|
load: LoadOp::Clear(0.0),
|
|
|
|
store: true,
|
|
|
|
}),
|
|
|
|
stencil_ops: None,
|
2021-06-18 18:21:18 +00:00
|
|
|
}),
|
2023-01-09 19:24:56 +00:00
|
|
|
});
|
2023-01-02 21:39:54 +00:00
|
|
|
|
|
|
|
shadow_phase.render(&mut render_pass, world, view_light_entity);
|
2021-06-18 18:21:18 +00:00
|
|
|
}
|
2021-06-02 02:59:17 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
Ok(())
|
|
|
|
}
|
|
|
|
}
|