Automatic batching/instancing of draw commands (#9685)

# Objective - Implement the foundations of automatic batching/instancing of draw commands as the next step from #89 - NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data. ## Solution - The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on. - The following assumptions have been made: - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases - View bindings are constant across a phase for a given draw function as phases are per-view - `batch_and_prepare_render_phase` is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the `batch_and_prepare_render_phase` system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size. - Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago. - Change batch size for a range in `PhaseItem`, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed. - Add an optional mesh bind group dynamic offset to `PhaseItem`. This avoids having to do a massive table move just to insert `GpuArrayBufferIndex` components. ## Benchmarks All tests have been run on an M1 Max on AC power. `bevymark` and `many_cubes` were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with `--features bevy_ci_testing,trace_tracy` and `CI_TESTING_CONFIG=../benchmark.ron` with the contents of `../benchmark.ron`: ```rust ( exit_after: Some(1500) ) ``` ...in order to run each test for 1500 frames. The recent changes to `many_cubes` and `bevymark` added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch. The graphs compare main (yellow) to this PR (red). ### 3D Mesh `many_cubes --benchmark` <img width="1411" alt="Screenshot 2023-09-03 at 23 42 10" src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4"> The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time. The 1000th frame is identical using the flip tool: ![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4615 seconds ``` ### 3D Mesh `many_cubes --benchmark --material-texture-count 10` <img width="1404" alt="Screenshot 2023-09-03 at 23 45 18" src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf"> This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched. The 1000th frame is identical in flip: ![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4537 seconds ``` ### 3D Mesh `many_cubes --benchmark --vary-per-instance` <img width="1394" alt="Screenshot 2023-09-03 at 23 48 44" src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f"> This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities. The 1000th frame is identical according to flip: ![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4568 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` <img width="1412" alt="Screenshot 2023-09-03 at 23 59 56" src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4"> This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time. Capturing a screenshot of the 1000th frame main vs PR gives: ![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40) ``` Mean: 0.001222 Weighted median: 0.750432 1st weighted quartile: 0.453494 3rd weighted quartile: 0.969758 Min: 0.000000 Max: 0.990296 Evaluation time: 0.4255 seconds ``` So they seem to produce the same results. I also double-checked the number of draws. `main` does 160000 draws, and the PR does 160, as expected. ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10` <img width="1392" alt="Screenshot 2023-09-04 at 00 09 22" src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c"> This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay). ![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f) ``` Mean: 0.002877 Weighted median: 0.964980 1st weighted quartile: 0.668871 3rd weighted quartile: 0.982749 Min: 0.000000 Max: 0.992377 Evaluation time: 0.4301 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance` <img width="1396" alt="Screenshot 2023-09-04 at 00 13 53" src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb"> This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws. The 1000th frame is the same: ![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97) ``` Mean: 0.001214 Weighted median: 0.937499 1st weighted quartile: 0.635467 3rd weighted quartile: 0.979085 Min: 0.000000 Max: 0.988971 Evaluation time: 0.4462 seconds ``` ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` <img width="1396" alt="Screenshot 2023-09-04 at 12 21 12" src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43"> This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10` <img width="1389" alt="Screenshot 2023-09-04 at 12 36 08" src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413"> This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance` <img width="1401" alt="Screenshot 2023-09-04 at 12 29 52" src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6"> This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results. ### Summary | Benchmark | % change in median frame time | | ------------- | ------------- | | many_cubes | 🟩 -30% | | many_cubes 10 materials | 🟩 -5% | | many_cubes unique materials | 🟩 ~0% | | bevymark mesh2d | 🟩 -50% | | bevymark mesh2d 10 materials | 🟩 -50% | | bevymark mesh2d unique materials | 🟩 -7% | | bevymark sprite | 🟥 2% | | bevymark sprite 10 materials | 🟥 0.6% | | bevymark sprite unique materials | 🟥 4.1% | --- ## Changelog - Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Nicola Papale <nico@nicopap.ch>
2024-11-10 07:04:33 +00:00 · 2023-09-22 00:12:34 +02:00 · 2023-09-22 00:12:34 +02:00 · 5c884c5a15
commit 5c884c5a15
parent e60249e59d
31 changed files with 773 additions and 304 deletions
--- a/assets/shaders/custom_gltf_2d.wgsl
+++ b/assets/shaders/custom_gltf_2d.wgsl
@ -1,8 +1,9 @@
 #import bevy_sprite::mesh2d_view_bindings   globals
 #import bevy_sprite::mesh2d_bindings        mesh
-#import bevy_sprite::mesh2d_functions       mesh2d_position_local_to_clip
+#import bevy_sprite::mesh2d_functions       get_model_matrix, mesh2d_position_local_to_clip

 struct Vertex {
+    @builtin(instance_index) instance_index: u32,
    @location(0) position: vec3<f32>,
    @location(1) color: vec4<f32>,
    @location(2) barycentric: vec3<f32>,
@ -17,7 +18,8 @@ struct VertexOutput {
@vertex
 fn vertex(vertex: Vertex) -> VertexOutput {
    var out: VertexOutput;
-    out.clip_position = mesh2d_position_local_to_clip(mesh.model, vec4<f32>(vertex.position, 1.0));
+    let model = get_model_matrix(vertex.instance_index);
+    out.clip_position = mesh2d_position_local_to_clip(model, vec4<f32>(vertex.position, 1.0));
    out.color = vertex.color;
    out.barycentric = vertex.barycentric;
    return out;
--- a/crates/bevy_core_pipeline/src/core_2d/mod.rs
+++ b/crates/bevy_core_pipeline/src/core_2d/mod.rs
@ -19,6 +19,8 @@ pub mod graph {
 }
 pub const CORE_2D: &str = graph::NAME;

+use std::ops::Range;
+
 pub use camera_2d::*;
 pub use main_pass_2d_node::*;

@ -35,7 +37,7 @@ use bevy_render::{
    render_resource::CachedRenderPipelineId,
    Extract, ExtractSchedule, Render, RenderApp, RenderSet,
 };
-use bevy_utils::FloatOrd;
+use bevy_utils::{nonmax::NonMaxU32, FloatOrd};

 use crate::{tonemapping::TonemappingNode, upscaling::UpscalingNode};

@ -83,7 +85,8 @@ pub struct Transparent2d {
    pub entity: Entity,
    pub pipeline: CachedRenderPipelineId,
    pub draw_function: DrawFunctionId,
-    pub batch_size: usize,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for Transparent2d {
@ -111,8 +114,23 @@ impl PhaseItem for Transparent2d {
    }

    #[inline]
-    fn batch_size(&self) -> usize {
-        self.batch_size
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
    }
 }

--- a/crates/bevy_core_pipeline/src/core_3d/mod.rs
+++ b/crates/bevy_core_pipeline/src/core_3d/mod.rs
@ -24,7 +24,7 @@ pub mod graph {
 }
 pub const CORE_3D: &str = graph::NAME;

-use std::cmp::Reverse;
+use std::{cmp::Reverse, ops::Range};

 pub use camera_3d::*;
 pub use main_opaque_pass_3d_node::*;
@ -50,7 +50,7 @@ use bevy_render::{
    view::ViewDepthTexture,
    Extract, ExtractSchedule, Render, RenderApp, RenderSet,
 };
-use bevy_utils::{FloatOrd, HashMap};
+use bevy_utils::{nonmax::NonMaxU32, FloatOrd, HashMap};

 use crate::{
    prepass::{
@ -135,7 +135,8 @@ pub struct Opaque3d {
    pub pipeline: CachedRenderPipelineId,
    pub entity: Entity,
    pub draw_function: DrawFunctionId,
-    pub batch_size: usize,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for Opaque3d {
@ -164,8 +165,23 @@ impl PhaseItem for Opaque3d {
    }

    #[inline]
-    fn batch_size(&self) -> usize {
-        self.batch_size
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
    }
 }

@ -181,7 +197,8 @@ pub struct AlphaMask3d {
    pub pipeline: CachedRenderPipelineId,
    pub entity: Entity,
    pub draw_function: DrawFunctionId,
-    pub batch_size: usize,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for AlphaMask3d {
@ -210,8 +227,23 @@ impl PhaseItem for AlphaMask3d {
    }

    #[inline]
-    fn batch_size(&self) -> usize {
-        self.batch_size
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
    }
 }

@ -227,7 +259,8 @@ pub struct Transparent3d {
    pub pipeline: CachedRenderPipelineId,
    pub entity: Entity,
    pub draw_function: DrawFunctionId,
-    pub batch_size: usize,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for Transparent3d {
@ -255,8 +288,23 @@ impl PhaseItem for Transparent3d {
    }

    #[inline]
-    fn batch_size(&self) -> usize {
-        self.batch_size
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
    }
 }

--- a/crates/bevy_core_pipeline/src/prepass/mod.rs
+++ b/crates/bevy_core_pipeline/src/prepass/mod.rs
@ -27,7 +27,7 @@

 pub mod node;

-use std::cmp::Reverse;
+use std::{cmp::Reverse, ops::Range};

 use bevy_ecs::prelude::*;
 use bevy_reflect::Reflect;
@ -36,7 +36,7 @@ use bevy_render::{
    render_resource::{CachedRenderPipelineId, Extent3d, TextureFormat},
    texture::CachedTexture,
 };
-use bevy_utils::FloatOrd;
+use bevy_utils::{nonmax::NonMaxU32, FloatOrd};

 pub const DEPTH_PREPASS_FORMAT: TextureFormat = TextureFormat::Depth32Float;
 pub const NORMAL_PREPASS_FORMAT: TextureFormat = TextureFormat::Rgb10a2Unorm;
@ -83,6 +83,8 @@ pub struct Opaque3dPrepass {
    pub entity: Entity,
    pub pipeline_id: CachedRenderPipelineId,
    pub draw_function: DrawFunctionId,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for Opaque3dPrepass {
@ -109,6 +111,26 @@ impl PhaseItem for Opaque3dPrepass {
        // Key negated to match reversed SortKey ordering
        radsort::sort_by_key(items, |item| -item.distance);
    }
+
+    #[inline]
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
+    }
 }

 impl CachedRenderPipelinePhaseItem for Opaque3dPrepass {
@ -128,6 +150,8 @@ pub struct AlphaMask3dPrepass {
    pub entity: Entity,
    pub pipeline_id: CachedRenderPipelineId,
    pub draw_function: DrawFunctionId,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for AlphaMask3dPrepass {
@ -154,6 +178,26 @@ impl PhaseItem for AlphaMask3dPrepass {
        // Key negated to match reversed SortKey ordering
        radsort::sort_by_key(items, |item| -item.distance);
    }
+
+    #[inline]
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
+    }
 }

 impl CachedRenderPipelinePhaseItem for AlphaMask3dPrepass {
--- a/crates/bevy_gizmos/src/pipeline_2d.rs
+++ b/crates/bevy_gizmos/src/pipeline_2d.rs
@ -178,7 +178,8 @@ fn queue_line_gizmos_2d(
                draw_function,
                pipeline,
                sort_key: FloatOrd(f32::INFINITY),
-                batch_size: 1,
+                batch_range: 0..1,
+                dynamic_offset: None,
            });
        }
    }
--- a/crates/bevy_gizmos/src/pipeline_3d.rs
+++ b/crates/bevy_gizmos/src/pipeline_3d.rs
@ -192,7 +192,8 @@ fn queue_line_gizmos_3d(
                draw_function,
                pipeline,
                distance: 0.,
-                batch_size: 1,
+                batch_range: 0..1,
+                dynamic_offset: None,
            });
        }
    }
--- a/crates/bevy_math/src/affine3.rs
+++ b/crates/bevy_math/src/affine3.rs
@ -1,4 +1,4 @@
-use glam::{Affine3A, Mat3, Vec3};
+use glam::{Affine3A, Mat3, Vec3, Vec3Swizzles, Vec4};

 /// Reduced-size version of `glam::Affine3A` for use when storage has
 /// significant performance impact. Convert to `glam::Affine3A` to do
@ -10,6 +10,36 @@ pub struct Affine3 {
    pub translation: Vec3,
 }

+impl Affine3 {
+    /// Calculates the transpose of the affine 4x3 matrix to a 3x4 and formats it for packing into GPU buffers
+    #[inline]
+    pub fn to_transpose(&self) -> [Vec4; 3] {
+        let transpose_3x3 = self.matrix3.transpose();
+        [
+            transpose_3x3.x_axis.extend(self.translation.x),
+            transpose_3x3.y_axis.extend(self.translation.y),
+            transpose_3x3.z_axis.extend(self.translation.z),
+        ]
+    }
+
+    /// Calculates the inverse transpose of the 3x3 matrix and formats it for packing into GPU buffers
+    #[inline]
+    pub fn inverse_transpose_3x3(&self) -> ([Vec4; 2], f32) {
+        let inverse_transpose_3x3 = Affine3A::from(self).inverse().matrix3.transpose();
+        (
+            [
+                (inverse_transpose_3x3.x_axis, inverse_transpose_3x3.y_axis.x).into(),
+                (
+                    inverse_transpose_3x3.y_axis.yz(),
+                    inverse_transpose_3x3.z_axis.xy(),
+                )
+                    .into(),
+            ],
+            inverse_transpose_3x3.z_axis.z,
+        )
+    }
+}
+
 impl From<&Affine3A> for Affine3 {
    fn from(affine: &Affine3A) -> Self {
        Self {
--- a/crates/bevy_pbr/src/material.rs
+++ b/crates/bevy_pbr/src/material.rs
@ -20,7 +20,6 @@ use bevy_ecs::{
    },
 };
 use bevy_render::{
-    extract_component::ExtractComponentPlugin,
    mesh::{Mesh, MeshVertexBufferLayout},
    prelude::Image,
    render_asset::{prepare_assets, RenderAssets},
@ -29,13 +28,13 @@ use bevy_render::{
        RenderPhase, SetItemPipeline, TrackedRenderPass,
    },
    render_resource::{
-        AsBindGroup, AsBindGroupError, BindGroup, BindGroupLayout, OwnedBindingResource,
-        PipelineCache, RenderPipelineDescriptor, Shader, ShaderRef, SpecializedMeshPipeline,
-        SpecializedMeshPipelineError, SpecializedMeshPipelines,
+        AsBindGroup, AsBindGroupError, BindGroup, BindGroupId, BindGroupLayout,
+        OwnedBindingResource, PipelineCache, RenderPipelineDescriptor, Shader, ShaderRef,
+        SpecializedMeshPipeline, SpecializedMeshPipelineError, SpecializedMeshPipelines,
    },
    renderer::RenderDevice,
    texture::FallbackImage,
-    view::{ExtractedView, Msaa, VisibleEntities},
+    view::{ExtractedView, Msaa, ViewVisibility, VisibleEntities},
    Extract, ExtractSchedule, Render, RenderApp, RenderSet,
 };
 use bevy_utils::{tracing::error, HashMap, HashSet};
@ -180,8 +179,7 @@ where
    M::Data: PartialEq + Eq + Hash + Clone,
 {
    fn build(&self, app: &mut App) {
-        app.init_asset::<M>()
-            .add_plugins(ExtractComponentPlugin::<Handle<M>>::extract_visible());
+        app.init_asset::<M>();

        if let Ok(render_app) = app.get_sub_app_mut(RenderApp) {
            render_app
@ -193,7 +191,10 @@ where
                .init_resource::<ExtractedMaterials<M>>()
                .init_resource::<RenderMaterials<M>>()
                .init_resource::<SpecializedMeshPipelines<MaterialPipeline<M>>>()
-                .add_systems(ExtractSchedule, extract_materials::<M>)
+                .add_systems(
+                    ExtractSchedule,
+                    (extract_materials::<M>, extract_material_meshes::<M>),
+                )
                .add_systems(
                    Render,
                    (
@ -225,6 +226,26 @@ where
    }
 }

+fn extract_material_meshes<M: Material>(
+    mut commands: Commands,
+    mut previous_len: Local<usize>,
+    query: Extract<Query<(Entity, &ViewVisibility, &Handle<M>)>>,
+) {
+    let mut values = Vec::with_capacity(*previous_len);
+    for (entity, view_visibility, material) in &query {
+        if view_visibility.get() {
+            // NOTE: MaterialBindGroupId is inserted here to avoid a table move. Upcoming changes
+            // to use SparseSet for render world entity storage will do this automatically.
+            values.push((
+                entity,
+                (material.clone_weak(), MaterialBindGroupId::default()),
+            ));
+        }
+    }
+    *previous_len = values.len();
+    commands.insert_or_spawn_batch(values);
+}
+
 /// A key uniquely identifying a specialized [`MaterialPipeline`].
 pub struct MaterialPipelineKey<M: Material> {
    pub mesh_key: MeshPipelineKey,
@ -403,7 +424,12 @@ pub fn queue_material_meshes<M: Material>(
    msaa: Res<Msaa>,
    render_meshes: Res<RenderAssets<Mesh>>,
    render_materials: Res<RenderMaterials<M>>,
-    material_meshes: Query<(&Handle<M>, &Handle<Mesh>, &MeshTransforms)>,
+    mut material_meshes: Query<(
+        &Handle<M>,
+        &mut MaterialBindGroupId,
+        &Handle<Mesh>,
+        &MeshTransforms,
+    )>,
    images: Res<RenderAssets<Image>>,
    mut views: Query<(
        &ExtractedView,
@ -467,8 +493,8 @@ pub fn queue_material_meshes<M: Material>(
        }
        let rangefinder = view.rangefinder3d();
        for visible_entity in &visible_entities.entities {
-            let Ok((material_handle, mesh_handle, mesh_transforms)) =
-                material_meshes.get(*visible_entity)
+            let Ok((material_handle, mut material_bind_group_id, mesh_handle, mesh_transforms)) =
+                material_meshes.get_mut(*visible_entity)
            else {
                continue;
            };
@ -504,6 +530,8 @@ pub fn queue_material_meshes<M: Material>(
                }
            };

+            *material_bind_group_id = material.get_bind_group_id();
+
            let distance = rangefinder.distance_translation(&mesh_transforms.transform.translation)
                + material.properties.depth_bias;
            match material.properties.alpha_mode {
@ -513,7 +541,8 @@ pub fn queue_material_meshes<M: Material>(
                        draw_function: draw_opaque_pbr,
                        pipeline: pipeline_id,
                        distance,
-                        batch_size: 1,
+                        batch_range: 0..1,
+                        dynamic_offset: None,
                    });
                }
                AlphaMode::Mask(_) => {
@ -522,7 +551,8 @@ pub fn queue_material_meshes<M: Material>(
                        draw_function: draw_alpha_mask_pbr,
                        pipeline: pipeline_id,
                        distance,
-                        batch_size: 1,
+                        batch_range: 0..1,
+                        dynamic_offset: None,
                    });
                }
                AlphaMode::Blend
@ -534,7 +564,8 @@ pub fn queue_material_meshes<M: Material>(
                        draw_function: draw_transparent_pbr,
                        pipeline: pipeline_id,
                        distance,
-                        batch_size: 1,
+                        batch_range: 0..1,
+                        dynamic_offset: None,
                    });
                }
            }
@ -560,6 +591,15 @@ pub struct PreparedMaterial<T: Material> {
    pub properties: MaterialProperties,
 }

+#[derive(Component, Clone, Copy, Default, PartialEq, Eq, Deref, DerefMut)]
+pub struct MaterialBindGroupId(Option<BindGroupId>);
+
+impl<T: Material> PreparedMaterial<T> {
+    pub fn get_bind_group_id(&self) -> MaterialBindGroupId {
+        MaterialBindGroupId(Some(self.bind_group.id()))
+    }
+}
+
 #[derive(Resource)]
 pub struct ExtractedMaterials<M: Material> {
    extracted: Vec<(AssetId<M>, M)>,
--- a/crates/bevy_pbr/src/prepass/mod.rs
+++ b/crates/bevy_pbr/src/prepass/mod.rs
@ -17,6 +17,7 @@ use bevy_ecs::{
 };
 use bevy_math::{Affine3A, Mat4};
 use bevy_render::{
+    batching::batch_and_prepare_render_phase,
    globals::{GlobalsBuffer, GlobalsUniform},
    mesh::MeshVertexBufferLayout,
    prelude::{Camera, Mesh},
@ -158,7 +159,12 @@ where
                .add_systems(ExtractSchedule, extract_camera_previous_view_projection)
                .add_systems(
                    Render,
-                    prepare_previous_view_projection_uniforms.in_set(RenderSet::PrepareResources),
+                    (
+                        prepare_previous_view_projection_uniforms,
+                        batch_and_prepare_render_phase::<Opaque3dPrepass, MeshPipeline>,
+                        batch_and_prepare_render_phase::<AlphaMask3dPrepass, MeshPipeline>,
+                    )
+                        .in_set(RenderSet::PrepareResources),
                );
        }

@ -849,6 +855,8 @@ pub fn queue_prepass_material_meshes<M: Material>(
                        draw_function: opaque_draw_prepass,
                        pipeline_id,
                        distance,
+                        batch_range: 0..1,
+                        dynamic_offset: None,
                    });
                }
                AlphaMode::Mask(_) => {
@ -857,6 +865,8 @@ pub fn queue_prepass_material_meshes<M: Material>(
                        draw_function: alpha_mask_draw_prepass,
                        pipeline_id,
                        distance,
+                        batch_range: 0..1,
+                        dynamic_offset: None,
                    });
                }
                AlphaMode::Blend
--- a/crates/bevy_pbr/src/render/light.rs
+++ b/crates/bevy_pbr/src/render/light.rs
@ -27,10 +27,11 @@ use bevy_render::{
 };
 use bevy_transform::{components::GlobalTransform, prelude::Transform};
 use bevy_utils::{
+    nonmax::NonMaxU32,
    tracing::{error, warn},
    HashMap,
 };
-use std::{hash::Hash, num::NonZeroU64};
+use std::{hash::Hash, num::NonZeroU64, ops::Range};

 #[derive(Component)]
 pub struct ExtractedPointLight {
@ -1641,6 +1642,8 @@ pub fn queue_shadows<M: Material>(
                    pipeline: pipeline_id,
                    entity,
                    distance: 0.0, // TODO: sort front-to-back
+                    batch_range: 0..1,
+                    dynamic_offset: None,
                });
            }
        }
@ -1652,6 +1655,8 @@ pub struct Shadow {
    pub entity: Entity,
    pub pipeline: CachedRenderPipelineId,
    pub draw_function: DrawFunctionId,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for Shadow {
@ -1679,6 +1684,26 @@ impl PhaseItem for Shadow {
        // better than rebinding everything at a high rate.
        radsort::sort_by_key(items, |item| item.sort_key());
    }
+
+    #[inline]
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
+    }
 }

 impl CachedRenderPipelinePhaseItem for Shadow {
--- a/crates/bevy_pbr/src/render/mesh.rs
+++ b/crates/bevy_pbr/src/render/mesh.rs
@ -1,8 +1,8 @@
 use crate::{
    environment_map, prepass, EnvironmentMapLight, FogMeta, GlobalLightMeta, GpuFog, GpuLights,
-    GpuPointLights, LightMeta, NotShadowCaster, NotShadowReceiver, PreviousGlobalTransform,
-    ScreenSpaceAmbientOcclusionTextures, Shadow, ShadowSamplers, ViewClusterBindings,
-    ViewFogUniformOffset, ViewLightsUniformOffset, ViewShadowBindings,
+    GpuPointLights, LightMeta, MaterialBindGroupId, NotShadowCaster, NotShadowReceiver,
+    PreviousGlobalTransform, ScreenSpaceAmbientOcclusionTextures, Shadow, ShadowSamplers,
+    ViewClusterBindings, ViewFogUniformOffset, ViewLightsUniformOffset, ViewShadowBindings,
    CLUSTERED_FORWARD_STORAGE_BUFFER_COUNT, MAX_CASCADES_PER_LIGHT, MAX_DIRECTIONAL_LIGHTS,
 };
 use bevy_app::Plugin;
@ -16,11 +16,15 @@ use bevy_core_pipeline::{
 };
 use bevy_ecs::{
    prelude::*,
-    query::ROQueryItem,
+    query::{QueryItem, ROQueryItem},
    system::{lifetimeless::*, SystemParamItem, SystemState},
 };
-use bevy_math::{Affine3, Affine3A, Mat4, Vec2, Vec3Swizzles, Vec4};
+use bevy_math::{Affine3, Mat4, Vec2, Vec4};
 use bevy_render::{
+    batching::{
+        batch_and_prepare_render_phase, write_batched_instance_buffer, GetBatchData,
+        NoAutomaticBatching,
+    },
    globals::{GlobalsBuffer, GlobalsUniform},
    mesh::{
        skinning::{SkinnedMesh, SkinnedMeshInverseBindposes},
@ -29,7 +33,7 @@ use bevy_render::{
    },
    prelude::Msaa,
    render_asset::RenderAssets,
-    render_phase::{PhaseItem, RenderCommand, RenderCommandResult, RenderPhase, TrackedRenderPass},
+    render_phase::{PhaseItem, RenderCommand, RenderCommandResult, TrackedRenderPass},
    render_resource::*,
    renderer::{RenderDevice, RenderQueue},
    texture::{
@ -41,7 +45,6 @@ use bevy_render::{
 };
 use bevy_transform::components::GlobalTransform;
 use bevy_utils::{tracing::error, HashMap, Hashed};
-use fixedbitset::FixedBitSet;

 use crate::render::{
    morph::{extract_morphs, prepare_morphs, MorphIndex, MorphUniform},
@ -119,7 +122,15 @@ impl Plugin for MeshRenderPlugin {
                .add_systems(
                    Render,
                    (
-                        prepare_mesh_uniforms.in_set(RenderSet::PrepareResources),
+                        (
+                            batch_and_prepare_render_phase::<Opaque3d, MeshPipeline>,
+                            batch_and_prepare_render_phase::<Transparent3d, MeshPipeline>,
+                            batch_and_prepare_render_phase::<AlphaMask3d, MeshPipeline>,
+                            batch_and_prepare_render_phase::<Shadow, MeshPipeline>,
+                        )
+                            .in_set(RenderSet::PrepareResources),
+                        write_batched_instance_buffer::<MeshPipeline>
+                            .in_set(RenderSet::PrepareResourcesFlush),
                        prepare_skinned_meshes.in_set(RenderSet::PrepareResources),
                        prepare_morphs.in_set(RenderSet::PrepareResources),
                        prepare_mesh_bind_group.in_set(RenderSet::PrepareBindGroups),
@ -184,48 +195,13 @@ pub struct MeshUniform {

 impl From<&MeshTransforms> for MeshUniform {
    fn from(mesh_transforms: &MeshTransforms) -> Self {
-        let transpose_model_3x3 = mesh_transforms.transform.matrix3.transpose();
-        let transpose_previous_model_3x3 = mesh_transforms.previous_transform.matrix3.transpose();
-        let inverse_transpose_model_3x3 = Affine3A::from(&mesh_transforms.transform)
-            .inverse()
-            .matrix3
-            .transpose();
+        let (inverse_transpose_model_a, inverse_transpose_model_b) =
+            mesh_transforms.transform.inverse_transpose_3x3();
        Self {
-            transform: [
-                transpose_model_3x3
-                    .x_axis
-                    .extend(mesh_transforms.transform.translation.x),
-                transpose_model_3x3
-                    .y_axis
-                    .extend(mesh_transforms.transform.translation.y),
-                transpose_model_3x3
-                    .z_axis
-                    .extend(mesh_transforms.transform.translation.z),
-            ],
-            previous_transform: [
-                transpose_previous_model_3x3
-                    .x_axis
-                    .extend(mesh_transforms.previous_transform.translation.x),
-                transpose_previous_model_3x3
-                    .y_axis
-                    .extend(mesh_transforms.previous_transform.translation.y),
-                transpose_previous_model_3x3
-                    .z_axis
-                    .extend(mesh_transforms.previous_transform.translation.z),
-            ],
-            inverse_transpose_model_a: [
-                (
-                    inverse_transpose_model_3x3.x_axis,
-                    inverse_transpose_model_3x3.y_axis.x,
-                )
-                    .into(),
-                (
-                    inverse_transpose_model_3x3.y_axis.yz(),
-                    inverse_transpose_model_3x3.z_axis.xy(),
-                )
-                    .into(),
-            ],
-            inverse_transpose_model_b: inverse_transpose_model_3x3.z_axis.z,
+            transform: mesh_transforms.transform.to_transpose(),
+            previous_transform: mesh_transforms.previous_transform.to_transpose(),
+            inverse_transpose_model_a,
+            inverse_transpose_model_b,
            flags: mesh_transforms.flags,
        }
    }
@ -234,7 +210,7 @@ impl From<&MeshTransforms> for MeshUniform {
 // NOTE: These must match the bit flags in bevy_pbr/src/render/mesh_types.wgsl!
 bitflags::bitflags! {
    #[repr(transparent)]
-    struct MeshFlags: u32 {
+    pub struct MeshFlags: u32 {
        const SHADOW_RECEIVER            = (1 << 0);
        // Indicates the sign of the determinant of the 3x3 model matrix. If the sign is positive,
        // then the flag should be set, else it should not be set.
@ -361,7 +337,12 @@ pub fn extract_skinned_meshes(
            SkinnedMeshJoints::build(skin, &inverse_bindposes, &joint_query, &mut uniform.buffer)
        {
            last_start = last_start.max(skinned_joints.index as usize);
-            values.push((entity, skinned_joints.to_buffer_index()));
+            // NOTE: The skinned joints uniform buffer has to be bound at a dynamic offset per
+            // entity and so cannot currently be batched.
+            values.push((
+                entity,
+                (skinned_joints.to_buffer_index(), NoAutomaticBatching),
+            ));
        }
    }

@ -374,63 +355,6 @@ pub fn extract_skinned_meshes(
    commands.insert_or_spawn_batch(values);
 }

-#[allow(clippy::too_many_arguments)]
-pub fn prepare_mesh_uniforms(
-    mut seen: Local<FixedBitSet>,
-    mut commands: Commands,
-    mut previous_len: Local<usize>,
-    render_device: Res<RenderDevice>,
-    render_queue: Res<RenderQueue>,
-    mut gpu_array_buffer: ResMut<GpuArrayBuffer<MeshUniform>>,
-    views: Query<(
-        &RenderPhase<Opaque3d>,
-        &RenderPhase<Transparent3d>,
-        &RenderPhase<AlphaMask3d>,
-    )>,
-    shadow_views: Query<&RenderPhase<Shadow>>,
-    meshes: Query<(Entity, &MeshTransforms)>,
-) {
-    gpu_array_buffer.clear();
-    seen.clear();
-
-    let mut indices = Vec::with_capacity(*previous_len);
-    let mut push_indices = |(mesh, mesh_uniform): (Entity, &MeshTransforms)| {
-        let index = mesh.index() as usize;
-        if !seen.contains(index) {
-            if index >= seen.len() {
-                seen.grow(index + 1);
-            }
-            seen.insert(index);
-            indices.push((mesh, gpu_array_buffer.push(mesh_uniform.into())));
-        }
-    };
-
-    for (opaque_phase, transparent_phase, alpha_phase) in &views {
-        meshes
-            .iter_many(opaque_phase.iter_entities())
-            .for_each(&mut push_indices);
-
-        meshes
-            .iter_many(transparent_phase.iter_entities())
-            .for_each(&mut push_indices);
-
-        meshes
-            .iter_many(alpha_phase.iter_entities())
-            .for_each(&mut push_indices);
-    }
-
-    for shadow_phase in &shadow_views {
-        meshes
-            .iter_many(shadow_phase.iter_entities())
-            .for_each(&mut push_indices);
-    }
-
-    *previous_len = indices.len();
-    commands.insert_or_spawn_batch(indices);
-
-    gpu_array_buffer.write_buffer(&render_device, &render_queue);
-}
-
 #[derive(Resource, Clone)]
 pub struct MeshPipeline {
    pub view_layout: BindGroupLayout,
@ -713,6 +637,26 @@ impl MeshPipeline {
    }
 }

+impl GetBatchData for MeshPipeline {
+    type Query = (
+        Option<&'static MaterialBindGroupId>,
+        &'static Handle<Mesh>,
+        &'static MeshTransforms,
+    );
+    type CompareData = (Option<MaterialBindGroupId>, AssetId<Mesh>);
+    type BufferData = MeshUniform;
+
+    fn get_buffer_data(&(.., mesh_transforms): &QueryItem<Self::Query>) -> Self::BufferData {
+        mesh_transforms.into()
+    }
+
+    fn get_compare_data(
+        &(material_bind_group_id, mesh_handle, ..): &QueryItem<Self::Query>,
+    ) -> Self::CompareData {
+        (material_bind_group_id.copied(), mesh_handle.id())
+    }
+}
+
 bitflags::bitflags! {
    #[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
    #[repr(transparent)]
@ -1364,16 +1308,15 @@ impl<P: PhaseItem, const I: usize> RenderCommand<P> for SetMeshBindGroup<I> {
    type ViewWorldQuery = ();
    type ItemWorldQuery = (
        Read<Handle<Mesh>>,
-        Read<GpuArrayBufferIndex<MeshUniform>>,
        Option<Read<SkinnedMeshJoints>>,
        Option<Read<MorphIndex>>,
    );

    #[inline]
    fn render<'w>(
-        _item: &P,
+        item: &P,
        _view: (),
-        (mesh, batch_indices, skin_index, morph_index): ROQueryItem<Self::ItemWorldQuery>,
+        (mesh, skin_index, morph_index): ROQueryItem<Self::ItemWorldQuery>,
        bind_groups: SystemParamItem<'w, '_, Self::Param>,
        pass: &mut TrackedRenderPass<'w>,
    ) -> RenderCommandResult {
@ -1391,20 +1334,20 @@ impl<P: PhaseItem, const I: usize> RenderCommand<P> for SetMeshBindGroup<I> {
        };

        let mut dynamic_offsets: [u32; 3] = Default::default();
-        let mut index_count = 0;
-        if let Some(mesh_index) = batch_indices.dynamic_offset {
-            dynamic_offsets[index_count] = mesh_index;
-            index_count += 1;
+        let mut offset_count = 0;
+        if let Some(dynamic_offset) = item.dynamic_offset() {
+            dynamic_offsets[offset_count] = dynamic_offset.get();
+            offset_count += 1;
        }
        if let Some(skin_index) = skin_index {
-            dynamic_offsets[index_count] = skin_index.index;
-            index_count += 1;
+            dynamic_offsets[offset_count] = skin_index.index;
+            offset_count += 1;
        }
        if let Some(morph_index) = morph_index {
-            dynamic_offsets[index_count] = morph_index.index;
-            index_count += 1;
+            dynamic_offsets[offset_count] = morph_index.index;
+            offset_count += 1;
        }
-        pass.set_bind_group(I, bind_group, &dynamic_offsets[0..index_count]);
+        pass.set_bind_group(I, bind_group, &dynamic_offsets[0..offset_count]);

        RenderCommandResult::Success
    }
@ -1414,22 +1357,23 @@ pub struct DrawMesh;
 impl<P: PhaseItem> RenderCommand<P> for DrawMesh {
    type Param = SRes<RenderAssets<Mesh>>;
    type ViewWorldQuery = ();
-    type ItemWorldQuery = (Read<GpuArrayBufferIndex<MeshUniform>>, Read<Handle<Mesh>>);
+    type ItemWorldQuery = Read<Handle<Mesh>>;
    #[inline]
    fn render<'w>(
-        _item: &P,
+        item: &P,
        _view: (),
-        (batch_indices, mesh_handle): ROQueryItem<'_, Self::ItemWorldQuery>,
+        mesh_handle: ROQueryItem<'_, Self::ItemWorldQuery>,
        meshes: SystemParamItem<'w, '_, Self::Param>,
        pass: &mut TrackedRenderPass<'w>,
    ) -> RenderCommandResult {
        if let Some(gpu_mesh) = meshes.into_inner().get(mesh_handle) {
+            let batch_range = item.batch_range();
            pass.set_vertex_buffer(0, gpu_mesh.vertex_buffer.slice(..));
            #[cfg(all(feature = "webgl", target_arch = "wasm32"))]
            pass.set_push_constants(
                ShaderStages::VERTEX,
                0,
-                &(batch_indices.index as i32).to_le_bytes(),
+                &(batch_range.start as i32).to_le_bytes(),
            );
            match &gpu_mesh.buffer_info {
                GpuBufferInfo::Indexed {
@ -1438,13 +1382,10 @@ impl<P: PhaseItem> RenderCommand<P> for DrawMesh {
                    count,
                } => {
                    pass.set_index_buffer(buffer.slice(..), 0, *index_format);
-                    pass.draw_indexed(0..*count, 0, batch_indices.index..batch_indices.index + 1);
+                    pass.draw_indexed(0..*count, 0, batch_range.clone());
                }
                GpuBufferInfo::NonIndexed => {
-                    pass.draw(
-                        0..gpu_mesh.vertex_count,
-                        batch_indices.index..batch_indices.index + 1,
-                    );
+                    pass.draw(0..gpu_mesh.vertex_count, batch_range.clone());
                }
            }
            RenderCommandResult::Success
--- a/crates/bevy_pbr/src/render/morph.rs
+++ b/crates/bevy_pbr/src/render/morph.rs
@ -2,6 +2,7 @@ use std::{iter, mem};

 use bevy_ecs::prelude::*;
 use bevy_render::{
+    batching::NoAutomaticBatching,
    mesh::morph::{MeshMorphWeights, MAX_MORPH_WEIGHTS},
    render_resource::{BufferUsages, BufferVec},
    renderer::{RenderDevice, RenderQueue},
@ -89,7 +90,9 @@ pub fn extract_morphs(
        add_to_alignment::<f32>(&mut uniform.buffer);

        let index = (start * mem::size_of::<f32>()) as u32;
-        values.push((entity, MorphIndex { index }));
+        // NOTE: Because morph targets require per-morph target texture bindings, they cannot
+        // currently be batched.
+        values.push((entity, (MorphIndex { index }, NoAutomaticBatching)));
    }
    *previous_len = values.len();
    commands.insert_or_spawn_batch(values);
--- a/crates/bevy_pbr/src/wireframe.rs
+++ b/crates/bevy_pbr/src/wireframe.rs
@ -152,7 +152,8 @@ fn queue_wireframes(
                pipeline: pipeline_id,
                draw_function: draw_custom,
                distance: rangefinder.distance_translation(&mesh_transforms.transform.translation),
-                batch_size: 1,
+                batch_range: 0..1,
+                dynamic_offset: None,
            });
        };

--- a/crates/bevy_render/src/batching/mod.rs
+++ b/crates/bevy_render/src/batching/mod.rs
@ -0,0 +1,123 @@
+use bevy_ecs::{
+    component::Component,
+    prelude::Res,
+    query::{Has, QueryItem, ReadOnlyWorldQuery},
+    system::{Query, ResMut},
+};
+use bevy_utils::nonmax::NonMaxU32;
+
+use crate::{
+    render_phase::{CachedRenderPipelinePhaseItem, DrawFunctionId, RenderPhase},
+    render_resource::{CachedRenderPipelineId, GpuArrayBuffer, GpuArrayBufferable},
+    renderer::{RenderDevice, RenderQueue},
+};
+
+/// Add this component to mesh entities to disable automatic batching
+#[derive(Component)]
+pub struct NoAutomaticBatching;
+
+/// Data necessary to be equal for two draw commands to be mergeable
+///
+/// This is based on the following assumptions:
+/// - Only entities with prepared assets (pipelines, materials, meshes) are
+///   queued to phases
+/// - View bindings are constant across a phase for a given draw function as
+///   phases are per-view
+/// - `batch_and_prepare_render_phase` is the only system that performs this
+///   batching and has sole responsibility for preparing the per-object data.
+///   As such the mesh binding and dynamic offsets are assumed to only be
+///   variable as a result of the `batch_and_prepare_render_phase` system, e.g.
+///   due to having to split data across separate uniform bindings within the
+///   same buffer due to the maximum uniform buffer binding size.
+#[derive(PartialEq)]
+struct BatchMeta<T: PartialEq> {
+    /// The pipeline id encompasses all pipeline configuration including vertex
+    /// buffers and layouts, shaders and their specializations, bind group
+    /// layouts, etc.
+    pipeline_id: CachedRenderPipelineId,
+    /// The draw function id defines the RenderCommands that are called to
+    /// set the pipeline and bindings, and make the draw command
+    draw_function_id: DrawFunctionId,
+    dynamic_offset: Option<NonMaxU32>,
+    user_data: T,
+}
+
+impl<T: PartialEq> BatchMeta<T> {
+    fn new(item: &impl CachedRenderPipelinePhaseItem, user_data: T) -> Self {
+        BatchMeta {
+            pipeline_id: item.cached_pipeline(),
+            draw_function_id: item.draw_function(),
+            dynamic_offset: item.dynamic_offset(),
+            user_data,
+        }
+    }
+}
+
+/// A trait to support getting data used for batching draw commands via phase
+/// items.
+pub trait GetBatchData {
+    type Query: ReadOnlyWorldQuery;
+    /// Data used for comparison between phase items. If the pipeline id, draw
+    /// function id, per-instance data buffer dynamic offset and this data
+    /// matches, the draws can be batched.
+    type CompareData: PartialEq;
+    /// The per-instance data to be inserted into the [`GpuArrayBuffer`]
+    /// containing these data for all instances.
+    type BufferData: GpuArrayBufferable + Sync + Send + 'static;
+    /// Get the per-instance data to be inserted into the [`GpuArrayBuffer`].
+    fn get_buffer_data(query_item: &QueryItem<Self::Query>) -> Self::BufferData;
+    /// Get the data used for comparison when deciding whether draws can be
+    /// batched.
+    fn get_compare_data(query_item: &QueryItem<Self::Query>) -> Self::CompareData;
+}
+
+/// Batch the items in a render phase. This means comparing metadata needed to draw each phase item
+/// and trying to combine the draws into a batch.
+pub fn batch_and_prepare_render_phase<I: CachedRenderPipelinePhaseItem, F: GetBatchData>(
+    gpu_array_buffer: ResMut<GpuArrayBuffer<F::BufferData>>,
+    mut views: Query<&mut RenderPhase<I>>,
+    query: Query<(Has<NoAutomaticBatching>, F::Query)>,
+) {
+    let gpu_array_buffer = gpu_array_buffer.into_inner();
+
+    let mut process_item = |item: &mut I| {
+        let (no_auto_batching, batch_query_item) = query.get(item.entity()).ok()?;
+
+        let buffer_data = F::get_buffer_data(&batch_query_item);
+        let buffer_index = gpu_array_buffer.push(buffer_data);
+
+        let index = buffer_index.index.get();
+        *item.batch_range_mut() = index..index + 1;
+        *item.dynamic_offset_mut() = buffer_index.dynamic_offset;
+
+        (!no_auto_batching).then(|| {
+            let compare_data = F::get_compare_data(&batch_query_item);
+            BatchMeta::new(item, compare_data)
+        })
+    };
+
+    for mut phase in &mut views {
+        let items = phase.items.iter_mut().map(|item| {
+            let batch_data = process_item(item);
+            (item.batch_range_mut(), batch_data)
+        });
+        items.reduce(|(start_range, prev_batch_meta), (range, batch_meta)| {
+            if batch_meta.is_some() && prev_batch_meta == batch_meta {
+                start_range.end = range.end;
+                (start_range, prev_batch_meta)
+            } else {
+                (range, batch_meta)
+            }
+        });
+    }
+}
+
+pub fn write_batched_instance_buffer<F: GetBatchData>(
+    render_device: Res<RenderDevice>,
+    render_queue: Res<RenderQueue>,
+    gpu_array_buffer: ResMut<GpuArrayBuffer<F::BufferData>>,
+) {
+    let gpu_array_buffer = gpu_array_buffer.into_inner();
+    gpu_array_buffer.write_buffer(&render_device, &render_queue);
+    gpu_array_buffer.clear();
+}
--- a/crates/bevy_render/src/lib.rs
+++ b/crates/bevy_render/src/lib.rs
@ -5,6 +5,7 @@ compile_error!("bevy_render cannot compile for a 16-bit platform.");

 extern crate core;

+pub mod batching;
 pub mod camera;
 pub mod color;
 pub mod extract_component;
--- a/crates/bevy_render/src/render_phase/mod.rs
+++ b/crates/bevy_render/src/render_phase/mod.rs
@ -29,6 +29,7 @@ mod draw;
 mod draw_state;
 mod rangefinder;

+use bevy_utils::nonmax::NonMaxU32;
 pub use draw::*;
 pub use draw_state::*;
 pub use rangefinder::*;
@ -38,7 +39,7 @@ use bevy_ecs::{
    prelude::*,
    system::{lifetimeless::SRes, SystemParamItem},
 };
-use std::ops::Range;
+use std::{ops::Range, slice::SliceIndex};

 /// A collection of all rendering instructions, that will be executed by the GPU, for a
 /// single render phase for a single view.
@ -86,22 +87,7 @@ impl<I: PhaseItem> RenderPhase<I> {
        world: &'w World,
        view: Entity,
    ) {
-        let draw_functions = world.resource::<DrawFunctions<I>>();
-        let mut draw_functions = draw_functions.write();
-        draw_functions.prepare(world);
-
-        let mut index = 0;
-        while index < self.items.len() {
-            let item = &self.items[index];
-            let batch_size = item.batch_size();
-            if batch_size > 0 {
-                let draw_function = draw_functions.get_mut(item.draw_function()).unwrap();
-                draw_function.draw(world, render_pass, view, item);
-                index += batch_size;
-            } else {
-                index += 1;
-            }
-        }
+        self.render_range(render_pass, world, view, ..);
    }

    /// Renders all [`PhaseItem`]s in the provided `range` (based on their index in `self.items`) using their corresponding draw functions.
@ -110,27 +96,27 @@ impl<I: PhaseItem> RenderPhase<I> {
        render_pass: &mut TrackedRenderPass<'w>,
        world: &'w World,
        view: Entity,
-        range: Range<usize>,
+        range: impl SliceIndex<[I], Output = [I]>,
    ) {
-        let draw_functions = world.resource::<DrawFunctions<I>>();
-        let mut draw_functions = draw_functions.write();
-        draw_functions.prepare(world);
-
        let items = self
            .items
            .get(range)
            .expect("`Range` provided to `render_range()` is out of bounds");

+        let draw_functions = world.resource::<DrawFunctions<I>>();
+        let mut draw_functions = draw_functions.write();
+        draw_functions.prepare(world);
+
        let mut index = 0;
        while index < items.len() {
            let item = &items[index];
-            let batch_size = item.batch_size();
-            if batch_size > 0 {
+            let batch_range = item.batch_range();
+            if batch_range.is_empty() {
+                index += 1;
+            } else {
                let draw_function = draw_functions.get_mut(item.draw_function()).unwrap();
                draw_function.draw(world, render_pass, view, item);
-                index += batch_size;
-            } else {
-                index += 1;
+                index += batch_range.len();
            }
        }
    }
@ -182,12 +168,14 @@ pub trait PhaseItem: Sized + Send + Sync + 'static {
        items.sort_unstable_by_key(|item| item.sort_key());
    }

-    /// The number of items to skip after rendering this [`PhaseItem`].
-    ///
-    /// Items with a `batch_size` of 0 will not be rendered.
-    fn batch_size(&self) -> usize {
-        1
-    }
+    /// The range of instances that the batch covers. After doing a batched draw, batch range
+    /// length phase items will be skipped. This design is to avoid having to restructure the
+    /// render phase unnecessarily.
+    fn batch_range(&self) -> &Range<u32>;
+    fn batch_range_mut(&mut self) -> &mut Range<u32>;
+
+    fn dynamic_offset(&self) -> Option<NonMaxU32>;
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32>;
 }

 /// A [`PhaseItem`] item, that automatically sets the appropriate render pipeline,
--- a/crates/bevy_render/src/render_resource/batched_uniform_buffer.rs
+++ b/crates/bevy_render/src/render_resource/batched_uniform_buffer.rs
@ -3,6 +3,7 @@ use crate::{
    render_resource::DynamicUniformBuffer,
    renderer::{RenderDevice, RenderQueue},
 };
+use bevy_utils::nonmax::NonMaxU32;
 use encase::{
    private::{ArrayMetadata, BufferMut, Metadata, RuntimeSizedArray, WriteInto, Writer},
    ShaderType,
@ -76,8 +77,8 @@ impl<T: GpuArrayBufferable> BatchedUniformBuffer<T> {

    pub fn push(&mut self, component: T) -> GpuArrayBufferIndex<T> {
        let result = GpuArrayBufferIndex {
-            index: self.temp.0.len() as u32,
-            dynamic_offset: Some(self.current_offset),
+            index: NonMaxU32::new(self.temp.0.len() as u32).unwrap(),
+            dynamic_offset: NonMaxU32::new(self.current_offset),
            element_type: PhantomData,
        };
        self.temp.0.push(component);
--- a/crates/bevy_render/src/render_resource/gpu_array_buffer.rs
+++ b/crates/bevy_render/src/render_resource/gpu_array_buffer.rs
@ -4,6 +4,7 @@ use crate::{
    renderer::{RenderDevice, RenderQueue},
 };
 use bevy_ecs::{prelude::Component, system::Resource};
+use bevy_utils::nonmax::NonMaxU32;
 use encase::{private::WriteInto, ShaderSize, ShaderType};
 use std::{marker::PhantomData, mem};
 use wgpu::{BindGroupLayoutEntry, BindingResource, BindingType, BufferBindingType, ShaderStages};
@ -52,7 +53,7 @@ impl<T: GpuArrayBufferable> GpuArrayBuffer<T> {
        match self {
            GpuArrayBuffer::Uniform(buffer) => buffer.push(value),
            GpuArrayBuffer::Storage((_, buffer)) => {
-                let index = buffer.len() as u32;
+                let index = NonMaxU32::new(buffer.len() as u32).unwrap();
                buffer.push(value);
                GpuArrayBufferIndex {
                    index,
@ -118,12 +119,12 @@ impl<T: GpuArrayBufferable> GpuArrayBuffer<T> {
 }

 /// An index into a [`GpuArrayBuffer`] for a given element.
-#[derive(Component)]
+#[derive(Component, Clone)]
 pub struct GpuArrayBufferIndex<T: GpuArrayBufferable> {
    /// The index to use in a shader into the array.
-    pub index: u32,
+    pub index: NonMaxU32,
    /// The dynamic offset to use when setting the bind group in a pass.
    /// Only used on platforms that don't support storage buffers.
-    pub dynamic_offset: Option<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
    pub element_type: PhantomData<T>,
 }
--- a/crates/bevy_sprite/src/mesh2d/material.rs
+++ b/crates/bevy_sprite/src/mesh2d/material.rs
@ -15,7 +15,6 @@ use bevy_ecs::{
 };
 use bevy_log::error;
 use bevy_render::{
-    extract_component::ExtractComponentPlugin,
    mesh::{Mesh, MeshVertexBufferLayout},
    prelude::Image,
    render_asset::{prepare_assets, RenderAssets},
@ -24,9 +23,9 @@ use bevy_render::{
        RenderPhase, SetItemPipeline, TrackedRenderPass,
    },
    render_resource::{
-        AsBindGroup, AsBindGroupError, BindGroup, BindGroupLayout, OwnedBindingResource,
-        PipelineCache, RenderPipelineDescriptor, Shader, ShaderRef, SpecializedMeshPipeline,
-        SpecializedMeshPipelineError, SpecializedMeshPipelines,
+        AsBindGroup, AsBindGroupError, BindGroup, BindGroupId, BindGroupLayout,
+        OwnedBindingResource, PipelineCache, RenderPipelineDescriptor, Shader, ShaderRef,
+        SpecializedMeshPipeline, SpecializedMeshPipelineError, SpecializedMeshPipelines,
    },
    renderer::RenderDevice,
    texture::FallbackImage,
@ -39,8 +38,8 @@ use std::hash::Hash;
 use std::marker::PhantomData;

 use crate::{
-    DrawMesh2d, Mesh2dHandle, Mesh2dPipeline, Mesh2dPipelineKey, Mesh2dUniform, SetMesh2dBindGroup,
-    SetMesh2dViewBindGroup,
+    DrawMesh2d, Mesh2dHandle, Mesh2dPipeline, Mesh2dPipelineKey, Mesh2dTransforms,
+    SetMesh2dBindGroup, SetMesh2dViewBindGroup,
 };

 /// Materials are used alongside [`Material2dPlugin`] and [`MaterialMesh2dBundle`]
@ -144,8 +143,7 @@ where
    M::Data: PartialEq + Eq + Hash + Clone,
 {
    fn build(&self, app: &mut App) {
-        app.init_asset::<M>()
-            .add_plugins(ExtractComponentPlugin::<Handle<M>>::extract_visible());
+        app.init_asset::<M>();

        if let Ok(render_app) = app.get_sub_app_mut(RenderApp) {
            render_app
@ -153,7 +151,10 @@ where
                .init_resource::<ExtractedMaterials2d<M>>()
                .init_resource::<RenderMaterials2d<M>>()
                .init_resource::<SpecializedMeshPipelines<Material2dPipeline<M>>>()
-                .add_systems(ExtractSchedule, extract_materials_2d::<M>)
+                .add_systems(
+                    ExtractSchedule,
+                    (extract_materials_2d::<M>, extract_material_meshes_2d::<M>),
+                )
                .add_systems(
                    Render,
                    (
@ -175,6 +176,26 @@ where
    }
 }

+fn extract_material_meshes_2d<M: Material2d>(
+    mut commands: Commands,
+    mut previous_len: Local<usize>,
+    query: Extract<Query<(Entity, &ViewVisibility, &Handle<M>)>>,
+) {
+    let mut values = Vec::with_capacity(*previous_len);
+    for (entity, view_visibility, material) in &query {
+        if view_visibility.get() {
+            // NOTE: Material2dBindGroupId is inserted here to avoid a table move. Upcoming changes
+            // to use SparseSet for render world entity storage will do this automatically.
+            values.push((
+                entity,
+                (material.clone_weak(), Material2dBindGroupId::default()),
+            ));
+        }
+    }
+    *previous_len = values.len();
+    commands.insert_or_spawn_batch(values);
+}
+
 /// Render pipeline data for a given [`Material2d`]
 #[derive(Resource)]
 pub struct Material2dPipeline<M: Material2d> {
@ -343,7 +364,12 @@ pub fn queue_material2d_meshes<M: Material2d>(
    msaa: Res<Msaa>,
    render_meshes: Res<RenderAssets<Mesh>>,
    render_materials: Res<RenderMaterials2d<M>>,
-    material2d_meshes: Query<(&Handle<M>, &Mesh2dHandle, &Mesh2dUniform)>,
+    mut material2d_meshes: Query<(
+        &Handle<M>,
+        &mut Material2dBindGroupId,
+        &Mesh2dHandle,
+        &Mesh2dTransforms,
+    )>,
    mut views: Query<(
        &ExtractedView,
        &VisibleEntities,
@ -374,8 +400,12 @@ pub fn queue_material2d_meshes<M: Material2d>(
            }
        }
        for visible_entity in &visible_entities.entities {
-            let Ok((material2d_handle, mesh2d_handle, mesh2d_uniform)) =
-                material2d_meshes.get(*visible_entity)
+            let Ok((
+                material2d_handle,
+                mut material2d_bind_group_id,
+                mesh2d_handle,
+                mesh2d_uniform,
+            )) = material2d_meshes.get_mut(*visible_entity)
            else {
                continue;
            };
@ -406,7 +436,8 @@ pub fn queue_material2d_meshes<M: Material2d>(
                }
            };

-            let mesh_z = mesh2d_uniform.transform.w_axis.z;
+            *material2d_bind_group_id = material2d.get_bind_group_id();
+            let mesh_z = mesh2d_uniform.transform.translation.z;
            transparent_phase.add(Transparent2d {
                entity: *visible_entity,
                draw_function: draw_transparent_pbr,
@ -416,13 +447,17 @@ pub fn queue_material2d_meshes<M: Material2d>(
                // -z in front of the camera, the largest distance is -far with values increasing toward the
                // camera. As such we can just use mesh_z as the distance
                sort_key: FloatOrd(mesh_z),
-                // This material is not batched
-                batch_size: 1,
+                // Batching is done in batch_and_prepare_render_phase
+                batch_range: 0..1,
+                dynamic_offset: None,
            });
        }
    }
 }

+#[derive(Component, Clone, Copy, Default, PartialEq, Eq, Deref, DerefMut)]
+pub struct Material2dBindGroupId(Option<BindGroupId>);
+
 /// Data prepared for a [`Material2d`] instance.
 pub struct PreparedMaterial2d<T: Material2d> {
    pub bindings: Vec<OwnedBindingResource>,
@ -430,6 +465,12 @@ pub struct PreparedMaterial2d<T: Material2d> {
    pub key: T::Data,
 }

+impl<T: Material2d> PreparedMaterial2d<T> {
+    pub fn get_bind_group_id(&self) -> Material2dBindGroupId {
+        Material2dBindGroupId(Some(self.bind_group.id()))
+    }
+}
+
 #[derive(Resource)]
 pub struct ExtractedMaterials2d<M: Material2d> {
    extracted: Vec<(AssetId<M>, M)>,
--- a/crates/bevy_sprite/src/mesh2d/mesh.rs
+++ b/crates/bevy_sprite/src/mesh2d/mesh.rs
@ -1,15 +1,16 @@
 use bevy_app::Plugin;
-use bevy_asset::{load_internal_asset, Handle};
+use bevy_asset::{load_internal_asset, AssetId, Handle};

+use bevy_core_pipeline::core_2d::Transparent2d;
 use bevy_ecs::{
    prelude::*,
-    query::ROQueryItem,
+    query::{QueryItem, ROQueryItem},
    system::{lifetimeless::*, SystemParamItem, SystemState},
 };
-use bevy_math::{Mat4, Vec2};
+use bevy_math::{Affine3, Vec2, Vec4};
 use bevy_reflect::Reflect;
 use bevy_render::{
-    extract_component::{ComponentUniforms, DynamicUniformIndex, UniformComponentPlugin},
+    batching::{batch_and_prepare_render_phase, write_batched_instance_buffer, GetBatchData},
    globals::{GlobalsBuffer, GlobalsUniform},
    mesh::{GpuBufferInfo, Mesh, MeshVertexBufferLayout},
    render_asset::RenderAssets,
@ -26,10 +27,12 @@ use bevy_render::{
 };
 use bevy_transform::components::GlobalTransform;

+use crate::Material2dBindGroupId;
+
 /// Component for rendering with meshes in the 2d pipeline, usually with a [2d material](crate::Material2d) such as [`ColorMaterial`](crate::ColorMaterial).
 ///
 /// It wraps a [`Handle<Mesh>`] to differentiate from the 3d pipelines which use the handles directly as components
-#[derive(Default, Clone, Component, Debug, Reflect)]
+#[derive(Default, Clone, Component, Debug, Reflect, PartialEq, Eq)]
 #[reflect(Component)]
 pub struct Mesh2dHandle(pub Handle<Mesh>);

@ -76,12 +79,6 @@ impl Plugin for Mesh2dRenderPlugin {
            "mesh2d_types.wgsl",
            Shader::from_wgsl
        );
-        load_internal_asset!(
-            app,
-            MESH2D_BINDINGS_HANDLE,
-            "mesh2d_bindings.wgsl",
-            Shader::from_wgsl
-        );
        load_internal_asset!(
            app,
            MESH2D_FUNCTIONS_HANDLE,
@ -90,8 +87,6 @@ impl Plugin for Mesh2dRenderPlugin {
        );
        load_internal_asset!(app, MESH2D_SHADER_HANDLE, "mesh2d.wgsl", Shader::from_wgsl);

-        app.add_plugins(UniformComponentPlugin::<Mesh2dUniform>::default());
-
        if let Ok(render_app) = app.get_sub_app_mut(RenderApp) {
            render_app
                .init_resource::<SpecializedMeshPipelines<Mesh2dPipeline>>()
@ -99,6 +94,10 @@ impl Plugin for Mesh2dRenderPlugin {
                .add_systems(
                    Render,
                    (
+                        batch_and_prepare_render_phase::<Transparent2d, Mesh2dPipeline>
+                            .in_set(RenderSet::PrepareResources),
+                        write_batched_instance_buffer::<Mesh2dPipeline>
+                            .in_set(RenderSet::PrepareResourcesFlush),
                        prepare_mesh2d_bind_group.in_set(RenderSet::PrepareBindGroups),
                        prepare_mesh2d_view_bind_groups.in_set(RenderSet::PrepareBindGroups),
                    ),
@ -107,19 +106,69 @@ impl Plugin for Mesh2dRenderPlugin {
    }

    fn finish(&self, app: &mut bevy_app::App) {
+        let mut mesh_bindings_shader_defs = Vec::with_capacity(1);
+
        if let Ok(render_app) = app.get_sub_app_mut(RenderApp) {
-            render_app.init_resource::<Mesh2dPipeline>();
+            if let Some(per_object_buffer_batch_size) = GpuArrayBuffer::<Mesh2dUniform>::batch_size(
+                render_app.world.resource::<RenderDevice>(),
+            ) {
+                mesh_bindings_shader_defs.push(ShaderDefVal::UInt(
+                    "PER_OBJECT_BUFFER_BATCH_SIZE".into(),
+                    per_object_buffer_batch_size,
+                ));
+            }
+
+            render_app
+                .insert_resource(GpuArrayBuffer::<Mesh2dUniform>::new(
+                    render_app.world.resource::<RenderDevice>(),
+                ))
+                .init_resource::<Mesh2dPipeline>();
        }
+
+        // Load the mesh_bindings shader module here as it depends on runtime information about
+        // whether storage buffers are supported, or the maximum uniform buffer binding size.
+        load_internal_asset!(
+            app,
+            MESH2D_BINDINGS_HANDLE,
+            "mesh2d_bindings.wgsl",
+            Shader::from_wgsl_with_defs,
+            mesh_bindings_shader_defs
+        );
    }
 }

-#[derive(Component, ShaderType, Clone)]
-pub struct Mesh2dUniform {
-    pub transform: Mat4,
-    pub inverse_transpose_model: Mat4,
+#[derive(Component)]
+pub struct Mesh2dTransforms {
+    pub transform: Affine3,
    pub flags: u32,
 }

+#[derive(ShaderType, Clone)]
+pub struct Mesh2dUniform {
+    // Affine 4x3 matrix transposed to 3x4
+    pub transform: [Vec4; 3],
+    // 3x3 matrix packed in mat2x4 and f32 as:
+    //   [0].xyz, [1].x,
+    //   [1].yz, [2].xy
+    //   [2].z
+    pub inverse_transpose_model_a: [Vec4; 2],
+    pub inverse_transpose_model_b: f32,
+    pub flags: u32,
+}
+
+impl From<&Mesh2dTransforms> for Mesh2dUniform {
+    fn from(mesh_transforms: &Mesh2dTransforms) -> Self {
+        let (inverse_transpose_model_a, inverse_transpose_model_b) =
+            mesh_transforms.transform.inverse_transpose_3x3();
+        Self {
+            transform: mesh_transforms.transform.to_transpose(),
+            inverse_transpose_model_a,
+            inverse_transpose_model_b,
+            flags: mesh_transforms.flags,
+        }
+    }
+}
+
 // NOTE: These must match the bit flags in bevy_sprite/src/mesh2d/mesh2d.wgsl!
 bitflags::bitflags! {
    #[repr(transparent)]
@ -139,15 +188,13 @@ pub fn extract_mesh2d(
        if !view_visibility.get() {
            continue;
        }
-        let transform = transform.compute_matrix();
        values.push((
            entity,
            (
                Mesh2dHandle(handle.0.clone_weak()),
-                Mesh2dUniform {
+                Mesh2dTransforms {
+                    transform: (&transform.affine()).into(),
                    flags: MeshFlags::empty().bits(),
-                    transform,
-                    inverse_transpose_model: transform.inverse().transpose(),
                },
            ),
        ));
@ -162,13 +209,18 @@ pub struct Mesh2dPipeline {
    pub mesh_layout: BindGroupLayout,
    // This dummy white texture is to be used in place of optional textures
    pub dummy_white_gpu_image: GpuImage,
+    pub per_object_buffer_batch_size: Option<u32>,
 }

 impl FromWorld for Mesh2dPipeline {
    fn from_world(world: &mut World) -> Self {
-        let mut system_state: SystemState<(Res<RenderDevice>, Res<DefaultImageSampler>)> =
-            SystemState::new(world);
-        let (render_device, default_sampler) = system_state.get_mut(world);
+        let mut system_state: SystemState<(
+            Res<RenderDevice>,
+            Res<RenderQueue>,
+            Res<DefaultImageSampler>,
+        )> = SystemState::new(world);
+        let (render_device, render_queue, default_sampler) = system_state.get_mut(world);
+        let render_device = render_device.into_inner();
        let view_layout = render_device.create_bind_group_layout(&BindGroupLayoutDescriptor {
            entries: &[
                // View
@ -197,16 +249,11 @@ impl FromWorld for Mesh2dPipeline {
        });

        let mesh_layout = render_device.create_bind_group_layout(&BindGroupLayoutDescriptor {
-            entries: &[BindGroupLayoutEntry {
-                binding: 0,
-                visibility: ShaderStages::VERTEX | ShaderStages::FRAGMENT,
-                ty: BindingType::Buffer {
-                    ty: BufferBindingType::Uniform,
-                    has_dynamic_offset: true,
-                    min_binding_size: Some(Mesh2dUniform::min_size()),
-                },
-                count: None,
-            }],
+            entries: &[GpuArrayBuffer::<Mesh2dUniform>::binding_layout(
+                0,
+                ShaderStages::VERTEX_FRAGMENT,
+                render_device,
+            )],
            label: Some("mesh2d_layout"),
        });
        // A 1x1x1 'all 1.0' texture to use as a dummy texture to use in place of optional StandardMaterial textures
@ -219,7 +266,6 @@ impl FromWorld for Mesh2dPipeline {
            };

            let format_size = image.texture_descriptor.format.pixel_size();
-            let render_queue = world.resource_mut::<RenderQueue>();
            render_queue.write_texture(
                ImageCopyTexture {
                    texture: &texture,
@ -253,6 +299,9 @@ impl FromWorld for Mesh2dPipeline {
            view_layout,
            mesh_layout,
            dummy_white_gpu_image,
+            per_object_buffer_batch_size: GpuArrayBuffer::<Mesh2dUniform>::batch_size(
+                render_device,
+            ),
        }
    }
 }
@ -275,6 +324,26 @@ impl Mesh2dPipeline {
    }
 }

+impl GetBatchData for Mesh2dPipeline {
+    type Query = (
+        Option<&'static Material2dBindGroupId>,
+        &'static Mesh2dHandle,
+        &'static Mesh2dTransforms,
+    );
+    type CompareData = (Option<Material2dBindGroupId>, AssetId<Mesh>);
+    type BufferData = Mesh2dUniform;
+
+    fn get_buffer_data(&(.., mesh_transforms): &QueryItem<Self::Query>) -> Self::BufferData {
+        mesh_transforms.into()
+    }
+
+    fn get_compare_data(
+        &(material_bind_group_id, mesh_handle, ..): &QueryItem<Self::Query>,
+    ) -> Self::CompareData {
+        (material_bind_group_id.copied(), mesh_handle.0.id())
+    }
+}
+
 bitflags::bitflags! {
    #[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
    #[repr(transparent)]
@ -477,9 +546,9 @@ pub fn prepare_mesh2d_bind_group(
    mut commands: Commands,
    mesh2d_pipeline: Res<Mesh2dPipeline>,
    render_device: Res<RenderDevice>,
-    mesh2d_uniforms: Res<ComponentUniforms<Mesh2dUniform>>,
+    mesh2d_uniforms: Res<GpuArrayBuffer<Mesh2dUniform>>,
 ) {
-    if let Some(binding) = mesh2d_uniforms.uniforms().binding() {
+    if let Some(binding) = mesh2d_uniforms.binding() {
        commands.insert_resource(Mesh2dBindGroup {
            value: render_device.create_bind_group(&BindGroupDescriptor {
                entries: &[BindGroupEntry {
@ -557,20 +626,26 @@ pub struct SetMesh2dBindGroup<const I: usize>;
 impl<P: PhaseItem, const I: usize> RenderCommand<P> for SetMesh2dBindGroup<I> {
    type Param = SRes<Mesh2dBindGroup>;
    type ViewWorldQuery = ();
-    type ItemWorldQuery = Read<DynamicUniformIndex<Mesh2dUniform>>;
+    type ItemWorldQuery = ();

    #[inline]
    fn render<'w>(
-        _item: &P,
+        item: &P,
        _view: (),
-        mesh2d_index: &'_ DynamicUniformIndex<Mesh2dUniform>,
+        _item_query: (),
        mesh2d_bind_group: SystemParamItem<'w, '_, Self::Param>,
        pass: &mut TrackedRenderPass<'w>,
    ) -> RenderCommandResult {
+        let mut dynamic_offsets: [u32; 1] = Default::default();
+        let mut offset_count = 0;
+        if let Some(dynamic_offset) = item.dynamic_offset() {
+            dynamic_offsets[offset_count] = dynamic_offset.get();
+            offset_count += 1;
+        }
        pass.set_bind_group(
            I,
            &mesh2d_bind_group.into_inner().value,
-            &[mesh2d_index.index()],
+            &dynamic_offsets[..offset_count],
        );
        RenderCommandResult::Success
    }
@ -584,14 +659,21 @@ impl<P: PhaseItem> RenderCommand<P> for DrawMesh2d {

    #[inline]
    fn render<'w>(
-        _item: &P,
+        item: &P,
        _view: (),
        mesh_handle: ROQueryItem<'w, Self::ItemWorldQuery>,
        meshes: SystemParamItem<'w, '_, Self::Param>,
        pass: &mut TrackedRenderPass<'w>,
    ) -> RenderCommandResult {
+        let batch_range = item.batch_range();
        if let Some(gpu_mesh) = meshes.into_inner().get(&mesh_handle.0) {
            pass.set_vertex_buffer(0, gpu_mesh.vertex_buffer.slice(..));
+            #[cfg(all(feature = "webgl", target_arch = "wasm32"))]
+            pass.set_push_constants(
+                ShaderStages::VERTEX,
+                0,
+                &(batch_range.start as i32).to_le_bytes(),
+            );
            match &gpu_mesh.buffer_info {
                GpuBufferInfo::Indexed {
                    buffer,
@ -599,10 +681,10 @@ impl<P: PhaseItem> RenderCommand<P> for DrawMesh2d {
                    count,
                } => {
                    pass.set_index_buffer(buffer.slice(..), 0, *index_format);
-                    pass.draw_indexed(0..*count, 0, 0..1);
+                    pass.draw_indexed(0..*count, 0, batch_range.clone());
                }
                GpuBufferInfo::NonIndexed => {
-                    pass.draw(0..gpu_mesh.vertex_count, 0..1);
+                    pass.draw(0..gpu_mesh.vertex_count, batch_range.clone());
                }
            }
            RenderCommandResult::Success
--- a/crates/bevy_sprite/src/mesh2d/mesh2d.wgsl
+++ b/crates/bevy_sprite/src/mesh2d/mesh2d.wgsl
@ -8,6 +8,7 @@
 #endif

 struct Vertex {
+    @builtin(instance_index) instance_index: u32,
 #ifdef VERTEX_POSITIONS
    @location(0) position: vec3<f32>,
 #endif
@ -33,20 +34,21 @@ fn vertex(vertex: Vertex) -> MeshVertexOutput {
 #endif

 #ifdef VERTEX_POSITIONS
+    var model = mesh_functions::get_model_matrix(vertex.instance_index);
    out.world_position = mesh_functions::mesh2d_position_local_to_world(
-        mesh.model, 
+        model,
        vec4<f32>(vertex.position, 1.0)
    );
    out.position = mesh_functions::mesh2d_position_world_to_clip(out.world_position);
 #endif

 #ifdef VERTEX_NORMALS
-    out.world_normal = mesh_functions::mesh2d_normal_local_to_world(vertex.normal);
+    out.world_normal = mesh_functions::mesh2d_normal_local_to_world(vertex.normal, vertex.instance_index);
 #endif

 #ifdef VERTEX_TANGENTS
    out.world_tangent = mesh_functions::mesh2d_tangent_local_to_world(
-        mesh.model, 
+        model,
        vertex.tangent
    );
 #endif
--- a/crates/bevy_sprite/src/mesh2d/mesh2d_bindings.wgsl
+++ b/crates/bevy_sprite/src/mesh2d/mesh2d_bindings.wgsl
@ -1,5 +1,21 @@
 #define_import_path bevy_sprite::mesh2d_bindings

-#import bevy_sprite::mesh2d_types
+#import bevy_sprite::mesh2d_types Mesh2d

-@group(2) @binding(0) var<uniform> mesh: bevy_sprite::mesh2d_types::Mesh2d;
+#ifdef MESH_BINDGROUP_1
+
+#ifdef PER_OBJECT_BUFFER_BATCH_SIZE
+@group(1) @binding(0) var<uniform> mesh: array<Mesh2d, #{PER_OBJECT_BUFFER_BATCH_SIZE}u>;
+#else
+@group(1) @binding(0) var<storage> mesh: array<Mesh2d>;
+#endif // PER_OBJECT_BUFFER_BATCH_SIZE
+
+#else // MESH_BINDGROUP_1
+
+#ifdef PER_OBJECT_BUFFER_BATCH_SIZE
+@group(2) @binding(0) var<uniform> mesh: array<Mesh2d, #{PER_OBJECT_BUFFER_BATCH_SIZE}u>;
+#else
+@group(2) @binding(0) var<storage> mesh: array<Mesh2d>;
+#endif // PER_OBJECT_BUFFER_BATCH_SIZE
+
+#endif // MESH_BINDGROUP_1
--- a/crates/bevy_sprite/src/mesh2d/mesh2d_functions.wgsl
+++ b/crates/bevy_sprite/src/mesh2d/mesh2d_functions.wgsl
@ -2,6 +2,12 @@

 #import bevy_sprite::mesh2d_view_bindings  view
 #import bevy_sprite::mesh2d_bindings       mesh
+#import bevy_render::instance_index        get_instance_index
+#import bevy_render::maths                 affine_to_square, mat2x4_f32_to_mat3x3_unpack
+
+fn get_model_matrix(instance_index: u32) -> mat4x4<f32> {
+    return affine_to_square(mesh[get_instance_index(instance_index)].model);
+}

 fn mesh2d_position_local_to_world(model: mat4x4<f32>, vertex_position: vec4<f32>) -> vec4<f32> {
    return model * vertex_position;
@ -19,11 +25,10 @@ fn mesh2d_position_local_to_clip(model: mat4x4<f32>, vertex_position: vec4<f32>)
    return mesh2d_position_world_to_clip(world_position);
 }

-fn mesh2d_normal_local_to_world(vertex_normal: vec3<f32>) -> vec3<f32> {
-    return mat3x3<f32>(
-        mesh.inverse_transpose_model[0].xyz,
-        mesh.inverse_transpose_model[1].xyz,
-        mesh.inverse_transpose_model[2].xyz
+fn mesh2d_normal_local_to_world(vertex_normal: vec3<f32>, instance_index: u32) -> vec3<f32> {
+    return mat2x4_f32_to_mat3x3_unpack(
+        mesh[instance_index].inverse_transpose_model_a,
+        mesh[instance_index].inverse_transpose_model_b,
    ) * vertex_normal;
 }

--- a/crates/bevy_sprite/src/mesh2d/mesh2d_types.wgsl
+++ b/crates/bevy_sprite/src/mesh2d/mesh2d_types.wgsl
@ -1,8 +1,16 @@
 #define_import_path bevy_sprite::mesh2d_types

 struct Mesh2d {
-    model: mat4x4<f32>,
-    inverse_transpose_model: mat4x4<f32>,
+    // Affine 4x3 matrix transposed to 3x4
+    // Use bevy_render::maths::affine_to_square to unpack
+    model: mat3x4<f32>,
+    // 3x3 matrix packed in mat2x4 and f32 as:
+    // [0].xyz, [1].x,
+    // [1].yz, [2].xy
+    // [2].z
+    // Use bevy_render::maths::mat2x4_f32_to_mat3x3_unpack to unpack
+    inverse_transpose_model_a: mat2x4<f32>,
+    inverse_transpose_model_b: f32,
    // 'flags' is a bit field indicating various options. u32 is 32 bits so we have up to 32 options.
    flags: u32,
 };
--- a/crates/bevy_sprite/src/render/mod.rs
+++ b/crates/bevy_sprite/src/render/mod.rs
@ -565,8 +565,9 @@ pub fn queue_sprites(
                    pipeline: colored_pipeline,
                    entity: *entity,
                    sort_key,
-                    // batch_size will be calculated in prepare_sprites
-                    batch_size: 0,
+                    // batch_range and dynamic_offset will be calculated in prepare_sprites
+                    batch_range: 0..0,
+                    dynamic_offset: None,
                });
            } else {
                transparent_phase.add(Transparent2d {
@ -574,8 +575,9 @@ pub fn queue_sprites(
                    pipeline,
                    entity: *entity,
                    sort_key,
-                    // batch_size will be calculated in prepare_sprites
-                    batch_size: 0,
+                    // batch_range and dynamic_offset will be calculated in prepare_sprites
+                    batch_range: 0..0,
+                    dynamic_offset: None,
                });
            }
        }
@ -739,7 +741,9 @@ pub fn prepare_sprites(
                    ));
                }

-                transparent_phase.items[batch_item_index].batch_size += 1;
+                transparent_phase.items[batch_item_index]
+                    .batch_range_mut()
+                    .end += 1;
                batches.last_mut().unwrap().1.range.end += 1;
                index += 1;
            }
--- a/crates/bevy_ui/src/render/mod.rs
+++ b/crates/bevy_ui/src/render/mod.rs
@ -4,6 +4,7 @@ mod render_pass;
 use bevy_core_pipeline::{core_2d::Camera2d, core_3d::Camera3d};
 use bevy_ecs::storage::SparseSet;
 use bevy_hierarchy::Parent;
+use bevy_render::render_phase::PhaseItem;
 use bevy_render::view::ViewVisibility;
 use bevy_render::{ExtractSchedule, Render};
 use bevy_window::{PrimaryWindow, Window};
@ -665,8 +666,9 @@ pub fn queue_uinodes(
                pipeline,
                entity: *entity,
                sort_key: FloatOrd(extracted_uinode.stack_index as f32),
-                // batch_size will be calculated in prepare_uinodes
-                batch_size: 0,
+                // batch_range will be calculated in prepare_uinodes
+                batch_range: 0..0,
+                dynamic_offset: None,
            });
        }
    }
@ -892,7 +894,7 @@ pub fn prepare_uinodes(
                    }
                    index += QUAD_INDICES.len() as u32;
                    existing_batch.unwrap().1.range.end = index;
-                    ui_phase.items[batch_item_index].batch_size += 1;
+                    ui_phase.items[batch_item_index].batch_range_mut().end += 1;
                } else {
                    batch_image_handle = AssetId::invalid();
                }
--- a/crates/bevy_ui/src/render/render_pass.rs
+++ b/crates/bevy_ui/src/render/render_pass.rs
@ -1,3 +1,5 @@
+use std::ops::Range;
+
 use super::{UiBatch, UiImageBindGroups, UiMeta};
 use crate::{prelude::UiCameraConfig, DefaultCameraView};
 use bevy_ecs::{
@ -11,7 +13,7 @@ use bevy_render::{
    renderer::*,
    view::*,
 };
-use bevy_utils::FloatOrd;
+use bevy_utils::{nonmax::NonMaxU32, FloatOrd};

 pub struct UiPassNode {
    ui_view_query: QueryState<
@ -90,7 +92,8 @@ pub struct TransparentUi {
    pub entity: Entity,
    pub pipeline: CachedRenderPipelineId,
    pub draw_function: DrawFunctionId,
-    pub batch_size: usize,
+    pub batch_range: Range<u32>,
+    pub dynamic_offset: Option<NonMaxU32>,
 }

 impl PhaseItem for TransparentUi {
@ -117,8 +120,23 @@ impl PhaseItem for TransparentUi {
    }

    #[inline]
-    fn batch_size(&self) -> usize {
-        self.batch_size
+    fn batch_range(&self) -> &Range<u32> {
+        &self.batch_range
+    }
+
+    #[inline]
+    fn batch_range_mut(&mut self) -> &mut Range<u32> {
+        &mut self.batch_range
+    }
+
+    #[inline]
+    fn dynamic_offset(&self) -> Option<NonMaxU32> {
+        self.dynamic_offset
+    }
+
+    #[inline]
+    fn dynamic_offset_mut(&mut self) -> &mut Option<NonMaxU32> {
+        &mut self.dynamic_offset
    }
 }

--- a/crates/bevy_utils/Cargo.toml
+++ b/crates/bevy_utils/Cargo.toml
@ -20,6 +20,7 @@ hashbrown = { version = "0.14", features = ["serde"] }
 bevy_utils_proc_macros = {version = "0.12.0-dev", path = "macros"}
 petgraph = "0.6"
 thiserror = "1.0"
+nonmax = "0.5"

 [target.'cfg(target_arch = "wasm32")'.dependencies]
 getrandom = {version = "0.2.0", features = ["js"]}
--- a/crates/bevy_utils/src/lib.rs
+++ b/crates/bevy_utils/src/lib.rs
@ -34,6 +34,11 @@ pub use thiserror;
 pub use tracing;
 pub use uuid::Uuid;

+#[allow(missing_docs)]
+pub mod nonmax {
+    pub use nonmax::*;
+}
+
 use hashbrown::hash_map::RawEntryMut;
 use std::{
    fmt::Debug,
--- a/examples/2d/mesh2d_manual.rs
+++ b/examples/2d/mesh2d_manual.rs
@ -21,7 +21,7 @@ use bevy::{
        Extract, Render, RenderApp, RenderSet,
    },
    sprite::{
-        DrawMesh2d, Mesh2dHandle, Mesh2dPipeline, Mesh2dPipelineKey, Mesh2dUniform,
+        DrawMesh2d, Mesh2dHandle, Mesh2dPipeline, Mesh2dPipelineKey, Mesh2dTransforms,
        SetMesh2dBindGroup, SetMesh2dViewBindGroup,
    },
    utils::FloatOrd,
@ -148,19 +148,24 @@ impl SpecializedRenderPipeline for ColoredMesh2dPipeline {
            false => TextureFormat::bevy_default(),
        };

+        // Meshes typically live in bind group 2. Because we are using bind group 1
+        // we need to add the MESH_BINDGROUP_1 shader def so that the bindings are correctly
+        // linked in the shader.
+        let shader_defs = vec!["MESH_BINDGROUP_1".into()];
+
        RenderPipelineDescriptor {
            vertex: VertexState {
                // Use our custom shader
                shader: COLORED_MESH2D_SHADER_HANDLE,
                entry_point: "vertex".into(),
-                shader_defs: Vec::new(),
+                shader_defs: shader_defs.clone(),
                // Use our custom vertex buffer
                buffers: vec![vertex_layout],
            },
            fragment: Some(FragmentState {
                // Use our custom shader
                shader: COLORED_MESH2D_SHADER_HANDLE,
-                shader_defs: Vec::new(),
+                shader_defs,
                entry_point: "fragment".into(),
                targets: vec![Some(ColorTargetState {
                    format,
@ -212,13 +217,12 @@ type DrawColoredMesh2d = (
 // using `include_str!()`, or loaded like any other asset with `asset_server.load()`.
 const COLORED_MESH2D_SHADER: &str = r"
 // Import the standard 2d mesh uniforms and set their bind groups
-#import bevy_sprite::mesh2d_types as MeshTypes
+#import bevy_sprite::mesh2d_bindings mesh
 #import bevy_sprite::mesh2d_functions as MeshFunctions

-@group(1) @binding(0) var<uniform> mesh: MeshTypes::Mesh2d;
-
 // The structure of the vertex buffer is as specified in `specialize()`
 struct Vertex {
+    @builtin(instance_index) instance_index: u32,
    @location(0) position: vec3<f32>,
    @location(1) color: u32,
 };
@ -235,7 +239,8 @@ struct VertexOutput {
 fn vertex(vertex: Vertex) -> VertexOutput {
    var out: VertexOutput;
    // Project the world position of the mesh into screen position
-    out.clip_position = MeshFunctions::mesh2d_position_local_to_clip(mesh.model, vec4<f32>(vertex.position, 1.0));
+    let model = MeshFunctions::get_model_matrix(vertex.instance_index);
+    out.clip_position = MeshFunctions::mesh2d_position_local_to_clip(model, vec4<f32>(vertex.position, 1.0));
    // Unpack the `u32` from the vertex buffer into the `vec4<f32>` used by the fragment shader
    out.color = vec4<f32>((vec4<u32>(vertex.color) >> vec4<u32>(0u, 8u, 16u, 24u)) & vec4<u32>(255u)) / 255.0;
    return out;
@ -315,7 +320,7 @@ pub fn queue_colored_mesh2d(
    pipeline_cache: Res<PipelineCache>,
    msaa: Res<Msaa>,
    render_meshes: Res<RenderAssets<Mesh>>,
-    colored_mesh2d: Query<(&Mesh2dHandle, &Mesh2dUniform), With<ColoredMesh2d>>,
+    colored_mesh2d: Query<(&Mesh2dHandle, &Mesh2dTransforms), With<ColoredMesh2d>>,
    mut views: Query<(
        &VisibleEntities,
        &mut RenderPhase<Transparent2d>,
@ -334,7 +339,7 @@ pub fn queue_colored_mesh2d(

        // Queue all entities visible to that view
        for visible_entity in &visible_entities.entities {
-            if let Ok((mesh2d_handle, mesh2d_uniform)) = colored_mesh2d.get(*visible_entity) {
+            if let Ok((mesh2d_handle, mesh2d_transforms)) = colored_mesh2d.get(*visible_entity) {
                // Get our specialized pipeline
                let mut mesh2d_key = mesh_key;
                if let Some(mesh) = render_meshes.get(&mesh2d_handle.0) {
@ -345,7 +350,7 @@ pub fn queue_colored_mesh2d(
                let pipeline_id =
                    pipelines.specialize(&pipeline_cache, &colored_mesh2d_pipeline, mesh2d_key);

-                let mesh_z = mesh2d_uniform.transform.w_axis.z;
+                let mesh_z = mesh2d_transforms.transform.translation.z;
                transparent_phase.add(Transparent2d {
                    entity: *visible_entity,
                    draw_function: draw_colored_mesh2d,
@ -354,7 +359,8 @@ pub fn queue_colored_mesh2d(
                    // in order to get correct transparency
                    sort_key: FloatOrd(mesh_z),
                    // This material is not batched
-                    batch_size: 1,
+                    batch_range: 0..1,
+                    dynamic_offset: None,
                });
            }
        }
--- a/examples/shader/shader_instancing.rs
+++ b/examples/shader/shader_instancing.rs
@ -136,7 +136,8 @@ fn queue_custom(
                    draw_function: draw_custom,
                    distance: rangefinder
                        .distance_translation(&mesh_transforms.transform.translation),
-                    batch_size: 1,
+                    batch_range: 0..1,
+                    dynamic_offset: None,
                });
            }
        }