Big big big buffers and how to work with them #6130
-
How are we to deal with really, really big data? Is there a way to make a series of bindings appear as a contiguous buffer (shared-side)? the However I've been unable to get it working. A toy example to demonstrate, the whole code is here, on the Let's say our goal is to make 1GB of 0.0s (f32s) into 1.0s with a simple shader, that adds 1.0 to every element in the array. Let's start with 1GB of f32s all initialized to 0.0. let numbers = gigs_of_zeroed_f32s(1.0); The usual boilerplate to get a device and adapter: let instance = wgpu::Instance::default();
let adapter = instance
.request_adapter(&wgpu::RequestAdapterOptions::default())
.await?;
let (device, queue) = adapter
.request_device(
&wgpu::DeviceDescriptor {
label: None,
required_features: Features::STORAGE_RESOURCE_BINDING_ARRAY
| Features::BUFFER_BINDING_ARRAY,
memory_hints: wgpu::MemoryHints::Performance,
..Default::default()
},
None,
)
.await?; This will be too large (on my and probably most GPUs for a single buffer), so we split them up, we'll use the
let chunks = calculate_chunks(numbers, RTX_TITAN_MAX_BUFFER_SIZE); // My GPU's max buffer size is actually big enough to do this in ~250MB chunks, but I appreciate it's a rare and special one so we'll assume half of that.
(0..chunks.len())
.map(|e| {
let size = std::mem::size_of_val(chunks[e]) as u64;
device.create_buffer(&wgpu::BufferDescriptor {
label: Some(&format!("staging buffer-{}", e)),
size,
usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
mapped_at_creation: false,
})
})
.collect()
chunks
.iter()
.enumerate()
.map(|(e, seg)| {
let size = std::mem::size_of_val(*seg) as u64;
device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some(&format!("Storage Buffer-{}", e)),
contents: bytemuck::cast_slice(seg),
usage: wgpu::BufferUsages::STORAGE
| wgpu::BufferUsages::COPY_DST
| wgpu::BufferUsages::COPY_SRC,
})
})
.collect() I'll omit the layout and pipeline creation etc for brevity, and because we'll talk about them after this next part. Now at this stage we can (unfortunately) get everything working with a shader that does this: group(0)
@binding(0)
var<storage, read_write> flat_buffer0: array<f32>;
@group(0)
@binding(1)
var<storage, read_write> flat_buffer1: array<f32>;
@group(0)
@binding(2)
var<storage, read_write> flat_buffer2: array<f32>;
@group(0)
@binding(3)
var<storage, read_write> flat_buffer3: array<f32>;
@group(0)
@binding(4)
var<storage, read_write> flat_buffer4: array<f32>;
@group(0)
@binding(5)
var<storage, read_write> flat_buffer5: array<f32>;
@group(0)
@binding(6)
var<storage, read_write> flat_buffer6: array<f32>;
@group(0)
@binding(7)
var<storage, read_write> flat_buffer7: array<f32>;
// Function to add one to a given value
fn add_one(n: f32) -> f32 {
return n + 1.0;
}
const OFFSET:u32 = 256u;
@compute
@workgroup_size(256, 1, 1)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let base_index = global_id.x * OFFSET;
let len = arrayLength(&flat_buffer0);
// Loop over the OFFSET indices that this thread is responsible for
for (var i = 0u; i < OFFSET; i++) {
let index = base_index + i;
if (index < arrayLength(&flat_buffer0)) {
flat_buffer0[index] = add_one(flat_buffer0[index]);
flat_buffer1[index] = add_one(flat_buffer1[index]);
flat_buffer2[index] = add_one(flat_buffer2[index]);
flat_buffer3[index] = add_one(flat_buffer3[index]);
flat_buffer4[index] = add_one(flat_buffer4[index]);
flat_buffer5[index] = add_one(flat_buffer5[index]);
flat_buffer6[index] = add_one(flat_buffer6[index]);
flat_buffer7[index] = add_one(flat_buffer7[index]);
}
}
} Which we can probably? all agree is neither the code we wish to see, nor write even ignoring the max_dispatch_size and offset etc which affects obviously how high the However, as the aforementioned struct OurBuffer {
inner: array<f32, BUFF_LENGTH>,
}
@group(0) @binding(0)
var<storage, read_write> all_buffers: array<OurBuffer, NUM_BUFFERS>; Treating them all as one, providing we supply the let bind_group_layout_entries: Vec<wgpu::BindGroupLayoutEntry> = (0..storage_buffers.len())
.map(|bind_idx| wgpu::BindGroupLayoutEntry {
binding: bind_idx as u32,
visibility: wgpu::ShaderStages::COMPUTE,
ty: wgpu::BindingType::Buffer {
ty: wgpu::BufferBindingType::Storage { read_only: false },
has_dynamic_offset: false,
min_binding_size: None,
},
count: Some(NonZero::new(1)?),
})
.collect(); However if we do that: ❯ naga src/shader.wgsl
Validation successful The shader 'seems' fine, however when you run this, the shader is expecting something at 0,0 equal to the size of the array, and it's only getting ERROR wgpu::backend::wgpu_core > Handling wgpu errors as fatal by default
thread 'main' panicked at /home/jer/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-22.1.0/src/backend/wgpu_core.rs:3411:5:
wgpu error: Validation Error
Caused by:
In ComputePass::end
In a dispatch command, indirect:false
Buffer is bound with size 134217728 where the shader expects 1073741824 in group[0] compact index 0 So my question: is it possible to do this? and if so, can anyone point me in the right direction. I have scoured the source of whisper/burn/bevy and of course the examples in this repo and been unable to find any examples or documentation to support this (very normal?) use case. For perhaps sideways awareness, you can do this in cuda, pretty trivially with some thing like the following: #include <cuda_runtime.h>
#include <iostream>
__global__ void addOne(float *d_data, size_t size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
d_data[idx] += 1.0f;
}
}
int main() {
size_t size = 1 << 28; // 1GB of f32s
float *h_data = (float*)malloc(size * sizeof(float));
float *d_data;
cudaMalloc(&d_data, size * sizeof(float));
cudaMemcpy(d_data, h_data, size * sizeof(float), cudaMemcpyHostToDevice);
int blockSize = 256;
int numBlocks = (size + blockSize - 1) / blockSize;
addOne<<<numBlocks, blockSize>>>(d_data, size);
cudaMemcpy(h_data, d_data, size * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
free(h_data);
std::cout << "Done" << std::endl;
return 0;
} Any help would be greatly appreciated && to any wgpu maintainers, once I get this working expect a PR for a new example! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Edit: I solved this, thank you @jasmine for your help. You need a few things, one the following features: let (device, queue) = adapter
.request_device(
&wgpu::DeviceDescriptor {
label: None,
required_features: Features::STORAGE_RESOURCE_BINDING_ARRAY
| Features::UNIFORM_BUFFER_AND_STORAGE_TEXTURE_ARRAY_NON_UNIFORM_INDEXING
| Features::BUFFER_BINDING_ARRAY,
memory_hints: wgpu::MemoryHints::Performance,
..Default::default()
},
None,
)
.await
.unwrap(); and two you need to also use the Then in our shader: // NOTE: binding_array will not work on WebGPU or dawn, it's wgpu(naga) only.
@group(0) @binding(0)
var<storage, read_write> all_buffers: binding_array<array<f32, NUM_BUFFERS>>; one contiguous buffer :) Working source on the I will make a PR for the examples later this week. |
Beta Was this translation helpful? Give feedback.
Edit: I solved this, thank you @jasmine for your help.
You need a few things, one the following features:
and two you need to also use the
binding_array
keyword, which these…