Big big big buffers and how to work with them #6130

alphastrata · 2024-08-19T02:15:16Z

alphastrata
Aug 19, 2024

How are we to deal with really, really big data?

Is there a way to make a series of bindings appear as a contiguous buffer (shared-side)? the count field in BindGroupEntry and the presence of features::STORAGE_RESOURCE_BINDING_ARRAY led me to believe yes..

However I've been unable to get it working.

A toy example to demonstrate, the whole code is here, on the dev branch, but the gist of it is below to show reasoning, to see actual implementations see the link above, (almost) all the rust is in src/lib.rs.

Let's say our goal is to make 1GB of 0.0s (f32s) into 1.0s with a simple shader, that adds 1.0 to every element in the array.

Let's start with 1GB of f32s all initialized to 0.0.

    let numbers = gigs_of_zeroed_f32s(1.0);

The usual boilerplate to get a device and adapter:

  let instance = wgpu::Instance::default();

    let adapter = instance
        .request_adapter(&wgpu::RequestAdapterOptions::default())
        .await?;

    let (device, queue) = adapter
        .request_device(
            &wgpu::DeviceDescriptor {
                label: None,
                required_features: Features::STORAGE_RESOURCE_BINDING_ARRAY
                    | Features::BUFFER_BINDING_ARRAY,
                memory_hints: wgpu::MemoryHints::Performance,
                ..Default::default()
            },
            None,
        )
        .await?;

This will be too large (on my and probably most GPUs for a single buffer), so we split them up, we'll use the storage and staging nomenclature that the wgpu examples use:

staging buffers

  let chunks = calculate_chunks(numbers, RTX_TITAN_MAX_BUFFER_SIZE); // My GPU's max buffer size is actually big enough to do this in ~250MB chunks, but I appreciate it's a rare and special one so we'll assume half of that.

    (0..chunks.len())
        .map(|e| {
            let size = std::mem::size_of_val(chunks[e]) as u64;
            device.create_buffer(&wgpu::BufferDescriptor {
                label: Some(&format!("staging buffer-{}", e)),
                size,
                usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
                mapped_at_creation: false,
            })
        })
        .collect()

Do the same for the storage:

  chunks
            .iter()
            .enumerate()
            .map(|(e, seg)| {
                let size = std::mem::size_of_val(*seg) as u64;
                device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
                    label: Some(&format!("Storage Buffer-{}", e)),
                    contents: bytemuck::cast_slice(seg),
                    usage: wgpu::BufferUsages::STORAGE
                        | wgpu::BufferUsages::COPY_DST
                        | wgpu::BufferUsages::COPY_SRC,
                })
            })
            .collect()

I'll omit the layout and pipeline creation etc for brevity, and because we'll talk about them after this next part.

Now at this stage we can (unfortunately) get everything working with a shader that does this:
/src/shader.wgsl

group(0)
@binding(0)
var<storage, read_write> flat_buffer0: array<f32>;
@group(0)
@binding(1)
var<storage, read_write> flat_buffer1: array<f32>;
@group(0)
@binding(2)
var<storage, read_write> flat_buffer2: array<f32>;
@group(0)
@binding(3)
var<storage, read_write> flat_buffer3: array<f32>;

@group(0)
@binding(4)
var<storage, read_write> flat_buffer4: array<f32>;
@group(0)
@binding(5)
var<storage, read_write> flat_buffer5: array<f32>;
@group(0)
@binding(6)
var<storage, read_write> flat_buffer6: array<f32>;
@group(0)
@binding(7)
var<storage, read_write> flat_buffer7: array<f32>;


// Function to add one to a given value
fn add_one(n: f32) -> f32 {
    return n + 1.0;
}

const OFFSET:u32 = 256u;

@compute
@workgroup_size(256, 1, 1)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let base_index = global_id.x * OFFSET;
    let len = arrayLength(&flat_buffer0);


    // Loop over the OFFSET indices that this thread is responsible for
    for (var i = 0u; i < OFFSET; i++) {
        let index = base_index + i;
        
        if (index < arrayLength(&flat_buffer0)) {
            flat_buffer0[index] = add_one(flat_buffer0[index]);
            flat_buffer1[index] = add_one(flat_buffer1[index]);
            flat_buffer2[index] = add_one(flat_buffer2[index]);
            flat_buffer3[index] = add_one(flat_buffer3[index]);

            flat_buffer4[index] = add_one(flat_buffer4[index]);
            flat_buffer5[index] = add_one(flat_buffer5[index]);
            flat_buffer6[index] = add_one(flat_buffer6[index]);
            flat_buffer7[index] = add_one(flat_buffer7[index]);

        }
    }
}

Which we can probably? all agree is neither the code we wish to see, nor write even ignoring the max_dispatch_size and offset etc which affects obviously how high the index of any given global_invocation_id.x can actually go.

However, as the aforementioned STORAGE_RESOURCE_BINDING_ARRAY feature indicates we should be able to go:

struct OurBuffer {
    inner: array<f32, BUFF_LENGTH>,
}

@group(0) @binding(0)
var<storage, read_write> all_buffers: array<OurBuffer, NUM_BUFFERS>;

Treating them all as one, providing we supply the count ? (citation needed) , the count is on the BingGroupLayoutEntry so:

 let bind_group_layout_entries: Vec<wgpu::BindGroupLayoutEntry> = (0..storage_buffers.len())
        .map(|bind_idx| wgpu::BindGroupLayoutEntry {
            binding: bind_idx as u32,
            visibility: wgpu::ShaderStages::COMPUTE,
            ty: wgpu::BindingType::Buffer {
                ty: wgpu::BufferBindingType::Storage { read_only: false },
                has_dynamic_offset: false,
                min_binding_size: None,
            },
            count: Some(NonZero::new(1)?),
        })
        .collect();

However if we do that:

❯ naga src/shader.wgsl
Validation successful

The shader 'seems' fine, however when you run this, the shader is expecting something at 0,0 equal to the size of the array, and it's only getting 1/nth of that:

 ERROR wgpu::backend::wgpu_core    > Handling wgpu errors as fatal by default
thread 'main' panicked at /home/jer/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-22.1.0/src/backend/wgpu_core.rs:3411:5:
wgpu error: Validation Error

Caused by:
  In ComputePass::end
    In a dispatch command, indirect:false
      Buffer is bound with size 134217728 where the shader expects 1073741824 in group[0] compact index 0

So my question: is it possible to do this? and if so, can anyone point me in the right direction.

I have scoured the source of whisper/burn/bevy and of course the examples in this repo and been unable to find any examples or documentation to support this (very normal?) use case.

For perhaps sideways awareness, you can do this in cuda, pretty trivially with some thing like the following:

#include <cuda_runtime.h>
#include <iostream>

__global__ void addOne(float *d_data, size_t size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        d_data[idx] += 1.0f;
    }
}

int main() {
    size_t size = 1 << 28;  // 1GB of f32s
    float *h_data = (float*)malloc(size * sizeof(float));
    float *d_data;

    cudaMalloc(&d_data, size * sizeof(float));
    cudaMemcpy(d_data, h_data, size * sizeof(float), cudaMemcpyHostToDevice);

    int blockSize = 256;
    int numBlocks = (size + blockSize - 1) / blockSize;
    addOne<<<numBlocks, blockSize>>>(d_data, size);

    cudaMemcpy(h_data, d_data, size * sizeof(float), cudaMemcpyDeviceToHost);

    cudaFree(d_data);
    free(h_data);

    std::cout << "Done" << std::endl;
    return 0;
}

Any help would be greatly appreciated && to any wgpu maintainers, once I get this working expect a PR for a new example!

Answered by alphastrata

Aug 19, 2024

Edit: I solved this, thank you @jasmine for your help.

You need a few things, one the following features:

 let (device, queue) = adapter
        .request_device(
            &wgpu::DeviceDescriptor {
                label: None,
                required_features: Features::STORAGE_RESOURCE_BINDING_ARRAY
                    | Features::UNIFORM_BUFFER_AND_STORAGE_TEXTURE_ARRAY_NON_UNIFORM_INDEXING
                    | Features::BUFFER_BINDING_ARRAY,

                memory_hints: wgpu::MemoryHints::Performance,
                ..Default::default()
            },
            None,
        )
        .await
        .unwrap();

and two you need to also use the binding_array keyword, which these…

View full answer

alphastrata · 2024-08-19T21:32:53Z

alphastrata
Aug 19, 2024
Author

Edit: I solved this, thank you @jasmine for your help.

You need a few things, one the following features:

 let (device, queue) = adapter
        .request_device(
            &wgpu::DeviceDescriptor {
                label: None,
                required_features: Features::STORAGE_RESOURCE_BINDING_ARRAY
                    | Features::UNIFORM_BUFFER_AND_STORAGE_TEXTURE_ARRAY_NON_UNIFORM_INDEXING
                    | Features::BUFFER_BINDING_ARRAY,

                memory_hints: wgpu::MemoryHints::Performance,
                ..Default::default()
            },
            None,
        )
        .await
        .unwrap();

and two you need to also use the binding_array keyword, which these docs will tell you is a reserved keyword, which may lead you to believe it's not in use, but it is.
(At least for wgpu native, i.e not WebGPU and not Dawn.

Then in our shader:

// NOTE: binding_array will not work on WebGPU or dawn, it's wgpu(naga) only.
@group(0) @binding(0)
var<storage, read_write> all_buffers: binding_array<array<f32, NUM_BUFFERS>>;

one contiguous buffer :)

Working source on the main branch.

I will make a PR for the examples later this week.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big big big buffers and how to work with them #6130

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Big big big buffers and how to work with them #6130

alphastrata Aug 19, 2024

Replies: 1 comment

alphastrata Aug 19, 2024 Author

alphastrata
Aug 19, 2024

alphastrata
Aug 19, 2024
Author