[Feature Request]: Implement block storage partitioning for scalable volume distribution #4100

aWN4Y25pa2EK · 2025-02-08T13:59:06Z

Implementation ideas

Overview

Currently, all *.ods and *.q4 datasets are stored under a single blocks/ path without filesystem partitioning. This monolithic storage approach creates scalability challenges and potential performance bottlenecks.

Risks and Challenges

Storage Capacity Constraints
- Since all of the datasets are stored on the same root path as blocks keep increasing this could represent a major challenge in terms of storage distribution and scalability of the volumes.
Performance Bottlenecks
- Single directory containing all blocks impacts file lookup performance
- Potential I/O contention when multiple processes access the same directory
- Limited ability to optimize for specific storage hardware characteristics

Proposed Enhancement

Implement a partitioning strategy that would:

Implement a two-level hierarchical partitioning strategy based on block hash prefixes
Create 256 primary partitions using the first two hexadecimal characters
Enable flexible volume distribution across storage resources

Partitioning Example

Existing structure (no partitioning, all files are stored on the same root path blocks/):

├── 00E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
├── 10E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods

Partitioning example through the use of the first 2 bytes [00->FF] would create a structure of 256 indexes.

blocks/
├── 0X/
│   ├── E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
│   ├── E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
├── 1X/
│   ├── E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
│   ├── E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
├── XX/
│   ├── E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
│   ├── E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods

Example of volume distribution with partitioning enabled:

Volume 0: 0x/ (Blocks starting with 0)
Volume 1: 1x/ (Blocks starting with 1)
Volume 2: 2x/ (Blocks starting with 2)
Volume 3: 3x/ (Blocks starting with 3)
Volume 4: 4x/ (Blocks starting with 4)
Volume 5: 5x/ (Blocks starting with 5)
Volume 6: 6x/ (Blocks starting with 6)
Volume 7: 7x/ (Blocks starting with 7)
Volume 8: 8x/ (Blocks starting with 8)
Volume 9: 9x/ (Blocks starting with 9)
Volume 10: Ax/ (Blocks starting with A)
Volume 11: Bx/ (Blocks starting with B)
Volume 12: Cx/ (Blocks starting with C)
Volume 13: Dx/ (Blocks starting with D)
Volume 14: Ex/ (Blocks starting with E)
Volume 15: Fx/ (Blocks starting with F)

How this would fix existing limitations ?

When deploying a Data Availability (DA) node on cloud infrastructure, service providers face a critical limitation: cloud platforms typically impose a hard storage limit per volume. Since DA nodes currently store all *.ods datasets in a single root path, this creates an absolute ceiling that cannot be bypassed.

The proposed partitioning strategy provides a robust solution to these storage constraints

Volume Distribution:

Creates 256 distinct indexes (00-FF) based on block hash prefixes
Distributes these indexes across up to 16 separate volumes (0-F)
Each volume handles blocks with specific prefix ranges

Example of Storage Capacity Benefits based on a limit of `10TB` per block storage (volume):

Number of volumes: 16 (one for each hex prefix)
Total theoretical capacity: 160TB per DA node
Scalability factor: 16x increase from baseline

The text was updated successfully, but these errors were encountered:

Wondertan · 2025-02-08T15:26:36Z

While its true that there is a limit on the amount of files per directory in the most popular ext4 FS, there is a way to bypass the limit via large_dir param. It's also true that having files in the single directory may lead to contentions in case of ext4. However, ext4 has further limitation that make it suboptimal choice for node storage.

The point here is that we shouldn't discuss storage improvements without FSes and their properties, e.g. Ext4 has limited directories, while XFS and ZFS do not. Similarly, XFS shouldn't have performance issues with many files in a directory, is recognized for its robust performance in parallel random read of big files and demonstrated superior performance in random read tests.

Another point, is that we can't afford optimizing for all the FSes out there and should pick the standard recommended one for our node runners that we optimize for. I believe that we should first pick one, analyze/benchmark its properties and scalability and then optimize/improve existing storage subsystem for it, until we come to making a new Celestia-optimized one.

walldiss · 2025-02-10T13:37:02Z

Hey @Wondertan I think the main concern raised here isn’t just about the file system’s ability to handle a large number of files in a single directory. The core issue is that when all block data resides on a single volume (regardless of file system), you hit practical limits like:

Bypass per-volume size limits (important on many cloud platforms)
Improve total I/O throughput(spreading load across volumes)
Enable more flexible hardware resource allocation
Simplify volume migration or replacement

Partitioning the dataset into multiple directories—each potentially mapped to different volumes—acts like an internal sharding mechanism. This way, we can scale horizontally across multiple volumes and avoid the single-volume bottlenecks. By distributing the blocks over multiple storage backends, we can achieve higher aggregate storage capacity and IOPS.

While it’s true that file systems like XFS or ZFS may mitigate single-directory performance issues, the proposal here specifically targets the broader limits of a single volume rather than directory-related overhead alone. We could certainly choose and optimize for one recommended file system in the future, but that still wouldn’t fully address constraints like maximum volume size on cloud providers.

So, in short:

The limitation is not about the directory file count per se, it’s about how one volume might become a bottleneck.
The proposed partitioning ensures we can spread block data across multiple volumes for higher total capacity and performance.

Hope this clarifies where the suggestion is coming from! Let me know what you think.

Wondertan · 2025-02-17T21:46:21Z

@walldiss, for both volume and partitioning issues you can use overlay filesystems like MergerFS. It creates a unified logical filesystem over multiple physical filesystems/volumes/partitions. The single "monolithic" directory we create can than be spread around multiple disks.

aWN4Y25pa2EK added the enhancement New feature or request label Feb 8, 2025

github-actions bot added the external Issues created by non node team members label Feb 8, 2025

sysrex linked a pull request Feb 18, 2025 that will close this issue

fix(store): adding_storage_sharding #4128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Implement block storage partitioning for scalable volume distribution #4100

[Feature Request]: Implement block storage partitioning for scalable volume distribution #4100

aWN4Y25pa2EK commented Feb 8, 2025 •

edited

Loading

Wondertan commented Feb 8, 2025

walldiss commented Feb 10, 2025

Wondertan commented Feb 17, 2025

[Feature Request]: Implement block storage partitioning for scalable volume distribution #4100

[Feature Request]: Implement block storage partitioning for scalable volume distribution #4100

Comments

aWN4Y25pa2EK commented Feb 8, 2025 • edited Loading

Implementation ideas

Overview

Risks and Challenges

Proposed Enhancement

Partitioning Example

How this would fix existing limitations ?

The proposed partitioning strategy provides a robust solution to these storage constraints

Volume Distribution:

Example of Storage Capacity Benefits based on a limit of 10TB per block storage (volume):

Wondertan commented Feb 8, 2025

walldiss commented Feb 10, 2025

Wondertan commented Feb 17, 2025

aWN4Y25pa2EK commented Feb 8, 2025 •

edited

Loading

Example of Storage Capacity Benefits based on a limit of `10TB` per block storage (volume):