Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Implement block storage partitioning for scalable volume distribution #4100

Open
aWN4Y25pa2EK opened this issue Feb 8, 2025 · 3 comments · May be fixed by #4128
Open
Labels
enhancement New feature or request external Issues created by non node team members

Comments

@aWN4Y25pa2EK
Copy link

aWN4Y25pa2EK commented Feb 8, 2025

Implementation ideas

Overview

Currently, all *.ods and *.q4 datasets are stored under a single blocks/ path without filesystem partitioning. This monolithic storage approach creates scalability challenges and potential performance bottlenecks.

Risks and Challenges

  • Storage Capacity Constraints

    • Since all of the datasets are stored on the same root path as blocks keep increasing this could represent a major challenge in terms of storage distribution and scalability of the volumes.
  • Performance Bottlenecks

    • Single directory containing all blocks impacts file lookup performance
    • Potential I/O contention when multiple processes access the same directory
    • Limited ability to optimize for specific storage hardware characteristics

Proposed Enhancement

Implement a partitioning strategy that would:

  • Implement a two-level hierarchical partitioning strategy based on block hash prefixes
  • Create 256 primary partitions using the first two hexadecimal characters
  • Enable flexible volume distribution across storage resources

Partitioning Example

Existing structure (no partitioning, all files are stored on the same root path blocks/):

├── 00E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
├── 10E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods

Partitioning example through the use of the first 2 bytes [00->FF] would create a structure of 256 indexes.

blocks/
├── 0X/
│   ├── E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
│   ├── E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
├── 1X/
│   ├── E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
│   ├── E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
├── XX/
│   ├── E1584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods
│   ├── E2584FF07A13371E6A293EAC970EF42F753C474E0737D93EF1430944227441.ods

Example of volume distribution with partitioning enabled:

Volume 0: 0x/ (Blocks starting with 0)
Volume 1: 1x/ (Blocks starting with 1)
Volume 2: 2x/ (Blocks starting with 2)
Volume 3: 3x/ (Blocks starting with 3)
Volume 4: 4x/ (Blocks starting with 4)
Volume 5: 5x/ (Blocks starting with 5)
Volume 6: 6x/ (Blocks starting with 6)
Volume 7: 7x/ (Blocks starting with 7)
Volume 8: 8x/ (Blocks starting with 8)
Volume 9: 9x/ (Blocks starting with 9)
Volume 10: Ax/ (Blocks starting with A)
Volume 11: Bx/ (Blocks starting with B)
Volume 12: Cx/ (Blocks starting with C)
Volume 13: Dx/ (Blocks starting with D)
Volume 14: Ex/ (Blocks starting with E)
Volume 15: Fx/ (Blocks starting with F)

How this would fix existing limitations ?

When deploying a Data Availability (DA) node on cloud infrastructure, service providers face a critical limitation: cloud platforms typically impose a hard storage limit per volume. Since DA nodes currently store all *.ods datasets in a single root path, this creates an absolute ceiling that cannot be bypassed.

The proposed partitioning strategy provides a robust solution to these storage constraints

Volume Distribution:

  • Creates 256 distinct indexes (00-FF) based on block hash prefixes
  • Distributes these indexes across up to 16 separate volumes (0-F)
  • Each volume handles blocks with specific prefix ranges

Example of Storage Capacity Benefits based on a limit of 10TB per block storage (volume):

  • Number of volumes: 16 (one for each hex prefix)
  • Total theoretical capacity: 160TB per DA node
  • Scalability factor: 16x increase from baseline
@aWN4Y25pa2EK aWN4Y25pa2EK added the enhancement New feature or request label Feb 8, 2025
@github-actions github-actions bot added the external Issues created by non node team members label Feb 8, 2025
@Wondertan
Copy link
Member

While its true that there is a limit on the amount of files per directory in the most popular ext4 FS, there is a way to bypass the limit via large_dir param. It's also true that having files in the single directory may lead to contentions in case of ext4. However, ext4 has further limitation that make it suboptimal choice for node storage.

The point here is that we shouldn't discuss storage improvements without FSes and their properties, e.g. Ext4 has limited directories, while XFS and ZFS do not. Similarly, XFS shouldn't have performance issues with many files in a directory, is recognized for its robust performance in parallel random read of big files and demonstrated superior performance in random read tests.

Another point, is that we can't afford optimizing for all the FSes out there and should pick the standard recommended one for our node runners that we optimize for. I believe that we should first pick one, analyze/benchmark its properties and scalability and then optimize/improve existing storage subsystem for it, until we come to making a new Celestia-optimized one.

@walldiss
Copy link
Member

Hey @Wondertan I think the main concern raised here isn’t just about the file system’s ability to handle a large number of files in a single directory. The core issue is that when all block data resides on a single volume (regardless of file system), you hit practical limits like:

  • Bypass per-volume size limits (important on many cloud platforms)
  • Improve total I/O throughput(spreading load across volumes)
  • Enable more flexible hardware resource allocation
  • Simplify volume migration or replacement

Partitioning the dataset into multiple directories—each potentially mapped to different volumes—acts like an internal sharding mechanism. This way, we can scale horizontally across multiple volumes and avoid the single-volume bottlenecks. By distributing the blocks over multiple storage backends, we can achieve higher aggregate storage capacity and IOPS.

While it’s true that file systems like XFS or ZFS may mitigate single-directory performance issues, the proposal here specifically targets the broader limits of a single volume rather than directory-related overhead alone. We could certainly choose and optimize for one recommended file system in the future, but that still wouldn’t fully address constraints like maximum volume size on cloud providers.

So, in short:

  • The limitation is not about the directory file count per se, it’s about how one volume might become a bottleneck.
  • The proposed partitioning ensures we can spread block data across multiple volumes for higher total capacity and performance.

Hope this clarifies where the suggestion is coming from! Let me know what you think.

@Wondertan
Copy link
Member

@walldiss, for both volume and partitioning issues you can use overlay filesystems like MergerFS. It creates a unified logical filesystem over multiple physical filesystems/volumes/partitions. The single "monolithic" directory we create can than be spread around multiple disks.

@sysrex sysrex linked a pull request Feb 18, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request external Issues created by non node team members
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants