Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragmentation Prevention #147

Merged
merged 7 commits into from
Jan 30, 2025
Merged

Fragmentation Prevention #147

merged 7 commits into from
Jan 30, 2025

Conversation

ylow
Copy link
Contributor

@ylow ylow commented Jan 23, 2025

For https://linear.app/xet/issue/XET-246/fragmentation-prevention We use average chunks / range as a fragmentation estimator, targetting an average of 16 chunks per range which roughly equates to 1MB per range. This is computed over the last window of 32 ranges. If the average drops below the target, dedupe is disabled until the average is above the target again.

Running on first 1GB of a highly fragmented file (comprising of a few hundred KB of an existing file, followed by a hundred KB of zeros, repeat) we see the following:

  • Baseline: 1000000001 bytes -> 726845953 bytes, 2975 ranges, 336134 average bytes per range
  • 512KB target (anti-fragmentation goal of 8 chunk per range): 1000000001 bytes -> 873515521 bytes, 1465 ranges, 682594 average bytes per range
  • 1MB target (anti-fragmentation goal of 16 chunks per range): 1000000001 bytes -> 932235777 bytes, 829 ranges, 1206273 average bytes per range

This also includes a hysteresis implementation:

  • 512KB target (anti-fragmentation goal of 8 chunk per range): 1000000001 bytes -> 873515521 bytes, 1657 ranges, 603500 average bytes per range.

The hysteresis turned out to be pretty important for deduping a content defined chunked variant of Parquet:
Without hysteresis (only concern is how v2 dedupes against v1):

parquet file v1: 5728317968 bytes -> 5728137283 bytes
parquet file v2: 5726717793 bytes -> 4544391399 bytes (11.14 chunks per range)

With hysteresis

parquet file v1: 5728317968 bytes -> 5728137283 bytes
parquet file v2: 5726717793 bytes -> 3568275084 bytes (8.11 chunks per range)

So with the hysteresis implementation we are closer to the target chunk per range and we are able to still dedupe pretty well. As comparison, without any fragmentation prevention:

parquet file v1: 5728317968 bytes -> 5728137283 bytes
parquet file v2: 5726717793 bytes -> 3402767500 bytes (6.89 chunks per segment)

For https://linear.app/xet/issue/XET-246/fragmentation-prevention
We use average chunks / range as a fragmentation estimator, targetting
an average of 16 chunks per range which roughly equates to 1MB per
range. This is computed over the last window of 32 ranges.

In the event of high fragmentation, we simply avoid dedupe.
@ylow ylow requested review from hoytak and seanses January 23, 2025 00:07
@ylow
Copy link
Contributor Author

ylow commented Jan 24, 2025

Other ideas which may or may not improve things are:

  • have a high-water and low-water mark so if we are near the 1MB boundary we don't keep jumping back and forth between dedupe and no-dedupe.
  • look ahead a bunch of chunks (this is somewhat complicated)

@ylow ylow marked this pull request as ready for review January 24, 2025 18:44
data/src/constants.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ylow ylow merged commit 5cf29c1 into main Jan 30, 2025
2 checks passed
@ylow ylow deleted the ylow/fragmentation_prevention branch January 30, 2025 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants