-
Notifications
You must be signed in to change notification settings - Fork 40
Garbage collection
Purpose of garbage collection includes removing old versions and removing unreferenced pages. The GC is the most "dangerous" component in the system, as its functionality to permanently delete data.
The GC is designed to be correct with no assumptions on external dependencies. Things like the system clock are used as heuristics to prevent transaction interruptions, but consistency and correctness are never sacrificed.
- Pages that have been committed must not be deleted
- Pages that have been deleted must not be committed
-
Fetch read version RAW-RV.
-
Collect inconsistent snapshot of hashes.
- Scan the page index incrementally and collect all hashes.
- Scan the delta referrer index incrementally and collect all hashes.
-
Scan the CAM index incrementally. For each batch (scanned with TXN1), transaction TXN2:
- Filter out those hashes seen in step 2.
- Filter out those hashes added within a time duration.
- Filter out those CAM entries modified after RAW-RV.
- Set TXN2.RV to TXN1.RV.
- Add the CAM index of the remaining pages to the conflict set.
- Delete the remaining pages from the CAM.
- Delete COMMIT-TOKEN.
-
Transaction:
- Check the existence of each page. If any page is missing, abort.
- Insert COMMIT-TOKEN.
-
Transaction:
- Check COMMIT-TOKEN. If not matching step 1, abort.
- Insert each hash to the CAM index.
- Insert each hash to the page index.
- Insert metadata, if any.
Note: In a previous version step 2c and 2d was split into their own transaction (with commit-token check). This is incorrect. Consider this order:
- CT, step 1.
- CT, step 2a-2b.
- GT, step 1 and 2.
- CT, step 2c-3d.
- GT, step 3.
In this case, the newly committed page index entries will be deleted.