[FEATURE] Option to disable auto commit after data ingestion #2521

HyunggyuJang · 2023-08-07T03:34:59Z

Description

Currently, versions are made upon data ingestion with the following code:

deeplake/deeplake/core/vectorstore/deeplake_vectorstore.py

Line 294 in 2ad84c1

self.dataset.commit(allow_empty=True)

It seems like every time the commit is made, the full dataset of current state is captured as a corresponding version. So, if the user commits a lot, the storage the versions consumes blows up rapidly.

It becomes problematic if the user ingest small data incrementally, i.e., the dataset between versions are almost the same, so consumes space inefficiently.

The canonical solution for this would be to capture only the diff data for each version, but as I'm not acquainted the codebase, don't know whether it is feasible.

So, instead, I suggest to offer an option that users can choose whether they want "auto-commit" or not when ingest a data.

Use Cases

No response

FayazRahman · 2023-08-07T18:37:21Z

Hey @HyunggyuJang, thanks a lot for raising the issue. We're already working on this, and I'll be sure to let you know when the updates are released.

HyunggyuJang added the enhancement New feature or request label Aug 7, 2023

tatevikh assigned FayazRahman Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Option to disable auto commit after data ingestion #2521

[FEATURE] Option to disable auto commit after data ingestion #2521

HyunggyuJang commented Aug 7, 2023

FayazRahman commented Aug 7, 2023

[FEATURE] Option to disable auto commit after data ingestion #2521

[FEATURE] Option to disable auto commit after data ingestion #2521

Comments

HyunggyuJang commented Aug 7, 2023

Description

Use Cases

FayazRahman commented Aug 7, 2023