You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like every time the commit is made, the full dataset of current state is captured as a corresponding version. So, if the user commits a lot, the storage the versions consumes blows up rapidly.
It becomes problematic if the user ingest small data incrementally, i.e., the dataset between versions are almost the same, so consumes space inefficiently.
The canonical solution for this would be to capture only the diff data for each version, but as I'm not acquainted the codebase, don't know whether it is feasible.
So, instead, I suggest to offer an option that users can choose whether they want "auto-commit" or not when ingest a data.
Use Cases
No response
The text was updated successfully, but these errors were encountered:
Description
Currently, versions are made upon data ingestion with the following code:
deeplake/deeplake/core/vectorstore/deeplake_vectorstore.py
Line 294 in 2ad84c1
It seems like every time the commit is made, the full dataset of current state is captured as a corresponding version. So, if the user commits a lot, the storage the versions consumes blows up rapidly.
It becomes problematic if the user ingest small data incrementally, i.e., the dataset between versions are almost the same, so consumes space inefficiently.
The canonical solution for this would be to capture only the diff data for each version, but as I'm not acquainted the codebase, don't know whether it is feasible.
So, instead, I suggest to offer an option that users can choose whether they want "auto-commit" or not when ingest a data.
Use Cases
No response
The text was updated successfully, but these errors were encountered: