Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: import only additional documents for datalake #2073

Open
cforce opened this issue Oct 23, 2024 · 0 comments
Open

feature: import only additional documents for datalake #2073

cforce opened this issue Oct 23, 2024 · 0 comments

Comments

@cforce
Copy link

cforce commented Oct 23, 2024

or local files, MD5 checks are conducted using pre-generated .md5 files. However, when working with the data lake, no such checks are implemented, resulting in each file being reprocessed even if it remains unchanged.

The file strategy for the data lake should prevent the addition of files that have the same MD5 hash in both the data lake and the content storage/index. Therefore, the data lake and blob content storage (if the --skipblobs option is not used) should generate an MD5 value as blob metadata. This metadata can be verified before adding or downloading a file by simply retrieving the blob metadata MD5 and performing an indexed search. Ideally, all chunks should share the same MD5 in the index, as a single match would suffice to confirm that the file is already known. Alternatively, the MD5 can be stored in the blob's metadata alongside the "copied" blob.

Additionally, an optional prepdocs parameter that only processes files with a source update (touch) date newer than a specified threshold would be a useful feature for testing. This could facilitate queries against the data lake based on the last import or job run's persisted data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant