feature: import only additional documents for datalake #2073

cforce · 2024-10-23T09:54:13Z

or local files, MD5 checks are conducted using pre-generated .md5 files. However, when working with the data lake, no such checks are implemented, resulting in each file being reprocessed even if it remains unchanged.

The file strategy for the data lake should prevent the addition of files that have the same MD5 hash in both the data lake and the content storage/index. Therefore, the data lake and blob content storage (if the --skipblobs option is not used) should generate an MD5 value as blob metadata. This metadata can be verified before adding or downloading a file by simply retrieving the blob metadata MD5 and performing an indexed search. Ideally, all chunks should share the same MD5 in the index, as a single match would suffice to confirm that the file is already known. Alternatively, the MD5 can be stored in the blob's metadata alongside the "copied" blob.

Additionally, an optional prepdocs parameter that only processes files with a source update (touch) date newer than a specified threshold would be a useful feature for testing. This could facilitate queries against the data lake based on the last import or job run's persisted data.

cforce mentioned this issue Oct 23, 2024

prepdocs.py --skipblobs seems to orphaned #2070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: import only additional documents for datalake #2073

feature: import only additional documents for datalake #2073

cforce commented Oct 23, 2024

feature: import only additional documents for datalake #2073

feature: import only additional documents for datalake #2073

Comments

cforce commented Oct 23, 2024