You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
or local files, MD5 checks are conducted using pre-generated .md5 files. However, when working with the data lake, no such checks are implemented, resulting in each file being reprocessed even if it remains unchanged.
The file strategy for the data lake should prevent the addition of files that have the same MD5 hash in both the data lake and the content storage/index. Therefore, the data lake and blob content storage (if the --skipblobs option is not used) should generate an MD5 value as blob metadata. This metadata can be verified before adding or downloading a file by simply retrieving the blob metadata MD5 and performing an indexed search. Ideally, all chunks should share the same MD5 in the index, as a single match would suffice to confirm that the file is already known. Alternatively, the MD5 can be stored in the blob's metadata alongside the "copied" blob.
Additionally, an optional prepdocs parameter that only processes files with a source update (touch) date newer than a specified threshold would be a useful feature for testing. This could facilitate queries against the data lake based on the last import or job run's persisted data.
The text was updated successfully, but these errors were encountered:
or local files, MD5 checks are conducted using pre-generated .md5 files. However, when working with the data lake, no such checks are implemented, resulting in each file being reprocessed even if it remains unchanged.
The file strategy for the data lake should prevent the addition of files that have the same MD5 hash in both the data lake and the content storage/index. Therefore, the data lake and blob content storage (if the --skipblobs option is not used) should generate an MD5 value as blob metadata. This metadata can be verified before adding or downloading a file by simply retrieving the blob metadata MD5 and performing an indexed search. Ideally, all chunks should share the same MD5 in the index, as a single match would suffice to confirm that the file is already known. Alternatively, the MD5 can be stored in the blob's metadata alongside the "copied" blob.
Additionally, an optional prepdocs parameter that only processes files with a source update (touch) date newer than a specified threshold would be a useful feature for testing. This could facilitate queries against the data lake based on the last import or job run's persisted data.
The text was updated successfully, but these errors were encountered: