Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Zstd decompression for "unknown decompressed size" when streaming API was used for compression #116

Open
3 tasks
mkitti opened this issue May 18, 2024 · 0 comments
Assignees
Labels
Filter - ZSTD Priority - 1. High 🔼 These are important issues that should be resolved in the next release Type - Improvement

Comments

@mkitti
Copy link

mkitti commented May 18, 2024

Introduction

The Zstandard plugin for HDF5 should be modified to allow for an unknown decompressed size in the frame header.

Currently, the Zstd decompression scheme, following from the original implemention, uses ZSTD_getDecompressedSize to obtain the size of the decompressed buffer. The returned value is not validated and passed directly to malloc.

size_t decompSize = ZSTD_getDecompressedSize(*buf, origSize);
if (NULL == (outbuf = malloc(decompSize)))

ZSTD_getDecompressedSize returns 0 if the decompressed size is empty, unknown, or an error has occured. If malloc is asked to allocate 0 bytes, it will return NULL, resulting in returning an error condition. This is an incorrect result if the decompressed size is actually empty or unknown and there is no actual error. ZSTD_getDecompressedSize is obsolete.

ZSTD_getFrameContentSize should replace the use of ZSTD_getDecompressedSize. ZSTD_getFrameContentSize distinguishes between empty, unknown, or an error. The unknown or error states are indicated by a return value of ZSTD_CONTENTSIZE_UNKNOWN or ZSTD_CONTENTSIZE_ERROR, respectively.

The unknown decompression state is common. This occurs when the compression is done via the streaming API via ZSTD_compressStream or ZSTD_compressStream2. ZSTD_compressStream2 in particular only stores the frame size when either ZSTD_e_end is provided on the initial call or ZSTD_CCtx_setPledgedSrcSize is used.

Tasks

  • Use ZSTD_getFrameContentSize instead of the obsolete ZSTD_getDecompressedSize to correctly distinguish between empty, unknown, or error states when determining the decompressed size.
  • Recognize the unknown size state by checking the return value of ZSTD_getFrameContentSize against ZSTD_CONTENTSIZE_UNKNOWN
  • Address unknown size state by growing buffer if needed or using stream decompression API via ZSTD_decompressStream
    • The HDF5 library should know the expected number of bytes for a chunk. We should not be relying on the filter to figure this out.

References

[1] https://facebook.github.io/zstd/zstd_manual.html

@derobins derobins added this to the HDF5 1.14.5 Release milestone Aug 15, 2024
@derobins derobins added Type - Improvement Priority - 1. High 🔼 These are important issues that should be resolved in the next release labels Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filter - ZSTD Priority - 1. High 🔼 These are important issues that should be resolved in the next release Type - Improvement
Projects
None yet
Development

No branches or pull requests

3 participants