Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ORC-1361: InvalidProtocolBufferException when reading large stripe st…
…atistics ### What changes were proposed in this pull request? Catch `InvalidProtocolBufferException` when parsing stripe statistics, log a warning message, and return an empty list. ### Why are the changes needed? In some cases ORC files may be created with very large Metadata section due to stripe statistics. The bigger the ORC file gets the easier is to endup with a large Metadata section. Nevertheless, it is still possible to hit the problem with smaller ORC files and `TestOrcWithLargeStripeStatistics` demonstrates some extremes. Any attempt to read back the stripe statistics from the file will fail with `InvalidProtocolBufferException`. The exact exception may differ slighly and depending on: a) the size of stripe statistics; b) protobuf size limit. Stripe statistics less than 2GB, and protobuf limit less than stats size: ``` com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. ``` Stripe statistics greater than 2GB, and protobuf limit 2GB (`Integer.MAX_VALUE`): ``` com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length. ``` The `Protocol message was too large` problem could be alleviated by increasing the limit as it was done in the past. However, increasing the limit would hide the underlying problem and probably lead to permanent metadata corruption in the near future. Moreover, the limit is already high enough (1GB) and bumping it further would lead to more memory being used potentially triggering `OutOfMemoryError`, OOM Killer, GC pauses, and other problems that are usually harder to debug and find the root cause. When the stripe statistics exceeds the 2GB there is nothing to be done to parse back the statistics since protobuf cannot deserialize such messages (protocolbuffers/protobuf#11729). The problem cannot be solved unless the metadata storage changes but this can only happen in newer versions. On the other hand, stripe statistics are important but not vital for readers. In those situations where limits are breached it is acceptable to log a warning and return nothing instead of raising a fatal error to the caller. ### How was this patch tested? Existing unit tests plus new tests added in `TestOrcWithLargeStripeStatistics`. This closes apache#1402
- Loading branch information