-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support read large statistics exceed 2GB #1404
Comments
In fact, it is not reasonable to have such large statistics for a single ORC file, it requires too much memory. My suggestion is to limit writing such huge statistics in writer side. What do you think? @zabetak |
At first glance 2GB of metadata seems big especially when considering the toy example that I made in ORC-1361. However, if you have a 500GB ORC file then 2GB of metadata does not appear too big anymore so things are relative. Are there limitations on the maximum size of an ORC file? Do we support such kind of use-cases? If we add a limit while writing (which by the way I also suggested in protocolbuffers/protobuf#11729) then we should define what happens when the limit is exceeded:
|
@zabetak Could you provide a scenario for a 500GB ORC file? In my experience, columnar storage generally serves big data engines, and each of these big data files is generally about 512M/256M. |
Yeah, I've seen ORC files upto 2gb or so, but @deshanxiao 's point is accurate that you'll probably get better performance out of Spark & Trino if you keep them smaller than 1-2gb. That said, can you share how many rows & columns are in the the 500gb file? How big did you configure the stripe buffers and how large are your actual stripes? You should bump up your stripe buffer to get your stripe size to be 256mb or so. That will still give you 2000 stripes, which unless you have a crazy number of columns should be under 2gb. My guess is that you have a huge number of very small stripes. That will hurt performance, especially if you are using the C++ reader. |
I am rather positive on the idea of enforcing limits on the writer but I would expect this to be on the protobuf layer (protocolbuffers/protobuf#11729). The limitation comes from protobuf so it seems natural to add checks there and not in ORC code. The reason that I brought up the question about maximum size is because as the file increases so does the metadata and clearly here we have a hard limitation on till where it can go. If there is a compelling use-case to support arbitrary big files (with arbitrary big metadata) then investing on a new design would make sense. To be clear, I am not pushing for a new design since I fully agree with both @deshanxiao and @omalley in that with proper schema design + configuration the chances of hitting the problem are rather small. From the ORC perspective, it seems acceptable to settle with the 1/2GB limit on the metadata section. I didn't mean to imply that having 500GB is a good/normal thing but if nothing prevents user to do it, eventually they will get there. :) Speaking about actual use-cases, in Hive, I have recently seen Summing up, I don't think a new metadata design is worth it at the moment and limiting the writer seems more appropriate to be done in the protobuf layer. From my perspective, the only actionable item regarding this issue at this point would be to add a brief mention about the metadata size limitation on the website (e.g., https://orc.apache.org/specification/ORCv1/). |
@zabetak Could you explain in detail why we can write but not read in protobuf layer? |
@deshanxiao In protocolbuffers/protobuf#11729 I shared a very simple project reproducing/explaining the write/read problem. Please have a look and if you have questions I will be happy to elaborate more. |
In Java and C++ reader, we cannot read the orc file with statistics exceed 2GB. We should find a new way or design to support read these files.
The text was updated successfully, but these errors were encountered: