-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reader options #81
Comments
I've had a bit of a poke around in the course of the streams/object store experiments with this, works pretty well so far (at the moment I've got a json object producing schema repr (some kind of schema summary felt necessary vs just guessing column names), and a Vec<String> -> ProjectionMask, plugged into So, design questions/opinions:
let instance = await new AsyncParquetFile("<targetFile>");
const stream = await instance.select([
'STATE', 'foo_struct.STATE_FIPS'
]).stream(4);
for await (const chunk of stream) {
console.log(chunk);
/* rows (after passing into parseRecordBatch) look like
* {"STATE": "ME", "foo_struct": {"STATE_FIPS":"23"}}
*/
} TangentExposing the It would be fascinating to see in the context of some of the less expensive boolean spatial predicates from geoarrow - provided they can be squeezed into the constraints imposed by ArrowPredicate/Fn (which it looks like they can), that would get you full-blown spatial predicate pushdown for... <<10MB of wasm (more or less instantly paid off by dint of all the saved bandwidth). |
I'd prefer to throw. The usual workflow I'd expect would be fetching the Parquet metadata first, letting the user pick which columns, and then fetching the rest of the data. It's too easy to be off by one character and miss a ton of data.
Do you think we could reuse arrow JS's repr? I.e. not make our own and only direct users to inspect the repr from an arrow JS schema object? You should be able to get a schema across FFI by treating it as a struct of fields.
That all seems reasonable. It's risky to pull columns up to the top level when the leaf names could collide with something else at the root. I tend to like dot separators, although I was reminded recently that there's no restriction about not having a dot in the column name, right? Would it be better to have something like
IIRC the row filter only happens after the data is materialized in memory, right? In that case I'd tend to think that parquet-wasm should handle row-group filtering but leave the individual row filtering to other tools, or another function (even in the same memory space). I wrote down some ideas on composition of wasm libraries here |
Nope, the row filter occurs just before that - the flow is more or less:
A simple hardcoded filter I ran ( I agree that specifying the construction of the row filters should be external, but it would have to be provided in some form to the record batch stream builder. The host (by that I mean the code passing in the row filter) would have to be in the same memory space too, likely written in Rust. Other than geoarrow-wasm[full], geoparquet-wasm makes a lot of sense for a consumer of this extension point - there's several quite high value, low cost (in terms of bundle size) hardcoded IO-integrated filters that anyone using the module would want (or not care about paying for):
Outside of that, if someone wants more elaborate expressions in the IO phase, they can build their own wrapper module easily enough, if |
I need to read through that a couple more times to get it, but as a quick note:
Yes, I definitely think geoparquet-wasm should have native support for bounding box filtering (ref opengeospatial/geoparquet#191); I think I just wasn't sure whether this filtering should happen at the row group or row level. |
Just a note, from my learnings in geoarrow-rs, the RowFilter itself doesn't filter out row groups, it only skips decoding of pages. So you need to both filter specific row groups and then also manage the row filter.
I think we could reimplement something like the pyarrow filters. E.g. you can pass in a filter to the dataset object. That would allow serializing a minimal function definition from JS to Wasm, which then could be added on the rust side to a But still, I think this would be entirely separate from geoparquet-wasm, which would automatically construct all of this from a single bounding box. |
We do have a |
No description provided.
The text was updated successfully, but these errors were encountered: