Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't copy arrow object when calling batchApply() on a dplyr query #53

Open
schuemie opened this issue Mar 29, 2023 · 0 comments
Open

Comments

@schuemie
Copy link
Member

A common pattern in HADES is to do some dplyr query on an Andromeda table, and then call batchApply() on that query object (e.g. by Cyclops). However, the current Andromeda implementation in this scenario always first copies the result of the query into a new Andromeda object before batching. My guess is this is because arrow::ScannerBuilder$create() does not accept a arrow_dplyr_query object.

But I did find the arrow::as_record_batch_reader() function works fine with arrow_dplyr_query objects:

 a <- andromeda(cars = cars)
dplyrQuery <- dplyr::filter(a$cars, speed > 10)
reader <- arrow::as_record_batch_reader(dplyrQuery)
head(as.data.frame(reader$read_next_batch()), 5)
# speed dist
# 1    11   17
# 2    11   28
# 3    12   14
# 4    12   20
# 5    12   24

The only downside is you can't set the batch size, although it definitely does batching.

I would propose to use this avoid having to copy the query result into an arrow object (which may eat a lot of resources, and also has the issue on Windows that we can't delete the temp Andromeda object).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant