Don't copy arrow object when calling batchApply() on a dplyr query #53

schuemie · 2023-03-29T04:42:39Z

A common pattern in HADES is to do some dplyr query on an Andromeda table, and then call batchApply() on that query object (e.g. by Cyclops). However, the current Andromeda implementation in this scenario always first copies the result of the query into a new Andromeda object before batching. My guess is this is because arrow::ScannerBuilder$create() does not accept a arrow_dplyr_query object.

But I did find the arrow::as_record_batch_reader() function works fine with arrow_dplyr_query objects:

 a <- andromeda(cars = cars)
dplyrQuery <- dplyr::filter(a$cars, speed > 10)
reader <- arrow::as_record_batch_reader(dplyrQuery)
head(as.data.frame(reader$read_next_batch()), 5)
# speed dist
# 1    11   17
# 2    11   28
# 3    12   14
# 4    12   20
# 5    12   24

The only downside is you can't set the batch size, although it definitely does batching.

I would propose to use this avoid having to copy the query result into an arrow object (which may eat a lot of resources, and also has the issue on Windows that we can't delete the temp Andromeda object).

The text was updated successfully, but these errors were encountered:

…ixes #53

schuemie pushed a commit that referenced this issue Mar 29, 2023

Using as_record_batch_reader() for batching. Deprecating batchSize. F…

68f9dbe

…ixes #53

schuemie mentioned this issue Mar 29, 2023

Issue 53 #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't copy arrow object when calling batchApply() on a dplyr query #53

Don't copy arrow object when calling batchApply() on a dplyr query #53

schuemie commented Mar 29, 2023

Don't copy arrow object when calling batchApply() on a dplyr query #53

Don't copy arrow object when calling batchApply() on a dplyr query #53

Comments

schuemie commented Mar 29, 2023