-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Date is changed to integer on selecting using dplyr #28
Comments
Still trying to figure out why this is happening. library(Andromeda)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# this works
anAndromedaObject <- andromeda()
x <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate
#> # Source: table<dataFrameWithDate> [?? x 2]
#> # Database: sqlite 3.35.5
#> # [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#> todayDate anotherDate
#> <date> <date>
#> 1 2021-10-05 2021-10-05
# this does not
anAndromedaObject$dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>%
dplyr::select(anotherDate, todayDate)
anAndromedaObject$dataFrameWithDate2
#> # Source: table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> # [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#> anotherDate todayDate
#> <int> <int>
#> 1 18905 18905
# but this does
dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>%
dplyr::select(anotherDate, todayDate) %>%
collect()
dataFrameWithDate2
#> # A tibble: 1 x 2
#> anotherDate todayDate
#> <date> <date>
#> 1 2021-10-05 2021-10-05
anAndromedaObject$dataFrameWithDate2 <- dataFrameWithDate2
anAndromedaObject$dataFrameWithDate2
#> # Source: table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> # [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#> anotherDate todayDate
#> <date> <date>
#> 1 2021-10-05 2021-10-05 Created on 2021-10-05 by the reprex package (v2.0.1) |
dateTable <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:", extended_types = TRUE)
RSQLite::dbWriteTable(con, "table1", dateTable)
RSQLite::dbGetQuery(con, "select * from table1")
#> todayDate anotherDate
#> 1 2021-10-05 2021-10-05
RSQLite::dbExecute(con, "create table table2 as select * from table1")
#> [1] 0
RSQLite::dbGetQuery(con, "select * from table2")
#> todayDate anotherDate
#> 1 18905 18905
RSQLite::dbDisconnect(con) Created on 2021-10-05 by the reprex package (v2.0.1) Seem like an RSQLite issue. Made a note here r-dbi/RSQLite#379. I'll try to come up with a temporary fix but I'm really not sure how to do this without bringing the data into R. I'm assuming "create a modified copy" must all happen in SQLite. |
I'm pretty tempted to swap out SQLite for duckdb which has real date support. dateTable <- dplyr::tibble(todayDate = wakefield::date_stamp(10, T), anotherDate = wakefield::date_stamp(10, T))
con <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")
DBI::dbWriteTable(con, "table1", dateTable)
DBI::dbGetQuery(con, "select * from table1")
#> todayDate anotherDate
#> 1 2021-03-05 2020-12-05
#> 2 2021-01-05 2021-02-05
#> 3 2021-10-05 2021-08-05
#> 4 2021-03-05 2021-09-05
#> 5 2020-12-05 2021-10-05
#> 6 2021-07-05 2021-01-05
#> 7 2021-05-05 2021-09-05
#> 8 2020-12-05 2021-10-05
#> 9 2020-11-05 2021-07-05
#> 10 2021-07-05 2021-10-05
DBI::dbExecute(con, "create table table2 as select * from table1")
#> [1] 10
DBI::dbGetQuery(con, "select * from table2")
#> todayDate anotherDate
#> 1 2021-03-05 2020-12-05
#> 2 2021-01-05 2021-02-05
#> 3 2021-10-05 2021-08-05
#> 4 2021-03-05 2021-09-05
#> 5 2020-12-05 2021-10-05
#> 6 2021-07-05 2021-01-05
#> 7 2021-05-05 2021-09-05
#> 8 2020-12-05 2021-10-05
#> 9 2020-11-05 2021-07-05
#> 10 2021-07-05 2021-10-05
DBI::dbDisconnect(con) Created on 2021-10-05 by the reprex package (v2.0.1) |
Have you compared performance (mostly speed) between SQLite and DuckDB? |
I'm working on creating a duckdb version of Andromeda. Then I'll run some comparison tests. I don't understand the purpose for these tests. Is it important that we can append with mismatched column names. duckdb throws an error when I try to do this. test_that("Append from other andromeda with switched column order", {
andromeda <- andromeda()
andromeda$cars <- cars
andromeda2 <- andromeda()
andromeda2$cars <- cars
appendToTable(andromeda$cars, andromeda2$cars %>% filter(speed > 10) %>% select(dist, speed))
carsPlus2 <- andromeda$cars %>% collect()
carsPlus <- rbind(cars, cars[cars$speed > 10, ])
expect_equivalent(carsPlus2, carsPlus)
close(andromeda)
close(andromeda2)
})
test_that("Append from data frame with switched column order", {
andromeda <- andromeda()
andromeda$cars <- cars
appendToTable(andromeda$cars, cars[cars$speed > 10, c("dist", "speed")])
carsPlus2 <- andromeda$cars %>% collect()
carsPlus <- rbind(cars, cars[cars$speed > 10, ])
expect_equivalent(carsPlus2, carsPlus)
close(andromeda)
}) |
Test using SQLite version library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
db <- andromeda(survey = r_data_theme(1e6, "survey"))
db$survey %>%
group_by(Item_1) %>%
summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}
microbenchmark::microbenchmark(
summarizeSomeData(),
times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#> expr min lq mean median uq max
#> summarizeSomeData() 2.585798 2.605132 2.822737 2.642374 2.674033 3.606348
#> neval
#> 5 Test using duckDB version library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
db <- andromeda(survey = r_data_theme(1e6, "survey"))
db$survey %>%
group_by(Item_1) %>%
summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}
microbenchmark::microbenchmark(
summarizeSomeData(),
times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#> expr min lq mean median uq max
#> summarizeSomeData() 2.685436 2.692851 2.992721 2.711518 2.725933 4.147866
#> neval
#> 5 Created on 2021-10-06 by the reprex package (v2.0.1) So they were basically the same on this very simple test that included loading data in and querying. I still have a work to do to get all the andromeda functions working on duckdb. Is there a test you'd like me try? Perhaps a particularly slow step of your normal workflow. duckdb branch is at https://github.com/OHDSI/Andromeda/tree/duckdb |
Could you test joins between large tables? That is often slow. |
About the test with different column orders: this is guarantee that I think all tidyverse functions also make (e.g. |
I did a few simple tests on joins both with and without indexes and posted one result here. SQLite was faster in all my tests. According the FAQ duckdb has not been fully optimized for performance yet. I'll keep working on implementing a duckdb Andromeda version on the duckdb branch and we can see if it gets better at some point. You could use the Any ideas for how to make date handling better in Andromeda? Maybe RSQLite should handle this by keeping track of date columns. This seems like a hard thing to do since RSQLite would have to infer the type of new columns from SQL. |
This seems like a SQLite issue, right? Apparently the 'extended types' extension only applies to columns of physical tables (permanent or temporary), but don't exist in any intermediate results like (nested) queries. |
Yes I guess RSQLite should handle this, but it's not clear to me if it will be possible to propagate date types in more complex queries. I posted an example on the RSQLite issue here. |
This will be addressed in the next release (arrow_S4 branch) library(Andromeda)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
anAndromedaObject <- Andromeda::andromeda()
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(),
anotherDate = Sys.Date())
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#> todayDate anotherDate
#> <date> <date>
#> 1 2022-11-20 2022-11-20
anAndromedaObject$dataFrameWithDate <- anAndromedaObject$dataFrameWithDate %>%
dplyr::select(.data$anotherDate, .data$todayDate)
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#> anotherDate todayDate
#> <date> <date>
#> 1 2022-11-20 2022-11-20 Created on 2022-11-20 with reprex v2.0.2 |
Observed
The text was updated successfully, but these errors were encountered: