Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date is changed to integer on selecting using dplyr #28

Open
gowthamrao opened this issue Aug 10, 2021 · 12 comments
Open

Date is changed to integer on selecting using dplyr #28

gowthamrao opened this issue Aug 10, 2021 · 12 comments
Assignees

Comments

@gowthamrao
Copy link
Member

Observed

anAndromedaObject <- Andromeda::andromeda()
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(),
                                                     anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate

todayDate anotherDate

1 2021-08-10 2021-08-10

anAndromedaObject$dataFrameWithDate <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(.data$anotherDate, .data$todayDate)
anAndromedaObject$dataFrameWithDate

anotherDate todayDate

1 18849 18849

@ablack3 ablack3 self-assigned this Oct 4, 2021
@ablack3
Copy link
Collaborator

ablack3 commented Oct 5, 2021

Still trying to figure out why this is happening.

library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# this works
anAndromedaObject <- andromeda()
x <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate
#> # Source:   table<dataFrameWithDate> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   todayDate  anotherDate
#>   <date>     <date>     
#> 1 2021-10-05 2021-10-05


# this does not
anAndromedaObject$dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(anotherDate, todayDate)
  
anAndromedaObject$dataFrameWithDate2 
#> # Source:   table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   anotherDate todayDate
#>         <int>     <int>
#> 1       18905     18905


# but this does
dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(anotherDate, todayDate) %>% 
  collect()

dataFrameWithDate2
#> # A tibble: 1 x 2
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2021-10-05  2021-10-05


anAndromedaObject$dataFrameWithDate2 <- dataFrameWithDate2

anAndromedaObject$dataFrameWithDate2
#> # Source:   table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2021-10-05  2021-10-05

Created on 2021-10-05 by the reprex package (v2.0.1)

@ablack3
Copy link
Collaborator

ablack3 commented Oct 6, 2021

dateTable <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())

con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:", extended_types = TRUE)

RSQLite::dbWriteTable(con, "table1", dateTable)

RSQLite::dbGetQuery(con, "select * from table1")
#>    todayDate anotherDate
#> 1 2021-10-05  2021-10-05

RSQLite::dbExecute(con, "create table table2 as select * from table1")
#> [1] 0

RSQLite::dbGetQuery(con, "select * from table2")
#>   todayDate anotherDate
#> 1     18905       18905

RSQLite::dbDisconnect(con)

Created on 2021-10-05 by the reprex package (v2.0.1)

Seem like an RSQLite issue. Made a note here r-dbi/RSQLite#379. I'll try to come up with a temporary fix but I'm really not sure how to do this without bringing the data into R. I'm assuming "create a modified copy" must all happen in SQLite.

@ablack3
Copy link
Collaborator

ablack3 commented Oct 6, 2021

I'm pretty tempted to swap out SQLite for duckdb which has real date support.

dateTable <- dplyr::tibble(todayDate = wakefield::date_stamp(10, T), anotherDate = wakefield::date_stamp(10, T))

con <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

DBI::dbWriteTable(con, "table1", dateTable)

DBI::dbGetQuery(con, "select * from table1")
#>     todayDate anotherDate
#> 1  2021-03-05  2020-12-05
#> 2  2021-01-05  2021-02-05
#> 3  2021-10-05  2021-08-05
#> 4  2021-03-05  2021-09-05
#> 5  2020-12-05  2021-10-05
#> 6  2021-07-05  2021-01-05
#> 7  2021-05-05  2021-09-05
#> 8  2020-12-05  2021-10-05
#> 9  2020-11-05  2021-07-05
#> 10 2021-07-05  2021-10-05

DBI::dbExecute(con, "create table table2 as select * from table1")
#> [1] 10

DBI::dbGetQuery(con, "select * from table2")
#>     todayDate anotherDate
#> 1  2021-03-05  2020-12-05
#> 2  2021-01-05  2021-02-05
#> 3  2021-10-05  2021-08-05
#> 4  2021-03-05  2021-09-05
#> 5  2020-12-05  2021-10-05
#> 6  2021-07-05  2021-01-05
#> 7  2021-05-05  2021-09-05
#> 8  2020-12-05  2021-10-05
#> 9  2020-11-05  2021-07-05
#> 10 2021-07-05  2021-10-05

DBI::dbDisconnect(con)

Created on 2021-10-05 by the reprex package (v2.0.1)

@schuemie
Copy link
Member

schuemie commented Oct 6, 2021

Have you compared performance (mostly speed) between SQLite and DuckDB?

@ablack3
Copy link
Collaborator

ablack3 commented Oct 6, 2021

I'm working on creating a duckdb version of Andromeda. Then I'll run some comparison tests.

I don't understand the purpose for these tests. Is it important that we can append with mismatched column names. duckdb throws an error when I try to do this.

test_that("Append from other andromeda with switched column order", {
  andromeda <- andromeda()
  andromeda$cars <- cars

  andromeda2 <- andromeda()
  andromeda2$cars <- cars
  appendToTable(andromeda$cars, andromeda2$cars %>% filter(speed > 10) %>% select(dist, speed))

  carsPlus2 <- andromeda$cars %>% collect()

  carsPlus <- rbind(cars, cars[cars$speed > 10, ])
  expect_equivalent(carsPlus2, carsPlus)
  close(andromeda)
  close(andromeda2)
})

test_that("Append from data frame with switched column order", {
  andromeda <- andromeda()
  andromeda$cars <- cars

  appendToTable(andromeda$cars, cars[cars$speed > 10, c("dist", "speed")])

  carsPlus2 <- andromeda$cars %>% collect()

  carsPlus <- rbind(cars, cars[cars$speed > 10, ])
  expect_equivalent(carsPlus2, carsPlus)
  close(andromeda)
})

@ablack3
Copy link
Collaborator

ablack3 commented Oct 7, 2021

Test using SQLite version

library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
  db <- andromeda(survey = r_data_theme(1e6, "survey"))
  
  db$survey %>% 
    group_by(Item_1) %>% 
    summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}

microbenchmark::microbenchmark(
  summarizeSomeData(), 
  times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#>                 expr      min       lq     mean   median       uq      max
#>  summarizeSomeData() 2.585798 2.605132 2.822737 2.642374 2.674033 3.606348
#>  neval
#>      5

Test using duckDB version

library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
  db <- andromeda(survey = r_data_theme(1e6, "survey"))
  
  db$survey %>% 
    group_by(Item_1) %>% 
    summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}

microbenchmark::microbenchmark(
  summarizeSomeData(), 
  times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#>                 expr      min       lq     mean   median       uq      max
#>  summarizeSomeData() 2.685436 2.692851 2.992721 2.711518 2.725933 4.147866
#>  neval
#>      5

Created on 2021-10-06 by the reprex package (v2.0.1)

So they were basically the same on this very simple test that included loading data in and querying. I still have a work to do to get all the andromeda functions working on duckdb. Is there a test you'd like me try? Perhaps a particularly slow step of your normal workflow.

duckdb branch is at https://github.com/OHDSI/Andromeda/tree/duckdb

@schuemie
Copy link
Member

schuemie commented Oct 7, 2021

Could you test joins between large tables? That is often slow.

@schuemie
Copy link
Member

schuemie commented Oct 7, 2021

About the test with different column orders: this is guarantee that I think all tidyverse functions also make (e.g. bind_rows()). If I have data with columns conceptId, count, and I append to that data with columns count, conceptId, it will match the data on column name, not column order, so at the end the conceptId column still only has concept IDs, and the count column only has counts.

@ablack3
Copy link
Collaborator

ablack3 commented Oct 14, 2021

I did a few simple tests on joins both with and without indexes and posted one result here. SQLite was faster in all my tests. According the FAQ duckdb has not been fully optimized for performance yet. I'll keep working on implementing a duckdb Andromeda version on the duckdb branch and we can see if it gets better at some point.

You could use the restoreDate() function (which I should un-deprecate) but that requires bringing the data into R which is probably not what we want.

Any ideas for how to make date handling better in Andromeda? Maybe RSQLite should handle this by keeping track of date columns. This seems like a hard thing to do since RSQLite would have to infer the type of new columns from SQL.

@schuemie
Copy link
Member

This seems like a SQLite issue, right? Apparently the 'extended types' extension only applies to columns of physical tables (permanent or temporary), but don't exist in any intermediate results like (nested) queries.

@ablack3
Copy link
Collaborator

ablack3 commented Oct 14, 2021

Yes I guess RSQLite should handle this, but it's not clear to me if it will be possible to propagate date types in more complex queries. I posted an example on the RSQLite issue here.

@ablack3
Copy link
Collaborator

ablack3 commented Nov 20, 2022

This will be addressed in the next release (arrow_S4 branch)

library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
anAndromedaObject <- Andromeda::andromeda()
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(),
                                                     anotherDate = Sys.Date())
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#>   todayDate  anotherDate
#>   <date>     <date>     
#> 1 2022-11-20 2022-11-20


anAndromedaObject$dataFrameWithDate <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(.data$anotherDate, .data$todayDate)
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2022-11-20  2022-11-20

Created on 2022-11-20 with reprex v2.0.2

This was referenced Nov 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants