Date is changed to integer on selecting using dplyr #28

gowthamrao · 2021-08-10T20:54:45Z

Observed

anAndromedaObject <- Andromeda::andromeda()
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(),
                                                     anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate

todayDate anotherDate

1 2021-08-10 2021-08-10

anAndromedaObject$dataFrameWithDate <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(.data$anotherDate, .data$todayDate)
anAndromedaObject$dataFrameWithDate

anotherDate todayDate

1 18849 18849

The text was updated successfully, but these errors were encountered:

ablack3 · 2021-10-05T22:16:31Z

Still trying to figure out why this is happening.

library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# this works
anAndromedaObject <- andromeda()
x <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate
#> # Source:   table<dataFrameWithDate> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   todayDate  anotherDate
#>   <date>     <date>     
#> 1 2021-10-05 2021-10-05


# this does not
anAndromedaObject$dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(anotherDate, todayDate)
  
anAndromedaObject$dataFrameWithDate2 
#> # Source:   table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   anotherDate todayDate
#>         <int>     <int>
#> 1       18905     18905


# but this does
dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(anotherDate, todayDate) %>% 
  collect()

dataFrameWithDate2
#> # A tibble: 1 x 2
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2021-10-05  2021-10-05


anAndromedaObject$dataFrameWithDate2 <- dataFrameWithDate2

anAndromedaObject$dataFrameWithDate2
#> # Source:   table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2021-10-05  2021-10-05

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

ablack3 · 2021-10-06T01:12:38Z

dateTable <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())

con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:", extended_types = TRUE)

RSQLite::dbWriteTable(con, "table1", dateTable)

RSQLite::dbGetQuery(con, "select * from table1")
#>    todayDate anotherDate
#> 1 2021-10-05  2021-10-05

RSQLite::dbExecute(con, "create table table2 as select * from table1")
#> [1] 0

RSQLite::dbGetQuery(con, "select * from table2")
#>   todayDate anotherDate
#> 1     18905       18905

RSQLite::dbDisconnect(con)

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

Seem like an RSQLite issue. Made a note here r-dbi/RSQLite#379. I'll try to come up with a temporary fix but I'm really not sure how to do this without bringing the data into R. I'm assuming "create a modified copy" must all happen in SQLite.

ablack3 · 2021-10-06T02:24:47Z

I'm pretty tempted to swap out SQLite for duckdb which has real date support.

dateTable <- dplyr::tibble(todayDate = wakefield::date_stamp(10, T), anotherDate = wakefield::date_stamp(10, T))

con <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

DBI::dbWriteTable(con, "table1", dateTable)

DBI::dbGetQuery(con, "select * from table1")
#>     todayDate anotherDate
#> 1  2021-03-05  2020-12-05
#> 2  2021-01-05  2021-02-05
#> 3  2021-10-05  2021-08-05
#> 4  2021-03-05  2021-09-05
#> 5  2020-12-05  2021-10-05
#> 6  2021-07-05  2021-01-05
#> 7  2021-05-05  2021-09-05
#> 8  2020-12-05  2021-10-05
#> 9  2020-11-05  2021-07-05
#> 10 2021-07-05  2021-10-05

DBI::dbExecute(con, "create table table2 as select * from table1")
#> [1] 10

DBI::dbGetQuery(con, "select * from table2")
#>     todayDate anotherDate
#> 1  2021-03-05  2020-12-05
#> 2  2021-01-05  2021-02-05
#> 3  2021-10-05  2021-08-05
#> 4  2021-03-05  2021-09-05
#> 5  2020-12-05  2021-10-05
#> 6  2021-07-05  2021-01-05
#> 7  2021-05-05  2021-09-05
#> 8  2020-12-05  2021-10-05
#> 9  2020-11-05  2021-07-05
#> 10 2021-07-05  2021-10-05

DBI::dbDisconnect(con)

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

schuemie · 2021-10-06T06:27:44Z

Have you compared performance (mostly speed) between SQLite and DuckDB?

ablack3 · 2021-10-06T23:21:35Z

I'm working on creating a duckdb version of Andromeda. Then I'll run some comparison tests.

I don't understand the purpose for these tests. Is it important that we can append with mismatched column names. duckdb throws an error when I try to do this.

test_that("Append from other andromeda with switched column order", {
  andromeda <- andromeda()
  andromeda$cars <- cars

  andromeda2 <- andromeda()
  andromeda2$cars <- cars
  appendToTable(andromeda$cars, andromeda2$cars %>% filter(speed > 10) %>% select(dist, speed))

  carsPlus2 <- andromeda$cars %>% collect()

  carsPlus <- rbind(cars, cars[cars$speed > 10, ])
  expect_equivalent(carsPlus2, carsPlus)
  close(andromeda)
  close(andromeda2)
})

test_that("Append from data frame with switched column order", {
  andromeda <- andromeda()
  andromeda$cars <- cars

  appendToTable(andromeda$cars, cars[cars$speed > 10, c("dist", "speed")])

  carsPlus2 <- andromeda$cars %>% collect()

  carsPlus <- rbind(cars, cars[cars$speed > 10, ])
  expect_equivalent(carsPlus2, carsPlus)
  close(andromeda)
})

ablack3 · 2021-10-07T00:20:05Z

Test using SQLite version

library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
  db <- andromeda(survey = r_data_theme(1e6, "survey"))
  
  db$survey %>% 
    group_by(Item_1) %>% 
    summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}

microbenchmark::microbenchmark(
  summarizeSomeData(), 
  times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#>                 expr      min       lq     mean   median       uq      max
#>  summarizeSomeData() 2.585798 2.605132 2.822737 2.642374 2.674033 3.606348
#>  neval
#>      5

Test using duckDB version

library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
  db <- andromeda(survey = r_data_theme(1e6, "survey"))
  
  db$survey %>% 
    group_by(Item_1) %>% 
    summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}

microbenchmark::microbenchmark(
  summarizeSomeData(), 
  times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#>                 expr      min       lq     mean   median       uq      max
#>  summarizeSomeData() 2.685436 2.692851 2.992721 2.711518 2.725933 4.147866
#>  neval
#>      5

^{Created on 2021-10-06 by the reprex package (v2.0.1)}

So they were basically the same on this very simple test that included loading data in and querying. I still have a work to do to get all the andromeda functions working on duckdb. Is there a test you'd like me try? Perhaps a particularly slow step of your normal workflow.

duckdb branch is at https://github.com/OHDSI/Andromeda/tree/duckdb

schuemie · 2021-10-07T04:52:54Z

Could you test joins between large tables? That is often slow.

schuemie · 2021-10-07T04:59:04Z

About the test with different column orders: this is guarantee that I think all tidyverse functions also make (e.g. bind_rows()). If I have data with columns conceptId, count, and I append to that data with columns count, conceptId, it will match the data on column name, not column order, so at the end the conceptId column still only has concept IDs, and the count column only has counts.

ablack3 · 2021-10-14T02:58:07Z

I did a few simple tests on joins both with and without indexes and posted one result here. SQLite was faster in all my tests. According the FAQ duckdb has not been fully optimized for performance yet. I'll keep working on implementing a duckdb Andromeda version on the duckdb branch and we can see if it gets better at some point.

You could use the restoreDate() function (which I should un-deprecate) but that requires bringing the data into R which is probably not what we want.

Any ideas for how to make date handling better in Andromeda? Maybe RSQLite should handle this by keeping track of date columns. This seems like a hard thing to do since RSQLite would have to infer the type of new columns from SQL.

schuemie · 2021-10-14T04:06:39Z

This seems like a SQLite issue, right? Apparently the 'extended types' extension only applies to columns of physical tables (permanent or temporary), but don't exist in any intermediate results like (nested) queries.

ablack3 · 2021-10-14T16:24:34Z

Yes I guess RSQLite should handle this, but it's not clear to me if it will be possible to propagate date types in more complex queries. I posted an example on the RSQLite issue here.

ablack3 · 2022-11-20T23:16:24Z

This will be addressed in the next release (arrow_S4 branch)

library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
anAndromedaObject <- Andromeda::andromeda()
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(),
                                                     anotherDate = Sys.Date())
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#>   todayDate  anotherDate
#>   <date>     <date>     
#> 1 2022-11-20 2022-11-20


anAndromedaObject$dataFrameWithDate <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(.data$anotherDate, .data$todayDate)
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2022-11-20  2022-11-20

^{Created on 2022-11-20 with reprex v2.0.2}

ablack3 self-assigned this Oct 4, 2021

schuemie mentioned this issue Oct 6, 2021

Support for DuckDb OHDSI/DatabaseConnector#153

Closed

peterdutey mentioned this issue Jul 12, 2022

Date handling in SQLite ramses-antibiotics/ramses-package#31

Closed

This was referenced Nov 20, 2022

Switch to Arrow backend #40

Merged

v1.0 release candidate #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Date is changed to integer on selecting using dplyr #28

Date is changed to integer on selecting using dplyr #28

gowthamrao commented Aug 10, 2021

ablack3 commented Oct 5, 2021

ablack3 commented Oct 6, 2021 •

edited

Loading

ablack3 commented Oct 6, 2021

schuemie commented Oct 6, 2021

ablack3 commented Oct 6, 2021

ablack3 commented Oct 7, 2021 •

edited

Loading

schuemie commented Oct 7, 2021

schuemie commented Oct 7, 2021

ablack3 commented Oct 14, 2021 •

edited

Loading

schuemie commented Oct 14, 2021

ablack3 commented Oct 14, 2021

ablack3 commented Nov 20, 2022

Date is changed to integer on selecting using dplyr #28

Date is changed to integer on selecting using dplyr #28

Comments

gowthamrao commented Aug 10, 2021

ablack3 commented Oct 5, 2021

ablack3 commented Oct 6, 2021 • edited Loading

ablack3 commented Oct 6, 2021

schuemie commented Oct 6, 2021

ablack3 commented Oct 6, 2021

ablack3 commented Oct 7, 2021 • edited Loading

schuemie commented Oct 7, 2021

schuemie commented Oct 7, 2021

ablack3 commented Oct 14, 2021 • edited Loading

schuemie commented Oct 14, 2021

ablack3 commented Oct 14, 2021

ablack3 commented Nov 20, 2022

ablack3 commented Oct 6, 2021 •

edited

Loading

ablack3 commented Oct 7, 2021 •

edited

Loading

ablack3 commented Oct 14, 2021 •

edited

Loading