Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cohorts converted to data.table/tidytable.... #1

Open
coforfe opened this issue Nov 15, 2021 · 6 comments
Open

cohorts converted to data.table/tidytable.... #1

coforfe opened this issue Nov 15, 2021 · 6 comments

Comments

@coforfe
Copy link

coforfe commented Nov 15, 2021

Hello Peer,

Thanks for your package. I find it very useful and convenient.

Since I am using a large customer base, the process to get the monthly or the daily cohorts take a little while, so I converted your code to data.table (and a little bit of tidytable) and I see already an improvement. Even for the small dataset you include online_cohorts in the package I see an improvement of 2x.

If you are interested, I can send to you the code to extend your package with them.

Thanks,
Carlos.

@PeerChristensen
Copy link
Owner

Hi Carlos

I'm very pleased that you find it useful! Converting to data.table is a great idea, and i don't remember why I didn't do this in the first place. Please send you code. I'd be more than happy to try it and perhaps make a new release.

Best
Peer

@coforfe
Copy link
Author

coforfe commented Nov 15, 2021

Hi Peer,

Yes, I have been playing with cohort_table_month() and cohort_table_day(), but without converting them to a function yet.

Here is the code for each case:

  • cohort_table_month():
library(data.table)
library(tidytable)
library(cohorts)
library(zoo)

online_cohorts %>%
  as.data.table() %>%
  setnames( old = c('CustomerID', 'InvoiceDate'), new = c('id_var', 'date')) %>%
  .[ , month  := as.yearmon(date), by =  .(id_var)] %>%
  .[ , cohort := min(month), by =.(id_var)] %>%
  .[ , users := .N, by = .(cohort, month) ] %>%
  .[ , .(cohort, month, users)] %>%
  unique() %>%
  pivot_wider.( names_from = month, values_from = users) %>%
  .[ , cohort := 1:uniqueN(cohort)]
  • cohort_table_day():
library(data.table)
library(tidytable)
library(cohorts)
library(zoo)

 online_cohorts %>%
    as.data.table() %>%
    setnames( old = c('CustomerID', 'InvoiceDate'), new = c('id_var', 'date')) %>%
    .[ , cohort := min(date), by = .(id_var)] %>%
    .[ , users := .N, by = .(cohort, date) ] %>%
    .[ , .(cohort, date, users)] %>%
    unique() %>%
    pivot_wider.( names_from = date, values_from = users) %>%
    .[ , cohort := 1:uniqueN(cohort)]

Thanks!
Carlos.

@coforfe
Copy link
Author

coforfe commented Nov 15, 2021

Hello,

Just in case, I leave here the code for the benchmarking, where it is compared:

  • the data.table approach
  • the equivalent with dplyr
  • your function
library(cohorts)
library(data.table)
library(magrittr)
library(zoo)
library(tidytable)
library(dplyr)
library(microbenchmark)
library(tidyr)


microbenchmark(

#------------------------
# cohort_table_month()
#------------------------
data.table = online_cohorts %>%
  as.data.table() %>%
  setnames( old = c('CustomerID', 'InvoiceDate'), new = c('id_var', 'date')) %>%
  .[ , month  := as.yearmon(date), by = .(id_var)] %>%
  .[ , cohort := min(month), by = .(id_var)] %>%
  .[ , users := .N, by = .(cohort, month) ] %>%
  .[ , .(cohort, month, users)] %>%
  unique() %>%
  pivot_wider.( names_from = month, values_from = users) %>%
  .[ , cohort := 1:uniqueN(cohort)]
, 

dplyr =
online_cohorts %>% 
  rename( id_var = CustomerID) %>%
  rename( date = InvoiceDate) %>%
  group_by( id_var ) %>% 
  mutate(month = zoo::as.yearmon( date )) %>%
  mutate(cohort = min(month)) %>% 
  group_by(cohort, month) %>% 
  summarise(users = dplyr::n()) %>% 
  pivot_wider(names_from = month,  values_from = users) %>%
  ungroup() %>%
  mutate(cohort = 1:dplyr::n_distinct(cohort)) %>% 
  tibble::as_tibble()

,

funcion = online_cohorts %>%
  cohort_table_month(CustomerID, InvoiceDate)

, times = 10
)


#---------------------
# cohort_table_day()
#---------------------
microbenchmark(
  data.table = online_cohorts %>%
    as.data.table() %>%
    setnames( old = c('CustomerID', 'InvoiceDate'), new = c('id_var', 'date')) %>%
    .[ , cohort := min(date), by = .(id_var)] %>%
    .[ , users := .N, by = .(cohort, date) ] %>%
    .[ , .(cohort, date, users)] %>%
    unique() %>%
    pivot_wider.( names_from = date, values_from = users) %>%
    .[ , cohort := 1:uniqueN(cohort)]
  , 
  
  dplyr =
    online_cohorts %>% 
    rename( id_var = CustomerID) %>%
    rename( date = InvoiceDate) %>%
    group_by( id_var ) %>% 
    mutate(cohort = min(date)) %>% 
    group_by(cohort, date) %>% 
    summarise(users = n()) %>% 
    pivot_wider(names_from = date,  values_from = users) %>%
    ungroup() %>%
    mutate(cohort = 1:dplyr::n_distinct(cohort)) %>% 
    tibble::as_tibble()
  
  ,
  
  funcion = online_cohorts %>%
    cohort_table_day(CustomerID, InvoiceDate)
  
  , times = 10
)

Thanks,
Carlos.

@coforfe
Copy link
Author

coforfe commented Nov 17, 2021

Hi,

Attached you can find the previous transformations converted to functions and using pure data.table syntax (no need for tidytable anymore). I also tuned a little bit cohort_table_pct() function.

#----------------------  MONTH  --------------------
cohort_table_month_fast <- function(dt , customer, date) {
  #-- Customer should be an id.
  #-- date: should be a date class.
  dt_out <- dt %>%
  as.data.table() %>%
    .[ , month  := as.yearmon(get(date)), by =  .(get(customer))] %>%
    .[ , cohort := min(month), by =.(get(customer))] %>%
    .[ , users := .N, by = .(cohort, month) ] %>%
    .[ , .(cohort, month, users)] %>%
    unique() %>%
    dcast( cohort ~ month, value.var = "users" ) %>%
    .[ , cohort := 1:uniqueN(cohort) ] %>%
    as.data.table()
  
  return(dt_out)
}



#----------------------   PCT --------------------
cohort_table_pct_fast <- function( dt, decimals = 1) {
  
  diagonal <- dt %>%
    .[ , -"cohort", with = FALSE] %>%
    as.matrix() %>%
    diag()
  
  res_pct <- round(dt*100/diagonal, decimals) %>%
    .[ , cohort := 1:nrow(dt)] %>%
    as.data.table()
  
  return(res_pct)
}




#-------------------   DAY  -----------------
cohort_table_day_fast <- function(dt , customer, date) {
  #-- Customer should be an id.
  #-- date: should be a date class.
  dt_out <- dt %>%
    as.data.table() %>%
    .[ , datetr := as.IDate(get(date))] %>%
    .[ , cohort := min(datetr), by =.(get(customer))] %>%
    .[ , users := .N, by = .(cohort, datetr) ] %>%
    .[ , .(cohort, datetr, users)] %>%
    unique() %>%
    dcast( cohort ~ datetr, value.var = "users") %>%
    .[ , cohort := 1:uniqueN(cohort)] %>%
    as.data.table()
  
  return(dt_out)
}

Hope that it helps.

Thanks!
Carlos.

@PeerChristensen
Copy link
Owner

Hi Carlos

Thank you so much for your great work.
I'd love to use your improvements in a coming release and make sure to give you credit.

Best regards
Peer

@coforfe
Copy link
Author

coforfe commented Nov 18, 2021

Thanks Peer!.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants