Skip to content

Commit

Permalink
tried w hash, makes it slower than %in%
Browse files Browse the repository at this point in the history
  • Loading branch information
your_username committed Apr 1, 2019
1 parent ef40b39 commit 45e3139
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 3 deletions.
2 changes: 2 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ This shows one how to scrape data directly from a website `r emo::ji("spider_web
library(tidyverse)
library(ggthemes)
library(lubridate)
library(hash)
library(tictoc)
```

>Loads functionality to decode and plot CRAN data `r emo::ji("computer")`
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ This shows one how to scrape data directly from a website 🕸 (an html table of
library(tidyverse)
library(ggthemes)
library(lubridate)
library(hash)
library(tictoc)
```

> Loads functionality to decode and plot CRAN data 💻
Expand Down
Binary file modified README_files/figure-markdown_github/unnamed-chunk-4-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified README_files/figure-markdown_github/unnamed-chunk-5-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 19 additions & 3 deletions count_cran.R
Original file line number Diff line number Diff line change
Expand Up @@ -44,24 +44,40 @@ last_char <- function(s) s%>%str_sub(start=-1)
all_but_last <- function(s) s%>%str_sub(end=-2)
# poor man's stemming
# convert plurals to singular if latter in supplied word list
stem_plurals <- function(ws) {
stem_plurals_in <- function(ws) {
lc <- last_char(ws)
abl <- all_but_last(ws)
if_else(lc=="s"&(abl%in%ws),abl,ws)
}

# hash::has.key could be faster than %in%
stem_plurals_hash <- function(ws) {
ws_h <- hash(ws,1)
lc <- last_char(ws)
abl <- all_but_last(ws)
if_else(lc=="s"&has.key(abl,ws_h),abl,ws)
}

stem_plurals <- function(ws,hash) {
if (hash)
stem_plurals_hash(ws)
else
stem_plurals_in(ws)
}

get_word_vector <- function(titles) titles %>%
str_squish() %>%
str_to_lower() %>%
str_remove_all("[^[:alpha:] ]") %>%
str_split(" ") %>%
unlist

get_top_words <- function(df,how_many) df %>%
get_top_words <- function(df,how_many,hash=F) df %>%
pull(Title) %>%
get_word_vector %>%
remove_short_and_stop %>%
stem_plurals %>%
#{tic();sp<-stem_plurals(.,hash);toc();sp} %>%
stem_plurals(.,hash) %>%
tibble(word=.) %>% # just to use count
count(word,sort=T) %>%
head(how_many)
Expand Down

0 comments on commit 45e3139

Please sign in to comment.