Skip to content

Commit

Permalink
top 10 words
Browse files Browse the repository at this point in the history
  • Loading branch information
your_username committed Mar 31, 2019
1 parent a48b298 commit 558f9f1
Show file tree
Hide file tree
Showing 11 changed files with 79 additions and 4 deletions.
20 changes: 18 additions & 2 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ output: github_document

This shows one how to scrape data directly from a website `r emo::ji("spider_web")` (an html table of CRAN packages, structure it as a dataframe and plot it).

>Loads packages `r emo::ji("package")`
>Loads R packages `r emo::ji("package")`
```{r setup,message=F}
library(tidyverse)
Expand All @@ -23,12 +23,28 @@ source("count_cran.R")

>Reads package table (as html) from CRAN and decode it into a dataframe: `r emo::ji("man_technologist")`
```{r}
```{r,cache=T}
url_cran <- "http://cran.r-project.org/web/packages/available_packages_by_date.html"
df_cran <- read_file(url_cran) %>% cran_html_to_df
nrow(df_cran)
```

What's in it?

```{r}
glimpse(df_cran)
```

>What are the top 10 words used in package titles?
```{r,cache=T}
df_cran %>%
get_top_words(10) %>%
knitr::kable()
```



>Plots it! `r emo::ji("chart")`
```{r}
Expand Down
37 changes: 35 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ How many packages are there on CRAN?
<!-- README.md is generated from README.Rmd. Please edit that file -->
This shows one how to scrape data directly from a website 🕸 (an html table of CRAN packages, structure it as a dataframe and plot it).

> Loads packages 📦
> Loads R packages 📦
``` r
library(tidyverse)
Expand All @@ -28,6 +28,39 @@ nrow(df_cran)

## [1] 11175

What's in it?

``` r
glimpse(df_cran)
```

## Observations: 11,175
## Variables: 3
## $ Date <date> 2019-03-31, 2019-03-31, 2019-03-31, 2019-03-31, 2019-03…
## $ Package <chr> "AmigaFFH", "AzureGraph", "bnviewer", "bysykkel", "fastN…
## $ Title <chr> "Commodore Amiga File Format Handler", "Simple Interface…

> What are the top 10 words used in package titles?
``` r
df_cran %>%
get_top_words(10) %>%
knitr::kable()
```

| word | n|
|:-----------|-----:|
| data | 1584|
| analysis | 1141|
| models | 683|
| functions | 446|
| tools | 434|
| regression | 426|
| using | 389|
| estimation | 364|
| model | 354|
| interface | 330|

> Plots it! 💹
``` r
Expand All @@ -36,6 +69,6 @@ df_cran %>%
theme_economist()
```

![](README_files/figure-markdown_github/unnamed-chunk-3-1.png)
![](README_files/figure-markdown_github/unnamed-chunk-5-1.png)

> There you have it! R is growing exponentially!😄
12 changes: 12 additions & 0 deletions README_cache/markdown_github/__packages
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
base
tidyverse
ggplot2
tibble
tidyr
readr
purrr
dplyr
stringr
forcats
ggthemes
lubridate
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions count_cran.R
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,17 @@ plot_cran_df <- function(df_cran,brks=1000) {
labs(title=sprintf("%d CRAN packages",total_max),
subtitle=today()%>%as.character)
}

stop_words <- c("for","and","the","with","from")

get_top_words <- function(df,how_many) df %>%
pull(Title) %>%
str_squish() %>%
str_to_lower() %>%
str_remove_all("[^[:alpha:] ]") %>%
str_split(" ") %>%
unlist %>%
keep(~str_length(.x)>2&!(.x%in%stop_words)) %>%
tibble(word=.) %>%
count(word,sort=T) %>%
head(how_many)

0 comments on commit 558f9f1

Please sign in to comment.