-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathREADME.Rmd
100 lines (70 loc) · 3.63 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# textrecipes <a href='https://textrecipes.tidymodels.org'><img src='man/figures/logo.png' align="right" height="139" /></a>
<!-- badges: start -->
[](https://github.com/tidymodels/textrecipes/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/tidymodels/textrecipes?branch=main)
[](https://CRAN.R-project.org/package=textrecipes)
[](https://CRAN.R-project.org/package=textrecipes)
[](https://lifecycle.r-lib.org/articles/stages.html)
<!-- badges: end -->
## Introduction
**textrecipes** contain extra steps for the [`recipes`](https://CRAN.R-project.org/package=recipes) package for preprocessing text data.
## Installation
You can install the released version of textrecipes from [CRAN](https://CRAN.R-project.org) with:
```{r, eval=FALSE}
install.packages("textrecipes")
```
Install the development version from GitHub with:
```{r installation, eval=FALSE}
# install.packages("pak")
pak::pak("tidymodels/textrecipes")
```
## Example
In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 10 most used words. The preprocessing will be conducted on the variable `medium` and `artist`.
```{r, message=FALSE}
library(recipes)
library(textrecipes)
library(modeldata)
data("tate_text")
okc_rec <- recipe(~ medium + artist, data = tate_text) %>%
step_tokenize(medium, artist) %>%
step_stopwords(medium, artist) %>%
step_tokenfilter(medium, artist, max_tokens = 10) %>%
step_tfidf(medium, artist)
okc_obj <- okc_rec %>%
prep()
str(bake(okc_obj, tate_text))
```
## Breaking changes
As of version 0.4.0, `step_lda()` no longer accepts character variables and instead takes tokenlist variables.
the following recipe
```{r, eval=FALSE}
recipe(~text_var, data = data) %>%
step_lda(text_var)
```
can be replaced with the following recipe to achive the same results
```{r, eval=FALSE}
lda_tokenizer <- function(x) text2vec::word_tokenizer(tolower(x))
recipe(~text_var, data = data) %>%
step_tokenize(text_var,
custom_token = lda_tokenizer
) %>%
step_lda(text_var)
```
## Contributing
This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/textrecipes/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).