Cannot use options when using spacyr as step_tokenize engine #217

gary-mu · 2023-03-12T19:49:02Z

The problem

I'm having trouble with providing options to step_tokenize when using spacyr engine

Reproducible example

short_data <- data.frame(text = c(
  "This is a short tale,",
  "With many cats and ladies.",
  'I have 2 very important news for you. The 1st one is this!'
))

rec_spec <- recipe(~text, data = short_data) %>%
  step_tokenize(text, 
                engine = 'spacyr', 
                options = list(
                  remove_punct = TRUE,
                  remove_numbers = TRUE,
                  remove_separators = FALSE,
                  remove_symbols = TRUE
                )) %>%
  step_lemma(text) %>%
  step_tf(text) %>% prep()

Error in token(x = data[, 1, drop = TRUE], remove_punct = TRUE, remove_numbers = TRUE,  : 
  unused arguments (remove_punct = TRUE, remove_numbers = TRUE, remove_separators = FALSE, remove_symbols = TRUE)

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2023-03-13T18:18:58Z

This is not possible because step_tokenize() uses spacyr::spacy_parse() instead of spacyr::spacy_tokenize() to be able to get part of speech and lemma information as well. spacyr::spacy_parse() doesn't have the arguments remove_punct, remove_numbers , remove_separators, remove_symbols.

Since you have part of speech information you can use step_pos_filter() to extract the types of speech you would like to keep. Mimicking the removal of certain parts of speech.

Below is just an example of how you would do this. You would need to expand vector passed to keep_tags. See https://textrecipes.tidymodels.org/reference/step_pos_filter.html#details for more information.

library(textrecipes)
short_data <- data.frame(text = c(
  "This is a short tale,",
  "With many cats and ladies.",
  'I have 2 very important news for you. The 1st one is this!'
))

rec_spec <- recipe(~text, data = short_data) %>%
  step_tokenize(text, 
                engine = 'spacyr') %>%
  step_pos_filter(text, keep_tags = c("ADJ", "ADV", "NOUN", "VERB")) %>%
  step_lemma(text) %>%
  step_tf(text) %>%
  prep()

rec_spec %>%
  bake(new_data = NULL) %>%
  glimpse()
#> Rows: 3
#> Columns: 11
#> $ tf_text_1st       <int> 0, 0, 1
#> $ tf_text_cat       <int> 0, 1, 0
#> $ tf_text_have      <int> 0, 0, 1
#> $ tf_text_important <int> 0, 0, 1
#> $ tf_text_lady      <int> 0, 1, 0
#> $ tf_text_many      <int> 0, 1, 0
#> $ tf_text_news      <int> 0, 0, 1
#> $ tf_text_one       <int> 0, 0, 1
#> $ tf_text_short     <int> 1, 0, 0
#> $ tf_text_tale      <int> 1, 0, 0
#> $ tf_text_very      <int> 0, 0, 1

Seeing this problem I am opening this issue to make it easier to remove unwanted tags #218

github-actions · 2023-03-28T01:20:02Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

EmilHvitfeldt closed this as completed Mar 13, 2023

github-actions bot locked and limited conversation to collaborators Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot use options when using spacyr as step_tokenize engine #217

Cannot use options when using spacyr as step_tokenize engine #217

gary-mu commented Mar 12, 2023

EmilHvitfeldt commented Mar 13, 2023

github-actions bot commented Mar 28, 2023

Cannot use options when using spacyr as step_tokenize engine #217

Cannot use options when using spacyr as step_tokenize engine #217

Comments

gary-mu commented Mar 12, 2023

The problem

Reproducible example

EmilHvitfeldt commented Mar 13, 2023

github-actions bot commented Mar 28, 2023