-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example needed for tidy approach for stm modeling with covariates #173
Comments
The main problem you are having is that when you remove stop words, you remove some entire documents. Then when you use the library(tidytext)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
gadarian_sparse <- gadarian %>%
mutate(document = row_number()) %>%
unnest_tokens(word, open.ended.response) %>%
count(document, word) %>%
cast_sparse(document, word, n)
topic_model <- stm(
gadarian_sparse,
K = 3, init.type = "Spectral",
prevalence = ~ treatment + s(pid_rep),
data = gadarian,
verbose = FALSE
)
summary(topic_model)
#> A topic model with 3 topics, 341 documents and a 1512 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: the, to, of, people, is, in, country
#> FREX: from, come, coming, if, entering, illegally, united
#> Lift: afraid, if, mean, unsecured, been, entering, from
#> Score: the, to, from, coming, people, come, it
#> Topic 2 Top Words:
#> Highest Prob: that, and, a, i, they, not, our
#> FREX: that, they, we, have, pay, so, usa
#> Lift: asians, east, indians, usa, bums, contibution, goverment
#> Score: that, we, they, not, our, have, here
#> Topic 3 Top Words:
#> Highest Prob: for, immigrants, illegal, of, and, jobs, our
#> FREX: security, social, job, health, mexico, workers, loss
#> Lift: caused, ducation, hospitals, lowering, quality, bombings, killing
#> Score: illegal, for, security, jobs, immigrants, loss, our Created on 2020-05-04 by the reprex package (v0.3.0) Another option is to create a new dataframe for covariates that only contains the observations in I think a good option would be to rewrite / expand the topic modeling vignette to use stm throughout and add a section for document-level covariates. It needs some updating anyway. |
Thank you very much for your kind explanation, @juliasilge! On top of your advice, I have got it to work. What do you think about my approach below? library(tidytext)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
gadarian2 <- gadarian %>%
mutate(document = as.character(row_number()))
gadarian_sparse <- gadarian2 %>%
unnest_tokens(word, open.ended.response) %>%
anti_join(stop_words) %>%
count(document, word) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
covariate_df <- tibble(document = rownames(gadarian_sparse)) %>%
inner_join(gadarian2)
#> Joining, by = "document"
topic_model <- stm(gadarian_sparse,
K = 3, init.type = "Spectral",
prevalence = ~ treatment + s(pid_rep),
data = covariate_df,
verbose = FALSE
)
summary(topic_model)
#> A topic model with 3 topics, 335 documents and a 1160 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: taxes, security, illegals, immigrants, english, language, social
#> FREX: 1, law, taxes, terrorists, due, lost, 3
#> Lift: extent, fined, fullest, ileagles, on't, sneack, buttons
#> Score: 1, assimilate, security, english, law, 3, recieve
#> Topic 2 Top Words:
#> Highest Prob: jobs, illegal, immigration, welfare, country, care, americans
#> FREX: healthcare, cost, hospitals, strain, welfare, lack, im
#> Lift: crowding, hospitals, cheap, draining, allowing, immigrates, sealing
#> Score: jobs, im, cost, loss, welfare, capitalist, question
#> Topic 3 Top Words:
#> Highest Prob: people, immigrants, illegal, country, immigration, coming, border
#> FREX: people, coming, live, illegally, process, means, support
#> Lift: live, coming, term, false, process, required, people
#> Score: people, coming, process, illegally, stop, businesses, suffering Created on 2020-05-04 by the reprex package (v0.3.0) Session infodevtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.0 (2020-04-24)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2020-05-04
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.1.6 2020-04-05 [1] CRAN (R 4.0.0)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
#> data.table 1.12.8 2019-12-09 [1] CRAN (R 4.0.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
#> devtools 2.2.2.9000 2020-05-01 [1] Github (r-lib/devtools@b166195)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
#> dplyr * 0.8.5 2020-03-07 [1] CRAN (R 4.0.0)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
#> fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
#> glue 1.4.0 2020-04-03 [1] CRAN (R 4.0.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 4.0.0)
#> janeaustenr 0.1.5 2017-06-10 [1] CRAN (R 4.0.0)
#> knitr 1.28.5 2020-04-28 [1] Github (yihui/knitr@93b46ba)
#> lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
#> Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.0)
#> matrixStats 0.56.0 2020-03-13 [1] CRAN (R 4.0.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0)
#> pillar 1.4.3 2019-12-20 [1] CRAN (R 4.0.0)
#> pkgbuild 1.0.7 2020-04-25 [1] CRAN (R 4.0.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0)
#> ps 1.3.2 2020-02-13 [1] CRAN (R 4.0.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0)
#> rmarkdown 2.1.3 2020-05-03 [1] Github (rstudio/rmarkdown@d7e1bda)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> SnowballC 0.7.0 2020-04-01 [1] CRAN (R 4.0.0)
#> stm * 1.3.5 2020-04-28 [1] Github (bstewart/stm@c95ef0b)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0)
#> tidyselect 1.0.0 2020-01-27 [1] CRAN (R 4.0.0)
#> tidytext * 0.2.4 2020-04-28 [1] Github (juliasilge/tidytext@a1c0220)
#> tokenizers 0.2.1 2018-03-29 [1] CRAN (R 4.0.0)
#> usethis 1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)
#> vctrs 0.2.4 2020-03-10 [1] CRAN (R 4.0.0)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
#> xfun 0.13.1 2020-04-30 [1] Github (yihui/xfun@bf8afdd)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] C:/Program Files/R/R-4.0.0/library |
Yep, that is what I would do! 🙌 |
In the current
tidytext
document explaining about the tidy approach tostm
object, there is no specific example of how to add covariates.I wanted to try that out with stm::gadarian data using
prevalence = ~treatment + s(pid_rep)
covariate formula; however, I have faced some errors.Would you mind adding one example on how to address this kind of model to the
tidytext
package document?Created on 2020-05-03 by the reprex package (v0.3.0)
Session info
The text was updated successfully, but these errors were encountered: