Datasets of keywords and urls censored on the Chinese internet, taken from greatfire.org
Though not publicly mentioned greatfire.org has an API publicly available, it’s not difficult to uncover. However, as I very much appreciate their project this package does not provide direct access to it and instead provides two datasets that I pledge to keep updated. If it is too outdated please open an issue.
Last updated: 2020-02-22 20:26:52
You can install greatfire
from Github using via the remotes
or
devtools
package.
# install.packages("remotes")
remotes::install_github("news-r/greatfire")
Both datasets are rather large and therefore not lazily loaded, you
therefore have to explicitly call the data
function.
library(greatfire)
data(censored_keywords)
nrow(censored_keywords)
#> [1] 27387
head(censored_keywords)
#> title blocked_last_30_days changed
#> 1 www.pinterest.com 6 2012-09-23 20:13:39
#> 2 占领 6 2012-09-23 22:27:46
#> 3 王悦 33 2012-09-23 11:34:48
#> 4 "f**k" in china 33 2019-07-20 05:25:09
#> 5 "Github" 33 2019-06-19 16:58:47
#> 6 "harmony" high-speed train 33 2019-07-03 13:53:26
data(censored_urls)
nrow(censored_urls)
#> [1] 89322
head(censored_urls)
#> title blocked_last_30_days
#> 1 ftp://creatorsinpack.dynalias.com 100
#> 2 ftp://jinshu.myftp.org/book/400books/%E7%8E%8B%E5%8A%9B%E9%9B%84-------------%E6%88%91%E7%9A%84%E8%A5%BF%E5%9F%9F+%E4%BD%A0%E7%9A%84%E4%B8%9C%E5%9C%9F.txt 100
#> 3 ftp://jinshu.myftp.org/gfw/new.html 100
#> 4 ftp://jinshu.myftp.org:20021/gfw/new.html 100
#> 5 http://0.facebook.com 100
#> 6 http://000.1024gc.com 100
#> changed
#> 1 2019-05-07 19:22:32
#> 2 2019-05-18 10:16:53
#> 3 2019-05-06 20:36:55
#> 4 2019-05-11 03:40:58
#> 5 2019-07-05 16:15:49
#> 6 2019-06-21 07:58:36
There are two convenience search_*
functions to search through the
data.
gh <- search_urls("github.com")
knitr::kable(gh)