-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
122 lines (93 loc) · 3.98 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "sentimentalk: Sentiment Analysis of Talk Pages as Service"
output:
github_document:
toc: yes
toc_depth: 4
md_extensions: -autolink_bare_uris
includes:
in_header: header.md
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, eval=FALSE)
library(printr) # install.packages("printr")
set.seed(0)
```
## Usage
### Input
The API endpoint accepts the following parameters:
- **page_name** (starts with "Talk:")
- **api** is used to specify the URL for the MediaWiki api.php to query
- **project** and **language**
- a Wikimedia **project** such as "wikipedia" or "wikiquote"
- a **language** code such as "en" (English) or "de" (German)
- **format**:
- "condensed" (default)
- "pretty" makes the output more human-friendly
**Note**: currently, classic (non-Flow) talk pages are unsupported because they are difficult to parse. As a result, if you try to supply a classic talk page, the API will yield an error message.
### Output
If the talk page has been successfully parsed and analyzed, the JSON-formatted output will be:
- `status`: "success" or "error"
- `message`: helpful message in case of error
- `results`:
- An array of `topics`
- For each topic, an array of shuffled `posts`
(1st post in output is not necessarily the 1st post within the topic)
- For each post, the following fields:
- `post`: identifier of the post
- `participant`: salted hash of the username (for anonymity)
- `total non-stopwords`: number of words within the post that were not [stop words](https://en.wikipedia.org/wiki/Stop_words) such as `r paste0('"', paste0(dplyr::sample_n(tidytext::stop_words, 5)$word, collapse = '", "'), '"')`
- `sentiment expression`: proportion of words that express a particular sentiment / total non-stopwords
(may sum up to greater than 1)
## Setup
```R
# install.packages("devtools")
devtools::install_github("bearloga/wmf-wmhack17/api")
```
## Example
To analyze [Talk:Cross-wiki Search Result Improvements on MediaWiki](https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_Result_Improvements) we need to provide the API url "www.mediawiki.org/w/api.php"
### R
`sentimentalk::process` (used by the endpoint) outputs a tidy tibble by default, which can be used to perform additional analyses in R:
```{r example_r, eval=TRUE}
sentiment_breakdown <- sentimentalk::process(
page_name = "Talk:Cross-wiki Search Result Improvements",
api = "www.mediawiki.org/w/api.php"
)
head(sentiment_breakdown)
```
### Endpoint
We start the endpoint for local use via **endpoint.R** and [pm2](https://plumber.trestletech.com/docs/hosting/)
#### GET
```bash
curl -s -G \
--data-urlencode "page_name=Talk:Cross-wiki Search Result Improvements" \
--data-urlencode "api=www.mediawiki.org/w/api.php" \
"https://sentimentalk.wmflabs.org/analyze"
```
#### POST
```bash
curl -s \
--data-urlencode "page_name=Talk:Cross-wiki Search Result Improvements" \
--data-urlencode "api=www.mediawiki.org/w/api.php" \
"https://sentimentalk.wmflabs.org/analyze"
```
#### POST with input data as JSON
```bash
curl -s \
--data '{"page_name":"Talk:Cross-wiki Search Result Improvements", "api":"www.mediawiki.org/w/api.php"}' \
"https://sentimentalk.wmflabs.org/analyze"
```
#### JSON output
**Note**: the output has been truncated.
```{r example_json, eval=TRUE, echo=FALSE, results='asis'}
sentiment_breakdown <- sentimentalk::process(
page_name = "Talk:Cross-wiki Search Result Improvements",
api = "www.mediawiki.org/w/api.php",
.format = "json", .json = "pretty"
)
cat("```json\n")
cat(paste0(strsplit(sentiment_breakdown, "\n")[[1]][1:43], collapse = "\n"), "\n ...\n")
cat("```")
```
## Additional Information
The sentiment analysis is performed using the National Research Council Canada (NRC) Word-Emotion Association Lexicon from Saif Mohammad and Peter Turney ([link](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)).