Skip to content

Commit

Permalink
Merge pull request #18 from ComputerMaestro/master
Browse files Browse the repository at this point in the history
Adding IMDB, Twitter and Stanford  Sentiment datasets
  • Loading branch information
oxinabox authored Jul 5, 2019
2 parents bbe594a + b7e19da commit 5d12a44
Show file tree
Hide file tree
Showing 18 changed files with 569 additions and 8 deletions.
7 changes: 4 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
language: julia
os:
- linux
# - osx
- osx
env:
- DATADEPS_ALWAYS_ACCEPT=true
julia:
Expand All @@ -12,12 +12,13 @@ matrix:
allow_failures:
- julia: nightly
notifications:
email: false
email: false

# uncomment the following lines to override the default test script
#script:
# - if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
# - julia -e 'Pkg.clone(pwd()); Pkg.build("CorpusLoaders"); Pkg.test("CorpusLoaders"; coverage=true)'

after_success:
# Push Documentation
- julia -e 'Pkg.add("Documenter")'
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,6 @@ Follow the links below for full docs on the usage of the corpora.
- [SemCor](docs/src/SemCor.md)
- [Senseval3](docs/src/Senseval3.md)
- [CoNLL](docs/src/CoNLL.md)
- [IMDB movie reviews](docs/src/IMDB.md)
- [Twitter sentiment dataset](docs/src/Twitter.md)
- [Stanford Sentiment Treebank](docs/src/SST.md)
1 change: 1 addition & 0 deletions REQUIRE
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ InternedStrings
StringEncodings
WordTokenizers
MultiResolutionIterators
CSV 0.4.3
69 changes: 69 additions & 0 deletions docs/src/IMDB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
### IMDB

IMDB movie reviews dataset a standard collection for Binary Sentiment Analysis task. It is used for benchmarking Sentiment Analysis algorithms. It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided

Structure of the reviews contain different levels:<br>
documents, sentences, words/tokens, characters

Whole data is divided into 5 parts which can be accessed by providing following keywords: <br>
`train_pos` : positive polarity sentiment train set examples (default) <br>
`train_neg` : negative polarity sentiment train set examples <br>
`test_pos` : positive polarity sentiment test set examples <br>
`test_neg` : negative polarity sentiment test set examples <br>
`train_unsup` : unlabeled examples

To get rid of unwanted levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used.


Example:

#Using "test_neg" keywords for negative test set examples

```
julia> dataset_test_neg = load(IMDB("test_neg"))
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
julia> docs = collect(take(dataset_test_neg, 2))
2-element Array{Array{Array{String,1},1},1}:
[["Once", "again", "Mr.", "Costner", "has", "dragged", "out", "a", "movie", "for", "far", "longer", "than", "necessary", "."], ["Aside", "from", "the", "terrific", "sea", "rescue", "sequences", ",", "of", "which" … "just", "did", "not", "care", "about", "any", "of", "the", "characters", "."], ["Most", "of", "us", "have", "ghosts", "in", "the", "closet", ",", "and" … "later", ",", "by", "which", "time", "I", "did", "not", "care", "."], ["The", "character", "we", "should", "really", "care", "about", "is", "a", "very", "cocky", ",", "overconfident", "Ashton", "Kutcher", "."], ["The", "problem", "is", "he", "comes", "off", "as", "kid", "who", "thinks" … "him", "and", "shows", "no", "signs", "of", "a", "cluttered", "closet", "."], ["His", "only", "obstacle", "appears", "to", "be", "winning", "over", "Costner", "."], ["Finally", "when", "we", "are", "well", "past", "the", "half", "way", "point" … ",", "Costner", "tells", "us", "all", "about", "Kutcher", "'s", "ghosts", "."], ["We", "are", "told", "why", "Kutcher", "is", "driven", "to", "be", "the", "best", "with", "no", "prior", "inkling", "or", "foreshadowing", "."], ["No", "magic", "here", ",", "it", "was", "all", "I", "could", "do", "to", "keep", "from", "turning", "it", "off", "an", "hour", "in", "."]]
[["This", "is", "an", "example", "of", "why", "the", "majority", "of", "action", "films", "are", "the", "same", "."], ["Generic", "and", "boring", ",", "there", "'s", "really", "nothing", "worth", "watching", "here", "."], ["A", "complete", "waste", "of", "the", "then", "barely-tapped", "talents", "of", "Ice-T" … "they", "are", "capable", "of", "acting", ",", "and", "acting", "well", "."], ["Do", "n't", "bother", "with", "this", "one", ",", "go", "see", "New" … "Friday", "for", "Ice", "Cube", "and", "see", "the", "real", "deal", "."], ["Ice-T", "'s", "horribly", "cliched", "dialogue", "alone", "makes", "this", "film", "grate" … "the", "heck", "Bill", "Paxton",
"was", "doing", "in", "this", "film", "?"], ["And", "why", "the", "heck", "does", "he", "always", "play", "the", "exact", "same", "character", "?"], ["From", "Aliens", "onward", ",", "every", "film", "I", "'ve", "seen", "with" … "><br", "/", ">Overall", ",", "this", "is", "second-rate", "action", "trash", "."], ["There", "are", "countless", "better", "films", "to", "see", ",", "and", "if" … "copy", "but", "has", "better", "acting", "and", "a", "better", "script", "."], ["The", "only", "thing", "that", "made", "this", "at", "all", "worth", "watching" … "for", "the", "horrible", "film", "itself", "-", "but", "not", "quite", "."], ["4", "/", "10", "."]]
```

#Using "train_pos" keyword for positive train set examples

```
julia> dataset_train_pos = load(IMDB()) #no need to specify category because "train_pos" is default
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
julia> using Base.Iterators
julia> docs = collect(take(dataset_train_pos, 2))
2-element Array{Array{Array{String,1},1},1}:
[["Bromwell", "High", "is", "a", "cartoon", "comedy", "."], ["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."], ["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to" … "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."], ["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who" … "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."], ["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly" … "immediately", "recalled", "...",
"...", "...", "at", "...", "...", "...", "."], ["High", "."], ["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."], ["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."], ["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."], ["What", "a", "pity", "that", "it", "is", "n't", "!"]]
[["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has" … "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."], ["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost" … "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"], ["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel" … "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."], ["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the" … "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."], ["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after" … "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."], ["They", "'re", "survivors", "."], ["Bolt", "is", "n't", "."], ["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he" … "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."], ["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of" … "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."]]
julia> flatten_levels(docs, lvls(IMDB, :documents))|>full_consolidate
19-element Array{Array{String,1},1}:
["Bromwell", "High", "is", "a", "cartoon", "comedy", "."]
["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."]
["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to" … "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."]
["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who" … "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."]
["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly" … "immediately", "recalled", "...", "...", "...", "at", "...", "...", "...", "."]
["High", "."]
["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."]
["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."]
["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."]
["What", "a", "pity", "that", "it", "is", "n't", "!"]
["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has" … "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."]
["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost" … "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"]
["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel" … "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."]
["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the" … "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."]
["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after" … "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."]
["They", "'re", "survivors", "."]
["Bolt", "is", "n't", "."]
["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he" … "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."]
["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of" … "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."]
```
112 changes: 112 additions & 0 deletions docs/src/StanfordSentimentTreebank.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
### StanfordSentimentTreebank
This contains sentiment part of famous dataset Stanford Sentiment Treebank V1.0 for [Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) paper by Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts.
The dataset gives the phases with their sentiment labels between 0 to 1. This dataset can be used as binary or fine-grained sentiment classification problems.

Structure of dataset:
documents/tweets, sentences, words, characters

To get desired levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used.

## Usage:

The output dataset is a 2-dimensional `Array` with first column as `Vector`s of sentences as tokens and second column as their respective sentiment scores.

```
julia> dataset = load(StanfordSentimentTreebank())
239232×2 Array{Any,2}:
Array{String,1}[["!"]]
… 0.5
Array{String,1}[["!"], ["'"]]
0.52778
Array{String,1}[["!"], ["'", "'"]]
0.5
Array{String,1}[["!"], ["Alas"]]
0.44444
Array{String,1}[["!"], ["Brilliant"]]
0.86111
Array{String,1}[["!"], ["Brilliant", "!"]]
… 0.93056
Array{String,1}[["!"], ["Brilliant", "!"], ["'"]]
1.0
Array{String,1}[["!"], ["C", "'", "mon"]]
0.47222
Array{String,1}[["!"], ["Gollum", "'", "s", "`", "performance", "'", "is", "incredible"]]
0.76389
Array{String,1}[["!"], ["Oh", ",", "look", "at", "that", "clever", "angle", "!"], ["Wow", ",", "a", "jump", "cut", "!"]]
0.27778
Array{String,1}[["!"], ["Romething"]]
… 0.5
Array{String,1}[["!"], ["Run"]]
0.43056
Array{String,1}[["!"], ["The", "Movie"]]
0.5
Array{String,1}[["!"], ["The", "camera", "twirls", "!"], ["Oh", ",", "look", "at", "that", "clever", "angle", "!"], ["Wow", ",", "a", "jump", "cut", "!"]]
0.22222
Array{String,1}[["!"], ["True", "Hollywood", "Story"]]
0.55556
Array{String,1}[["!"], ["Wow"]]
… 0.77778
⋮ …
```

# To get phrases from `data`:

```
julia> phrases = dataset[1:5, 1] #Here `data1`is a 2-D Array
5-element Array{Any,1}:
Array{String,1}[["!"]]
Array{String,1}[["!"], ["'"]]
Array{String,1}[["!"], ["'", "'"]]
Array{String,1}[["!"], ["Alas"]]
Array{String,1}[["!"], ["Brilliant"]]
```

# To get sentiments values:

```
julia> values = data[1:5, 2] #Here "data" is a 2-D Array
5-element Array{Any,1}:
0.5
0.52778
0.5
0.44444
0.86111
```

# Using `flatten_levels`

To get an `Array` of all sentences from all the `phrases` (since each phrase can contain more than one sentence):

```
julia> sentences = flatten_levels(phrases, (lvls)(StanfordSentimentTreebank, :documents))|>full_consolidate
9-element Array{Array{String,1},1}:
["!"]
["!"]
["'"]
["!"]
["'", "'"]
["!"]
["Alas"]
["!"]
["Brilliant"]
```

To get `Array` of all the from `phrases`:

```
julia> words = flatten_levels(phrases, (!lvls)(StanfordSentimentTreebank, :words))|>full_consolidate
10-element Array{String,1}:
"!"
"!"
"'"
"!"
"'"
"'"
"!"
"Alas"
"!"
"Brilliant"
```

Similarily, desired manipulation can be done the levels using `flatten_levels`.
115 changes: 115 additions & 0 deletions docs/src/Twitter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
## Twitter

Twitter sentiment dataset by Nick Sanders. Downloaded from [Sentiment140 site](http://help.sentiment140.com/for-students).
It is large dataset for the Sentiment Analysis task. Every tweets falls in either three categories positive(4), negative(0) or neutral(2).It contains 1600000 training examples and 498 testing examples.

Structure of dataset:
documents/tweets, sentences, words, characters

This whole dataset is divided into four categories which can accessed by giving corresponding keywords: <br>
`train_pos` : positive polarity sentiment train set examples (default) <br>
`train_neg` : negative polarity sentiment train set examples <br>
`test_pos` : positive polarity sentiment test set examples <br>
`test_neg` : negative polarity sentiment test set examples <br>

To get rid of unwanted levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used.

Example:

#Using "test_pos" keyword for getting positive polarity sentiment examples

```
julia> dataset_test_pos = load(Twitter("test_pos"))
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
julia> using Base.Iterators
julia> tweets = collect(take(dataset_test_pos, 2))
2-element Array{Array{Array{String,1},1},1}:
[["@", "stellargirl", "I", "loooooooovvvvvveee", "my", "Kindle", "2", "."], ["Not", "that", "the", "DX", "is", "cool", ",", "but", "the", "2", "is", "fantastic", "in", "its", "own", "right", "."]]
[["Reading", "my", "kindle", "2", "..", "."], ["Love", "it..", "."], ["Lee", "childs", "is", "good", "read", "."]]
julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate
40-element Array{String,1}:
"@"
"stellargirl"
"I"
"loooooooovvvvvveee"
"my"
"Kindle"
"2"
"."
"Not"
"that"
"the"
"DX"
"is"
"cool"
","
"but"
"Reading"
"my"
"kindle"
"2"
".."
"."
"Love"
"it.."
"."
"Lee"
"childs"
"is"
"good"
"read"
"."
```

#Using "train_pos" category to get positive polarity sentiment examples

```
julia> dataset_train_pos = load(Twitter()) #no need to specify category because "train_pos" is default
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
julia> using Base.Iterators
julia> tweets = collect(take(dataset_train_pos, 4))
4-element Array{Array{Array{String,1},1},1}:
[["I", "LOVE", "@", "Health", "4", "UandPets", "u", "guys", "r", "the", "best", "!", "!"]]
[["im", "meeting", "up", "with", "one", "of", "my", "besties", "tonight", "!"], ["Cant", "wait", "!", "!"], ["-", "GIRL", "TALK", "!", "!"]]
[["@", "DaRealSunisaKim", "Thanks", "for", "the", "Twitter", "add", ",", "Sunisa", "!"],
["I", "got", "to", "meet", "you", "once", "at", "a", "HIN", "show" … "in", "the", "DC", "area", "and", "you", "were", "a", "sweetheart", "."]]
[["Being", "sick", "can", "be", "really", "cheap", "when", "it", "hurts", "too" … "eat", "real", "food", "Plus", ",", "your", "friends", "make", "you", "soup"]]
julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate
85-element Array{String,1}: "I" "LOVE" "@" "Health"
"4"
"UandPets"
"u"
"guys"
"r"
"the"
"best"
"!"
"!"
"im"
"meeting"
"up"
"it"
"hurts"
"too"
"much"
"to"
"eat"
"real"
"food"
"Plus"
","
"your"
"friends"
"make"
"you"
"soup"
```
6 changes: 3 additions & 3 deletions docs/src/WikiCorpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@

Very commonly used corpus in general.
The loader (and default datadep) is for [Samuel Reese's 2006 based corpus](http://www.lsi.upc.edu/~nlp/wikicorpus/).
The only real feature we rely on the input having is the `<doc title="DocTitle"..>` tags seperating the documents.
The only real feature we rely on the input having is the `<doc title="DocTitle"..>` tags separating the documents.
So any corpus following close-enough to that should work.

We capture a lot of structure.
Document, section, paragaph/line, sentence, word.
Note that paragaph/line level does not differnetiate between a paragraph of prose, vs a line in a list.
Document, section, paragraph/line, sentence, word.
Note that paragraph/line level does not differentiate between a paragraph of prose, vs a line in a list.

Most users are not going to be wanting that level of structure,
so should use `flatten_levels` (from MultiResolutionIterators.jl) to get rid of levels they don't want.
Expand Down
10 changes: 8 additions & 2 deletions src/CorpusLoaders.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,23 @@ using MultiResolutionIterators
using InternedStrings
using Glob
using StringEncodings
using CSV

export Document, TaggedWord, SenseAnnotatedWord, PosTaggedWord, CoNLL2003TaggedWord
export title, sensekey, word
export load

export WikiCorpus, SemCor, Senseval3, CoNLL

export WikiCorpus, SemCor, Senseval3, CoNLL, IMDB, Twitter, StanfordSentimentTreebank

function __init__()
include(joinpath(@__DIR__, "WikiCorpus_DataDeps.jl"))
include(joinpath(@__DIR__, "SemCor_DataDeps.jl"))
include(joinpath(@__DIR__, "SemEval2007Task7_DataDeps.jl"))
include(joinpath(@__DIR__, "Senseval3_DataDeps.jl"))
include(joinpath(@__DIR__, "CoNLL_DataDeps.jl"))
include(joinpath(@__DIR__, "IMDB_DataDeps.jl"))
include(joinpath(@__DIR__, "Twitter_DataDeps.jl"))
include(joinpath(@__DIR__, "StanfordSentimentTreebank_DataDeps.jl"))
end

include("types.jl")
Expand All @@ -28,5 +31,8 @@ include("SemCor.jl")
include("SemEval2007Task7.jl")
include("Senseval3.jl")
include("CoNLL.jl")
include("IMDB.jl")
include("Twitter.jl")
include("StanfordSentimentTreebank.jl")

end
Loading

0 comments on commit 5d12a44

Please sign in to comment.