Merge pull request #18 from ComputerMaestro/master

Adding IMDB, Twitter and Stanford Sentiment datasets
JuliaText · Jul 5, 2019 · 5d12a44 · 5d12a44
2 parents bbe594a + b7e19da
commit 5d12a44
Show file tree

Hide file tree

Showing 18 changed files with 569 additions and 8 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -2,7 +2,7 @@
 language: julia
 os:
   - linux
-#  - osx
+  - osx
 env:
   - DATADEPS_ALWAYS_ACCEPT=true
 julia:
@@ -12,12 +12,13 @@ matrix:
   allow_failures:
     - julia: nightly
 notifications:
-  email: false
-  
+email: false
+
 # uncomment the following lines to override the default test script
 #script:
 #  - if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
 #  - julia -e 'Pkg.clone(pwd()); Pkg.build("CorpusLoaders"); Pkg.test("CorpusLoaders"; coverage=true)'
+
 after_success:
   # Push Documentation
   - julia -e 'Pkg.add("Documenter")'

diff --git a/README.md b/README.md
@@ -37,3 +37,6 @@ Follow the links below for full docs on the usage of the corpora.
  - [SemCor](docs/src/SemCor.md)
  - [Senseval3](docs/src/Senseval3.md)
  - [CoNLL](docs/src/CoNLL.md)
+ - [IMDB movie reviews](docs/src/IMDB.md)
+ - [Twitter sentiment dataset](docs/src/Twitter.md)
+ - [Stanford Sentiment Treebank](docs/src/SST.md)
diff --git a/REQUIRE b/REQUIRE
@@ -6,3 +6,4 @@ InternedStrings
 StringEncodings
 WordTokenizers
 MultiResolutionIterators
+CSV 0.4.3
diff --git a/docs/src/IMDB.md b/docs/src/IMDB.md
@@ -0,0 +1,69 @@
+### IMDB
+
+IMDB movie reviews dataset a standard collection for Binary Sentiment Analysis task. It is used for benchmarking Sentiment Analysis algorithms. It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided
+
+Structure of the reviews contain different levels:<br>
+documents, sentences, words/tokens, characters
+
+Whole data is divided into 5 parts which can be accessed by providing following keywords: <br>
+`train_pos`     : positive polarity sentiment train set examples (default) <br>
+`train_neg`     : negative polarity sentiment train set examples <br>
+`test_pos`      : positive polarity sentiment test set examples <br>
+`test_neg`      : negative polarity sentiment test set examples <br>
+`train_unsup`   : unlabeled examples
+
+To get rid of unwanted levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used.
+
+
+Example:
+
+#Using "test_neg" keywords for negative test set examples
+
+```
+ julia> dataset_test_neg = load(IMDB("test_neg"))
+Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
+
+julia> docs = collect(take(dataset_test_neg, 2))
+2-element Array{Array{Array{String,1},1},1}:
+ [["Once", "again", "Mr.", "Costner", "has", "dragged", "out", "a", "movie", "for", "far", "longer", "than", "necessary", "."], ["Aside", "from", "the", "terrific", "sea", "rescue", "sequences", ",", "of", "which"  …  "just", "did", "not", "care", "about", "any", "of", "the", "characters", "."], ["Most", "of", "us", "have", "ghosts", "in", "the", "closet", ",", "and"  …  "later", ",", "by", "which", "time", "I", "did", "not", "care", "."], ["The", "character", "we", "should", "really", "care", "about", "is", "a", "very", "cocky", ",", "overconfident", "Ashton", "Kutcher", "."], ["The", "problem", "is", "he", "comes", "off", "as", "kid", "who", "thinks"  …  "him", "and", "shows", "no", "signs", "of", "a", "cluttered", "closet", "."], ["His", "only", "obstacle", "appears", "to", "be", "winning", "over", "Costner", "."], ["Finally", "when", "we", "are", "well", "past", "the", "half", "way", "point"  …  ",", "Costner", "tells", "us", "all", "about", "Kutcher", "'s", "ghosts", "."], ["We", "are", "told", "why", "Kutcher", "is", "driven", "to", "be", "the", "best", "with", "no", "prior", "inkling", "or", "foreshadowing", "."], ["No", "magic", "here", ",", "it", "was", "all", "I", "could", "do", "to", "keep", "from", "turning", "it", "off", "an", "hour", "in", "."]]
+ [["This", "is", "an", "example", "of", "why", "the", "majority", "of", "action", "films", "are", "the", "same", "."], ["Generic", "and", "boring", ",", "there", "'s", "really", "nothing", "worth", "watching", "here", "."], ["A", "complete", "waste", "of", "the", "then", "barely-tapped", "talents", "of", "Ice-T"  …  "they", "are", "capable", "of", "acting", ",", "and", "acting", "well", "."], ["Do", "n't", "bother", "with", "this", "one", ",", "go", "see", "New"  …  "Friday", "for", "Ice", "Cube", "and", "see", "the", "real", "deal", "."], ["Ice-T", "'s", "horribly", "cliched", "dialogue", "alone", "makes", "this", "film", "grate"  …  "the", "heck", "Bill", "Paxton",
+"was", "doing", "in", "this", "film", "?"], ["And", "why", "the", "heck", "does", "he", "always", "play", "the", "exact", "same", "character", "?"], ["From", "Aliens", "onward", ",", "every", "film", "I", "'ve", "seen", "with"  …  "><br", "/", ">Overall", ",", "this", "is", "second-rate", "action", "trash", "."], ["There", "are", "countless", "better", "films", "to", "see", ",", "and", "if"  …  "copy", "but", "has", "better", "acting", "and", "a", "better", "script", "."], ["The", "only", "thing", "that", "made", "this", "at", "all", "worth", "watching"  …  "for", "the", "horrible", "film", "itself", "-", "but", "not", "quite", "."], ["4", "/", "10", "."]]
+```
+
+#Using "train_pos" keyword for positive train set examples
+
+```
+julia> dataset_train_pos = load(IMDB())   #no need to specify category because "train_pos" is default
+Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
+
+julia> using Base.Iterators
+
+julia> docs = collect(take(dataset_train_pos, 2))
+2-element Array{Array{Array{String,1},1},1}:
+ [["Bromwell", "High", "is", "a", "cartoon", "comedy", "."], ["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."], ["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to"  …  "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."], ["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who"  …  "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."], ["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly"  …  "immediately", "recalled", "...",
+"...", "...", "at", "...", "...", "...", "."], ["High", "."], ["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."], ["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."], ["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."], ["What", "a", "pity", "that", "it", "is", "n't", "!"]]
+
+ [["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has"  …  "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."], ["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost"  …  "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"], ["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel"  …  "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."], ["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the"  …  "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."], ["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after"  …  "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."], ["They", "'re", "survivors", "."], ["Bolt", "is", "n't", "."], ["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he"  …  "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."], ["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of"  …  "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."]]
+
+julia> flatten_levels(docs, lvls(IMDB, :documents))|>full_consolidate
+19-element Array{Array{String,1},1}:
+ ["Bromwell", "High", "is", "a", "cartoon", "comedy", "."]
+ ["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."]
+ ["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to"  …  "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."]
+ ["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who"  …  "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."]
+ ["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly"  …  "immediately", "recalled", "...", "...", "...", "at", "...", "...", "...", "."]
+ ["High", "."]
+ ["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."]
+ ["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."]
+ ["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."]
+ ["What", "a", "pity", "that", "it", "is", "n't", "!"]
+ ["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has"  …  "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."]
+ ["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost"  …  "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"]
+ ["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel"  …  "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."]
+ ["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the"  …  "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."]
+ ["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after"  …  "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."]
+ ["They", "'re", "survivors", "."]
+ ["Bolt", "is", "n't", "."]
+ ["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he"  …  "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."]
+ ["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of"  …  "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."]
+```
diff --git a/docs/src/StanfordSentimentTreebank.md b/docs/src/StanfordSentimentTreebank.md
@@ -0,0 +1,112 @@
+### StanfordSentimentTreebank
+This contains sentiment part of famous dataset Stanford Sentiment Treebank V1.0 for [Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) paper by Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts.
+The dataset gives the phases with their sentiment labels between 0 to 1. This dataset can be used as binary or fine-grained sentiment classification problems.
+
+Structure of dataset:
+documents/tweets, sentences, words, characters
+
+To get desired levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used.
+
+## Usage:
+
+The output dataset is a 2-dimensional `Array` with first column as `Vector`s of sentences as tokens and second column as their respective sentiment scores.
+
+```
+julia> dataset = load(StanfordSentimentTreebank())
+239232×2 Array{Any,2}:
+ Array{String,1}[["!"]]
+             …  0.5
+ Array{String,1}[["!"], ["'"]]
+                0.52778
+ Array{String,1}[["!"], ["'", "'"]]
+                0.5
+ Array{String,1}[["!"], ["Alas"]]
+                0.44444
+ Array{String,1}[["!"], ["Brilliant"]]
+                0.86111
+ Array{String,1}[["!"], ["Brilliant", "!"]]
+             …  0.93056
+ Array{String,1}[["!"], ["Brilliant", "!"], ["'"]]
+                1.0
+ Array{String,1}[["!"], ["C", "'", "mon"]]
+                0.47222
+ Array{String,1}[["!"], ["Gollum", "'", "s", "`", "performance", "'", "is", "incredible"]]
+                0.76389
+ Array{String,1}[["!"], ["Oh", ",", "look", "at", "that", "clever", "angle", "!"], ["Wow", ",", "a", "jump", "cut", "!"]]
+                0.27778
+ Array{String,1}[["!"], ["Romething"]]
+             …  0.5
+ Array{String,1}[["!"], ["Run"]]
+                0.43056
+ Array{String,1}[["!"], ["The", "Movie"]]
+                0.5
+ Array{String,1}[["!"], ["The", "camera", "twirls", "!"], ["Oh", ",", "look", "at", "that", "clever", "angle", "!"], ["Wow", ",", "a", "jump", "cut", "!"]]
+                0.22222
+ Array{String,1}[["!"], ["True", "Hollywood", "Story"]]
+                0.55556
+ Array{String,1}[["!"], ["Wow"]]
+             …  0.77778
+ ⋮ …
+
+```
+
+# To get phrases from `data`:
+
+```
+julia> phrases = dataset[1:5, 1]       #Here `data1`is a 2-D Array
+5-element Array{Any,1}:
+ Array{String,1}[["!"]]
+ Array{String,1}[["!"], ["'"]]
+ Array{String,1}[["!"], ["'", "'"]]
+ Array{String,1}[["!"], ["Alas"]]
+ Array{String,1}[["!"], ["Brilliant"]]
+```
+
+# To get sentiments values:
+
+```
+julia> values = data[1:5, 2]          #Here "data" is a 2-D Array
+5-element Array{Any,1}:
+ 0.5
+ 0.52778
+ 0.5
+ 0.44444
+ 0.86111
+```
+
+# Using `flatten_levels`
+
+To get an `Array` of all sentences from all the `phrases` (since each phrase can contain more than one sentence):
+
+```
+julia> sentences = flatten_levels(phrases, (lvls)(StanfordSentimentTreebank, :documents))|>full_consolidate
+9-element Array{Array{String,1},1}:
+ ["!"]
+ ["!"]
+ ["'"]
+ ["!"]
+ ["'", "'"]
+ ["!"]
+ ["Alas"]
+ ["!"]
+ ["Brilliant"]
+```
+
+To get `Array` of all the from `phrases`:
+
+```
+julia> words = flatten_levels(phrases, (!lvls)(StanfordSentimentTreebank, :words))|>full_consolidate
+10-element Array{String,1}:
+ "!"
+ "!"
+ "'"
+ "!"
+ "'"
+ "'"
+ "!"
+ "Alas"
+ "!"
+ "Brilliant"
+```
+
+Similarily, desired manipulation can be done the levels using `flatten_levels`.
diff --git a/docs/src/Twitter.md b/docs/src/Twitter.md
@@ -0,0 +1,115 @@
+## Twitter
+
+Twitter sentiment dataset by Nick Sanders. Downloaded from [Sentiment140 site](http://help.sentiment140.com/for-students).
+It is large dataset for the Sentiment Analysis task. Every tweets falls in either three categories positive(4), negative(0) or neutral(2).It contains 1600000 training examples and 498 testing examples.
+
+Structure of dataset:
+documents/tweets, sentences, words, characters
+
+This whole dataset is divided into four categories which can accessed by giving corresponding keywords: <br>
+`train_pos`   :   positive polarity sentiment train set examples (default) <br>
+`train_neg`   :   negative polarity sentiment train set examples <br>
+`test_pos`    :   positive polarity sentiment test set examples <br>
+`test_neg`    :   negative polarity sentiment test set examples <br>
+
+To get rid of unwanted levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used.
+
+Example:
+
+#Using "test_pos" keyword for getting positive polarity sentiment examples
+
+```
+julia> dataset_test_pos = load(Twitter("test_pos"))
+Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
+
+julia> using Base.Iterators
+
+julia> tweets = collect(take(dataset_test_pos, 2))
+2-element Array{Array{Array{String,1},1},1}:
+ [["@", "stellargirl", "I", "loooooooovvvvvveee", "my", "Kindle", "2", "."], ["Not", "that", "the", "DX", "is", "cool", ",", "but", "the", "2", "is", "fantastic", "in", "its", "own", "right", "."]]
+ [["Reading", "my", "kindle", "2", "..", "."], ["Love", "it..", "."], ["Lee", "childs", "is", "good", "read", "."]]
+
+ julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate
+40-element Array{String,1}:
+ "@"
+ "stellargirl"
+ "I"
+ "loooooooovvvvvveee"
+ "my"
+ "Kindle"
+ "2"
+ "."
+ "Not"
+ "that"
+ "the"
+ "DX"
+ "is"
+ "cool"
+ ","
+ "but"
+ ⋮
+ "Reading"
+ "my"
+ "kindle"
+ "2"
+ ".."
+ "."
+ "Love"
+ "it.."
+ "."
+ "Lee"
+ "childs"
+ "is"
+ "good"
+ "read"
+ "."
+```
+
+#Using "train_pos" category to get positive polarity sentiment examples
+
+```
+julia> dataset_train_pos = load(Twitter()) #no need to specify category because "train_pos" is default
+Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
+
+julia> using Base.Iterators
+
+julia> tweets = collect(take(dataset_train_pos, 4))
+4-element Array{Array{Array{String,1},1},1}:
+[["I", "LOVE", "@", "Health", "4", "UandPets", "u", "guys", "r", "the", "best", "!", "!"]]
+[["im", "meeting", "up", "with", "one", "of", "my", "besties", "tonight", "!"], ["Cant", "wait", "!", "!"], ["-", "GIRL", "TALK", "!", "!"]]
+[["@", "DaRealSunisaKim", "Thanks", "for", "the", "Twitter", "add", ",", "Sunisa", "!"],
+["I", "got", "to", "meet", "you", "once", "at", "a", "HIN", "show"  …  "in", "the", "DC", "area", "and", "you", "were", "a", "sweetheart", "."]]
+[["Being", "sick", "can", "be", "really", "cheap", "when", "it", "hurts", "too"  …  "eat", "real", "food", "Plus", ",", "your", "friends", "make", "you", "soup"]]
+
+julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate
+85-element Array{String,1}: "I" "LOVE" "@" "Health"
+ "4"
+ "UandPets"
+ "u"
+ "guys"
+ "r"
+ "the"
+ "best"
+ "!"
+ "!"
+ "im"
+ "meeting"
+ "up"
+ ⋮
+ "it"
+ "hurts"
+ "too"
+ "much"
+ "to"
+ "eat"
+ "real"
+ "food"
+ "Plus"
+ ","
+ "your"
+ "friends"
+ "make"
+ "you"
+ "soup"
+
+```
diff --git a/docs/src/WikiCorpus.md b/docs/src/WikiCorpus.md
@@ -3,12 +3,12 @@
 
 Very commonly used corpus in general.
 The loader (and default datadep) is for [Samuel Reese's 2006 based corpus](http://www.lsi.upc.edu/~nlp/wikicorpus/).
-The only real feature we rely on the input having is the `<doc title="DocTitle"..>` tags seperating the documents.
+The only real feature we rely on the input having is the `<doc title="DocTitle"..>` tags separating the documents.
 So any corpus following close-enough to that should work.
 
 We capture a lot of structure.
-Document, section, paragaph/line, sentence, word.
-Note that paragaph/line level does not differnetiate between a paragraph of prose, vs a line in a list.
+Document, section, paragraph/line, sentence, word.
+Note that paragraph/line level does not differentiate between a paragraph of prose, vs a line in a list.
 
 Most users are not going to be wanting that level of structure,
 so should use `flatten_levels` (from MultiResolutionIterators.jl)  to get rid of levels they don't want.

diff --git a/src/CorpusLoaders.jl b/src/CorpusLoaders.jl
@@ -5,20 +5,23 @@ using MultiResolutionIterators
 using InternedStrings
 using Glob
 using StringEncodings
+using CSV
 
 export Document, TaggedWord, SenseAnnotatedWord, PosTaggedWord, CoNLL2003TaggedWord
 export title, sensekey, word
 export load
 
-export WikiCorpus, SemCor, Senseval3, CoNLL
-
+export WikiCorpus, SemCor, Senseval3, CoNLL, IMDB, Twitter, StanfordSentimentTreebank
 
 function __init__()
     include(joinpath(@__DIR__, "WikiCorpus_DataDeps.jl"))
     include(joinpath(@__DIR__, "SemCor_DataDeps.jl"))
     include(joinpath(@__DIR__, "SemEval2007Task7_DataDeps.jl"))
     include(joinpath(@__DIR__, "Senseval3_DataDeps.jl"))
     include(joinpath(@__DIR__, "CoNLL_DataDeps.jl"))
+    include(joinpath(@__DIR__, "IMDB_DataDeps.jl"))
+    include(joinpath(@__DIR__, "Twitter_DataDeps.jl"))
+    include(joinpath(@__DIR__, "StanfordSentimentTreebank_DataDeps.jl"))
 end
 
 include("types.jl")
@@ -28,5 +31,8 @@ include("SemCor.jl")
 include("SemEval2007Task7.jl")
 include("Senseval3.jl")
 include("CoNLL.jl")
+include("IMDB.jl")
+include("Twitter.jl")
+include("StanfordSentimentTreebank.jl")
 
 end