-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #18 from ComputerMaestro/master
Adding IMDB, Twitter and Stanford Sentiment datasets
- Loading branch information
Showing
18 changed files
with
569 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,4 @@ InternedStrings | |
StringEncodings | ||
WordTokenizers | ||
MultiResolutionIterators | ||
CSV 0.4.3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
### IMDB | ||
|
||
IMDB movie reviews dataset a standard collection for Binary Sentiment Analysis task. It is used for benchmarking Sentiment Analysis algorithms. It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided | ||
|
||
Structure of the reviews contain different levels:<br> | ||
documents, sentences, words/tokens, characters | ||
|
||
Whole data is divided into 5 parts which can be accessed by providing following keywords: <br> | ||
`train_pos` : positive polarity sentiment train set examples (default) <br> | ||
`train_neg` : negative polarity sentiment train set examples <br> | ||
`test_pos` : positive polarity sentiment test set examples <br> | ||
`test_neg` : negative polarity sentiment test set examples <br> | ||
`train_unsup` : unlabeled examples | ||
|
||
To get rid of unwanted levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used. | ||
|
||
|
||
Example: | ||
|
||
#Using "test_neg" keywords for negative test set examples | ||
|
||
``` | ||
julia> dataset_test_neg = load(IMDB("test_neg")) | ||
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4) | ||
julia> docs = collect(take(dataset_test_neg, 2)) | ||
2-element Array{Array{Array{String,1},1},1}: | ||
[["Once", "again", "Mr.", "Costner", "has", "dragged", "out", "a", "movie", "for", "far", "longer", "than", "necessary", "."], ["Aside", "from", "the", "terrific", "sea", "rescue", "sequences", ",", "of", "which" … "just", "did", "not", "care", "about", "any", "of", "the", "characters", "."], ["Most", "of", "us", "have", "ghosts", "in", "the", "closet", ",", "and" … "later", ",", "by", "which", "time", "I", "did", "not", "care", "."], ["The", "character", "we", "should", "really", "care", "about", "is", "a", "very", "cocky", ",", "overconfident", "Ashton", "Kutcher", "."], ["The", "problem", "is", "he", "comes", "off", "as", "kid", "who", "thinks" … "him", "and", "shows", "no", "signs", "of", "a", "cluttered", "closet", "."], ["His", "only", "obstacle", "appears", "to", "be", "winning", "over", "Costner", "."], ["Finally", "when", "we", "are", "well", "past", "the", "half", "way", "point" … ",", "Costner", "tells", "us", "all", "about", "Kutcher", "'s", "ghosts", "."], ["We", "are", "told", "why", "Kutcher", "is", "driven", "to", "be", "the", "best", "with", "no", "prior", "inkling", "or", "foreshadowing", "."], ["No", "magic", "here", ",", "it", "was", "all", "I", "could", "do", "to", "keep", "from", "turning", "it", "off", "an", "hour", "in", "."]] | ||
[["This", "is", "an", "example", "of", "why", "the", "majority", "of", "action", "films", "are", "the", "same", "."], ["Generic", "and", "boring", ",", "there", "'s", "really", "nothing", "worth", "watching", "here", "."], ["A", "complete", "waste", "of", "the", "then", "barely-tapped", "talents", "of", "Ice-T" … "they", "are", "capable", "of", "acting", ",", "and", "acting", "well", "."], ["Do", "n't", "bother", "with", "this", "one", ",", "go", "see", "New" … "Friday", "for", "Ice", "Cube", "and", "see", "the", "real", "deal", "."], ["Ice-T", "'s", "horribly", "cliched", "dialogue", "alone", "makes", "this", "film", "grate" … "the", "heck", "Bill", "Paxton", | ||
"was", "doing", "in", "this", "film", "?"], ["And", "why", "the", "heck", "does", "he", "always", "play", "the", "exact", "same", "character", "?"], ["From", "Aliens", "onward", ",", "every", "film", "I", "'ve", "seen", "with" … "><br", "/", ">Overall", ",", "this", "is", "second-rate", "action", "trash", "."], ["There", "are", "countless", "better", "films", "to", "see", ",", "and", "if" … "copy", "but", "has", "better", "acting", "and", "a", "better", "script", "."], ["The", "only", "thing", "that", "made", "this", "at", "all", "worth", "watching" … "for", "the", "horrible", "film", "itself", "-", "but", "not", "quite", "."], ["4", "/", "10", "."]] | ||
``` | ||
|
||
#Using "train_pos" keyword for positive train set examples | ||
|
||
``` | ||
julia> dataset_train_pos = load(IMDB()) #no need to specify category because "train_pos" is default | ||
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4) | ||
julia> using Base.Iterators | ||
julia> docs = collect(take(dataset_train_pos, 2)) | ||
2-element Array{Array{Array{String,1},1},1}: | ||
[["Bromwell", "High", "is", "a", "cartoon", "comedy", "."], ["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."], ["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to" … "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."], ["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who" … "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."], ["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly" … "immediately", "recalled", "...", | ||
"...", "...", "at", "...", "...", "...", "."], ["High", "."], ["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."], ["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."], ["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."], ["What", "a", "pity", "that", "it", "is", "n't", "!"]] | ||
[["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has" … "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."], ["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost" … "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"], ["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel" … "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."], ["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the" … "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."], ["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after" … "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."], ["They", "'re", "survivors", "."], ["Bolt", "is", "n't", "."], ["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he" … "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."], ["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of" … "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."]] | ||
julia> flatten_levels(docs, lvls(IMDB, :documents))|>full_consolidate | ||
19-element Array{Array{String,1},1}: | ||
["Bromwell", "High", "is", "a", "cartoon", "comedy", "."] | ||
["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."] | ||
["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to" … "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."] | ||
["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who" … "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."] | ||
["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly" … "immediately", "recalled", "...", "...", "...", "at", "...", "...", "...", "."] | ||
["High", "."] | ||
["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."] | ||
["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."] | ||
["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."] | ||
["What", "a", "pity", "that", "it", "is", "n't", "!"] | ||
["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has" … "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."] | ||
["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost" … "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"] | ||
["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel" … "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."] | ||
["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the" … "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."] | ||
["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after" … "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."] | ||
["They", "'re", "survivors", "."] | ||
["Bolt", "is", "n't", "."] | ||
["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he" … "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."] | ||
["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of" … "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
### StanfordSentimentTreebank | ||
This contains sentiment part of famous dataset Stanford Sentiment Treebank V1.0 for [Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) paper by Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts. | ||
The dataset gives the phases with their sentiment labels between 0 to 1. This dataset can be used as binary or fine-grained sentiment classification problems. | ||
|
||
Structure of dataset: | ||
documents/tweets, sentences, words, characters | ||
|
||
To get desired levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used. | ||
|
||
## Usage: | ||
|
||
The output dataset is a 2-dimensional `Array` with first column as `Vector`s of sentences as tokens and second column as their respective sentiment scores. | ||
|
||
``` | ||
julia> dataset = load(StanfordSentimentTreebank()) | ||
239232×2 Array{Any,2}: | ||
Array{String,1}[["!"]] | ||
… 0.5 | ||
Array{String,1}[["!"], ["'"]] | ||
0.52778 | ||
Array{String,1}[["!"], ["'", "'"]] | ||
0.5 | ||
Array{String,1}[["!"], ["Alas"]] | ||
0.44444 | ||
Array{String,1}[["!"], ["Brilliant"]] | ||
0.86111 | ||
Array{String,1}[["!"], ["Brilliant", "!"]] | ||
… 0.93056 | ||
Array{String,1}[["!"], ["Brilliant", "!"], ["'"]] | ||
1.0 | ||
Array{String,1}[["!"], ["C", "'", "mon"]] | ||
0.47222 | ||
Array{String,1}[["!"], ["Gollum", "'", "s", "`", "performance", "'", "is", "incredible"]] | ||
0.76389 | ||
Array{String,1}[["!"], ["Oh", ",", "look", "at", "that", "clever", "angle", "!"], ["Wow", ",", "a", "jump", "cut", "!"]] | ||
0.27778 | ||
Array{String,1}[["!"], ["Romething"]] | ||
… 0.5 | ||
Array{String,1}[["!"], ["Run"]] | ||
0.43056 | ||
Array{String,1}[["!"], ["The", "Movie"]] | ||
0.5 | ||
Array{String,1}[["!"], ["The", "camera", "twirls", "!"], ["Oh", ",", "look", "at", "that", "clever", "angle", "!"], ["Wow", ",", "a", "jump", "cut", "!"]] | ||
0.22222 | ||
Array{String,1}[["!"], ["True", "Hollywood", "Story"]] | ||
0.55556 | ||
Array{String,1}[["!"], ["Wow"]] | ||
… 0.77778 | ||
⋮ … | ||
``` | ||
|
||
# To get phrases from `data`: | ||
|
||
``` | ||
julia> phrases = dataset[1:5, 1] #Here `data1`is a 2-D Array | ||
5-element Array{Any,1}: | ||
Array{String,1}[["!"]] | ||
Array{String,1}[["!"], ["'"]] | ||
Array{String,1}[["!"], ["'", "'"]] | ||
Array{String,1}[["!"], ["Alas"]] | ||
Array{String,1}[["!"], ["Brilliant"]] | ||
``` | ||
|
||
# To get sentiments values: | ||
|
||
``` | ||
julia> values = data[1:5, 2] #Here "data" is a 2-D Array | ||
5-element Array{Any,1}: | ||
0.5 | ||
0.52778 | ||
0.5 | ||
0.44444 | ||
0.86111 | ||
``` | ||
|
||
# Using `flatten_levels` | ||
|
||
To get an `Array` of all sentences from all the `phrases` (since each phrase can contain more than one sentence): | ||
|
||
``` | ||
julia> sentences = flatten_levels(phrases, (lvls)(StanfordSentimentTreebank, :documents))|>full_consolidate | ||
9-element Array{Array{String,1},1}: | ||
["!"] | ||
["!"] | ||
["'"] | ||
["!"] | ||
["'", "'"] | ||
["!"] | ||
["Alas"] | ||
["!"] | ||
["Brilliant"] | ||
``` | ||
|
||
To get `Array` of all the from `phrases`: | ||
|
||
``` | ||
julia> words = flatten_levels(phrases, (!lvls)(StanfordSentimentTreebank, :words))|>full_consolidate | ||
10-element Array{String,1}: | ||
"!" | ||
"!" | ||
"'" | ||
"!" | ||
"'" | ||
"'" | ||
"!" | ||
"Alas" | ||
"!" | ||
"Brilliant" | ||
``` | ||
|
||
Similarily, desired manipulation can be done the levels using `flatten_levels`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
|
||
Twitter sentiment dataset by Nick Sanders. Downloaded from [Sentiment140 site](http://help.sentiment140.com/for-students). | ||
It is large dataset for the Sentiment Analysis task. Every tweets falls in either three categories positive(4), negative(0) or neutral(2).It contains 1600000 training examples and 498 testing examples. | ||
|
||
Structure of dataset: | ||
documents/tweets, sentences, words, characters | ||
|
||
This whole dataset is divided into four categories which can accessed by giving corresponding keywords: <br> | ||
`train_pos` : positive polarity sentiment train set examples (default) <br> | ||
`train_neg` : negative polarity sentiment train set examples <br> | ||
`test_pos` : positive polarity sentiment test set examples <br> | ||
`test_neg` : negative polarity sentiment test set examples <br> | ||
|
||
To get rid of unwanted levels, `flatten_levels` function from [MultiResolutionIterators.jl](https://github.com/oxinabox/MultiResolutionIterators.jl) can be used. | ||
|
||
Example: | ||
|
||
#Using "test_pos" keyword for getting positive polarity sentiment examples | ||
|
||
``` | ||
julia> dataset_test_pos = load(Twitter("test_pos")) | ||
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4) | ||
julia> using Base.Iterators | ||
julia> tweets = collect(take(dataset_test_pos, 2)) | ||
2-element Array{Array{Array{String,1},1},1}: | ||
[["@", "stellargirl", "I", "loooooooovvvvvveee", "my", "Kindle", "2", "."], ["Not", "that", "the", "DX", "is", "cool", ",", "but", "the", "2", "is", "fantastic", "in", "its", "own", "right", "."]] | ||
[["Reading", "my", "kindle", "2", "..", "."], ["Love", "it..", "."], ["Lee", "childs", "is", "good", "read", "."]] | ||
julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate | ||
40-element Array{String,1}: | ||
"@" | ||
"stellargirl" | ||
"I" | ||
"loooooooovvvvvveee" | ||
"my" | ||
"Kindle" | ||
"2" | ||
"." | ||
"Not" | ||
"that" | ||
"the" | ||
"DX" | ||
"is" | ||
"cool" | ||
"," | ||
"but" | ||
⋮ | ||
"Reading" | ||
"my" | ||
"kindle" | ||
"2" | ||
".." | ||
"." | ||
"Love" | ||
"it.." | ||
"." | ||
"Lee" | ||
"childs" | ||
"is" | ||
"good" | ||
"read" | ||
"." | ||
``` | ||
|
||
#Using "train_pos" category to get positive polarity sentiment examples | ||
|
||
``` | ||
julia> dataset_train_pos = load(Twitter()) #no need to specify category because "train_pos" is default | ||
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4) | ||
julia> using Base.Iterators | ||
julia> tweets = collect(take(dataset_train_pos, 4)) | ||
4-element Array{Array{Array{String,1},1},1}: | ||
[["I", "LOVE", "@", "Health", "4", "UandPets", "u", "guys", "r", "the", "best", "!", "!"]] | ||
[["im", "meeting", "up", "with", "one", "of", "my", "besties", "tonight", "!"], ["Cant", "wait", "!", "!"], ["-", "GIRL", "TALK", "!", "!"]] | ||
[["@", "DaRealSunisaKim", "Thanks", "for", "the", "Twitter", "add", ",", "Sunisa", "!"], | ||
["I", "got", "to", "meet", "you", "once", "at", "a", "HIN", "show" … "in", "the", "DC", "area", "and", "you", "were", "a", "sweetheart", "."]] | ||
[["Being", "sick", "can", "be", "really", "cheap", "when", "it", "hurts", "too" … "eat", "real", "food", "Plus", ",", "your", "friends", "make", "you", "soup"]] | ||
julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate | ||
85-element Array{String,1}: "I" "LOVE" "@" "Health" | ||
"4" | ||
"UandPets" | ||
"u" | ||
"guys" | ||
"r" | ||
"the" | ||
"best" | ||
"!" | ||
"!" | ||
"im" | ||
"meeting" | ||
"up" | ||
⋮ | ||
"it" | ||
"hurts" | ||
"too" | ||
"much" | ||
"to" | ||
"eat" | ||
"real" | ||
"food" | ||
"Plus" | ||
"," | ||
"your" | ||
"friends" | ||
"make" | ||
"you" | ||
"soup" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.