diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index f8387dc..ad1132f 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2024-02-07T22:32:52","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2024-09-06T16:09:44","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/APIReference/index.html b/dev/APIReference/index.html index df38a1a..d8f6ef4 100644 --- a/dev/APIReference/index.html +++ b/dev/APIReference/index.html @@ -1,13 +1,13 @@ -API References · TextAnalysis

API References

Base.argmaxMethod
argmax(scores::Vector{Score})::Score

Returns maximum by precision fiels of each Score

source
Base.merge!Method
merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.

source
TextAnalysis.bleu_scoreMethod
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

  • reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: maximum n-gram order to use when computing BLEU score.
  • smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
+API References · TextAnalysis

API References

Base.argmaxMethod
argmax(scores::Vector{Score})::Score

Returns maximum by precision fiels of each Score

source
Base.merge!Method
merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.

source
TextAnalysis.bleu_scoreMethod
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

  • reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: maximum n-gram order to use when computing BLEU score.
  • smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
     ["apple", "is", "apple"],
     ["apple", "is", "a", "fruit"]
 ]  
 one_doc_translation = [
     "apple", "is", "appl"
 ]
-bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source
TextAnalysis.coo_matrixMethod
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)

Basic low-level function that calculates the co-occurrence matrix of a document. Returns a sparse co-occurrence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. Themodekeyword can be either:defaultor:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that ifmodeis:directionalthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc. Ifmodeis:defaultthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be twice the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc` (once for each direction, from i to j + from j to i).

Example

julia> using TextAnalysis, DataStructures
+bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source
TextAnalysis.coo_matrixMethod
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)

Basic low-level function that calculates the co-occurrence matrix of a document. Returns a sparse co-occurrence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. Themodekeyword can be either:defaultor:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that ifmodeis:directionalthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc. Ifmodeis:defaultthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be twice the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc` (once for each direction, from i to j + from j to i).

Example

julia> using TextAnalysis, DataStructures
        doc = StringDocument("This is a text about an apple. There are many texts about apples.")
        docv = TextAnalysis.tokenize(language(doc), text(doc))
        vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
@@ -29,7 +29,7 @@
   [2, 1]  =  1.0
   [1, 2]  =  1.0
   [3, 2]  =  0.1999
-  [2, 3]  =  0.1999
source
TextAnalysis.coomMethod
coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

source
TextAnalysis.cos_similarityMethod
function cos_similarity(tfm::AbstractMatrix)

cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).

Example

crps = Corpus( StringDocument.([
+  [2, 3]  =  0.1999
source
TextAnalysis.coomMethod
coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

source
TextAnalysis.cos_similarityMethod
function cos_similarity(tfm::AbstractMatrix)

cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).

Example

crps = Corpus( StringDocument.([
     "to be or not to be",
     "to sing or not to sing",
     "to talk or to silence"]) )
@@ -41,12 +41,12 @@
     # 3×3 Array{Float64,2}:
     #  1.0        0.0329318  0.0
     #  0.0329318  1.0        0.0
-    #  0.0        0.0        1.0
source
TextAnalysis.counter2Method
counter2(
     data,
     min::Integer,
     max::Integer
 ) -> DataStructures.DefaultDict{SubString{String}, DataStructures.Accumulator{String, Int64}, DataStructures.Accumulator{SubString{String}, Int64}}
-

counter is used to make conditional distribution, which is used by score functions to calculate conditional frequency distribution

source
TextAnalysis.dtmMethod
dtm(crps::Corpus)
+

counter is used to make conditional distribution, which is used by score functions to calculate conditional frequency distribution

source
TextAnalysis.dtmMethod
dtm(crps::Corpus)
 dtm(d::DocumentTermMatrix)
 dtm(d::DocumentTermMatrix, density::Symbol)

Creates a simple sparse matrix of DocumentTermMatrix object.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                       StringDocument("To become or not to become")])
@@ -69,14 +69,14 @@
 julia> dtm(DocumentTermMatrix(crps), :dense)
 2×6 Array{Int64,2}:
  1  2  0  1  1  1
- 1  0  2  1  1  1
source
TextAnalysis.dtvMethod
dtv(d::AbstractDocument, lex::Dict{String, Int})

Produce a single row of a DocumentTermMatrix.

Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.

Examples

julia> dtv(crps[1], lexicon(crps))
+ 1  0  2  1  1  1
source
TextAnalysis.dtvMethod
dtv(d::AbstractDocument, lex::Dict{String, Int})

Produce a single row of a DocumentTermMatrix.

Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.

Examples

julia> dtv(crps[1], lexicon(crps))
 1×6 Array{Int64,2}:
- 1  2  0  1  1  1
source
TextAnalysis.entropyMethod
entropy(
     m::TextAnalysis.Langmodel,
     lm::DataStructures.DefaultDict,
     text_ngram::AbstractVector
 ) -> Float64
-

Calculate cross-entropy of model for given evaluation text.

Input text must be Vector of ngram of same lengths

source
TextAnalysis.everygramMethod
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
+

Calculate cross-entropy of model for given evaluation text.

Input text must be Vector of ngram of same lengths

source
TextAnalysis.everygramMethod
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
 julia> a = everygram(seq,min_len=1, max_len=-1)
  10-element Array{Any,1}:
   "or"          
@@ -87,16 +87,16 @@
   "be or"       
   "be or not"   
   "To be or"    
-  "To be or not"
source
TextAnalysis.extend!Method
extend!(model::NaiveBayesClassifier, dictElement)

Add the dictElement to dictionary of the Classifier model.

source
TextAnalysis.extend!Method
extend!(model::NaiveBayesClassifier, dictElement)

Add the dictElement to dictionary of the Classifier model.

source
TextAnalysis.featuresMethod
features(
     fs::AbstractDict,
     dict::AbstractVector
 ) -> Vector{Int64}
-

Compute an Array, mapping the value corresponding to elements of dict to the input AbstractDict.

source
TextAnalysis.fit!Method
fit!(model::NaiveBayesClassifier, str, class)
+

Compute an Array, mapping the value corresponding to elements of dict to the input AbstractDict.

source
TextAnalysis.fit!Method
fit!(model::NaiveBayesClassifier, str, class)
 fit!(model::NaiveBayesClassifier, ::Features, class)
-fit!(model::NaiveBayesClassifier, ::StringDocument, class)

Fit the weights for the model on the input data.

source
TextAnalysis.fmeasure_lcsFunction
fmeasure_lcs(RLCS, PLCS, β)

Compute the F-measure based on WLCS.

Arguments

  • RLCS - Recall Factor
  • PLCS - Precision Factor
  • β - Parameter
source
TextAnalysis.frequenciesMethod
frequencies(
+fit!(model::NaiveBayesClassifier, ::StringDocument, class)

Fit the weights for the model on the input data.

source
TextAnalysis.fmeasure_lcsFunction
fmeasure_lcs(RLCS, PLCS, β)

Compute the F-measure based on WLCS.

Arguments

  • RLCS - Recall Factor
  • PLCS - Precision Factor
  • β - Parameter
source
TextAnalysis.frequenciesMethod
frequencies(
     xs::AbstractArray{T, 1}
 ) -> Dict{_A, Int64} where _A
-

Create a dict that maps elements in input array to their frequencies.

source
TextAnalysis.frequent_termsFunction
frequent_terms(crps, alpha=0.95)

Find the frequent terms from Corpus, occurring more than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
+

Create a dict that maps elements in input array to their frequencies.

source
TextAnalysis.frequent_termsFunction
frequent_terms(crps, alpha=0.95)

Find the frequent terms from Corpus, occurring more than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                       StringDocument("This is Document 2")])
 A Corpus with 2 documents:
  * 2 StringDocument's
@@ -109,8 +109,8 @@
 3-element Array{String,1}:
  "is"
  "This"
- "Document"

See also: remove_frequent_terms!, sparse_terms

source
TextAnalysis.get_ngramsMethod
get_ngrams(segment, max_order)

Extracts all n-grams upto a given maximum order from an input segment. Returns the counter containing all n-grams upto max_order in segment with a count of how many times each n-gram occurred.

Arguments

  • segment: text segment from which n-grams will be extracted.
  • max_order: maximum length in tokens of the n-grams returned by this methods.
source
TextAnalysis.hash_dtmMethod
hash_dtm(crps::Corpus)
-hash_dtm(crps::Corpus, h::TextHashFunction)

Represents a Corpus as a Matrix with N entries.

source
TextAnalysis.get_ngramsMethod
get_ngrams(segment, max_order)

Extracts all n-grams upto a given maximum order from an input segment. Returns the counter containing all n-grams upto max_order in segment with a count of how many times each n-gram occurred.

Arguments

  • segment: text segment from which n-grams will be extracted.
  • max_order: maximum length in tokens of the n-grams returned by this methods.
source
TextAnalysis.hash_dtmMethod
hash_dtm(crps::Corpus)
+hash_dtm(crps::Corpus, h::TextHashFunction)

Represents a Corpus as a Matrix with N entries.

source
TextAnalysis.hash_dtvMethod
hash_dtv(d::AbstractDocument)
 hash_dtv(d::AbstractDocument, h::TextHashFunction)

Represents a document as a vector with N entries.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                       StringDocument("To become or not to become")])
 
@@ -123,20 +123,20 @@
 
 julia> hash_dtv(crps[1])
 1×100 Array{Int64,2}:
- 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
source
TextAnalysis.index_hashMethod
index_hash(str, TextHashFunc)

Shows mapping of string to integer.

Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object

julia> h = TextHashFunction(10)
+ 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
source
TextAnalysis.index_hashMethod
index_hash(str, TextHashFunc)

Shows mapping of string to integer.

Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object

julia> h = TextHashFunction(10)
 TextHashFunction(hash, 10)
 
 julia> index_hash("a", h)
 8
 
 julia> index_hash("b", h)
-7
source
TextAnalysis.inverse_indexMethod
inverse_index(crps::Corpus)

Shows the inverse index of a corpus.

If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.

source
TextAnalysis.language!Method
language!(doc, lang::Language)

Set the language of doc to lang.

Example

julia> d = StringDocument("String Document 1")
+7
source
TextAnalysis.inverse_indexMethod
inverse_index(crps::Corpus)

Shows the inverse index of a corpus.

If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.

source
TextAnalysis.language!Method
language!(doc, lang::Language)

Set the language of doc to lang.

Example

julia> d = StringDocument("String Document 1")
 
 julia> language!(d, Languages.Spanish())
 
 julia> d.metadata.language
-Languages.Spanish()

See also: language, languages, languages!

source
TextAnalysis.languages!Method
languages!(crps, langs::Vector{Language})
-languages!(crps, lang::Language)

Update languages of documents in a Corpus.

If the input is a Vector, then language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of vector.

See also: languages, language!, language

source
TextAnalysis.ldaMethod
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

  • showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
source
TextAnalysis.lexiconMethod
lexicon(crps::Corpus)

Shows the lexicon of the corpus.

Lexicon of a corpus consists of all the terms that occur in any document in the corpus.

Example

julia> crps = Corpus([StringDocument("Name Foo"),
+Languages.Spanish()

See also: language, languages, languages!

source
TextAnalysis.languages!Method
languages!(crps, langs::Vector{Language})
+languages!(crps, lang::Language)

Update languages of documents in a Corpus.

If the input is a Vector, then language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of vector.

See also: languages, language!, language

source
TextAnalysis.ldaMethod
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

  • showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
source
TextAnalysis.lexiconMethod
lexicon(crps::Corpus)

Shows the lexicon of the corpus.

Lexicon of a corpus consists of all the terms that occur in any document in the corpus.

Example

julia> crps = Corpus([StringDocument("Name Foo"),
                           StringDocument("Name Bar")])
 A Corpus with 2 documents:
 * 2 StringDocument's
@@ -148,28 +148,28 @@
 Corpus's index contains 0 tokens
 
 julia> lexicon(crps)
-Dict{String,Int64} with 0 entries
source
TextAnalysis.logscoreMethod
logscore(
     m::TextAnalysis.Langmodel,
     temp_lm::DataStructures.DefaultDict,
     word,
     context
 ) -> Float64
-

Evaluate the log score of this word in this context.

The arguments are the same as for score and maskedscore

source
TextAnalysis.lookupMethod
lookup(
     voc::Vocabulary,
     word::AbstractArray{T<:AbstractString, 1}
 ) -> Vector
-

lookup a sequence or words in the vocabulary

Return an Array of String

See Vocabulary

source
TextAnalysis.lsaMethod
lsa(dtm::DocumentTermMatrix)
-lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

source
TextAnalysis.lsaMethod
lsa(dtm::DocumentTermMatrix)
+lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

source
TextAnalysis.maskedscoreMethod
maskedscore(
     m::TextAnalysis.Langmodel,
     temp_lm::DataStructures.DefaultDict,
     word,
     context
 ) -> Float64
-

It is used to evaluate score with masks out of vocabulary words

The arguments are the same as for score

source
TextAnalysis.ngramizeMethod
ngramize(lang, tokens, n)

Compute the ngrams of tokens of the order n.

Example

julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
+

It is used to evaluate score with masks out of vocabulary words

The arguments are the same as for score

source
TextAnalysis.ngramizeMethod
ngramize(lang, tokens, n)

Compute the ngrams of tokens of the order n.

Example

julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
 Dict{AbstractString,Int64} with 3 entries:
   "be or not" => 1
   "or not to" => 1
-  "To be or"  => 1
source
TextAnalysis.ngramizenewMethod
ngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}

ngramizenew is used to out putting ngrmas in set

Example

julia> seq=["To","be","or","not","To","not","To","not"]
+  "To be or"  => 1
source
TextAnalysis.ngramizenewMethod
ngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}

ngramizenew is used to out putting ngrmas in set

Example

julia> seq=["To","be","or","not","To","not","To","not"]
 julia> ngramizenew(seq ,2)
  7-element Array{Any,1}:
   "To be" 
@@ -178,7 +178,7 @@
   "not To"
   "To not"
   "not To"
-  "To not"
source
TextAnalysis.ngramsMethod
ngrams(ngd::NGramDocument, n::Integer)
 ngrams(d::AbstractDocument, n::Integer)
 ngrams(d::NGramDocument)
 ngrams(d::AbstractDocument)

Access the document text as n-gram counts.

Example

julia> sd = StringDocument("To be or not to be...")
@@ -197,13 +197,13 @@
   "To"   => 1
   "be"   => 1
   "be.." => 1
-  "."    => 1
source
TextAnalysis.onegramizeMethod
onegramize(lang, tokens)

Create the unigrams dict for input tokens.

Example

julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
+  "."    => 1
source
TextAnalysis.onegramizeMethod
onegramize(lang, tokens)

Create the unigrams dict for input tokens.

Example

julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
 Dict{String,Int64} with 5 entries:
   "or"  => 1
   "not" => 1
   "to"  => 1
   "To"  => 1
-  "be"  => 2
source
TextAnalysis.padding_ngramMethod
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]
+  "be"  => 2
source
TextAnalysis.padding_ngramMethod
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]
 
 julia> padding_ngram(example,2,pad_left=true,pad_right=true)
  6-element Array{Any,1}:
@@ -212,14 +212,14 @@
   "2 3"   
   "3 4"   
   "4 5"   
-  "5 </s>"
source
TextAnalysis.perplexityMethod
perplexity(
     m::TextAnalysis.Langmodel,
     lm::DataStructures.DefaultDict,
     text_ngram::AbstractVector
 ) -> Float64
-

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy

source
TextAnalysis.predictMethod
predict(::NaiveBayesClassifier, str)
+

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy

source
TextAnalysis.predictMethod
predict(::NaiveBayesClassifier, str)
 predict(::NaiveBayesClassifier, ::Features)
-predict(::NaiveBayesClassifier, ::StringDocument)

Predict probabilities for each class on the input Features or String.

source
TextAnalysis.prepare!Method
prepare!(doc, flags)
+predict(::NaiveBayesClassifier, ::StringDocument)

Predict probabilities for each class on the input Features or String.

source
TextAnalysis.prepare!Method
prepare!(doc, flags)
 prepare!(crps, flags)

Preprocess document or corpus based on the input flags.

List of Flags

  • strip_patterns
  • stripcorruptutf8
  • strip_case
  • stem_words
  • tagpartof_speech
  • strip_whitespace
  • strip_punctuation
  • strip_numbers
  • stripnonletters
  • stripindefinitearticles
  • stripdefinitearticles
  • strip_articles
  • strip_prepositions
  • strip_pronouns
  • strip_stopwords
  • stripsparseterms
  • stripfrequentterms
  • striphtmltags

Example

julia> doc = StringDocument("This is a document of mine")
 A StringDocument{String}
  * Language: Languages.English()
@@ -229,7 +229,7 @@
  * Snippet: This is a document of mine
 julia> prepare!(doc, strip_pronouns | strip_articles)
 julia> text(doc)
-"This is   document of "
source
TextAnalysis.probFunction
prob(
     m::TextAnalysis.Langmodel,
     templ_lm::DataStructures.DefaultDict,
     word
@@ -240,7 +240,7 @@
     word,
     context
 ) -> Float64
-

To get probability of word given that context

In other words, for given context calculate frequency distribution of word

source
TextAnalysis.prune!Method
prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}

Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.

source
TextAnalysis.remove_case!Method
remove_case!(doc)
+

To get probability of word given that context

In other words, for given context calculate frequency distribution of word

source
TextAnalysis.prune!Method
prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}

Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.

source
TextAnalysis.remove_case!Method
remove_case!(doc)
 remove_case!(crps)

Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.

Example

julia> str = "The quick brown fox jumps over the lazy dog"
 julia> sd = StringDocument(str)
 A StringDocument{String}
@@ -251,8 +251,8 @@
  * Snippet: The quick brown fox jumps over the lazy dog
 julia> remove_case!(sd)
 julia> sd.text
-"the quick brown fox jumps over the lazy dog"

See also: remove_case

source
TextAnalysis.remove_frequent_terms!Function
remove_frequent_terms!(crps, alpha=0.95)

Remove terms in crps, occurring more than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
+"the quick brown fox jumps over the lazy dog"

See also: remove_case

source
TextAnalysis.remove_frequent_terms!Function
remove_frequent_terms!(crps, alpha=0.95)

Remove terms in crps, occurring more than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                       StringDocument("This is Document 2")])
 A Corpus with 2 documents:
 * 2 StringDocument's
@@ -265,7 +265,7 @@
 julia> text(crps[1])
 "     1"
 julia> text(crps[2])
-"     2"

See also: remove_sparse_terms!, frequent_terms

source
TextAnalysis.remove_html_tags!Method
remove_html_tags!(doc::StringDocument)
 remove_html_tags!(crps)

Remove html tags from the StringDocument or documents crps. Does not work for documents other than StringDocument.

Example

julia> html_doc = StringDocument(
              "
                <html>
@@ -284,8 +284,8 @@
  * Snippet:  <html> <head><s
 julia> remove_html_tags!(html_doc)
 julia> strip(text(html_doc))
-"Hello world"

See also: remove_html_tags

source
TextAnalysis.remove_patterns!Method
remove_patterns!(doc, rex::Regex)
-remove_patterns!(crps, rex::Regex)

Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns

source
TextAnalysis.remove_patterns!Method
remove_patterns!(doc, rex::Regex)
+remove_patterns!(crps, rex::Regex)

Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns

source
TextAnalysis.remove_sparse_terms!Function
remove_sparse_terms!(crps, alpha=0.05)

Remove sparse terms in crps, occurring less than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                       StringDocument("This is Document 2")])
 A Corpus with 2 documents:
  * 2 StringDocument's
@@ -298,29 +298,29 @@
 julia> crps[1].text
 "This is Document "
 julia> crps[2].text
-"This is Document "

See also: remove_frequent_terms!, sparse_terms

source
TextAnalysis.remove_whitespace!Method
remove_whitespace!(doc)
-remove_whitespace!(crps)

Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace

source
TextAnalysis.remove_whitespace!Method
remove_whitespace!(doc)
+remove_whitespace!(crps)

Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace

source
TextAnalysis.remove_words!Method
remove_words!(doc, words::Vector{AbstractString})
 remove_words!(crps, words::Vector{AbstractString})

Remove the occurrences of words from doc or crps.

Example

julia> str="The quick brown fox jumps over the lazy dog"
 julia> sd=StringDocument(str);
 julia> remove_words = ["fox", "over"]
 julia> remove_words!(sd, remove_words)
 julia> sd.text
-"the quick brown   jumps   the lazy dog"
source
TextAnalysis.rouge_l_sentenceFunction
rouge_l_sentence(
     references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
     weighted=false, weight_func=sqrt,
     lang=Languages.English()
-)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source
TextAnalysis.rouge_l_summaryMethod
rouge_l_summary(
+)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source
TextAnalysis.rouge_nMethod
rouge_n(
     references::Vector{<:AbstractString}, 
     candidate::AbstractString, 
     n::Int; 
     lang::Language
-)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

The function takes the following arguments -

  • references::Vector{T} where T<: AbstractString = The list of reference summaries.
  • candidate::AbstractString = Input candidate summary, to be scored against reference summaries.
  • n::Integer = Order of NGrams
  • lang::Language = Language of the text, useful while generating N-grams. Defaults value is Languages.English()

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.scoreFunction
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in MLE

source
TextAnalysis.scoreFunction
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in InterpolatedLanguageModel

Apply Kneserney and WittenBell smoothing depending upon the sub-Type

source
TextAnalysis.scoreMethod
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context

Add-one smoothing to Lidstone or Laplace(gammamodel) models

source
TextAnalysis.sentence_tokenizeMethod
sentence_tokenize(language, str)

Split str into sentences.

Example

julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
+)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

The function takes the following arguments -

  • references::Vector{T} where T<: AbstractString = The list of reference summaries.
  • candidate::AbstractString = Input candidate summary, to be scored against reference summaries.
  • n::Integer = Order of NGrams
  • lang::Language = Language of the text, useful while generating N-grams. Defaults value is Languages.English()

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.scoreFunction
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in MLE

source
TextAnalysis.scoreFunction
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in InterpolatedLanguageModel

Apply Kneserney and WittenBell smoothing depending upon the sub-Type

source
TextAnalysis.scoreMethod
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context

Add-one smoothing to Lidstone or Laplace(gammamodel) models

source
TextAnalysis.sentence_tokenizeMethod
sentence_tokenize(language, str)

Split str into sentences.

Example

julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
 2-element Array{SubString{String},1}:
  "Here are few words!"
- "I am Foo Bar."

See also: tokenize

source
TextAnalysis.sparse_termsFunction
sparse_terms(crps, alpha=0.05])

Find the sparse terms from Corpus, occurring in less than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
+ "I am Foo Bar."

See also: tokenize

source
TextAnalysis.sparse_termsFunction
sparse_terms(crps, alpha=0.05])

Find the sparse terms from Corpus, occurring in less than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                       StringDocument("This is Document 2")])
 A Corpus with 2 documents:
 * 2 StringDocument's
@@ -332,7 +332,7 @@
 julia> sparse_terms(crps, 0.5)
 2-element Array{String,1}:
  "1"
- "2"

See also: remove_sparse_terms!, frequent_terms

source
TextAnalysis.standardize!Method
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument

Standardize the documents in a Corpus to a common type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
 		              TokenDocument("Document 2"),
 		              NGramDocument("Document 3")])
 A Corpus with 3 documents:
@@ -357,13 +357,13 @@
  * 3 NGramDocument's
 
 Corpus's lexicon contains 0 tokens
-Corpus's index contains 0 tokens
source
TextAnalysis.stem!Method
stem!(doc)
-stem!(crps)

Stems the document or documents in crps with a suitable stemmer.

Stemming cannot be done for FileDocument and Corpus made of these type of documents.

source
TextAnalysis.stem!Method
stem!(crps::Corpus)

Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first)

source
TextAnalysis.summarizeMethod
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. It takes 2 arguments:

  • d : A document of type StringDocument, FileDocument or TokenDocument
  • ns : (Optional) Mention the number of sentences in the Summary, defaults to 5 sentences.

By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
+Corpus's index contains 0 tokens
source
TextAnalysis.stem!Method
stem!(doc)
+stem!(crps)

Stems the document or documents in crps with a suitable stemmer.

Stemming cannot be done for FileDocument and Corpus made of these type of documents.

source
TextAnalysis.stem!Method
stem!(crps::Corpus)

Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first)

source
TextAnalysis.summarizeMethod
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. It takes 2 arguments:

  • d : A document of type StringDocument, FileDocument or TokenDocument
  • ns : (Optional) Mention the number of sentences in the Summary, defaults to 5 sentences.

By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
 
 julia> summarize(s, ns=2)
 2-element Array{SubString{String},1}:
  "Assume this Short Document as an example."
- "This has too foo sentences."
source
TextAnalysis.tag_scheme!Method
tag_scheme!(tags, current_scheme::String, new_scheme::String)

Convert tags from current_scheme to new_scheme.

List of tagging schemes currently supported-

  • BIO1 (BIO)
  • BIO2
  • BIOES

Example

julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]
+ "This has too foo sentences."
source
TextAnalysis.tag_scheme!Method
tag_scheme!(tags, current_scheme::String, new_scheme::String)

Convert tags from current_scheme to new_scheme.

List of tagging schemes currently supported-

  • BIO1 (BIO)
  • BIO2
  • BIOES

Example

julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]
 
 julia> tag_scheme!(tags, "BIO1", "BIOES")
 
@@ -376,7 +376,7 @@
  "E-MISC"
  "B-PER"
  "I-PER"
- "E-PER"
source
TextAnalysis.textMethod
text(fd::FileDocument)
 text(sd::StringDocument)
 text(ngd::NGramDocument)

Access the text of Document as a string.

Example

julia> sd = StringDocument("To be or not to be...")
 A StringDocument{String}
@@ -387,7 +387,7 @@
  * Snippet: To be or not to be...
 
 julia> text(sd)
-"To be or not to be..."
source
TextAnalysis.tf!Method
tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

tf should have the has same nonzeros as dtm.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf!Method
tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

Works correctly if dtm and tf are same matrix.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf!Method
tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

tf should have the has same nonzeros as dtm.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf!Method
tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

Works correctly if dtm and tf are same matrix.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tfMethod
tf(dtm::DocumentTermMatrix)
 tf(dtm::SparseMatrixCSC{Real})
 tf(dtm::Matrix{Real})

Compute the term-frequency of the input.

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
               StringDocument("To become or not to become")])
@@ -407,7 +407,7 @@
   [1, 5]  =  0.166667
   [2, 5]  =  0.166667
   [1, 6]  =  0.166667
-  [2, 6]  =  0.166667

See also: tf!, tf_idf, tf_idf!

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})

Overwrite tfidf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

The arguments must have same number of nonzeros.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})

Overwrite tf_idf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

dtm and tf-idf must be matrices of same dimensions.

See also: tf, tf! , tf_idf

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})

Overwrite tfidf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

The arguments must have same number of nonzeros.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})

Overwrite tf_idf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

dtm and tf-idf must be matrices of same dimensions.

See also: tf, tf! , tf_idf

source
TextAnalysis.tf_idfMethod
tf(dtm::DocumentTermMatrix)
 tf(dtm::SparseMatrixCSC{Real})
 tf(dtm::Matrix{Real})

Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.

In many cases, raw word counts are not appropriate for use because:

  • Some documents are longer than other documents
  • Some words are more frequent than other words

A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
               StringDocument("To become or not to become")])
@@ -427,14 +427,14 @@
   [1, 5]  =  0.0
   [2, 5]  =  0.0
   [1, 6]  =  0.0
-  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

source
TextAnalysis.titles!Method
titles!(crps, vec::Vector{String})
-titles!(crps, str)

Update titles of the documents in a Corpus.

If the input is a String, set the same title for all documents. If the input is a vector, set title of ith document to corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of vector.

See also: titles, title!, title

source
TextAnalysis.tokenizeMethod
tokenize(language, str)

Split str into words and other tokens such as punctuation.

Example

julia> tokenize(Languages.English(), "Too foo words!")
+  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

source
TextAnalysis.titles!Method
titles!(crps, vec::Vector{String})
+titles!(crps, str)

Update titles of the documents in a Corpus.

If the input is a String, set the same title for all documents. If the input is a vector, set title of ith document to corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of vector.

See also: titles, title!, title

source
TextAnalysis.tokenizeMethod
tokenize(language, str)

Split str into words and other tokens such as punctuation.

Example

julia> tokenize(Languages.English(), "Too foo words!")
 4-element Array{String,1}:
  "Too"
  "foo"
  "words"
- "!"

See also: sentence_tokenize

source
TextAnalysis.tokensMethod
tokens(d::TokenDocument)
 tokens(d::(Union{FileDocument, StringDocument}))

Access the document text as a token array.

Example

julia> sd = StringDocument("To be or not to be...")
 A StringDocument{String}
  * Language: Languages.English()
@@ -451,8 +451,8 @@
     "not"
     "to"
     "be.."
-    "."
source
TextAnalysis.weighted_lcsFunction
weighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)

Compute the Weighted Longest Common Subsequence of X and Y.

source
TextAnalysis.CooMatrixType

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coom::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

  • terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

  • column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

source
TextAnalysis.CooMatrixMethod
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

source
TextAnalysis.CorpusMethod
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
+    "."
source
TextAnalysis.weighted_lcsFunction
weighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)

Compute the Weighted Longest Common Subsequence of X and Y.

source
TextAnalysis.CooMatrixType

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coom::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

  • terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

  • column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

source
TextAnalysis.CooMatrixMethod
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

source
TextAnalysis.CorpusMethod
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
 		              StringDocument("Document 2")])
 A Corpus with 2 documents:
  * 2 StringDocument's
@@ -461,13 +461,13 @@
  * 0 NGramDocument's
 
 Corpus's lexicon contains 0 tokens
-Corpus's index contains 0 tokens
source
TextAnalysis.DocumentMetadataType
DocumentMetadata(
     language::Language,
     title::String,
     author::String,
     timestamp::String,
     custom::Any
-)

Stores basic metadata about Document.

...

Arguments

  • language: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.
  • title::String : What is the title of the document? Defaults to "Untitled Document".
  • author::String : Who wrote the document? Defaults to "Unknown Author".
  • timestamp::String : When was the document written? Defaults to "Unknown Time".
  • custom : user specific data field. Defaults to nothing.

...

source
TextAnalysis.DocumentTermMatrixMethod
DocumentTermMatrix(crps::Corpus)
+)

Stores basic metadata about Document.

...

Arguments

  • language: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.
  • title::String : What is the title of the document? Defaults to "Untitled Document".
  • author::String : Who wrote the document? Defaults to "Unknown Author".
  • timestamp::String : When was the document written? Defaults to "Unknown Time".
  • custom : user specific data field. Defaults to nothing.

...

source
TextAnalysis.DocumentTermMatrixMethod
DocumentTermMatrix(crps::Corpus)
 DocumentTermMatrix(crps::Corpus, terms::Vector{String})
 DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
 DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String})

Represent documents as a matrix of word counts.

Allow us to apply linear algebra operations and statistical techniques. Need to update lexicon before use.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
@@ -489,7 +489,7 @@
   [1, 5]  =  1
   [2, 5]  =  1
   [1, 6]  =  1
-  [2, 6]  =  1
source
TextAnalysis.FileDocumentMethod
FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
+  [2, 6]  =  1
source
TextAnalysis.FileDocumentMethod
FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
 "/usr/share/dict/words"
 
 julia> fd = FileDocument(pathname)
@@ -498,7 +498,7 @@
  * Title: /usr/share/dict/words
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
source
TextAnalysis.KneserNeyInterpolatedMethod
KneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing KneserNey Interpolated language model.

The idea to abstract this comes from Chen & Goodman 1995.

source
TextAnalysis.LaplaceType
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Laplace) for providing Laplace-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.

source
TextAnalysis.LidstoneMethod
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

source
TextAnalysis.MLEMethod
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing MLE ngram model scores.

Implementation of Base Ngram Model.

source
TextAnalysis.KneserNeyInterpolatedMethod
KneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing KneserNey Interpolated language model.

The idea to abstract this comes from Chen & Goodman 1995.

source
TextAnalysis.LaplaceType
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Laplace) for providing Laplace-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.

source
TextAnalysis.LidstoneMethod
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

source
TextAnalysis.MLEMethod
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing MLE ngram model scores.

Implementation of Base Ngram Model.

source
TextAnalysis.NGramDocumentMethod
NGramDocument(txt::AbstractString, n::Integer=1)
 NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
 NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                      "or" => 1, "not" => 1,
@@ -517,7 +517,7 @@
  * Title: Untitled Document
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.NaiveBayesClassifierMethod
NaiveBayesClassifier([dict, ]classes)

A Naive Bayes Classifier for classifying documents.

It takes two arguments:

  • classes: An array of possible classes that the concerned data could belong to.
  • dict:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)

Example

julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict
+ * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.NaiveBayesClassifierMethod
NaiveBayesClassifier([dict, ]classes)

A Naive Bayes Classifier for classifying documents.

It takes two arguments:

  • classes: An array of possible classes that the concerned data could belong to.
  • dict:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)

Example

julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict
 
 julia> m = NaiveBayesClassifier([:spam, :non_spam])
 NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2))
@@ -531,13 +531,13 @@
 julia> predict(m, "is this a spam")
 Dict{Symbol, Float64} with 2 entries:
   :spam     => 0.59883
-  :non_spam => 0.40117
source
TextAnalysis.ScoreMethod
Score(
     precision::AbstractFloat,
     recall::AbstractFloat,
     fmeasure::AbstractFloat
 ) -> Score
-

Stores a result of evaluation

source
TextAnalysis.StringDocumentMethod
StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
+

Stores a result of evaluation

source
TextAnalysis.StringDocumentMethod
StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
 "To be or not to be..."
 
 julia> sd = StringDocument(str)
@@ -546,9 +546,9 @@
  * Title: Untitled Document
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: To be or not to be...
source
TextAnalysis.TextHashFunctionMethod
TextHashFunction(cardinality)
 TextHashFunction(hash_function, cardinality)

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the Hash Trick in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N.

Parameters: - cardinality = Max index used for hashing (default 100) - hash_function = function used for hashing process (default function present, see code-base)

julia> h = TextHashFunction(10)
-TextHashFunction(hash, 10)
source
TextAnalysis.TokenDocumentMethod
TokenDocument(txt::AbstractString)
 TokenDocument(txt::AbstractString, dm::DocumentMetadata)
 TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represents a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
 6-element Array{String,1}:
@@ -565,7 +565,7 @@
  * Title: Untitled Document
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.VocabularyType
Vocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")

Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:

  • When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.

Example

julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
+ * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.VocabularyType
Vocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")

Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:

  • When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.

Example

julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
 julia> vocabulary = Vocabulary(words, 2) 
   Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>") 
 
@@ -622,7 +622,7 @@
  1
 
 julia> vocabulary.vocab["b"]
- 1
source
TextAnalysis.VocabularyMethod
Vocabulary(word::Array{T<:AbstractString, 1}) -> Vocabulary
 Vocabulary(
     word::Array{T<:AbstractString, 1},
     unk_cutoff
@@ -632,4 +632,4 @@
     unk_cutoff,
     unk_label
 ) -> Vocabulary
-
source
TextAnalysis.WittenBellInterpolatedMethod
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}

Initiate Type for providing Interpolated version of Witten-Bell smoothing.

The idea to abstract this comes from Chen & Goodman 1995.

source
+
source
TextAnalysis.WittenBellInterpolatedMethod
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}

Initiate Type for providing Interpolated version of Witten-Bell smoothing.

The idea to abstract this comes from Chen & Goodman 1995.

source
diff --git a/dev/LM/index.html b/dev/LM/index.html index dcad227..0fa24ca 100644 --- a/dev/LM/index.html +++ b/dev/LM/index.html @@ -1,5 +1,5 @@ -Statistical Language Model · TextAnalysis

Statistical Language Model

TextAnalysis provide following different Language Models

  • MLE - Base Ngram model.
  • Lidstone - Base Ngram model with Lidstone smoothing.
  • Laplace - Base Ngram language model with Laplace smoothing.
  • WittenBellInterpolated - Interpolated Version of witten-Bell algorithm.
  • KneserNeyInterpolated - Interpolated version of Kneser -Ney smoothing.

APIs

To use the API, we first Instantiate desired model and then load it with train set

MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
+Statistical Language Model · TextAnalysis

Statistical Language Model

TextAnalysis provide following different Language Models

  • MLE - Base Ngram model.
  • Lidstone - Base Ngram model with Lidstone smoothing.
  • Laplace - Base Ngram language model with Laplace smoothing.
  • WittenBellInterpolated - Interpolated Version of witten-Bell algorithm.
  • KneserNeyInterpolated - Interpolated version of Kneser -Ney smoothing.

APIs

To use the API, we first Instantiate desired model and then load it with train set

MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
         
 Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
         
@@ -40,29 +40,29 @@
 julia> masked_score = maskedscore(model,fit,"is","alien")
 0.3333333333333333
 #as expected maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"
-
Note

When you call MLE(voc) for the first time, It will update your vocabulary set as well.

Evaluation Method

score

used to evaluate the probability of word given context (P(word | context))

TextAnalysis.scoreFunction
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context

Add-one smoothing to Lidstone or Laplace(gammamodel) models

source
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in MLE

source
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in InterpolatedLanguageModel

Apply Kneserney and WittenBell smoothing depending upon the sub-Type

source

Arguments:

  1. m : Instance of Langmodel struct.
  2. temp_lm: output of function call of instance of Langmodel.
  3. word: string of word
  4. context: context of given word
  • In case of Lidstone and Laplace it apply smoothing and,

  • In Interpolated language model, provide Kneserney and WittenBell smoothing

maskedscore

TextAnalysis.maskedscoreFunction
maskedscore(
+
Note

When you call MLE(voc) for the first time, It will update your vocabulary set as well.

Evaluation Method

score

used to evaluate the probability of word given context (P(word | context))

TextAnalysis.scoreFunction
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context

Add-one smoothing to Lidstone or Laplace(gammamodel) models

source
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in MLE

source
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in InterpolatedLanguageModel

Apply Kneserney and WittenBell smoothing depending upon the sub-Type

source

Arguments:

  1. m : Instance of Langmodel struct.
  2. temp_lm: output of function call of instance of Langmodel.
  3. word: string of word
  4. context: context of given word
  • In case of Lidstone and Laplace it apply smoothing and,

  • In Interpolated language model, provide Kneserney and WittenBell smoothing

maskedscore

TextAnalysis.maskedscoreFunction
maskedscore(
     m::TextAnalysis.Langmodel,
     temp_lm::DataStructures.DefaultDict,
     word,
     context
 ) -> Float64
-

It is used to evaluate score with masks out of vocabulary words

The arguments are the same as for score

source

logscore

logscore

TextAnalysis.logscoreFunction
logscore(
     m::TextAnalysis.Langmodel,
     temp_lm::DataStructures.DefaultDict,
     word,
     context
 ) -> Float64
-

Evaluate the log score of this word in this context.

The arguments are the same as for score and maskedscore

source

entropy

entropy

TextAnalysis.entropyFunction
entropy(
     m::TextAnalysis.Langmodel,
     lm::DataStructures.DefaultDict,
     text_ngram::AbstractVector
 ) -> Float64
-

Calculate cross-entropy of model for given evaluation text.

Input text must be Vector of ngram of same lengths

source

perplexity

TextAnalysis.perplexityFunction
perplexity(
+

Calculate cross-entropy of model for given evaluation text.

Input text must be Vector of ngram of same lengths

source

perplexity

TextAnalysis.perplexityFunction
perplexity(
     m::TextAnalysis.Langmodel,
     lm::DataStructures.DefaultDict,
     text_ngram::AbstractVector
 ) -> Float64
-

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy

source

Preprocessing

For Preprocessing following functions:

TextAnalysis.everygramFunction
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
+

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy

source

Preprocessing

For Preprocessing following functions:

TextAnalysis.everygramFunction
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
 julia> a = everygram(seq,min_len=1, max_len=-1)
  10-element Array{Any,1}:
   "or"          
@@ -73,7 +73,7 @@
   "be or"       
   "be or not"   
   "To be or"    
-  "To be or not"
source
TextAnalysis.padding_ngramFunction
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]
+  "To be or not"
source
TextAnalysis.padding_ngramFunction
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]
 
 julia> padding_ngram(example,2,pad_left=true,pad_right=true)
  6-element Array{Any,1}:
@@ -82,7 +82,7 @@
   "2 3"   
   "3 4"   
   "4 5"   
-  "5 </s>"
source

Vocabulary

Struct to store Language models vocabulary

checking membership and filters items by comparing their counts to a cutoff value

It also Adds a special "unknown" tokens which unseen words are mapped to

julia> using TextAnalysis
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]11-element Vector{String}: + "5 </s>"
source

Vocabulary

Struct to store Language models vocabulary

checking membership and filters items by comparing their counts to a cutoff value

It also Adds a special "unknown" tokens which unseen words are mapped to

julia> using TextAnalysis
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]11-element Vector{String}: "a" "c" "-" @@ -105,4 +105,4 @@ "<unk>" "d" "c" - "a"
+ "a"
diff --git a/dev/assets/documenter.js b/dev/assets/documenter.js index f531160..82252a1 100644 --- a/dev/assets/documenter.js +++ b/dev/assets/documenter.js @@ -4,7 +4,6 @@ requirejs.config({ 'highlight-julia': 'https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.8.0/languages/julia.min', 'headroom': 'https://cdnjs.cloudflare.com/ajax/libs/headroom/0.12.0/headroom.min', 'jqueryui': 'https://cdnjs.cloudflare.com/ajax/libs/jqueryui/1.13.2/jquery-ui.min', - 'minisearch': 'https://cdn.jsdelivr.net/npm/minisearch@6.1.0/dist/umd/index.min', 'katex-auto-render': 'https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.8/contrib/auto-render.min', 'jquery': 'https://cdnjs.cloudflare.com/ajax/libs/jquery/3.7.0/jquery.min', 'headroom-jquery': 'https://cdnjs.cloudflare.com/ajax/libs/headroom/0.12.0/jQuery.headroom.min', @@ -78,48 +77,54 @@ require(['jquery'], function($) { let timer = 0; var isExpanded = true; -$(document).on("click", ".docstring header", function () { - let articleToggleTitle = "Expand docstring"; - - debounce(() => { - if ($(this).siblings("section").is(":visible")) { - $(this) - .find(".docstring-article-toggle-button") - .removeClass("fa-chevron-down") - .addClass("fa-chevron-right"); - } else { - $(this) - .find(".docstring-article-toggle-button") - .removeClass("fa-chevron-right") - .addClass("fa-chevron-down"); +$(document).on( + "click", + ".docstring .docstring-article-toggle-button", + function () { + let articleToggleTitle = "Expand docstring"; + const parent = $(this).parent(); + + debounce(() => { + if (parent.siblings("section").is(":visible")) { + parent + .find("a.docstring-article-toggle-button") + .removeClass("fa-chevron-down") + .addClass("fa-chevron-right"); + } else { + parent + .find("a.docstring-article-toggle-button") + .removeClass("fa-chevron-right") + .addClass("fa-chevron-down"); - articleToggleTitle = "Collapse docstring"; - } + articleToggleTitle = "Collapse docstring"; + } - $(this) - .find(".docstring-article-toggle-button") - .prop("title", articleToggleTitle); - $(this).siblings("section").slideToggle(); - }); -}); + parent + .children(".docstring-article-toggle-button") + .prop("title", articleToggleTitle); + parent.siblings("section").slideToggle(); + }); + } +); -$(document).on("click", ".docs-article-toggle-button", function () { +$(document).on("click", ".docs-article-toggle-button", function (event) { let articleToggleTitle = "Expand docstring"; let navArticleToggleTitle = "Expand all docstrings"; + let animationSpeed = event.noToggleAnimation ? 0 : 400; debounce(() => { if (isExpanded) { $(this).removeClass("fa-chevron-up").addClass("fa-chevron-down"); - $(".docstring-article-toggle-button") + $("a.docstring-article-toggle-button") .removeClass("fa-chevron-down") .addClass("fa-chevron-right"); isExpanded = false; - $(".docstring section").slideUp(); + $(".docstring section").slideUp(animationSpeed); } else { $(this).removeClass("fa-chevron-down").addClass("fa-chevron-up"); - $(".docstring-article-toggle-button") + $("a.docstring-article-toggle-button") .removeClass("fa-chevron-right") .addClass("fa-chevron-down"); @@ -127,7 +132,7 @@ $(document).on("click", ".docs-article-toggle-button", function () { articleToggleTitle = "Collapse docstring"; navArticleToggleTitle = "Collapse all docstrings"; - $(".docstring section").slideDown(); + $(".docstring section").slideDown(animationSpeed); } $(this).prop("title", navArticleToggleTitle); @@ -224,224 +229,474 @@ $(document).ready(function () { }) //////////////////////////////////////////////////////////////////////////////// -require(['jquery', 'minisearch'], function($, minisearch) { - -// In general, most search related things will have "search" as a prefix. -// To get an in-depth about the thought process you can refer: https://hetarth02.hashnode.dev/series/gsoc +require(['jquery'], function($) { -let results = []; -let timer = undefined; +$(document).ready(function () { + let meta = $("div[data-docstringscollapsed]").data(); -let data = documenterSearchIndex["docs"].map((x, key) => { - x["id"] = key; // minisearch requires a unique for each object - return x; + if (meta?.docstringscollapsed) { + $("#documenter-article-toggle-button").trigger({ + type: "click", + noToggleAnimation: true, + }); + } }); -// list below is the lunr 2.1.3 list minus the intersect with names(Base) -// (all, any, get, in, is, only, which) and (do, else, for, let, where, while, with) -// ideally we'd just filter the original list but it's not available as a variable -const stopWords = new Set([ - "a", - "able", - "about", - "across", - "after", - "almost", - "also", - "am", - "among", - "an", - "and", - "are", - "as", - "at", - "be", - "because", - "been", - "but", - "by", - "can", - "cannot", - "could", - "dear", - "did", - "does", - "either", - "ever", - "every", - "from", - "got", - "had", - "has", - "have", - "he", - "her", - "hers", - "him", - "his", - "how", - "however", - "i", - "if", - "into", - "it", - "its", - "just", - "least", - "like", - "likely", - "may", - "me", - "might", - "most", - "must", - "my", - "neither", - "no", - "nor", - "not", - "of", - "off", - "often", - "on", - "or", - "other", - "our", - "own", - "rather", - "said", - "say", - "says", - "she", - "should", - "since", - "so", - "some", - "than", - "that", - "the", - "their", - "them", - "then", - "there", - "these", - "they", - "this", - "tis", - "to", - "too", - "twas", - "us", - "wants", - "was", - "we", - "were", - "what", - "when", - "who", - "whom", - "why", - "will", - "would", - "yet", - "you", - "your", -]); - -let index = new minisearch({ - fields: ["title", "text"], // fields to index for full-text search - storeFields: ["location", "title", "text", "category", "page"], // fields to return with search results - processTerm: (term) => { - let word = stopWords.has(term) ? null : term; - if (word) { - // custom trimmer that doesn't strip @ and !, which are used in julia macro and function names - word = word - .replace(/^[^a-zA-Z0-9@!]+/, "") - .replace(/[^a-zA-Z0-9@!]+$/, ""); - } +}) +//////////////////////////////////////////////////////////////////////////////// +require(['jquery'], function($) { - return word ?? null; - }, - // add . as a separator, because otherwise "title": "Documenter.Anchors.add!", would not find anything if searching for "add!", only for the entire qualification - tokenize: (string) => string.split(/[\s\-\.]+/), - // options which will be applied during the search - searchOptions: { - boost: { title: 100 }, - fuzzy: 2, +/* +To get an in-depth about the thought process you can refer: https://hetarth02.hashnode.dev/series/gsoc + +PSEUDOCODE: + +Searching happens automatically as the user types or adjusts the selected filters. +To preserve responsiveness, as much as possible of the slow parts of the search are done +in a web worker. Searching and result generation are done in the worker, and filtering and +DOM updates are done in the main thread. The filters are in the main thread as they should +be very quick to apply. This lets filters be changed without re-searching with minisearch +(which is possible even if filtering is on the worker thread) and also lets filters be +changed _while_ the worker is searching and without message passing (neither of which are +possible if filtering is on the worker thread) + +SEARCH WORKER: + +Import minisearch + +Build index + +On message from main thread + run search + find the first 200 unique results from each category, and compute their divs for display + note that this is necessary and sufficient information for the main thread to find the + first 200 unique results from any given filter set + post results to main thread + +MAIN: + +Launch worker + +Declare nonconstant globals (worker_is_running, last_search_text, unfiltered_results) + +On text update + if worker is not running, launch_search() + +launch_search + set worker_is_running to true, set last_search_text to the search text + post the search query to worker + +on message from worker + if last_search_text is not the same as the text in the search field, + the latest search result is not reflective of the latest search query, so update again + launch_search() + otherwise + set worker_is_running to false + + regardless, display the new search results to the user + save the unfiltered_results as a global + update_search() + +on filter click + adjust the filter selection + update_search() + +update_search + apply search filters by looping through the unfiltered_results and finding the first 200 + unique results that match the filters + + Update the DOM +*/ + +/////// SEARCH WORKER /////// + +function worker_function(documenterSearchIndex, documenterBaseURL, filters) { + importScripts( + "https://cdn.jsdelivr.net/npm/minisearch@6.1.0/dist/umd/index.min.js" + ); + + let data = documenterSearchIndex.map((x, key) => { + x["id"] = key; // minisearch requires a unique for each object + return x; + }); + + // list below is the lunr 2.1.3 list minus the intersect with names(Base) + // (all, any, get, in, is, only, which) and (do, else, for, let, where, while, with) + // ideally we'd just filter the original list but it's not available as a variable + const stopWords = new Set([ + "a", + "able", + "about", + "across", + "after", + "almost", + "also", + "am", + "among", + "an", + "and", + "are", + "as", + "at", + "be", + "because", + "been", + "but", + "by", + "can", + "cannot", + "could", + "dear", + "did", + "does", + "either", + "ever", + "every", + "from", + "got", + "had", + "has", + "have", + "he", + "her", + "hers", + "him", + "his", + "how", + "however", + "i", + "if", + "into", + "it", + "its", + "just", + "least", + "like", + "likely", + "may", + "me", + "might", + "most", + "must", + "my", + "neither", + "no", + "nor", + "not", + "of", + "off", + "often", + "on", + "or", + "other", + "our", + "own", + "rather", + "said", + "say", + "says", + "she", + "should", + "since", + "so", + "some", + "than", + "that", + "the", + "their", + "them", + "then", + "there", + "these", + "they", + "this", + "tis", + "to", + "too", + "twas", + "us", + "wants", + "was", + "we", + "were", + "what", + "when", + "who", + "whom", + "why", + "will", + "would", + "yet", + "you", + "your", + ]); + + let index = new MiniSearch({ + fields: ["title", "text"], // fields to index for full-text search + storeFields: ["location", "title", "text", "category", "page"], // fields to return with results processTerm: (term) => { let word = stopWords.has(term) ? null : term; if (word) { + // custom trimmer that doesn't strip @ and !, which are used in julia macro and function names word = word .replace(/^[^a-zA-Z0-9@!]+/, "") .replace(/[^a-zA-Z0-9@!]+$/, ""); + + word = word.toLowerCase(); } return word ?? null; }, + // add . as a separator, because otherwise "title": "Documenter.Anchors.add!", would not + // find anything if searching for "add!", only for the entire qualification tokenize: (string) => string.split(/[\s\-\.]+/), - }, -}); + // options which will be applied during the search + searchOptions: { + prefix: true, + boost: { title: 100 }, + fuzzy: 2, + }, + }); + + index.addAll(data); + + /** + * Used to map characters to HTML entities. + * Refer: https://github.com/lodash/lodash/blob/main/src/escape.ts + */ + const htmlEscapes = { + "&": "&", + "<": "<", + ">": ">", + '"': """, + "'": "'", + }; + + /** + * Used to match HTML entities and HTML characters. + * Refer: https://github.com/lodash/lodash/blob/main/src/escape.ts + */ + const reUnescapedHtml = /[&<>"']/g; + const reHasUnescapedHtml = RegExp(reUnescapedHtml.source); + + /** + * Escape function from lodash + * Refer: https://github.com/lodash/lodash/blob/main/src/escape.ts + */ + function escape(string) { + return string && reHasUnescapedHtml.test(string) + ? string.replace(reUnescapedHtml, (chr) => htmlEscapes[chr]) + : string || ""; + } -index.addAll(data); + /** + * RegX escape function from MDN + * Refer: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#escaping + */ + function escapeRegExp(string) { + return string.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); // $& means the whole matched string + } -let filters = [...new Set(data.map((x) => x.category))]; -var modal_filters = make_modal_body_filters(filters); -var filter_results = []; + /** + * Make the result component given a minisearch result data object and the value + * of the search input as queryString. To view the result object structure, refer: + * https://lucaong.github.io/minisearch/modules/_minisearch_.html#searchresult + * + * @param {object} result + * @param {string} querystring + * @returns string + */ + function make_search_result(result, querystring) { + let search_divider = `
`; + let display_link = + result.location.slice(Math.max(0), Math.min(50, result.location.length)) + + (result.location.length > 30 ? "..." : ""); // To cut-off the link because it messes with the overflow of the whole div -$(document).on("keyup", ".documenter-search-input", function (event) { - // Adding a debounce to prevent disruptions from super-speed typing! - debounce(() => update_search(filter_results), 300); + if (result.page !== "") { + display_link += ` (${result.page})`; + } + searchstring = escapeRegExp(querystring); + let textindex = new RegExp(`${searchstring}`, "i").exec(result.text); + let text = + textindex !== null + ? result.text.slice( + Math.max(textindex.index - 100, 0), + Math.min( + textindex.index + querystring.length + 100, + result.text.length + ) + ) + : ""; // cut-off text before and after from the match + + text = text.length ? escape(text) : ""; + + let display_result = text.length + ? "..." + + text.replace( + new RegExp(`${escape(searchstring)}`, "i"), // For first occurrence + '$&' + ) + + "..." + : ""; // highlights the match + + let in_code = false; + if (!["page", "section"].includes(result.category.toLowerCase())) { + in_code = true; + } + + // We encode the full url to escape some special characters which can lead to broken links + let result_div = ` + +
+
${escape(result.title)}
+
${result.category}
+
+

+ ${display_result} +

+
+ ${display_link} +
+
+ ${search_divider} + `; + + return result_div; + } + + self.onmessage = function (e) { + let query = e.data; + let results = index.search(query, { + filter: (result) => { + // Only return relevant results + return result.score >= 1; + }, + combineWith: "AND", + }); + + // Pre-filter to deduplicate and limit to 200 per category to the extent + // possible without knowing what the filters are. + let filtered_results = []; + let counts = {}; + for (let filter of filters) { + counts[filter] = 0; + } + let present = {}; + + for (let result of results) { + cat = result.category; + cnt = counts[cat]; + if (cnt < 200) { + id = cat + "---" + result.location; + if (present[id]) { + continue; + } + present[id] = true; + filtered_results.push({ + location: result.location, + category: cat, + div: make_search_result(result, query), + }); + } + } + + postMessage(filtered_results); + }; +} + +// `worker = Threads.@spawn worker_function(documenterSearchIndex)`, but in JavaScript! +const filters = [ + ...new Set(documenterSearchIndex["docs"].map((x) => x.category)), +]; +const worker_str = + "(" + + worker_function.toString() + + ")(" + + JSON.stringify(documenterSearchIndex["docs"]) + + "," + + JSON.stringify(documenterBaseURL) + + "," + + JSON.stringify(filters) + + ")"; +const worker_blob = new Blob([worker_str], { type: "text/javascript" }); +const worker = new Worker(URL.createObjectURL(worker_blob)); + +/////// SEARCH MAIN /////// + +// Whether the worker is currently handling a search. This is a boolean +// as the worker only ever handles 1 or 0 searches at a time. +var worker_is_running = false; + +// The last search text that was sent to the worker. This is used to determine +// if the worker should be launched again when it reports back results. +var last_search_text = ""; + +// The results of the last search. This, in combination with the state of the filters +// in the DOM, is used compute the results to display on calls to update_search. +var unfiltered_results = []; + +// Which filter is currently selected +var selected_filter = ""; + +$(document).on("input", ".documenter-search-input", function (event) { + if (!worker_is_running) { + launch_search(); + } }); +function launch_search() { + worker_is_running = true; + last_search_text = $(".documenter-search-input").val(); + worker.postMessage(last_search_text); +} + +worker.onmessage = function (e) { + if (last_search_text !== $(".documenter-search-input").val()) { + launch_search(); + } else { + worker_is_running = false; + } + + unfiltered_results = e.data; + update_search(); +}; + $(document).on("click", ".search-filter", function () { if ($(this).hasClass("search-filter-selected")) { - $(this).removeClass("search-filter-selected"); + selected_filter = ""; } else { - $(this).addClass("search-filter-selected"); + selected_filter = $(this).text().toLowerCase(); } - // Adding a debounce to prevent disruptions from crazy clicking! - debounce(() => get_filters(), 300); + // This updates search results and toggles classes for UI: + update_search(); }); -/** - * A debounce function, takes a function and an optional timeout in milliseconds - * - * @function callback - * @param {number} timeout - */ -function debounce(callback, timeout = 300) { - clearTimeout(timer); - timer = setTimeout(callback, timeout); -} - /** * Make/Update the search component - * - * @param {string[]} selected_filters */ -function update_search(selected_filters = []) { - let initial_search_body = ` -
Type something to get started!
- `; - +function update_search() { let querystring = $(".documenter-search-input").val(); if (querystring.trim()) { - results = index.search(querystring, { - filter: (result) => { - // Filtering results - if (selected_filters.length === 0) { - return result.score >= 1; - } else { - return ( - result.score >= 1 && selected_filters.includes(result.category) - ); - } - }, - }); + if (selected_filter == "") { + results = unfiltered_results; + } else { + results = unfiltered_results.filter((result) => { + return selected_filter == result.category.toLowerCase(); + }); + } let search_result_container = ``; + let modal_filters = make_modal_body_filters(); let search_divider = `
`; if (results.length) { @@ -449,19 +704,23 @@ function update_search(selected_filters = []) { let count = 0; let search_results = ""; - results.forEach(function (result) { - if (result.location) { - // Checking for duplication of results for the same page - if (!links.includes(result.location)) { - search_results += make_search_result(result, querystring); - count++; - } - + for (var i = 0, n = results.length; i < n && count < 200; ++i) { + let result = results[i]; + if (result.location && !links.includes(result.location)) { + search_results += result.div; + count++; links.push(result.location); } - }); + } - let result_count = `
${count} result(s)
`; + if (count == 1) { + count_str = "1 result"; + } else if (count == 200) { + count_str = "200+ results"; + } else { + count_str = count + " results"; + } + let result_count = `
${count_str}
`; search_result_container = `
@@ -490,125 +749,37 @@ function update_search(selected_filters = []) { $(".search-modal-card-body").html(search_result_container); } else { - filter_results = []; - modal_filters = make_modal_body_filters(filters, filter_results); - if (!$(".search-modal-card-body").hasClass("is-justify-content-center")) { $(".search-modal-card-body").addClass("is-justify-content-center"); } - $(".search-modal-card-body").html(initial_search_body); + $(".search-modal-card-body").html(` +
Type something to get started!
+ `); } } /** * Make the modal filter html * - * @param {string[]} filters - * @param {string[]} selected_filters * @returns string */ -function make_modal_body_filters(filters, selected_filters = []) { - let str = ``; - - filters.forEach((val) => { - if (selected_filters.includes(val)) { - str += `${val}`; - } else { - str += `${val}`; - } - }); +function make_modal_body_filters() { + let str = filters + .map((val) => { + if (selected_filter == val.toLowerCase()) { + return `${val}`; + } else { + return `${val}`; + } + }) + .join(""); - let filter_html = ` + return `
Filters: ${str} -
- `; - - return filter_html; -} - -/** - * Make the result component given a minisearch result data object and the value of the search input as queryString. - * To view the result object structure, refer: https://lucaong.github.io/minisearch/modules/_minisearch_.html#searchresult - * - * @param {object} result - * @param {string} querystring - * @returns string - */ -function make_search_result(result, querystring) { - let search_divider = `
`; - let display_link = - result.location.slice(Math.max(0), Math.min(50, result.location.length)) + - (result.location.length > 30 ? "..." : ""); // To cut-off the link because it messes with the overflow of the whole div - - if (result.page !== "") { - display_link += ` (${result.page})`; - } - - let textindex = new RegExp(`\\b${querystring}\\b`, "i").exec(result.text); - let text = - textindex !== null - ? result.text.slice( - Math.max(textindex.index - 100, 0), - Math.min( - textindex.index + querystring.length + 100, - result.text.length - ) - ) - : ""; // cut-off text before and after from the match - - let display_result = text.length - ? "..." + - text.replace( - new RegExp(`\\b${querystring}\\b`, "i"), // For first occurrence - '$&' - ) + - "..." - : ""; // highlights the match - - let in_code = false; - if (!["page", "section"].includes(result.category.toLowerCase())) { - in_code = true; - } - - // We encode the full url to escape some special characters which can lead to broken links - let result_div = ` - -
-
${result.title}
-
${result.category}
-
-

- ${display_result} -

-
- ${display_link} -
-
- ${search_divider} - `; - - return result_div; -} - -/** - * Get selected filters, remake the filter html and lastly update the search modal - */ -function get_filters() { - let ele = $(".search-filters .search-filter-selected").get(); - filter_results = ele.map((x) => $(x).text().toLowerCase()); - modal_filters = make_modal_body_filters(filters, filter_results); - update_search(filter_results); +
`; } }) @@ -635,103 +806,107 @@ $(document).ready(function () { //////////////////////////////////////////////////////////////////////////////// require(['jquery'], function($) { -let search_modal_header = ` - -`; - -let initial_search_body = ` -
Type something to get started!
-`; - -let search_modal_footer = ` - -`; - -$(document.body).append( - ` - diff --git a/dev/corpus/index.html b/dev/corpus/index.html index f41844b..93c734c 100644 --- a/dev/corpus/index.html +++ b/dev/corpus/index.html @@ -1,5 +1,5 @@ -Corpus · TextAnalysis

Creating a Corpus

Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type:

TextAnalysis.CorpusType
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
+Corpus · TextAnalysis

Creating a Corpus

Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type:

TextAnalysis.CorpusType
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
 		              StringDocument("Document 2")])
 A Corpus with 2 documents:
  * 2 StringDocument's
@@ -8,7 +8,7 @@
  * 0 NGramDocument's
 
 Corpus's lexicon contains 0 tokens
-Corpus's index contains 0 tokens
source

Standardizing a Corpus

A Corpus may contain many different types of documents. It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the standardize! function:

TextAnalysis.standardize!Function
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument

Standardize the documents in a Corpus to a common type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
+Corpus's index contains 0 tokens
source

Standardizing a Corpus

A Corpus may contain many different types of documents. It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the standardize! function:

TextAnalysis.standardize!Function
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument

Standardize the documents in a Corpus to a common type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
 		              TokenDocument("Document 2"),
 		              NGramDocument("Document 3")])
 A Corpus with 3 documents:
@@ -33,7 +33,7 @@
  * 3 NGramDocument's
 
 Corpus's lexicon contains 0 tokens
-Corpus's index contains 0 tokens
source

Processing a Corpus

We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("Document ..!!"), +Corpus's index contains 0 tokens
source

Processing a Corpus

We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("Document ..!!"), StringDocument("Document ..!!")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's @@ -99,4 +99,4 @@ julia> timestamps!(crps, "Now")

Additionally, you can specify the metadata fields for each document in a Corpus individually:

julia> languages!(crps, [Languages.German(), Languages.English
 julia> titles!(crps, ["", "Untitled"])
 julia> authors!(crps, ["Ich", "You"])
-julia> timestamps!(crps, ["Unbekannt", "2018"])
+julia> timestamps!(crps, ["Unbekannt", "2018"]) diff --git a/dev/documents/index.html b/dev/documents/index.html index 805f8d5..a6081e7 100644 --- a/dev/documents/index.html +++ b/dev/documents/index.html @@ -1,5 +1,5 @@ -Documents · TextAnalysis

Creating Documents

The basic unit of text analysis is a document. The TextAnalysis package allows one to work with documents stored in a variety of formats:

  • FileDocument : A document represented using a plain text file on disk
  • StringDocument : A document represented using a UTF8 String stored in RAM
  • TokenDocument : A document represented as a sequence of UTF8 tokens
  • NGramDocument : A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts
Note

These formats represent a hierarchy: you can always move down the hierarchy, but can generally not move up the hierarchy. A FileDocument can easily become a StringDocument, but an NGramDocument cannot easily become a FileDocument.

Creating any of the four basic types of documents is very easy:

TextAnalysis.StringDocumentType
StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
+Documents · TextAnalysis

Creating Documents

The basic unit of text analysis is a document. The TextAnalysis package allows one to work with documents stored in a variety of formats:

  • FileDocument : A document represented using a plain text file on disk
  • StringDocument : A document represented using a UTF8 String stored in RAM
  • TokenDocument : A document represented as a sequence of UTF8 tokens
  • NGramDocument : A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts
Note

These formats represent a hierarchy: you can always move down the hierarchy, but can generally not move up the hierarchy. A FileDocument can easily become a StringDocument, but an NGramDocument cannot easily become a FileDocument.

Creating any of the four basic types of documents is very easy:

TextAnalysis.StringDocumentType
StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
 "To be or not to be..."
 
 julia> sd = StringDocument(str)
@@ -8,7 +8,7 @@
  * Title: Untitled Document
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: To be or not to be...
source
TextAnalysis.FileDocumentType
FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
+ * Snippet: To be or not to be...
source
TextAnalysis.FileDocumentType
FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
 "/usr/share/dict/words"
 
 julia> fd = FileDocument(pathname)
@@ -17,7 +17,7 @@
  * Title: /usr/share/dict/words
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
source
TextAnalysis.TokenDocumentType
TokenDocument(txt::AbstractString)
 TokenDocument(txt::AbstractString, dm::DocumentMetadata)
 TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represents a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
 6-element Array{String,1}:
@@ -34,7 +34,7 @@
  * Title: Untitled Document
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.NGramDocumentType
NGramDocument(txt::AbstractString, n::Integer=1)
 NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
 NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                      "or" => 1, "not" => 1,
@@ -53,7 +53,7 @@
  * Title: Untitled Document
  * Author: Unknown Author
  * Timestamp: Unknown Time
- * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source

An NGramDocument consisting of bigrams or any higher order representation N can be easily created by passing the parameter N to NGramDocument

julia> using TextAnalysis
julia> NGramDocument("To be or not to be ...", 2)A NGramDocument{AbstractString} + * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source

An NGramDocument consisting of bigrams or any higher order representation N can be easily created by passing the parameter N to NGramDocument

julia> using TextAnalysis
julia> NGramDocument("To be or not to be ...", 2)A NGramDocument{AbstractString} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author @@ -138,8 +138,8 @@ * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time - * Snippet: This document has too foo words
julia> language!(sd, Languages.Spanish())ERROR: UndefVarError: `Languages` not defined
julia> title!(sd, "El Cid")"El Cid"
julia> author!(sd, "Desconocido")"Desconocido"
julia> timestamp!(sd, "Desconocido")"Desconocido"

Preprocessing Documents

Having easy access to the text of a document and its metadata is very important, but most text analysis tasks require some amount of preprocessing.

At a minimum, your text source may contain corrupt characters. You can remove these using the remove_corrupt_utf8!() function:

Alternatively, you may want to edit the text to remove items that are hard to process automatically. For example, our sample text sentence taken from Hamlet has three periods that we might like to discard. We can remove this kind of punctuation using the prepare!() function:

julia> using TextAnalysis
julia> str = StringDocument("here are some punctuations !!!...")A StringDocument{String} + * Snippet: This document has too foo words
julia> language!(sd, Languages.Spanish())ERROR: UndefVarError: `Languages` not defined
julia> title!(sd, "El Cid")"El Cid"
julia> author!(sd, "Desconocido")"Desconocido"
julia> timestamp!(sd, "Desconocido")"Desconocido"

Preprocessing Documents

Having easy access to the text of a document and its metadata is very important, but most text analysis tasks require some amount of preprocessing.

At a minimum, your text source may contain corrupt characters. You can remove these using the remove_corrupt_utf8!() function:

Alternatively, you may want to edit the text to remove items that are hard to process automatically. For example, our sample text sentence taken from Hamlet has three periods that we might like to discard. We can remove this kind of punctuation using the prepare!() function:

julia> using TextAnalysis
julia> str = StringDocument("here are some punctuations !!!...")A StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author @@ -154,4 +154,4 @@ * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time - * Snippet: They write, it writes
julia> stem!(sd)
julia> text(sd)"They write , it write"
+ * Snippet: They write, it writes
julia> stem!(sd)
julia> text(sd)"They write , it write" diff --git a/dev/evaluation_metrics/index.html b/dev/evaluation_metrics/index.html index 00e783b..d9594af 100644 --- a/dev/evaluation_metrics/index.html +++ b/dev/evaluation_metrics/index.html @@ -1,17 +1,17 @@ -Evaluation Metrics · TextAnalysis

Evaluation Metrics

Natural Language Processing tasks require certain Evaluation Metrics. As of now TextAnalysis provides the following evaluation metrics.

ROUGE-N, ROUGE-L, ROUGE-L-Summary

This metric evaluation based on the overlap of N-grams between the system and reference summaries.

Base.argmaxFunction
argmax(scores::Vector{Score})::Score

Returns maximum by precision fiels of each Score

source
TextAnalysis.rouge_nFunction
rouge_n(
+Evaluation Metrics · TextAnalysis

Evaluation Metrics

Natural Language Processing tasks require certain Evaluation Metrics. As of now TextAnalysis provides the following evaluation metrics.

ROUGE-N, ROUGE-L, ROUGE-L-Summary

This metric evaluation based on the overlap of N-grams between the system and reference summaries.

Base.argmaxFunction
argmax(scores::Vector{Score})::Score

Returns maximum by precision fiels of each Score

source
TextAnalysis.rouge_nFunction
rouge_n(
     references::Vector{<:AbstractString}, 
     candidate::AbstractString, 
     n::Int; 
     lang::Language
-)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

The function takes the following arguments -

  • references::Vector{T} where T<: AbstractString = The list of reference summaries.
  • candidate::AbstractString = Input candidate summary, to be scored against reference summaries.
  • n::Integer = Order of NGrams
  • lang::Language = Language of the text, useful while generating N-grams. Defaults value is Languages.English()

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.rouge_l_sentenceFunction
rouge_l_sentence(
+)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

The function takes the following arguments -

  • references::Vector{T} where T<: AbstractString = The list of reference summaries.
  • candidate::AbstractString = Input candidate summary, to be scored against reference summaries.
  • n::Integer = Order of NGrams
  • lang::Language = Language of the text, useful while generating N-grams. Defaults value is Languages.English()

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.rouge_l_sentenceFunction
rouge_l_sentence(
     references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
     weighted=false, weight_func=sqrt,
     lang=Languages.English()
-)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source
TextAnalysis.rouge_l_summaryFunction
rouge_l_summary(
+)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source
using TextAnalysis
+)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at summary level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence(), rouge_n

source
using TextAnalysis
 
 candidate_summary =  "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
 reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the  BRIC(S) and have been invited to the G20 summit."]
@@ -21,14 +21,14 @@
     rouge_n(reference_summaries, candidate_summary, 1)
 ] .|> argmax
2-element Vector{Score}:
  Score(precision=0.14814815, recall=0.16, fmeasure=0.15384616)
- Score(precision=0.53571427, recall=0.5769231, fmeasure=0.5555556)

BLEU (bilingual evaluation understudy)

TextAnalysis.bleu_scoreFunction
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

  • reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: maximum n-gram order to use when computing BLEU score.
  • smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
+ Score(precision=0.53571427, recall=0.5769231, fmeasure=0.5555556)

BLEU (bilingual evaluation understudy)

TextAnalysis.bleu_scoreFunction
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

  • reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: maximum n-gram order to use when computing BLEU score.
  • smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
     ["apple", "is", "apple"],
     ["apple", "is", "a", "fruit"]
 ]  
 one_doc_translation = [
     "apple", "is", "appl"
 ]
-bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source

NLTK sample

    using TextAnalysis
+bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source

NLTK sample

    using TextAnalysis
 
     reference1 = [
         "It", "is", "a", "guide", "to", "action", "that",
@@ -53,4 +53,4 @@
         "obeys", "the", "commands", "of", "the", "party"
     ]
 
-    score = bleu_score([[reference1, reference2, reference3]], [hypothesis1])
(bleu = 0.5045666840058485, precisions = [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], bp = 1.0, geo_mean = 0.5045666840058485, translation_length = 18, reference_length = 16)
+ score = bleu_score([[reference1, reference2, reference3]], [hypothesis1])
(bleu = 0.5045666840058485, precisions = [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], bp = 1.0, geo_mean = 0.5045666840058485, translation_length = 18, reference_length = 16)
diff --git a/dev/example/index.html b/dev/example/index.html index 962c7f3..917982d 100644 --- a/dev/example/index.html +++ b/dev/example/index.html @@ -1,5 +1,5 @@ -Extended Example · TextAnalysis

Extended Usage Example

To show you how text analysis might work in practice, we're going to work with a text corpus composed of political speeches from American presidents given as part of the State of the Union Address tradition.

    using TextAnalysis, MultivariateStats, Clustering
+Extended Example · TextAnalysis

Extended Usage Example

To show you how text analysis might work in practice, we're going to work with a text corpus composed of political speeches from American presidents given as part of the State of the Union Address tradition.

    using TextAnalysis, MultivariateStats, Clustering
 
     crps = DirectoryCorpus("sotu")
 
@@ -21,4 +21,4 @@
 
     T = tf_idf(D)
 
-    cl = kmeans(T, 5)
+ cl = kmeans(T, 5)
diff --git a/dev/features/index.html b/dev/features/index.html index 615fcdc..dfd03e4 100644 --- a/dev/features/index.html +++ b/dev/features/index.html @@ -1,5 +1,5 @@ -Features · TextAnalysis

Creating a Document Term Matrix

Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), +Features · TextAnalysis

Creating a Document Term Matrix

Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's @@ -38,7 +38,7 @@ 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:

julia> hash_dtv(crps[1])
 1×100 Array{Int64,2}:
- 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0

TF (Term Frequency)

Often we need to find out the proportion of a document is contributed by each term. This can be done by finding the term frequency function

TextAnalysis.tfFunction
tf(dtm::DocumentTermMatrix)
+ 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0

TF (Term Frequency)

Often we need to find out the proportion of a document is contributed by each term. This can be done by finding the term frequency function

TextAnalysis.tfFunction
tf(dtm::DocumentTermMatrix)
 tf(dtm::SparseMatrixCSC{Real})
 tf(dtm::Matrix{Real})

Compute the term-frequency of the input.

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
               StringDocument("To become or not to become")])
@@ -58,7 +58,7 @@
   [1, 5]  =  0.166667
   [2, 5]  =  0.166667
   [1, 6]  =  0.166667
-  [2, 6]  =  0.166667

See also: tf!, tf_idf, tf_idf!

source

The parameter, dtm can be of the types - DocumentTermMatrix , SparseMatrixCSC or Matrix

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), + [2, 6] = 0.166667

See also: tf!, tf_idf, tf_idf!

source

The parameter, dtm can be of the types - DocumentTermMatrix , SparseMatrixCSC or Matrix

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's @@ -68,7 +68,7 @@ Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix
julia> tf(m)2×6 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.166667 0.333333 ⋅ 0.166667 0.166667 0.166667 - 0.166667 ⋅ 0.333333 0.166667 0.166667 0.166667

TF-IDF (Term Frequency - Inverse Document Frequency)

TextAnalysis.tf_idfFunction
tf(dtm::DocumentTermMatrix)
+ 0.166667   ⋅        0.333333  0.166667  0.166667  0.166667

TF-IDF (Term Frequency - Inverse Document Frequency)

TextAnalysis.tf_idfFunction
tf(dtm::DocumentTermMatrix)
 tf(dtm::SparseMatrixCSC{Real})
 tf(dtm::Matrix{Real})

Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.

In many cases, raw word counts are not appropriate for use because:

  • Some documents are longer than other documents
  • Some words are more frequent than other words

A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
               StringDocument("To become or not to become")])
@@ -88,7 +88,7 @@
   [1, 5]  =  0.0
   [2, 5]  =  0.0
   [1, 6]  =  0.0
-  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

source

In many cases, raw word counts are not appropriate for use because:

  • (A) Some documents are longer than other documents
  • (B) Some words are more frequent than other words

You can work around this by performing TF-IDF on a DocumentTermMatrix:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), + [2, 6] = 0.0

See also: tf!, tf_idf, tf_idf!

source

In many cases, raw word counts are not appropriate for use because:

  • (A) Some documents are longer than other documents
  • (B) Some words are more frequent than other words

You can work around this by performing TF-IDF on a DocumentTermMatrix:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's @@ -144,9 +144,9 @@ CooMatrix{T}(doc; window, normalize) where T <: AbstractFloat CooMatrix{T}(crps, terms; window, normalize) where T <: AbstractFloat CooMatrix{T}(doc, terms; window, normalize) where T <: AbstractFloat

Remarks:

  • The sliding window used to count co-occurrences does not take into consideration sentence stops however, it does with documents i.e. does not span across documents
  • The co-occurrence matrices of the documents in a corpus are summed up when calculating the matrix for an entire corpus
Note

The Co occurrence matrix does not work for NGramDocument, or a Corpus containing an NGramDocument.

julia> C = CooMatrix(NGramDocument("A document"), window=1, normalize=false) # fails, documents are NGramDocument
-ERROR: The tokens of an NGramDocument cannot be reconstructed

Summarizer

TextAnalysis offers a simple text-rank based summarizer for its various document types.

TextAnalysis.summarizeFunction
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. It takes 2 arguments:

  • d : A document of type StringDocument, FileDocument or TokenDocument
  • ns : (Optional) Mention the number of sentences in the Summary, defaults to 5 sentences.

By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
+ERROR: The tokens of an NGramDocument cannot be reconstructed

Summarizer

TextAnalysis offers a simple text-rank based summarizer for its various document types.

TextAnalysis.summarizeFunction
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. It takes 2 arguments:

  • d : A document of type StringDocument, FileDocument or TokenDocument
  • ns : (Optional) Mention the number of sentences in the Summary, defaults to 5 sentences.

By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
 
 julia> summarize(s, ns=2)
 2-element Array{SubString{String},1}:
  "Assume this Short Document as an example."
- "This has too foo sentences."
source
+ "This has too foo sentences."
source
diff --git a/dev/index.html b/dev/index.html index bec05e1..c186f0e 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · TextAnalysis

Preface

This manual is designed to get you started doing text analysis in Julia. It assumes that you already familiar with the basic methods of text analysis.

Installation

The TextAnalysis package can be installed using Julia's package manager:

Pkg.add("TextAnalysis")

Loading

In all of the examples that follow, we'll assume that you have the TextAnalysis package fully loaded. This means that we think you've implicitly typed

using TextAnalysis

before every snippet of code.

TextModels

The TextModels package enhances this library with the addition of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies.

+Home · TextAnalysis

Preface

This manual is designed to get you started doing text analysis in Julia. It assumes that you already familiar with the basic methods of text analysis.

Installation

The TextAnalysis package can be installed using Julia's package manager:

Pkg.add("TextAnalysis")

Loading

In all of the examples that follow, we'll assume that you have the TextAnalysis package fully loaded. This means that we think you've implicitly typed

using TextAnalysis

before every snippet of code.

TextModels

The TextModels package enhances this library with the addition of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies.

diff --git a/dev/objects.inv b/dev/objects.inv new file mode 100644 index 0000000..260d6c8 Binary files /dev/null and b/dev/objects.inv differ diff --git a/dev/semantic/index.html b/dev/semantic/index.html index 61e80f4..924865f 100644 --- a/dev/semantic/index.html +++ b/dev/semantic/index.html @@ -1,6 +1,6 @@ -Semantic Analysis · TextAnalysis

LSA: Latent Semantic Analysis

Often we want to think about documents from the perspective of semantic content. One standard approach to doing this, is to perform Latent Semantic Analysis or LSA on the corpus.

TextAnalysis.lsaFunction
lsa(dtm::DocumentTermMatrix)
-lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

source

lsa uses tf_idf for statistics.

julia> using TextAnalysis
julia> crps = Corpus([ +Semantic Analysis · TextAnalysis

LSA: Latent Semantic Analysis

Often we want to think about documents from the perspective of semantic content. One standard approach to doing this, is to perform Latent Semantic Analysis or LSA on the corpus.

TextAnalysis.lsaFunction
lsa(dtm::DocumentTermMatrix)
+lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

source

lsa uses tf_idf for statistics.

julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ])A Corpus with 2 documents: @@ -37,11 +37,11 @@ Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 - 0.0 0.0 0.0 0.0 0.0 1.0

LDA: Latent Dirichlet Allocation

Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:

First we need to produce the DocumentTermMatrix

TextAnalysis.ldaFunction
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

  • showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
source
julia> using TextAnalysis
julia> crps = Corpus([ + 0.0 0.0 0.0 0.0 0.0 1.0

LDA: Latent Dirichlet Allocation

Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:

First we need to produce the DocumentTermMatrix

TextAnalysis.ldaFunction
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

  • showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
source
julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words") ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 10 DocumentTermMatrix
julia> k = 2 # number of topics2
julia> iterations = 1000 # number of gibbs sampling iterations1000
julia> α = 0.1 # hyper parameter0.1
julia> β = 0.1 # hyper parameter0.1
julia> ϕ, θ = lda(m, k, iterations, α, β);
julia> ϕ2×10 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: - ⋅ ⋅ ⋅ ⋅ 0.5 0.5 ⋅ ⋅ ⋅ ⋅ - 0.1 0.1 0.2 0.2 ⋅ ⋅ 0.1 0.1 0.1 0.1
julia> θ2×2 Matrix{Float64}: - 0.0 0.333333 - 1.0 0.666667

See ?lda for more help.

+ 0.125 0.125 0.25 0.25 ⋅ ⋅ 0.125 ⋅ ⋅ 0.125 + ⋅ ⋅ ⋅ ⋅ 0.25 0.25 ⋅ 0.25 0.25 ⋅

julia> θ2×2 Matrix{Float64}: + 0.833333 0.5 + 0.166667 0.5

See ?lda for more help.