Skip to content

Commit

Permalink
Detection improvements and documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
aviks committed Aug 4, 2018
1 parent 2c37a31 commit eb4e6ff
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 14 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.DS_Store
42 changes: 38 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Languages.jl
[![Languages](http://pkg.julialang.org/badges/Languages_0.6.svg)](http://pkg.julialang.org/?pkg=Languages)


# Introduction
## Introduction

Languages.jl is a Julia package for working with human languages. It provides:

Expand All @@ -17,11 +17,45 @@ Languages.jl is a Julia package for working with human languages. It provides:
* Pronouns
* Stopwords

# Usage
These methods are supported only for English and German currently.

This package also detects the script and language for written text in a wide variety of languages.

## Usage

using Languages

articles(EnglishLanguage)
stopwords(EnglishLanguage)
articles(Languages.English())
stopwords(Languages.English())

All word lists are returned as vectors of UTF-8 strings.

## Script detection

Script detection model works by checking the unicode character ranges present within
the input text

Languages.detect_script("To be or not to be") # => Languages.LatinScript()

## Language Detection

A trigram based model is used to detect the language for the text. The model is
filtered based on the detected script.

We detect 84 of the most common languages spoken around the world. This usually
covers most languages with more than 10 million native speakers.

Languages.detect("To be or not to be")
# (Languages.English(), Languages.LatinScript(), 1.0)

The `detect` function returns the language, the script, and the confidence.

The language and script detection code in this package is ported from the rust package [whatlang-rs](https://github.com/greyblake/whatlang-rs). That package is in turn derived from [franc](https://github.com/wooorm/franc). See `LICENSE.whatlang-rs` for details.

## Deprecations

The API of this package has been refurbished recently. If you have used this package earlier,
please be aware of these changes.

* The language names have been shortened. So `English` instead of `EnglishLanguage`. However, the language names are no longer exported. So they should be referred to with the package name: `Languages.English`
* Every language is a type. However all functions now accept and return instances of these types, rather than the types themselves.
8 changes: 5 additions & 3 deletions src/whatlang.jl
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,9 @@ function detect_script(text::AbstractString)
end
end

sort(script_counters, lt=(x,y)->x[2]<y[2])
if script_counters[2] > 0
return script_counters[1]
sort!(script_counters, lt=(x,y)->x[2]<y[2])
if script_counters[1][2] > 0
return script_counters[1][2]
else
return nothing
end
Expand Down Expand Up @@ -410,7 +410,9 @@ function calculate_distance(lang_trigrams, text_trigrams)
end

function detect(text::AbstractString, options=default_options())
if text==""; throw(ArgumentError("Cannot detect language for empty text")); end
script = detect_script(text)
if script == nothing; return (nothing, nothing, 0); end
lang, conf = detect_lang_based_on_script(text, script, options)
return (from_code(lang), script, conf)
end
9 changes: 2 additions & 7 deletions src/word_lists.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,12 @@ for f in ("articles", "indefinite_articles", "definite_articles", "prepositions"
return fetch_word_list(filename)
end
end
end


for f in ("articles", "indefinite_articles", "definite_articles", "prepositions", "pronouns", "stopwords")
# Deprecations
@eval begin
function $(Symbol(f)){T <: Language}(l::Type{T})
filename = Pkg.dir("Languages", "data", $f, string(string(l.name.name), ".txt"))
Base.depwarn("Use of Languages as types is deprecated. Use instances.", Symbol(T))
return fetch_word_list(filename)
$(Symbol(f))(l())
end

$(Symbol(f))(l::T) where T <: Language = $(Symbol(f))(T)
end
end

0 comments on commit eb4e6ff

Please sign in to comment.