The symspell
package provides a Golang implementation of the SymSpell algorithm, a fast and memory-efficient algorithm
for spelling correction, word segmentation, and fuzzy string matching. It supports both unigrams and bigrams for
advanced contextual correction.
- Fast lookup for single-word corrections
- Compound word corrections
- Customizable edit distance and prefix length
- Support for unigram and bigram dictionaries
- Configurable thresholds for performance tuning
Install the package using go get
:
go get github.com/snapp-incubator/go-symspell
-
Import the Package
-
import "github.com/snapp-incubator/go-symspell"
-
Initialize SymSpell
-
Simple Lookup
package main
import "github.com/snapp-incubator/go-symspell"
func main() {
symSpell := symspell.NewSymSpellWithLoadDictionary("path/to/vocab.txt", 0, 1,
symspell.WithCountThreshold(10),
symspell.WithMaxDictionaryEditDistance(3),
symspell.WithPrefixLength(5),
)
}
package main
func main() {
symSpell := symspell.NewSymSpellWithLoadBigramDictionary("path/to/vocab.txt", "path/to/vocab_bigram.txt", 0, 1,
symspell.WithCountThreshold(1),
symspell.WithMaxDictionaryEditDistance(3),
symspell.WithPrefixLength(7),
)
}
suggestions, err := symSpell.Lookup("حیابان", symspell.Top, 3)
if err != nil {
log.Fatal(err)
}
fmt.Println(suggestions[0].Term) // Output: خیابان
Compound Word Lookup
suggestion := symSpell.LookupCompound("حیابان ملاصدزا", 3)
fmt.Println(suggestion.Term) // Output: خیابان ملاصدرا
The repository includes comprehensive unit tests. Run the tests with:
go test ./...
Example test cases include single-word corrections, compound word corrections, and edge cases.
- WithMaxDictionaryEditDistance: Sets the maximum edit distance for corrections.
- WithPrefixLength: Sets the prefix length for index optimization.
- WithCountThreshold: Filters dictionary entries with low frequency.
Dictionaries
The dictionaries should be formatted as plain text files:
- Unigram file: Each line should contain a term and its frequency, separated by a space.(or could be custom seperator)
- Bigram file: Each line should contain two terms and their frequency, separated by a space.
Unigram (vocab.txt):
خیابان 1000
میدان 800
Bigram (vocab_bigram.txt):
خیابان کارگر 500
میدان آزادی 300
SymSpell is optimized for speed and memory efficiency. For large vocabularies, tune maxEditDistance, prefixLength, and countThreshold to balance performance and accuracy.