BNLP 4.0.0: Re-design of BNLP version 3 with proper OOP methods for re-use model, use separate train module, and so on
Highlights
BNLP v4.0.0 is re-design with proper object-orient programming method. In the earlier version pre-trained model was loading every time we try to tokenize or embed a text. But this version model will load only once and re-use for tokenization, embedding, and other task as well. Also added automatic model downloading so if someone passes no pre-train model path it will automatically load a pre-train model from the hub. In the earlier version training module was embedded with the same prediction module. Which was creating a problem to add some separate functionalities for train and predicting. So, we separated the training module for every task like tokenization, and embeddings. The Corpus module is now a class to reuse and add new features.
API Changes
Model loading changes: Previously model was loading every time it generate a results
- Model was loading while initiating any classes
- If no model passes through it will automatically load a pre-train model from the hub.
3.3.2 | 4.0.0 |
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'গ্রাম'
similar = bwv.most_similar(model_path, word, topn=10)
print(similar)
|
from bnlp import BengaliWord2Vec
model_path = "path/mymodel.model"
bwv = BengaliWord2Vec(model_path=model_path)
word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape)
|
Training module changes
The training module separated from the main module and added relevant features into it.
3.3.2 | 4.0.0 |
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name, epochs=5)
|
from bnlp import Word2VecTraining
trainer = Word2VecTraining()
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.train(data_file, model_name, vector_name, epochs=5)
|
Corpus is now class
3.3.2 | 4.0.0 |
from bnlp.corpus import stopwords, punctuations, letters, digits
print(stopwords)
print(punctuations)
print(letters)
print(digits)
|
from bnlp import BengaliCorpus as corpus
print(corpus.stopwords)
print(corpus.punctuations)
print(corpus.letters)
print(corpus.digits)
print(corpus.vowels)
|
Contributors
- Ibrahim (automatic model downloading, fixing glove vector loading)