BNLP 4.0.0: Re-design of BNLP version 3 with proper OOP methods for re-use model, use separate train module, and so on

Highlights

BNLP v4.0.0 is re-design with proper object-orient programming method. In the earlier version pre-trained model was loading every time we try to tokenize or embed a text. But this version model will load only once and re-use for tokenization, embedding, and other task as well. Also added automatic model downloading so if someone passes no pre-train model path it will automatically load a pre-train model from the hub. In the earlier version training module was embedded with the same prediction module. Which was creating a problem to add some separate functionalities for train and predicting. So, we separated the training module for every task like tokenization, and embeddings. The Corpus module is now a class to reuse and add new features.

API Changes

Model loading changes: Previously model was loading every time it generate a results

Model was loading while initiating any classes
If no model passes through it will automatically load a pre-train model from the hub.

3.3.2

4.0.0

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'গ্রাম'
similar = bwv.most_similar(model_path, word, topn=10)
print(similar)

from bnlp import BengaliWord2Vec

model_path = "path/mymodel.model"
bwv = BengaliWord2Vec(model_path=model_path)

word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape)

Training module changes

The training module separated from the main module and added relevant features into it.

3.3.2

4.0.0

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name, epochs=5)

from bnlp import Word2VecTraining

trainer = Word2VecTraining()

data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.train(data_file, model_name, vector_name, epochs=5)

Corpus is now class

3.3.2

4.0.0

from bnlp.corpus import stopwords, punctuations, letters, digits

print(stopwords)
print(punctuations)
print(letters)
print(digits)

from bnlp import BengaliCorpus as corpus

print(corpus.stopwords)
print(corpus.punctuations)
print(corpus.letters)
print(corpus.digits)
print(corpus.vowels)

Contributors

Ibrahim (automatic model downloading, fixing glove vector loading)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BNLP 4.0.0

BNLP 4.0.0: Re-design of BNLP version 3 with proper OOP methods for re-use model, use separate train module, and so on

Highlights

API Changes

Model loading changes: Previously model was loading every time it generate a results

Training module changes

Corpus is now class

Contributors