Skip to content

BNLP 4.0.0

Compare
Choose a tag to compare
@sagorbrur sagorbrur released this 14 Aug 14:34

BNLP 4.0.0: Re-design of BNLP version 3 with proper OOP methods for re-use model, use separate train module, and so on

Highlights

BNLP v4.0.0 is re-design with proper object-orient programming method. In the earlier version pre-trained model was loading every time we try to tokenize or embed a text. But this version model will load only once and re-use for tokenization, embedding, and other task as well. Also added automatic model downloading so if someone passes no pre-train model path it will automatically load a pre-train model from the hub. In the earlier version training module was embedded with the same prediction module. Which was creating a problem to add some separate functionalities for train and predicting. So, we separated the training module for every task like tokenization, and embeddings. The Corpus module is now a class to reuse and add new features.

API Changes

Model loading changes: Previously model was loading every time it generate a results

  • Model was loading while initiating any classes
  • If no model passes through it will automatically load a pre-train model from the hub.
3.3.2 4.0.0
from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'গ্রাম'
similar = bwv.most_similar(model_path, word, topn=10)
print(similar)
from bnlp import BengaliWord2Vec

model_path = "path/mymodel.model"
bwv = BengaliWord2Vec(model_path=model_path)

word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape)

Training module changes

The training module separated from the main module and added relevant features into it.

3.3.2 4.0.0
from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name, epochs=5)
from bnlp import Word2VecTraining

trainer = Word2VecTraining()

data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.train(data_file, model_name, vector_name, epochs=5)

Corpus is now class

3.3.2 4.0.0
from bnlp.corpus import stopwords, punctuations, letters, digits

print(stopwords)
print(punctuations)
print(letters)
print(digits)
from bnlp import BengaliCorpus as corpus

print(corpus.stopwords)
print(corpus.punctuations)
print(corpus.letters)
print(corpus.digits)
print(corpus.vowels)

Contributors

  • Ibrahim (automatic model downloading, fixing glove vector loading)