This repository holds two simple class wrapper implementing some word embedding models.
- MongoWordEmbedding implements a MongoDB based wrapper that consume embeddings from a mongoDB database, this is memory efficient but require a MongoDB instance.
- WordEmbedding implements an in memory wrapper that loads the models into the memory, this is memory inefficient but do not require a MongoDB instance.
- Word2Vec: TODO insert link
- FastText (en): TODO insert link
- FastText (Aligned): https://fasttext.cc/docs/en/aligned-vectors.html a. Spanish: https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.es.align.vec b. English: https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.en.align.vec c. Portugese: https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.pt.align.vec
- LSA:
The MongoWordEmbedding uses MongoDB, in order to use that version a running conextion to a MongoDB with read/write permisions is required
- Install MongoDB if you intend to use MongoWordEmbedding.
- Clone this repo.
- Enter to the repo root dir from a console.
- python setup.py install.
- Configure settings.json.
The settings.json file stores the following configuration:
{
'embeddings_folder': path to the folder where the embedding model files are stored with the structire shown in (1)
'mongo_client': a dictionary with the parameters of the Pymongo.MongoClient(**parameters) database conexion method
}
(1) The word embedding folder and file structure is the following:
word_mebddings_folder
|
├── word2vec
| └── GoogleNews-vectors-negative300.bin
|
├── fasttext
| └── cc.en.300.bin
|
├── glove
| └── glove.840B.300d.txt
|
└── LSA
└── tasa_300
└── matrix.npy
from MultiModelWordEmbedding.WordEmbedding import WordEmbedding
w2v = WordEmbedding('Word2Vec')
w2v['cat']
from MultiModelWordEmbedding.MongoWordEmbedding import download_embedding_models, MongoWordEmbedding
download_embedding_models('words_embedding_folder')
w2v = MongoWordEmbedding('Word2Vec')
w2v['cat']