Word2Vec: Custom implementation in PyTorch

Custom implementation of the original paper on Word2Vec - Efficient Estimation of Word Representations in Vector Space

It uses minimum of third-party packages. Most of the functionality is implemented using basic features of PyTorch.

Additional information:

Overview

There are 2 model architectures implemented in this project:
- Continuous Bag-of-Words Model (CBOW), that predicts word based on its context
- Continuous Skip-gram Model (Skip-Gram), that predicts context for a given word
Models are trained on text8 corpus which is the first 10⁹ bytes of the English Wikipedia dump on Mar. 3, 2006
Context for both models is represented as 5 words before and 5 words after the central word
AdamW optimizer is used
Trained for 5 epochs
Vocabulary size is limited by 5000 words
Results can be compared with reference Gensim Word2Vec module

Repository mirrors

Repository structure

.
├── dataset
│   └── tesxt8.txt
├── imgs
│   ├── cbow.png
│   └── gensim.png
├── notebooks
│   └── training.ipynb
├── results
│   ├── cbow
│   └── skipgram
├── src
│   ├── custom_word2vec.py
│   ├── dataloader.py
│   ├── gensim_word2vec.py
│   ├── metric_monitor.py
│   ├── trainer.py
│   └── vocab.py
├── main.py
├── README.md
└── requirements.txt

dataset/text8.txt - text8 corpus file
imgs/ - images for documentation
notebooks/training.ipynb - demo for training procedure
notebooks/evaluation.ipynb - demo for visually evaluating models
results/ - folder for storing results
src/custom_word2vec.py - custom Word2Vec model
src/dataloader.py - dataloader related classes and functions
src/gensim_word2vec.py - Gensim Word2Vec model
src/metric_monitor.py - metric monitor class
src/vocab.py - vocabulary class
main.py - main script for training

Usage

Training in local environment

python main.py

Before running the command, the following parameters can be changed in main.py file:

MAX_VOCAB_SIZE - Max vocabulary size
EPOCHS - Number of epochs
MODEL_TYPE - Model type to be used: "cbow" or "skipgram"
EMBEDDING_SIZE - Embedding size
SAVE_PATH - Path for saving results

By default, parameters are similar to ones used in Gensim.

Using notebooks

notebooks/training.ipynb notebook can be used to run train process in Colab or Kaggle environments
notebooks/evaluation.ipynb notebook can be used to evaluate different models e.g. display scatterplots or find similar words

Here are two examples of word groupings

Gensim model

CBOW custom model

License

This project is licensed under the terms of the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec: Custom implementation in PyTorch

Additional information:

Overview

Repository mirrors

Repository structure

Usage

Training in local environment

Using notebooks

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
dataset		dataset
imgs		imgs
notebooks		notebooks
results		results
src		src
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

lacteolus/word2vec

Folders and files

Latest commit

History

Repository files navigation

Word2Vec: Custom implementation in PyTorch

Additional information:

Overview

Repository mirrors

Repository structure

Usage

Training in local environment

Using notebooks

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages