Adjusting Word Embeddings and Downstream Tasks

Introduction

We adjust the pre-trained word embeddings through a self-attention mechanism so that the word embeddings can preserve the relationship between synonyms and antonyms in the lexicon, and evaluate the adjusted word embeddings the through NLP downstream tasks.

Thesis Links: 以同反義詞典調整的詞向量對下游自然語言任務影響之實證研究

Run

Environment suggestion

The library bcolz is difficult to install. We suggest using Conda Python 3.7 and installing library bcolz via conda install for an easier setup.

Adjusting Word Embeddings

main.py

Train self-attention model and adjust pre-trained word embeddings.
Refer to the chapter 3.1.3 in the paper for the input of loss1/loss2/alpha

Usage:

$python main.py -ep <filepath of pre-trained embeddings> \
                -en <filename of pre-trained embeddings> \
                -lp <filepath of lexicons> \
                -vp <filepath of vocabulary> \
                -op <filepath to save model> \
                -loss1 <method to use for the first target of the loss function> ('Cosine' or 'Euclidean') \
                -loss2 <method to use for the second target of the loss function> ('MSE' or 'Euclidean') \
                -alpha <hyperparameter for the loss function>

Example:

$python main.py -ep ../data/embeddings/GloVe/glove.6B.300d.txt \
                -en glove.6B.300d \
                -lp ../data/lexicons/wordnet_syn_ant.txt \
                -vp ../data/embeddings/GloVe/glove.6B.300d.txt.vocab.pkl \
                -op ../output/model/listwise.model \
                -loss1 Euclidean \
                -loss2 Euclidean \
                -alpha 0.3

evaluation.py

Compare the performance of un-adjusted and adjusted word embeddings on test data.
Transfer the raw output to GloVe format.

Usage:

$python evaluation.py -ep <filepath of the original pre-trained embeddings> \
                      -vp <filepath of vocabulary> \
                      -tp <filepath of test data for evaluation> \
                      -op <filepath for saving embeddings>

Example:

$python evaluation.py -ep ../data/embeddings/GloVe/glove.6B.300d.txt \
                      -vp ../data/embeddings/GloVe/glove.6B.300d.txt.vocab.pkl \
                      -tp ../data/lexicons/wordnet_syn_ant_test_pair.txt \
                      -op ../output/embeddings/Listwise_Vectors.txt

NLP Downstream Tasks

IMDb Task: LSTM_IMDb.py

Compare the performance of un-adjusted and adjusted word embeddings on IMDb.

Usage:

$python LSTM_IMDb.py -en <filename of embeddings> \
                     -op <filepath for saving the result>

Example:

$python LSTM_IMDb.py -en ../output/embeddings/Listwise_Vectors.txt \
                     -op ../output/IMDb/IMDb_result.txt

US Airline Task: LSTM_USairline.py

Compare the performance of un-adjusted and adjusted word embeddings on US Airline.

Usage:

$python LSTM_USairline.py -en <filename of embeddings> \
                          -op <filepath for saving the result>

Example:

$python LSTM_USairline.py -en ../output/embeddings/Listwise_Vectors.txt \
                          -op ../output/USairline/USairline_result.txt

Datasets

Embeddings

Pretrained word embeddings filtered by 50K frequent words in GloVe format.
Files end with '_train.txt' are for training, while files end with '_test_pair.txt' are for testing.

Data format:

word1 -0.09611 -0.25788 ... -0.092774  0.39058
word2 -0.24837 -0.45461 ...  0.15458  -0.38053

Lexicons

Synonyms and antonyms retrieved from dictionary.

Data format:

word1 syn1 ... synn \t ant1 ... antn 
word2 syn1 ... synn \t ant1 ... antn

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
data		data
output		output
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adjusting Word Embeddings and Downstream Tasks

Introduction

Run

Environment suggestion

Adjusting Word Embeddings

main.py

evaluation.py

NLP Downstream Tasks

IMDb Task: LSTM_IMDb.py

US Airline Task: LSTM_USairline.py

Datasets

Embeddings

Lexicons

Reference

About

Releases

Packages

Contributors 2

Languages

ncu-dart/Adjusting-Word-Embeddings-and-Downstream-tasks

Folders and files

Latest commit

History

Repository files navigation

Adjusting Word Embeddings and Downstream Tasks

Introduction

Run

Environment suggestion

Adjusting Word Embeddings

main.py

evaluation.py

NLP Downstream Tasks

IMDb Task: LSTM_IMDb.py

US Airline Task: LSTM_USairline.py

Datasets

Embeddings

Lexicons

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages