Skip to content

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
maruker committed Nov 1, 2024
1 parent c8f86a7 commit a7f81bd
Show file tree
Hide file tree
Showing 41 changed files with 3,692 additions and 4 deletions.
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# python
__pycache__/
*.py[cod]
*$py.class
venv

# training and test data
data/*

# temporary outputs
src/vec2sent/evaluation/bleu/bleu-hypothesis
src/vec2sent/evaluation/bleu/bleu-reference

# IDE
.idea
12 changes: 12 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[submodule "src/external/mos/mos"]
path = src/external/mos/mos
url = https://github.com/zihangdai/mos.git
[submodule "src/external/InferSent"]
path = src/external/InferSent
url = https://github.com/facebookresearch/InferSent.git
[submodule "src/external/geometric_embedding/geometric_embedding"]
path = src/external/geometric_embedding/geometric_embedding
url = https://github.com/fursovia/geometric_embedding.git
[submodule "src/external/quick_thought/S2V"]
path = src/external/quick_thought/S2V
url = https://github.com/lajanugen/S2V.git
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.7
106 changes: 102 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,105 @@
# Vec2Sent: Probing Sentence Embeddings with Natural Language Generation
# Vec2Sent<br><sub><sup>Probing Sentence Embeddings with Natural Language Generation</sup></sub>
[![arXiv](https://img.shields.io/badge/View%20on%20arXiv-B31B1B?logo=arxiv&labelColor=gray)](https://arxiv.org/abs/2011.00592)

**Coming soon**
We introspect black-box sentence embeddings by conditionally generating from them with the
objective to retrieve the underlying discrete sentence. We perceive of this as a new unsupervised
probing task and show that it correlates well with downstream task performance. We also illustrate
how the language generated from different encoders differs. We apply our approach to generate
sentence analogies from sentence embeddings.

This repository contains the code needed to reproduce the results from our Coling paper [Vec2Sent: Probing Sentence Embeddings with Natural Language Generation](https://arxiv.org/abs/2011.00592).
## Quickstart

### Reference
You can quickly install Vec2Sent using pip:

```shell
pip install "vec2sent @ git+https://github.com/maruker/vec2sent.git"
```

There are three entry points to **generate** and **evaluate** sentences, and to perform **arithmetic** in the vector space.

### Vector Arithmetic

```shell
vec2sent_arithmetic -s infersent -c maruker/vec2sent-infersent
```

```text
Please enter sentence a (Or nothing if done):his name is robert
Please enter sentence b (Or nothing if done):he is a doctor
Please enter sentence c (Or nothing if done):her name is julia
Please enter sentence d (Or nothing if done):
Please enter an arithmetic expression (e.g. (a + b) * c / 2):b-a+c
she is a doctor
```

### Sentence Generation

For example, generate outputs using the hierarchical sentence embedding

```shell
vec2sent_generate -s hier -c maruker/vec2sent-hier -d data/test.en.2008 -o hier.txt
```

### Evaluation

The outputs from the previous step can now be evaluated. For example, the following command computes the bleu score

```shell
vec2sent_evaluate --metric BLEU --file hier.txt
```

The following metrics are available

| Parameter | Explanation |
|-----------|-----------------------------------------------------------------------------------|
| ID | Fraction of all sentences where the output is identical to the input |
| PERM | Fraction of all output sentences that can be formed as a permutation of the input |
| ID_PERM | Fraction of all permuations that are identical to the input |
| BLEU | Document BLEU score |
| MOVER | Average Mover Score between input and output sentences |


> [!TIP]
> Vec2Sent needs to download several gigabites of sentence embedding models. Those files can be deleted using the command `vec2sent_cleanup`
## Available Models

We upload our models to the Hugging Face Hub. The following table shows, which parameters to set in order to load the sentence embeddings and corresponding Vec2Sent models.

| Sentence embedding name `-s` | Checkpoint `-c` | Explanation |
|------------------------------|------------------------------|-----------------------------------------------------------------------------------|
| avg | maruker/vec2sent-avg | Average pooling on [BPEmb](https://github.com/bheinzerling/bpemb) word embeddings |
| hier | maruker/vec2sent-hier | Hierarchical pooling on [BPEmb](https://github.com/bheinzerling/bpemb) |
| gem | maruker/vec2sent-gem | [Geometric Embeddings](https://github.com/fursovia/geometric_embedding) |
| sent2vec | maruker/vec2sent-sent2vec | [Sent2Vec](https://github.com/epfml/sent2vec) |
| infersent | maruker/vec2sent-infersent | [InferSent](https://github.com/facebookresearch/InferSent) |
| sbert-large | maruker/vec2sent-sbert-large | [SBERT](https://github.com/UKPLab/sentence-transformers) |

Additional sentence embeddings can be used by extending the class ``vec2sent.sentence_embeddings.abstract_sentence_embedding.AbstractEmbedding``.

## Installation

#### (Optional) Setup Virtual Environment

```shell
python -m venv venv
source venv/bin/activate
```

#### Download requirements
```shell
# Download git submodules (MoS model and some sentence embeddings)
git submodule update --init
```

#### Install

```shell
pip install .
```

## Citation
If you find Vec2Sent useful in your academic work, please consider citing
```
@inproceedings{kerscher-eger-2020-vec2sent,
title = "{V}ec2{S}ent: Probing Sentence Embeddings with Natural Language Generation",
Expand All @@ -21,3 +115,7 @@ This repository contains the code needed to reproduce the results from our Colin
abstract = "We introspect black-box sentence embeddings by conditionally generating from them with the objective to retrieve the underlying discrete sentence. We perceive of this as a new unsupervised probing task and show that it correlates well with downstream task performance. We also illustrate how the language generated from different encoders differs. We apply our approach to generate sentence analogies from sentence embeddings.",
}
```

## Acknowledgments

The models are based on [Mixture of Softmaxes](https://github.com/zihangdai/mos) with a context vector added to the inputs.
56 changes: 56 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/vec2sent"]

# This moves the contents of src/external into src/vec2sent during build
[tool.hatch.metadata]
allow-direct-references = true

[tool.hatch.build.targets.wheel.force-include]
"src/external" = "src/vec2sent"

[project]
name = "vec2sent"
version = "0.1.0"
description = "Generate sentences from embeddings and evaluate the results."
authors = [
{ name="Martin Kerscher" },
]
requires-python = "<3.8"
dependencies = [
"bpemb>=0.3.5",
"fastBPE>=0.0.0",
"gensim==3.4.0",
"nltk==3.4.1",
"numpy>=1.17.1",
"pytorch-transformers==1.1.0",
"laserembeddings",
"scikit-learn==1.0.2",
"sentence-transformers==0.2.2",
"torch==1.1.0",
"tensorflow==1.15.0",
"protobuf==3.14.0",
"pandas==1.2.0",
"tqdm==4.42.1",
"huggingface-hub>=0.16.4",
"sent2vec @ git+https://github.com/epfml/sent2vec.git",
"pyemd==0.5.1",
"pytorch-pretrained-bert==0.6.2",
"platformdirs",
"gdown==4.7.3",
"requests",
]

[project.optional-dependencies]
evaluate_linguistic_features = [
"spacy"
]

[project.scripts]
vec2sent_cleanup = "vec2sent.sentence_embeddings.cache_utils:cleanup"
vec2sent_generate = "vec2sent.lstm.generate:main"
vec2sent_evaluate = "vec2sent.evaluation.__main__:main"
vec2sent_arithmetic = "vec2sent.scripts.vector_arithmetic:main"
1 change: 1 addition & 0 deletions src/external/InferSent
Submodule InferSent added at 31eede
5 changes: 5 additions & 0 deletions src/external/geometric_embedding/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from vec2sent.sys_path_hack import add_to_path

add_to_path(["geometric_embedding", "geometric_embedding"])

from .geometric_embedding.gem import SentenceEmbedder
1 change: 1 addition & 0 deletions src/external/geometric_embedding/geometric_embedding
Submodule geometric_embedding added at 7a84ef
7 changes: 7 additions & 0 deletions src/external/mos/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from vec2sent.sys_path_hack import add_to_path

add_to_path(["mos", "mos"])

from vec2sent.mos.mos.embed_regularize import embedded_dropout
from vec2sent.mos.mos.model import RNNModel
from vec2sent.mos.mos.weight_drop import WeightDrop
1 change: 1 addition & 0 deletions src/external/mos/mos
Submodule mos added at 9c0c60
1 change: 1 addition & 0 deletions src/external/quick_thought/S2V
Submodule S2V added at 397b8b
5 changes: 5 additions & 0 deletions src/external/quick_thought/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from vec2sent.sys_path_hack import add_to_path

add_to_path(["quick_thought", "S2V", "src"])

from .S2V.src import encoder_manager, configuration
14 changes: 14 additions & 0 deletions src/external/sys_path_hack.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import sys
from pathlib import Path
from typing import List

def add_to_path(path: List[str]) -> None:
"""
This is unfortunately necessary because I am importing from a lot of old code that is not organized as a python module.
A lot of old code uses absolute imports that assume the script is executed from the root folder of the repository.
The only solution is to add the root folder to pythonpath.
@param path: list of folders inside src/external containing the code (i.e. ["mos", "mos"] -> src/external/mos/mos)
"""
current_dir = Path(__file__).resolve().parent
sys.path.append(str(current_dir.joinpath(*path)))
65 changes: 65 additions & 0 deletions src/vec2sent/dataset/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
import logging
from tqdm import tqdm
from nltk.tokenize import word_tokenize

from vec2sent.sentence_embeddings.abstract_sentence_embedding import AbstractEmbedding
from vec2sent.util.embedding_wrapper import EmbeddingWrapper
from vec2sent.dataset.sentence_dataset import SortedSentenceDataset

import torch


def determine_batch_size(sentence_embeddings: AbstractEmbedding) -> int:
if sentence_embeddings.get_name() in ['randomLSTM', 'borep', 'gem'] or sentence_embeddings.input_strings():
return 200
return 1


def load_dataset(
path: str,
word_embeddings: EmbeddingWrapper,
sentence_embeddings: AbstractEmbedding,
device: torch.device,
leave_order: bool,
batch_size: int,
max_len: int,
num_sentences: int = 0,
start_token: str = None
) -> SortedSentenceDataset:
"""
Loads a dataset for training or evaluation.
@param path: path to the file containing the data
@param word_embeddings: word embeddings
@param sentence_embeddings: sentence embeddings
@param device: where to initially load the dataset (might not fit in VRAM)
@param leave_order: whether to leave the dataset in order, or sort it into batches of even length
@param batch_size: batch size
@param max_len: Maximum sentence length in dataset
@param num_sentences: if set to a number > 0, the dataset will be cut off after said number of sentences
@param start_token: String added to each line of the dataset (after tokenization)
"""

dataset = SortedSentenceDataset(word_embeddings, sentence_embeddings, device, batch_size, max_len)
oov = 0

with open(path, "r", encoding="utf-8") as f:
for i, line in enumerate(tqdm(f, total=num_sentences, desc="Loading dataset {}".format(path))):
# Apply basic tokenization to punctuation first
line = " ".join(word_tokenize(line))
line = line.replace("''", '"').replace("``", '"')

oov += dataset.add(line, max_len, start_token)

if i == num_sentences - 1:
break

logger = logging.getLogger(__name__)
logger.info('Loaded {} sentences'.format(len(dataset)))
dataset.finish_init()

logger.info("Out of vocabulary: {}".format(oov))

pin_memory = device.type != 'cpu'
dataset.create_data_loader(batch_size, leave_order, pin_memory)
return dataset
Loading

0 comments on commit a7f81bd

Please sign in to comment.