Initial Commit

maruker · Nov 1, 2024 · a7f81bd · a7f81bd
1 parent c8f86a7
commit a7f81bd
Show file tree

Hide file tree

Showing 41 changed files with 3,692 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,15 @@
+# python
+__pycache__/
+*.py[cod]
+*$py.class
+venv
+
+# training and test data
+data/*
+
+# temporary outputs
+src/vec2sent/evaluation/bleu/bleu-hypothesis
+src/vec2sent/evaluation/bleu/bleu-reference
+
+# IDE
+.idea
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,12 @@
+[submodule "src/external/mos/mos"]
+	path = src/external/mos/mos
+	url = https://github.com/zihangdai/mos.git
+[submodule "src/external/InferSent"]
+	path = src/external/InferSent
+	url = https://github.com/facebookresearch/InferSent.git
+[submodule "src/external/geometric_embedding/geometric_embedding"]
+	path = src/external/geometric_embedding/geometric_embedding
+	url = https://github.com/fursovia/geometric_embedding.git
+[submodule "src/external/quick_thought/S2V"]
+	path = src/external/quick_thought/S2V
+	url = https://github.com/lajanugen/S2V.git
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.7
diff --git a/README.md b/README.md
@@ -1,11 +1,105 @@
-# Vec2Sent: Probing Sentence Embeddings with Natural Language Generation
+# Vec2Sent<br><sub><sup>Probing Sentence Embeddings with Natural Language Generation</sup></sub>
+[![arXiv](https://img.shields.io/badge/View%20on%20arXiv-B31B1B?logo=arxiv&labelColor=gray)](https://arxiv.org/abs/2011.00592)
 
-**Coming soon**
+We introspect black-box sentence embeddings by conditionally generating from them with the
+objective to retrieve the underlying discrete sentence. We perceive of this as a new unsupervised
+probing task and show that it correlates well with downstream task performance. We also illustrate
+how the language generated from different encoders differs. We apply our approach to generate
+sentence analogies from sentence embeddings.
 
-This repository contains the code needed to reproduce the results from our Coling paper [Vec2Sent: Probing Sentence Embeddings with Natural Language Generation](https://arxiv.org/abs/2011.00592).
+## Quickstart
 
-### Reference
+You can quickly install Vec2Sent using pip:
 
+```shell
+pip install "vec2sent @ git+https://github.com/maruker/vec2sent.git"
+```
+
+There are three entry points to **generate** and **evaluate** sentences, and to perform **arithmetic** in the vector space.
+
+### Vector Arithmetic
+
+```shell
+vec2sent_arithmetic -s infersent -c maruker/vec2sent-infersent
+```
+
+```text
+Please enter sentence a (Or nothing if done):his name is robert
+Please enter sentence b (Or nothing if done):he is a doctor
+Please enter sentence c (Or nothing if done):her name is julia
+Please enter sentence d (Or nothing if done):
+Please enter an arithmetic expression (e.g. (a + b) * c / 2):b-a+c
+ she is a doctor
+```
+
+### Sentence Generation
+
+For example, generate outputs using the hierarchical sentence embedding
+
+```shell
+vec2sent_generate -s hier -c maruker/vec2sent-hier -d data/test.en.2008 -o hier.txt
+```
+
+### Evaluation
+
+The outputs from the previous step can now be evaluated. For example, the following command computes the bleu score
+
+```shell
+vec2sent_evaluate --metric BLEU --file hier.txt
+```
+
+The following metrics are available
+
+| Parameter | Explanation                                                                       |
+|-----------|-----------------------------------------------------------------------------------|
+| ID        | Fraction of all sentences where the output is identical to the input              |
+| PERM      | Fraction of all output sentences that can be formed as a permutation of the input |
+| ID_PERM   | Fraction of all permuations that are identical to the input                       |
+| BLEU      | Document BLEU score                                                               |
+| MOVER     | Average Mover Score between input and output sentences                            |
+
+
+> [!TIP]
+> Vec2Sent needs to download several gigabites of sentence embedding models. Those files can be deleted using the command `vec2sent_cleanup`
+
+## Available Models
+
+We upload our models to the Hugging Face Hub. The following table shows, which parameters to set in order to load the sentence embeddings and corresponding Vec2Sent models.
+
+| Sentence embedding name `-s` | Checkpoint `-c`              | Explanation                                                                       |
+|------------------------------|------------------------------|-----------------------------------------------------------------------------------|
+| avg                          | maruker/vec2sent-avg         | Average pooling on [BPEmb](https://github.com/bheinzerling/bpemb) word embeddings |
+| hier                         | maruker/vec2sent-hier        | Hierarchical pooling on [BPEmb](https://github.com/bheinzerling/bpemb)            |
+| gem                          | maruker/vec2sent-gem         | [Geometric Embeddings](https://github.com/fursovia/geometric_embedding)           |
+| sent2vec                     | maruker/vec2sent-sent2vec    | [Sent2Vec](https://github.com/epfml/sent2vec)                                     |
+| infersent                    | maruker/vec2sent-infersent   | [InferSent](https://github.com/facebookresearch/InferSent)                        |
+| sbert-large                  | maruker/vec2sent-sbert-large | [SBERT](https://github.com/UKPLab/sentence-transformers)                          |
+
+Additional sentence embeddings can be used by extending the class ``vec2sent.sentence_embeddings.abstract_sentence_embedding.AbstractEmbedding``.
+
+## Installation
+
+#### (Optional) Setup Virtual Environment
+
+```shell
+python -m venv venv
+source venv/bin/activate
+```
+
+#### Download requirements
+```shell
+# Download git submodules (MoS model and some sentence embeddings)
+git submodule update --init
+```
+
+#### Install
+
+```shell
+pip install .
+```
+
+## Citation
+If you find Vec2Sent useful in your academic work, please consider citing
 ```
 @inproceedings{kerscher-eger-2020-vec2sent,
     title = "{V}ec2{S}ent: Probing Sentence Embeddings with Natural Language Generation",
@@ -21,3 +115,7 @@ This repository contains the code needed to reproduce the results from our Colin
     abstract = "We introspect black-box sentence embeddings by conditionally generating from them with the objective to retrieve the underlying discrete sentence. We perceive of this as a new unsupervised probing task and show that it correlates well with downstream task performance. We also illustrate how the language generated from different encoders differs. We apply our approach to generate sentence analogies from sentence embeddings.",
 }
 ```
+
+## Acknowledgments
+
+The models are based on [Mixture of Softmaxes](https://github.com/zihangdai/mos) with a context vector added to the inputs.
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,56 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build.targets.wheel]
+packages = ["src/vec2sent"]
+
+# This moves the contents of src/external into src/vec2sent during build
+[tool.hatch.metadata]
+allow-direct-references = true
+
+[tool.hatch.build.targets.wheel.force-include]
+"src/external" = "src/vec2sent"
+
+[project]
+name = "vec2sent"
+version = "0.1.0"
+description = "Generate sentences from embeddings and evaluate the results."
+authors = [
+    { name="Martin Kerscher" },
+]
+requires-python = "<3.8"
+dependencies = [
+    "bpemb>=0.3.5",
+    "fastBPE>=0.0.0",
+    "gensim==3.4.0",
+    "nltk==3.4.1",
+    "numpy>=1.17.1",
+    "pytorch-transformers==1.1.0",
+    "laserembeddings",
+    "scikit-learn==1.0.2",
+    "sentence-transformers==0.2.2",
+    "torch==1.1.0",
+    "tensorflow==1.15.0",
+    "protobuf==3.14.0",
+    "pandas==1.2.0",
+    "tqdm==4.42.1",
+    "huggingface-hub>=0.16.4",
+    "sent2vec @ git+https://github.com/epfml/sent2vec.git",
+    "pyemd==0.5.1",
+    "pytorch-pretrained-bert==0.6.2",
+    "platformdirs",
+    "gdown==4.7.3",
+    "requests",
+]
+
+[project.optional-dependencies]
+evaluate_linguistic_features = [
+    "spacy"
+]
+
+[project.scripts]
+vec2sent_cleanup = "vec2sent.sentence_embeddings.cache_utils:cleanup"
+vec2sent_generate = "vec2sent.lstm.generate:main"
+vec2sent_evaluate = "vec2sent.evaluation.__main__:main"
+vec2sent_arithmetic = "vec2sent.scripts.vector_arithmetic:main"
diff --git a/src/external/InferSent b/src/external/InferSent
diff --git a/src/external/geometric_embedding/__init__.py b/src/external/geometric_embedding/__init__.py
@@ -0,0 +1,5 @@
+from vec2sent.sys_path_hack import add_to_path
+
+add_to_path(["geometric_embedding", "geometric_embedding"])
+
+from .geometric_embedding.gem import SentenceEmbedder
diff --git a/src/external/geometric_embedding/geometric_embedding b/src/external/geometric_embedding/geometric_embedding
diff --git a/src/external/mos/__init__.py b/src/external/mos/__init__.py
@@ -0,0 +1,7 @@
+from vec2sent.sys_path_hack import add_to_path
+
+add_to_path(["mos", "mos"])
+
+from vec2sent.mos.mos.embed_regularize import embedded_dropout
+from vec2sent.mos.mos.model import RNNModel
+from vec2sent.mos.mos.weight_drop import WeightDrop
diff --git a/src/external/mos/mos b/src/external/mos/mos
diff --git a/src/external/quick_thought/S2V b/src/external/quick_thought/S2V
diff --git a/src/external/quick_thought/__init__.py b/src/external/quick_thought/__init__.py
@@ -0,0 +1,5 @@
+from vec2sent.sys_path_hack import add_to_path
+
+add_to_path(["quick_thought", "S2V", "src"])
+
+from .S2V.src import encoder_manager, configuration
diff --git a/src/external/sys_path_hack.py b/src/external/sys_path_hack.py
@@ -0,0 +1,14 @@
+import sys
+from pathlib import Path
+from typing import List
+
+def add_to_path(path: List[str]) -> None:
+    """
+    This is unfortunately necessary because I am importing from a lot of old code that is not organized as a python module.
+    A lot of old code uses absolute imports that assume the script is executed from the root folder of the repository.
+    The only solution is to add the root folder to pythonpath.
+
+    @param path: list of folders inside src/external containing the code (i.e. ["mos", "mos"] -> src/external/mos/mos)
+    """
+    current_dir = Path(__file__).resolve().parent
+    sys.path.append(str(current_dir.joinpath(*path)))
diff --git a/src/vec2sent/dataset/__init__.py b/src/vec2sent/dataset/__init__.py
@@ -0,0 +1,65 @@
+import logging
+from tqdm import tqdm
+from nltk.tokenize import word_tokenize
+
+from vec2sent.sentence_embeddings.abstract_sentence_embedding import AbstractEmbedding
+from vec2sent.util.embedding_wrapper import EmbeddingWrapper
+from vec2sent.dataset.sentence_dataset import SortedSentenceDataset
+
+import torch
+
+
+def determine_batch_size(sentence_embeddings: AbstractEmbedding) -> int:
+    if sentence_embeddings.get_name() in ['randomLSTM', 'borep', 'gem'] or sentence_embeddings.input_strings():
+        return 200
+    return 1
+
+
+def load_dataset(
+        path: str,
+        word_embeddings: EmbeddingWrapper,
+        sentence_embeddings: AbstractEmbedding,
+        device: torch.device,
+        leave_order: bool,
+        batch_size: int,
+        max_len: int,
+        num_sentences: int = 0,
+        start_token: str = None
+) -> SortedSentenceDataset:
+    """
+    Loads a dataset for training or evaluation.
+
+    @param path: path to the file containing the data
+    @param word_embeddings: word embeddings
+    @param sentence_embeddings: sentence embeddings
+    @param device: where to initially load the dataset (might not fit in VRAM)
+    @param leave_order: whether to leave the dataset in order, or sort it into batches of even length
+    @param batch_size: batch size
+    @param max_len: Maximum sentence length in dataset
+    @param num_sentences: if set to a number > 0, the dataset will be cut off after said number of sentences
+    @param start_token: String added to each line of the dataset (after tokenization)
+    """
+
+    dataset = SortedSentenceDataset(word_embeddings, sentence_embeddings, device, batch_size, max_len)
+    oov = 0
+
+    with open(path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(tqdm(f, total=num_sentences, desc="Loading dataset {}".format(path))):
+            # Apply basic tokenization to punctuation first
+            line = " ".join(word_tokenize(line))
+            line = line.replace("''", '"').replace("``", '"')
+
+            oov += dataset.add(line, max_len, start_token)
+
+            if i == num_sentences - 1:
+                break
+
+    logger = logging.getLogger(__name__)
+    logger.info('Loaded {} sentences'.format(len(dataset)))
+    dataset.finish_init()
+
+    logger.info("Out of vocabulary: {}".format(oov))
+
+    pin_memory = device.type != 'cpu'
+    dataset.create_data_loader(batch_size, leave_order, pin_memory)
+    return dataset