This repository contains the code used in the research paper titled "Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction" authored by Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz. The paper was accepted for ACL 2024 (findings).
- ACL - TBA
- Arxiv
@misc{sawczyn2024developingpuggpolishmodern,
title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction},
author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz},
year={2024},
eprint={2408.02337},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2408.02337},
}
The PUGG dataset is available in the following repositories:
- General - contains all tasks (KBQA, MRC, IR*)
For more straightforward usage, the tasks are also available in separate repositories:
The knowledge graph for KBQA task is available in the following repository:
Note: If you want to utilize the IR task in the BEIR format (qrels
in .tsv
format), please
download the IR repository.
- Configured Python 3.10 environment.
- Installed Poetry.
- Installed Docker (for the
search_results_acquisition
stage and thererank
stage (read more)).
To install all dependencies, run:
poetry install
The repository uses DVC to manage the data. To download the data, run:
dvc pull
The repository uses DVC to manage the dataset construction pipeline.
dvc.yaml
contains all of the stages (exceptrun_search_results_acquisition.py
).- Any data that are external or generated by external tools (i.e. Inforex, spreadsheet) are
associated with
*.dvc
files stored in thedata
directory.
To reproduce all dvc stages run:
dvc repro
The run_search_results_acquisition.py
script acquires data from the Google Search API.
It should be run using Docker, not DVC, due to the utilization of a database.
In the case of a full reproduction, it should be run after the acquire_suggestions
stage.
Credentials should be passed using the following environmental variables.
CUSTOM_SEARCH_ID="..."
GOOGLE_API_KEYS='["...", "..."]'
To run script:
docker build -f docker/search_results_acquisition_runner/dockerfile -t search_results_acquisition_runner .
docker run -v "$(pwd)"/data:/google-query-qa-dataset/data --env-file credentials.env search_results_acquisition_runner
Below is an overview of the project structure along with descriptions of the most important modules, directories and files.
-
gqqd/
- a python module that contain python code for creating the KBQA (natural), MRC and IR datasets. -
sqqd/
- is a python module that contain Python code for creating the KBQA (template-based) dataset. -
tools/
- contains some tools that were used in the project but not integrated directly into the main codebase. -
baselines/
- contains implementations of baseline models that are used for evaluation on the constructed datasets. -
data/
- contains the data used in the project. It includes input data, intermediate data, and the final datasets. -
tests/
- contains unit tests for the codebase. -
.gitignore
,.dockerignore
,.dvcignore
- the files specify patterns for files or directories that should be ignored by Git, Docker, or DVC respectively. -
.env
,credentials.env
- the files contain environment variables or credentials required for the project. The files are not tracked by Git, because they contain sensitive information.If you want to reproduce the whole pipeline, you need to create the files with the following content:
- .env
SPARQL_USER_AGENT= # user agent for SPARQL queries
- credentials.env
CUSTOM_SEARCH_ID= # Google custom search ID GOOGLE_API_KEYS='["key1", "key2"]' # list of Google API keys to use custom search OPENAI_API_KEY= # OpenAI API key
-
dvc.yaml
,dvc.lock
- related to DVC. Thedvc.yaml
file contains all the stages, specifying the data dependencies, whiledvc.lock
locks the exact versions of the data files. -
pyproject.toml
,poetry.lock
- related to Poetry.