PUGG: KBQA, MRC, IR dataset for Polish

This repository contains the code used in the research paper titled "Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction" authored by Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz. The paper was accepted for ACL 2024 (findings).

Paper

ACL - TBA
Arxiv

Citation

@misc{sawczyn2024developingpuggpolishmodern,
      title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, 
      author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz},
      year={2024},
      eprint={2408.02337},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2408.02337}, 
}

PUGG

The PUGG dataset is available in the following repositories:

General - contains all tasks (KBQA, MRC, IR*)

For more straightforward usage, the tasks are also available in separate repositories:

The knowledge graph for KBQA task is available in the following repository:

Knowledge Graph

Note: If you want to utilize the IR task in the BEIR format (qrels in .tsv format), please download the IR repository.

Getting Started

Prerequisites

Configured Python 3.10 environment.
Installed Poetry.
Installed Docker (for the search_results_acquisition stage and the rerank stage (read more)).

Installing

To install all dependencies, run:

poetry install

Downloading data

The repository uses DVC to manage the data. To download the data, run:

dvc pull

Reproducing

DVC

The repository uses DVC to manage the dataset construction pipeline.

dvc.yaml contains all of the stages (except run_search_results_acquisition.py).
Any data that are external or generated by external tools (i.e. Inforex, spreadsheet) are associated with *.dvc files stored in the data directory.

To reproduce all dvc stages run:

dvc repro

The `search_results_acquisition` stage

The run_search_results_acquisition.py script acquires data from the Google Search API. It should be run using Docker, not DVC, due to the utilization of a database. In the case of a full reproduction, it should be run after the acquire_suggestions stage. Credentials should be passed using the following environmental variables.

CUSTOM_SEARCH_ID="..."
GOOGLE_API_KEYS='["...", "..."]'

To run script:

docker build -f docker/search_results_acquisition_runner/dockerfile -t search_results_acquisition_runner . 
docker run -v "$(pwd)"/data:/google-query-qa-dataset/data --env-file credentials.env search_results_acquisition_runner

Project structure

Below is an overview of the project structure along with descriptions of the most important modules, directories and files.

gqqd/ - a python module that contain python code for creating the KBQA (natural), MRC and IR datasets.
sqqd/ - is a python module that contain Python code for creating the KBQA (template-based) dataset.
tools/ - contains some tools that were used in the project but not integrated directly into the main codebase.
baselines/ - contains implementations of baseline models that are used for evaluation on the constructed datasets.
data/ - contains the data used in the project. It includes input data, intermediate data, and the final datasets.
tests/ - contains unit tests for the codebase.
.gitignore, .dockerignore, .dvcignore - the files specify patterns for files or directories that should be ignored by Git, Docker, or DVC respectively.
.env, credentials.env - the files contain environment variables or credentials required for the project. The files are not tracked by Git, because they contain sensitive information.

If you want to reproduce the whole pipeline, you need to create the files with the following content:
- .env
```
SPARQL_USER_AGENT= # user agent for SPARQL queries
```
- credentials.env
```
CUSTOM_SEARCH_ID= # Google custom search ID
GOOGLE_API_KEYS='["key1", "key2"]' # list of Google API keys to use custom search 
OPENAI_API_KEY= # OpenAI API key
```
dvc.yaml, dvc.lock - related to DVC. The dvc.yaml file contains all the stages, specifying the data dependencies, while dvc.lock locks the exact versions of the data files.
pyproject.toml, poetry.lock - related to Poetry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PUGG: KBQA, MRC, IR dataset for Polish

Paper

Citation

PUGG

Getting Started

Prerequisites

Installing

Downloading data

Reproducing

DVC

The `search_results_acquisition` stage

Project structure

Other readme documents in the repository

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.dvc		.dvc
baselines		baselines
data		data
docker		docker
gqqd		gqqd
sqqd		sqqd
tests		tests
tools		tools
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

CLARIN-PL/PUGG

Folders and files

Latest commit

History

Repository files navigation

PUGG: KBQA, MRC, IR dataset for Polish

Paper

Citation

PUGG

Getting Started

Prerequisites

Installing

Downloading data

Reproducing

DVC

The search_results_acquisition stage

Project structure

Other readme documents in the repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

The `search_results_acquisition` stage

Packages