BRIGHT #270

robro612 · 2024-08-18T09:49:52Z

Dataset Information:

BRIGHT: "A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval ... 1,398 real-world queries collected from [12] diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data."

Links to Resources:

Website: https://brightbenchmark.github.io/
Data: https://huggingface.co/datasets/xlangai/BRIGHT
Paper: https://arxiv.org/abs/2407.12883

Dataset ID(s) & supported entities:

bright/biology
bright/earth_science
bright/economics
bright/psychology
bright/robotics
bright/stackoverflow
bright/sustainable_living
bright/pony
bright/leetcode
bright/aops
bright/theoremqa_theorems
bright/theoremqa_questions

Each dataset would support their own queries, passages, and qrels. The StackExchange datasets (biology - pony) include both passage-level and document-level labels for the passage/long-document retrieval settings, so probably also bright/{domain}/long_documents that inherits queries from the base task, implicitly treating the passage retrieval setting as default as is the case in the paper.

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Dataset definition (in ir_datasets/datasets/[topid].py)
Tests (in tests/integration/[topid].py)
Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
Documentation (in ir_datasets/etc/[topid].yaml)
- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
Downloadable content (in ir_datasets/etc/downloads.json)
- Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

queries, qrels (same file as queries), and docs are all stored in single .parquet files on HF (max file size: leetcode-00000-of-00001.parquet: 211 MB) unlike other datasets sources in downloads.json

The text was updated successfully, but these errors were encountered:

robro612 added the add-dataset label Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BRIGHT #270

BRIGHT #270

robro612 commented Aug 18, 2024

BRIGHT #270

BRIGHT #270

Comments

robro612 commented Aug 18, 2024