Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BRIGHT #270

Open
8 tasks
robro612 opened this issue Aug 18, 2024 · 0 comments
Open
8 tasks

BRIGHT #270

robro612 opened this issue Aug 18, 2024 · 0 comments

Comments

@robro612
Copy link

Dataset Information:

BRIGHT: "A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval ... 1,398 real-world queries collected from [12] diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data."

Links to Resources:

Dataset ID(s) & supported entities:

  • bright/biology
  • bright/earth_science
  • bright/economics
  • bright/psychology
  • bright/robotics
  • bright/stackoverflow
  • bright/sustainable_living
  • bright/pony
  • bright/leetcode
  • bright/aops
  • bright/theoremqa_theorems
  • bright/theoremqa_questions

Each dataset would support their own queries, passages, and qrels. The StackExchange datasets (biology - pony) include both passage-level and document-level labels for the passage/long-document retrieval settings, so probably also bright/{domain}/long_documents that inherits queries from the base task, implicitly treating the passage retrieval setting as default as is the case in the paper.

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/[topid].py)
  • Tests (in tests/integration/[topid].py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/[topid].yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json)
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

  • queries, qrels (same file as queries), and docs are all stored in single .parquet files on HF (max file size: leetcode-00000-of-00001.parquet: 211 MB) unlike other datasets sources in downloads.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant