[FEATURE]: Improvements in Dataset provision #846

jularase · 2025-02-11T16:07:30Z

Motivation

We have received some very useful user feedback about dataset provision:

Users struggle to provide their own datasets – Many don’t have datasets ready and aren’t sure what a “good dataset” looks like.
Lack of example datasets – Users expect Lumigator to provide sample datasets to reduce friction in getting started.

User Story
As a developer or technical user evaluating models in Lumigator, I want ready-to-use example datasets and clear guidance on preparing my own data so that I can start evaluating models quickly without getting stuck.

Problem Statement
Users who are new to Lumigator often don’t have their own datasets and struggle to understand what a good dataset looks like. This creates a major onboarding hurdle and slows down their ability to test models. By providing preloaded example datasets and making dataset preparation easier, we can significantly reduce friction and improve adoption.

Key Capabilities

Preload example datasets in Lumigator, so users can run evaluations immediately without needing to upload their own data.
Provide clear dataset preparation guidelines in the UI, helping users understand the ideal format, structure, and quality of datasets.
Improve dataset discoverability by adding a dedicated section in documentation and GitHub with discussion around dataset examples.

Alternatives

No response

Contribution

No response

Have you searched for similar issues before submitting this one?

Yes, I have searched for similar issues

jbeliao · 2025-02-12T17:55:12Z

@jularase adding here ideas discussed with @agpituk and @ividal this morning about:
Enhancing Dataset Selection for Users Without Pre-existing Data

Not all users uploading datasets for evaluation will have their own high-quality dataset ready. We want to ensure that they can still interact with Lumigator meaningfully, explore evaluation workflows, and test models in realistic conditions. To achieve this, I think we should:

Offer a selection of preloaded datasets so users can experiment without uploading their own.
Ensure multilingual support by including diverse datasets.
Explore synthetic data generation and transcription-based datasets to cover more use cases.

Suggested Datasets to offer:

The Pile v2 (EleutherAI) – A large-scale dataset optimized for language modeling, useful for evaluating general LLM performance.
PLEIAS (Multilingual Dataset) – High-quality multilingual dataset covering diverse languages and use cases.
Common Voice Transcription Data – Leveraging Mozilla’s Common Voice project for evaluating models on transcribed speech-to-text data.
WMT Translation Datasets – Standard benchmark datasets used for evaluating machine translation systems.
XSum or CNN/DailyMail – Classic summarization datasets for benchmarking.

This will reduce friction for users, allowing them to test model evaluation workflows without having to source their own datasets first, BUT should be coupled with #834 as it's always better for user to evaluate models against their own use case data

jularase added the enhancement New feature or request label Feb 11, 2025

github-project-automation bot added this to Lumigator Public Roadmap H1 2025 Feb 11, 2025

jularase added the epic label Feb 11, 2025

hasangzl mentioned this issue Feb 13, 2025

Expose sample datasets link on datasets page #848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Improvements in Dataset provision #846

[FEATURE]: Improvements in Dataset provision #846

jularase commented Feb 11, 2025 •

edited

Loading

jbeliao commented Feb 12, 2025 •

edited

Loading

[FEATURE]: Improvements in Dataset provision #846

[FEATURE]: Improvements in Dataset provision #846

Comments

jularase commented Feb 11, 2025 • edited Loading

Motivation

Alternatives

Contribution

Have you searched for similar issues before submitting this one?

jbeliao commented Feb 12, 2025 • edited Loading

jularase commented Feb 11, 2025 •

edited

Loading

jbeliao commented Feb 12, 2025 •

edited

Loading