You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have received some very useful user feedback about dataset provision:
Users struggle to provide their own datasets – Many don’t have datasets ready and aren’t sure what a “good dataset” looks like.
Lack of example datasets – Users expect Lumigator to provide sample datasets to reduce friction in getting started.
User Story
As a developer or technical user evaluating models in Lumigator, I want ready-to-use example datasets and clear guidance on preparing my own data so that I can start evaluating models quickly without getting stuck.
Problem Statement
Users who are new to Lumigator often don’t have their own datasets and struggle to understand what a good dataset looks like. This creates a major onboarding hurdle and slows down their ability to test models. By providing preloaded example datasets and making dataset preparation easier, we can significantly reduce friction and improve adoption.
Key Capabilities
Preload example datasets in Lumigator, so users can run evaluations immediately without needing to upload their own data.
Provide clear dataset preparation guidelines in the UI, helping users understand the ideal format, structure, and quality of datasets.
Improve dataset discoverability by adding a dedicated section in documentation and GitHub with discussion around dataset examples.
Alternatives
No response
Contribution
No response
Have you searched for similar issues before submitting this one?
Yes, I have searched for similar issues
The text was updated successfully, but these errors were encountered:
@jularase adding here ideas discussed with @agpituk and @ividal this morning about: Enhancing Dataset Selection for Users Without Pre-existing Data
Not all users uploading datasets for evaluation will have their own high-quality dataset ready. We want to ensure that they can still interact with Lumigator meaningfully, explore evaluation workflows, and test models in realistic conditions. To achieve this, I think we should:
Offer a selection of preloaded datasets so users can experiment without uploading their own.
Ensure multilingual support by including diverse datasets.
Explore synthetic data generation and transcription-based datasets to cover more use cases.
Suggested Datasets to offer:
The Pile v2 (EleutherAI) – A large-scale dataset optimized for language modeling, useful for evaluating general LLM performance.
PLEIAS (Multilingual Dataset) – High-quality multilingual dataset covering diverse languages and use cases.
Common Voice Transcription Data – Leveraging Mozilla’s Common Voice project for evaluating models on transcribed speech-to-text data.
WMT Translation Datasets – Standard benchmark datasets used for evaluating machine translation systems.
XSum or CNN/DailyMail – Classic summarization datasets for benchmarking.
This will reduce friction for users, allowing them to test model evaluation workflows without having to source their own datasets first, BUT should be coupled with #834 as it's always better for user to evaluate models against their own use case data
Motivation
We have received some very useful user feedback about dataset provision:
Users struggle to provide their own datasets – Many don’t have datasets ready and aren’t sure what a “good dataset” looks like.
Lack of example datasets – Users expect Lumigator to provide sample datasets to reduce friction in getting started.
User Story
As a developer or technical user evaluating models in Lumigator, I want ready-to-use example datasets and clear guidance on preparing my own data so that I can start evaluating models quickly without getting stuck.
Problem Statement
Users who are new to Lumigator often don’t have their own datasets and struggle to understand what a good dataset looks like. This creates a major onboarding hurdle and slows down their ability to test models. By providing preloaded example datasets and making dataset preparation easier, we can significantly reduce friction and improve adoption.
Key Capabilities
Alternatives
No response
Contribution
No response
Have you searched for similar issues before submitting this one?
The text was updated successfully, but these errors were encountered: