Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Improvements in Dataset provision #846

Open
1 task done
jularase opened this issue Feb 11, 2025 · 1 comment
Open
1 task done

[FEATURE]: Improvements in Dataset provision #846

jularase opened this issue Feb 11, 2025 · 1 comment
Labels
enhancement New feature or request epic

Comments

@jularase
Copy link
Collaborator

jularase commented Feb 11, 2025

Motivation

We have received some very useful user feedback about dataset provision:

Users struggle to provide their own datasets – Many don’t have datasets ready and aren’t sure what a “good dataset” looks like.
Lack of example datasets – Users expect Lumigator to provide sample datasets to reduce friction in getting started.

User Story
As a developer or technical user evaluating models in Lumigator, I want ready-to-use example datasets and clear guidance on preparing my own data so that I can start evaluating models quickly without getting stuck.

Problem Statement
Users who are new to Lumigator often don’t have their own datasets and struggle to understand what a good dataset looks like. This creates a major onboarding hurdle and slows down their ability to test models. By providing preloaded example datasets and making dataset preparation easier, we can significantly reduce friction and improve adoption.

Key Capabilities

  • Preload example datasets in Lumigator, so users can run evaluations immediately without needing to upload their own data.
  • Provide clear dataset preparation guidelines in the UI, helping users understand the ideal format, structure, and quality of datasets.
  • Improve dataset discoverability by adding a dedicated section in documentation and GitHub with discussion around dataset examples.

Alternatives

No response

Contribution

No response

Have you searched for similar issues before submitting this one?

  • Yes, I have searched for similar issues
@jbeliao
Copy link

jbeliao commented Feb 12, 2025

@jularase adding here ideas discussed with @agpituk and @ividal this morning about:
Enhancing Dataset Selection for Users Without Pre-existing Data

Not all users uploading datasets for evaluation will have their own high-quality dataset ready. We want to ensure that they can still interact with Lumigator meaningfully, explore evaluation workflows, and test models in realistic conditions. To achieve this, I think we should:

  • Offer a selection of preloaded datasets so users can experiment without uploading their own.
  • Ensure multilingual support by including diverse datasets.
  • Explore synthetic data generation and transcription-based datasets to cover more use cases.

Suggested Datasets to offer:

  • The Pile v2 (EleutherAI) – A large-scale dataset optimized for language modeling, useful for evaluating general LLM performance.
  • PLEIAS (Multilingual Dataset) – High-quality multilingual dataset covering diverse languages and use cases.
  • Common Voice Transcription Data – Leveraging Mozilla’s Common Voice project for evaluating models on transcribed speech-to-text data.
  • WMT Translation Datasets – Standard benchmark datasets used for evaluating machine translation systems.
  • XSum or CNN/DailyMail – Classic summarization datasets for benchmarking.

This will reduce friction for users, allowing them to test model evaluation workflows without having to source their own datasets first, BUT should be coupled with #834 as it's always better for user to evaluate models against their own use case data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic
Projects
Development

No branches or pull requests

2 participants