The MC Speech Dataset

This is public domain speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish. A transcription is provided for each clip. Clips have total length of more than 22 hours.

Texts are in public domain. The audio was recorded in 2021-22 as a part of my master's thesis and is in public domain.

The dataset is available at:

HuggingFace
Kaggle
OpenSLR

If you use this dataset, please cite:

@masterthesis{mcspeech,
  title={Analiza porównawcza korpusów nagrań mowy dla celów syntezy mowy w języku polskim},
  author={Czyżnikiewicz, Mateusz},
  year={2022},
  month={December},
  school={Warsaw University of Technology},
  type={Master's thesis},
  doi={10.13140/RG.2.2.26293.24800},
  note={Available at \url{http://dx.doi.org/10.13140/RG.2.2.26293.24800}},
}

Also, if you find this resource helpful, kindly consider leaving a ⭐.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The MC Speech Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

The MC Speech Dataset