Corpus

The main resource for data are corpora. A corpus is a computer readable collection of linguistic productions. Text, speech, gestures. You have one corpus and there are multiple corpora.

Variants of Corpus

So structured data. Corpora are made by different people, this also makes them vary a lot.

Who produced it?
- Age, gender, education, ethnicity....
What language(s) is it?
When was the text produced?
- Synchronic corpus is if no time period information is in the corpus.
- Diachronic is when there is a time dimension.
For which goal or function was the corpus produced?
- Genre, medium

These questions make corpora vary.

Choosing a corpus based on this is important.

Sentences in languages are mostly infinite. This is why selecting a corpus that is representative of the phenomenon you want to model is very important. Your model will only be as good as the data you base it on.

Properties of a corpus

Crawled/manually curated
- Crawled is fetched from the internet
Balanced/imbalanced
Single author/more authors
Diachronic/Synchronic
- Diachronic is organized along the time dimension
- Synchronic is not organized along the time dimension
Written/spoken/mixed/video (the modal)
Single language/multi-language/parallel
- Parallel is when the different languages talk about the same things.
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus.md

Corpus.md

Corpus

Variants of Corpus

Choosing a corpus based on this is important.

Properties of a corpus

Files

Corpus.md

Latest commit

History

Corpus.md

File metadata and controls

Corpus

Variants of Corpus

Choosing a corpus based on this is important.

Properties of a corpus