Skip to content

Latest commit

 

History

History
33 lines (27 loc) · 1.35 KB

File metadata and controls

33 lines (27 loc) · 1.35 KB

Corpus

The main resource for data are corpora. A corpus is a computer readable collection of linguistic productions. Text, speech, gestures. You have one corpus and there are multiple corpora.

Variants of Corpus

So structured data. Corpora are made by different people, this also makes them vary a lot.

  • Who produced it?
    • Age, gender, education, ethnicity....
  • What language(s) is it?
  • When was the text produced?
    • Synchronic corpus is if no time period information is in the corpus.
    • Diachronic is when there is a time dimension.
  • For which goal or function was the corpus produced?
    • Genre, medium

These questions make corpora vary.

Choosing a corpus based on this is important.

Sentences in languages are mostly infinite. This is why selecting a corpus that is representative of the phenomenon you want to model is very important. Your model will only be as good as the data you base it on.

Properties of a corpus

  • Crawled/manually curated
    • Crawled is fetched from the internet
  • Balanced/imbalanced
  • Single author/more authors
  • Diachronic/Synchronic
    • Diachronic is organized along the time dimension
    • Synchronic is not organized along the time dimension
  • Written/spoken/mixed/video (the modal)
  • Single language/multi-language/parallel
    • Parallel is when the different languages talk about the same things.
  • ...