The main resource for data are corpora. A corpus is a computer readable collection of linguistic productions. Text, speech, gestures. You have one corpus and there are multiple corpora.
So structured data. Corpora are made by different people, this also makes them vary a lot.
- Who produced it?
- Age, gender, education, ethnicity....
- What language(s) is it?
- When was the text produced?
- Synchronic corpus is if no time period information is in the corpus.
- Diachronic is when there is a time dimension.
- For which goal or function was the corpus produced?
- Genre, medium
These questions make corpora vary.
Sentences in languages are mostly infinite. This is why selecting a corpus that is representative of the phenomenon you want to model is very important. Your model will only be as good as the data you base it on.
- Crawled/manually curated
- Crawled is fetched from the internet
- Balanced/imbalanced
- Single author/more authors
- Diachronic/Synchronic
- Diachronic is organized along the time dimension
- Synchronic is not organized along the time dimension
- Written/spoken/mixed/video (the modal)
- Single language/multi-language/parallel
- Parallel is when the different languages talk about the same things.
- ...