In CL we use a lot of statistics and probability theory. Why?
- A: They give us several distributions to characterise how people use language.
- B: Natural languages are ambiguous.
- C: We need to perform statistical tests to compare models.
- D: They offer a set of efficient algorithms for automating processes.
Which of the following features does not apply to natural languages?
- A: Subjected to change
- B: Unambiguous
- C: Conventional
- D: Context-dependent
Which level of linguistic analysis deals with the meaning of morphemes and words?
- A: Lexical semantics
- B: Syntax
- C: Morphology
- D: Compositional semantics
Which of the following is a diachronic corpus?
- A: CHILDES
- B: SubtLex
- C: Corpus of Historical American English
- D: TASA
Which of the following resources is not a lexicon?
- A: Words with proportion of native speakers who know the word meaning
- B: Words with concreteness ratings
- C: Words with age of acquisition estimates
- D: Words with their meaning definition
What is a valid hyponym of dog in WordNet?
- A: Dalmatian
- B: Animal
- C: Canine
- D: Cat
- B: Natural languages are ambiguous.
- B: Unambiguous
- A: Lexical semantics
- C: Corpus of Historical American English
- D: Cat
Which of the following is not an example of text classification?
- A: Essay grading (pass/fail)
- B: Text simplification
- C: Sentiment analysis
- D: Cyberbullying detection
Which of the following is an advantage of rule systems?
- A: Robust to rare events
- B: Cheap to write
- C: Cannot incorporate domain knowledge
- D: Can deal with ambiguity effortlessly
Which of the following statements about discriminative classifiers is wrong?
- A: They can only address binary classification problems
- B: They learn the hidden process which yielded the data sample
- C: They can only learn linear boundaries
- D: They are non-deterministic classifiers
Which of the following is an example of extrinsic evaluation?
- A: Run a t-test between precision scores in automatic grading
- B: Compute the difference in accuracy between two classifiers
- C: Measure the customer satisfaction when interacting with two different bots
- D: Compare translation quality between two machine translation models
What does the likelihood capture in Bayes Rule?
- A: The probability of the class given the input
- B: The probability of the input
- C: The probability of the input given the class
- D: The probability of the class
What does the conditional independence assumption entail in NBC?
- A: An NBC doesn't track feature co-presence
- B: An NBC doesn't consider the probability of the document given the class
- C: An NBC doesn't track sequential information
- D: An NBC doesn't consider all classes when classifying
Which of the following is not a stop word?
- A: I
- B: Child
- C: Do
- D: Because
In a dataset consisting of 100 tweets, 20 contain instances of cyberbullying. For the sake of argument, we pretend to be dealing with 2 binary features: whether the tweet contains at least a curse word and whether the tweet contain non-alphabetic characters. The likelihood that a tweet containing at least a curse word is an instance of cyberbullying is 0.8 while the likelihood that a tweet containing non-alphabetic characters is not an instance of cyberbullying is 0.7.
What is the prior of the cyberbullying class?
- A: 0.1
- B: 0.8
- C: cannot tell
- D: 0.2
In a dataset consisting of 100 tweets, 20 contain instances of cyberbullying. For the sake of argument, we pretend to be dealing with 2 binary features: whether the tweet contains at least a curse word and whether the tweet contain non-alphabetic characters. The likelihood that a tweet containing at least a curse word is an instance of cyberbullying is 0.8 while the likelihood that a tweet containing non-alphabetic characters is not an instance of cyberbullying is 0.3.
Consider a test tweet with at least a curse word and only alphabetic characters. What is the probability of the test tweet being an instance of cyberbullying?
- A: 0.8 $$ 0.8 $$ 0.3
- B: 0.8 $$ 0.2 $$ 0.7
- C: 0.2 $$ 0.8 $$ 0.3
- D: 0.2 $$ 0.2 $$ 0.7
In a dataset consisting of 100 tweets, 20 contain instances of cyberbullying. For the sake of argument, we pretend to be dealing with 2 binary features: whether the tweet contains at least a curse word and whether the tweet contain non-alphabetic characters. The likelihood that a tweet containing at least a curse word is an instance of cyberbullying is 0.8 while the likelihood that a tweet containing non-alphabetic characters is not an instance of cyberbullying is 0.3.
Consider a test tweet with at least a curse word and only alphabetic characters. Would an NBC using these features classify it as an instance of cyberbullying?
- A: Yes
- B: Not enough information given
- C: It'd be a tie
- D: No
- B: Text Simplification
- A: Robust to rare events
- B: They learn the hidden process which yielded the data sample
- C: Measure the customer satisfaction when interacting with two different bots
- C: The probability of the input given the class
- A: An NBC doesn't track feature co-presence
- B: Child
- D: 0.2
- C: 0.2 $$ 0.8 $$ 0.3
- D: No
How many lemmas are there in the sentence:
"The children were curious about whether there would be a surprise at home or whether there had been enough surprises already."
Punctuation doesn't count.
- A: 21
- B: 19
- C: 18
- D: 16
How many affixes are there in the word untrustworthy?
- A: 3
- B: 2
- C: 0
- D: 1
Which of the following words is inflected?
- A: Touchstone
- B: Colourful
- C: Children
- D: Professor
Which normalisation technique would you use before doing language identification?
- A: Lemmatisation
- B: Case folding
- C: None of them
- D: Tokenisation
Consider two regular expression /^[a-zA-Z]{2,6}\b/. What does it match?
- A: alphabetic strings between two and six characters at the beginning of a line followed by a word boundary Correct!
- B: any string but alphabetic strings between two and six characters
- C: Lines containing alphabetic strings between two and six characters
- D: any string between three and six characters
What is the minimum edit distance between glowing and growling?
- A: 4
- B: 1
- C: 2
- D: 3
- D: 16
- B: 2
- C: Children
- C: None of them
- A: alphabetic strings between two and six characters at the beginning of a line followed by a word boundary
- C: 2
How do we use language modelling in machine translation?
- A: To make sure the translation has the same meaning as the source
- B: To predict the next sentence in the translation
- C: To pick the most fluent candidate translation
- D: To pick the best word among possible candidate translations for a word in the source
Why do we care about the chain rule of probability?
- A: It tells us how to compute the probability of a sentence
- B: It deals with the infinite nature of language
- C: It tells us how to use limited context to approximate larger contexts
- D: It tells us how to deal with underflowing problems
How do we get ML estimates for bigram transition probabilities?
- A: Get co-occurrence counts and normalise by row marginals
- B: Get co-occurrence counts and take the log
- C: Get co-occurrence counts and normalise by column marginals
- D: Get co-occurrence counts and normalise by the matrix total
Which of the following is NOT a component of Markov Chains?
- A: Transition counts
- B: Initial probability distribution
- C: Accepting state
- D: History states
If we fit a 4-gram language model, how many BoS symbols do we need to prepend to the sentences?
- A: 1
- B: 4
- C: 2
- D: 3
In linear interpolation, lambdas have to meet a strict requirement. Which one?
- A: Their sum must equal 1
- B: They must be lower than 1
- C: The highest value matches the largest n-gram available
- D: Their algebraic sum must be 0
- C: To pick the most fluent candidate translation
- A: It tells us how to compute the probability of a sentence
- A: Get co-occurrence counts and normalise by row marginals
- C: Accepting state
- D: 3
- A: Their sum must equal 1
Which of the following lexical categories is an example of open class words?
- A: Auxiliaries
- B: Possessive pronouns
- C: Adverbs
- D: Conjunctions
In terms of PoS tag ambiguity, types tend to be .... tokens?
- A: More ambiguous than
- B: As unambiguous as
- C: Less ambiguous than
- D: As ambiguous as
What does the emission probability matrix encode in a bigram HMM?
- A: The probability of a word given a word
- B: The probability of a tag given a word
- C: The probability of a tag given a tag
- D: The probability of a word given a tag
Which component of the HMM encodes the Markov assumption?
- A: The sequence of observations
- B: The observation likelihood matrix
- C: The initial distribution
- D: The state transition probability matrix
The Viterbi algorithm is an example of?
- A: A classifier
- B: A dynamic programming algorithm
- C: A rule-based system
- D: A vector space
What is the complexity of the Viterbi algorithm? Q is the set of states, t is the length of the sentence, n is the order of the model, V is the vocabulary size.
- A:
$O(Q^t)$ - B:
$O(V^t*n)$ - C:
$O(Q^n*t)$ - D:
$O(Q*V)$
Which of the following quantities does not contribute to the computation of the new posterior probability of observing each tag given the sequence of observed events up to that point?
- A: Transition probabilities from state
$q_i$ to state$q_j$ - B: Emission probability for observation o_j given state
$q_j$ - C: Posterior probability up to the previous observed event
- D: The likelihood of state
$q_j$ given observed event$o_j$
In the Viterbi algorithm, we apply the argmax function. What is the input?
- A: The last column in the trellis
- B: The product of the last column in the trellis, the transition probability matrix and the emission probability matrix
- C: The product of the last column in the trellis and the transition probability matrix
- D: The product of the transition and emission probability matrices
- C: Adverbs
- D: Less ambiguous than
- D: The probability of a word given a tag
- D: The state transition probability matrix
- B: A dynamic programming algorithm
- C:
$O(Q^n*t)$ - D: The likelihood of state
$q_j$ given observed event$o_j$ - C: The product of the last column in the trellis and the transition probability matrix
Which of the following is not a component of a grammar?
- A: a finite set of non-terminal states
- B: a distinguished start state
- C: a finite set of terminal states
- D: an infinite set of production rules
Which of the following rules is in CNF?
- A: S → he ran
- B: S → NP VP NP
- C: S → you VP
- D: S → NP VP
What kind of corpus do we need to estimate a CFG?
- A: Plain corpus
- B: Parallel corpus
- C: Treebank
- D: Corpus with PoS annotations
Which is the defining feature of an S constituent?
- A: Its main verb has all its arguments
- B: Can only occur as the LHS of rules
- C: Cannot be coordinated with other S constituents
- D: Always contains at least one NP
What is the relation between CFGs and dynamic programming?
- A: We can use dynamic programming because context changes our sub-parses
- B: We cannot use dynamic programming because context will change our sub-parses
- C: We can use dynamic programming because context cannot change sub-parses
- D: They're unrelated
In order to initialise the table for the CKY, what do we need to know? Group of answer choices.
- A: The number of possible rules in the grammar
- B: The number of non-terminals in the grammar
- C: The number of terminals in the grammar
- D: The symbols in the target sentence
How do we use the CKY to know if a string is grammatical? Group of answer choices.
- A: We check whether the final cell contains the S state
- B: We check whether the final cell is not empty
- C: We check whether the S state happens anywhere in the table
- D: We check whether all terminals appear as the RHS in at least one rule
Which of the following is a pre-terminal symbol?
- A: N
- B: NP
- C: PP
- D: do
- A: an infinite set of production rules
- D: S → NP VP
- C: Treebank
- A: Its main verb has all its arguments
- C: We can use dynamic programming because context cannot change sub-parses
- D: The symbols in the target sentence
- A: We check whether the final cell contains the S state
- A: N