-
Notifications
You must be signed in to change notification settings - Fork 20
Language model guidelines
Which parts of a sentence should become Concept? Relation? PathRelevant? NonRelevant? Which 'lexical representations' (lexreps) serve as attribute markers and what is their scope? The answers to these questions vary per language, but there are some basic guidelines.
Not all categories of words occur in all languages and the use of a certain part-of-speech can differ by language. Typical examples are:
- Concept: noun, adjective, numeral - e.g. tree, nice, three
- Relation: preposition, conjunction, interrogative pronoun, relative pronoun, verb - e.g. in, because, where, which, is
- PathRelevant: personal pronoun, possessive pronoun, indefinite pronoun, time adverb (vague time expression) - e.g. we, our, something, soon
- NonRelevant: article, adverbs that refer to the author's viewpoint or to the structure of the text - e.g. the, hence, however
For some categories, their role depends on their use: an adverb that modifies a Concepts will be part of that Concept, whereas an adverb that modifies a Relation will become a Relation too. Adverbs expressing time or location will be PathRelevant entities and words like 'also' are NonRelevant. Demonstrative pronouns are NonRelevant when they act as an article ('this book') and PathRelevant when they are used in a non-attributive context ('this is a new book').
Currently implemented attributes are Negation, Sentiment, Time (covering Time, Frequency and Duration), Measurement, Certainty and 3 Generic attributes. Not all attributes are supported in all languages.
Overview:
X: built-in
UD: user dictionary support
Language | NEGATION | SENTIMENT | TIME | MEASUREMENT | CERTAINTY | GENERIC |
---|---|---|---|---|---|---|
English | X + UD | X + UD | X (Time, Frequency, Duration) + UD | X + UD | X + UD | X + UD |
Czech | X + UD | UD | X + UD | - | - | - |
Dutch | X + UD | UD | X (only vague time) + UD | - | - | - |
French | X + UD | UD | X (only vague time) + UD | - | - | - |
German | X + UD | UD | X (only vague time) + UD | - | - | - |
Japanese | X | - | X (Time (includes Duration) and Frequency) | X | - | - |
Portuguese | X + UD | UD | X (only vague time) + UD | - | - | - |
Russian | X + UD | UD | X + UD | - | - | - |
Spanish | X + UD | UD | X (only vague time) + UD | - | - | - |
Swedish | X + UD | UD | X (only vague time) + UD | - | - | - |
Ukrainian | X + UD | UD | X + UD | - | - | - |
The language models contain built-in markers for Negation and Time. The English language model contains more built-in markers than other languages: Frequency, Duration, Measurements and Certainty. Sentiment detection requires a user dictionary for the markers, only English has a limited set of built-in Sentiment markers. English also supports 3 Generic attributes, which can be defined through the user dictionary.
- Negation: no, not, never,...
- Time/Frequency/Duration: ago, tomorrow, next week, every year, 5-hour,...
- Measurement: kg, $, £, mmHg,...
- Positive Sentiment: awesome, enjoy, lovely,...
- Negative Sentiment: awful, disgust, worried,...
- Low certainty: chance of, doubtfully, may be,...
- High certainty: certainly, beyond doubt,...
A general guideline for attributes is that iKnow - unlike certain other NLP tools - always includes the attribute marker in the 'scope' or 'span'. Whether the span includes other words that precede or follow the marker depends on the marker.
- If the marker is or modifies a verb, its span will ideally include the subject and object connected to the marker, e.g. [NEGATION: I don't remember his name].
- If the marker is or modifies a noun, its span will mostly include the adjectives and prepositional phrases that belong to the noun, e.g. They provided [PositiveSentiment: excellent detailed instructions on the routes].
- If the marker is an independent pronoun, the guideline has changed over time. Before, it didn't get a span. According to the current guideline,
the marker expands to the related verb and object. This changed approach still has to be implemented in most language models. In the English language model the pronoun gets a span, e.g. [NEGATION: Nothing has changed recently].
Note that these are just general guidelines. Exceptions occur in all languages, depending on the attribute type and the actual marker.