Skip to content

Language model guidelines

ISC-SDE edited this page Feb 4, 2020 · 27 revisions

Which parts of a sentence should become Concept? Relation? PathRelevant? NonRelevant? Which 'lexical representations' (lexreps) serve as attribute markers and what is their scope? The answers to these questions vary per language, but there are some basic guidelines.

Entities

Not all categories of words occur in all languages and the use of a certain part-of-speech can differ by language. Typical examples are:

  • Concept: noun, adjective, numeral - e.g. tree, nice, three
  • Relation: preposition, conjunction, interrogative pronoun, relative pronoun, verb - e.g. in, because, where, which, is
  • PathRelevant: personal pronoun, possessive pronoun, indefinite pronoun, time adverb (vague time expression) - e.g. we, our, something, soon
  • NonRelevant: article, adverbs that refer to the author's viewpoint or to the structure of the text - e.g. the, hence, however

For some categories, their role depends on their use: an adverb that modifies a Concepts will be part of that Concept, whereas an adverb that modifies a Relation will become a Relation too. Adverbs expressing time or location will be PathRelevant entities and words like 'also' are NonRelevant. Demonstrative pronouns are NonRelevant when they act as an article ('this book') and PathRelevant when they are used in a non-attributive context ('this is a new book').

Attributes

Currently implemented attributes are Negation, Sentiment, Time (covering Time, Frequency and Duration) and Measurement. Certainty will be added in the near future. Not all attributes are supported in all languages.

Overview:

Language NEGATION SENTIMENT TIME MEASUREMENT
English X X 3 attributes: Time, Frequency, Duration X
Czech X X X -
Dutch X X only vague time expressions -
French X X - -
German X X - -
Japanese only markers - 2 attributes: Time (includes Duration) and Frequency -
Portuguese X X X -
Russian X X only vague time expressions -
Spanish X X - -
Swedish X X only vague time expressions -
Ukrainian X X X -

Markers

The language models contain built-in markers for Negation, Time/Frequency/Duration and Measurement. Sentiment detection requires a user dictionary for the markers.

  • Negation: no, not, never,...
  • Time/Frequency/Duration: ago, tomorrow, next week, every year, 5-hour,...
  • Measurement: kg, $, £, mmHg,...

Scope

A general guideline for attributes is that iKnow - unlike certain other NLP tools - always includes the attribute marker in the 'scope' or 'span'. Whether the span includes other words that precede or follow the marker depends on the marker.

  • If the marker is or modifies a verb, its span will ideally include the subject and object immediately preceding and/or following the marker.
  • If the marker is or modifies a noun, its span will mostly include the adjectives and prepositional phrases that belong to the noun.
  • If the marker is an independent pronoun, it doesn't get a span. However, we are planning to change this in the future in such a way that the marker expands to the related verb and object.

Note that these are just general guidelines. Exceptions occur in all languages, depending on the attribute type and the actual marker.