Skip to content
odanoburu edited this page Nov 29, 2019 · 10 revisions

This page documents the possible mill export formats. We describe their release versions, but similar development versions are also available.

TEXT -------hs-mill-----> JSON
  ^ \                       |
  |  \                      |
  |   ---hs-mill-> WNDB     |
  |                         |
py-mill                   py-mill
   \                       /
    \                     /
     \------- RDF <------/

Introduction

Here we briefly describe general aspects of all export formats (except the WNDB output, which should be exactly as Princeton WordNet specifies it).

IDs

In many WordNets the primary ID for synsets is the synset offset (see PWN documentation). The synset offset is often the called synset ID. This is a very unstable ID, since almost any change on the source files will change several synset offsets in the WNDB output. A stable ID that’s very often used in the case of wordsenses (and can thus be used as a non-unique ID for synsets) are sense keys (see PWN documentation).

In mill’s export options (except the WNDB format) we use a version of sense keys: together, the wordnet name (when using multiple wordnets), the lexicographer file name, the lexical form and the lexical id form a unique ID for each wordsense, and a non-unique ID to a synset. This ID scheme is in one-to-one correspondence to sense keys since we have created a separate lexicographer file for adjective satellites. The context is usually clear on whether an ID refers to a wordsense or a synset; if not we will explicitly indicate which one we mean.

The mill ‘sense keys’ are even more stable than PWN sense keys. Both identification schemes depend on the user not changing or reusing lexical ids, but PWN additionally depends on the lexicographers not changing the head word of an adjectives cluster — since these are included in the sense keys of satellite adjectives.

Relations

Synset and wordsense relations and their properties are specified by the configuration file relations.tsv.

Frames

Synset and wordsense frames are specified by the configuration file frames.tsv.

Lexicographer file names

Lexicographer file names are specified by the configuration file lexnames.tsv. Export formats usually exclude the PoS part from the name and include it separately on its own.

JSON

mill’s outputs JSON in JSON-lines format, one synset per line.

How-to

mill export --help

Specification

A synset is an object with the following keys:

id
an array of four elements: POS (string), lexicographer file name (string), lexical form (string), and lexical id (integer). This same ID array is used for synsets and wordsenses. (array of four elements)
position
a two element array where the first element is the beginning position of the synset in the source file, and the second element is the end position. (array of two integers)
definition
the synset’s definition (string)
examples
the synset’s examples (array of strings)
wordsenses
the synset’s wordsenses (array of objects; see below for specification)
relations
the synset’s relations (array of objects; see below for specification)
frames
the synset’s frames (array of integers)
comments
the synset’s comment lines (stripped of comment character)

A wordsense is an object with the following keys:

lexicalForm
the wordsense lexical form (string)
lexicalId
the wordsense lexical id (string)
frames
the wordsense’s frames (array of integers)
syntacticMarker
the wordsense’s syntactic marker (adjectival position)
pointers
the wordsense’s relations (array of objects; see below for specification)
senseKey
the wordsense’s sense key (string)

Both synset relations and wordsense relations/pointers are encoded as objects with the following keys:

name
the name of the relation (string)
id
the array of four elements specified above

jq scripts

A collection of useful jq sripts:

  • create table of |lexical_form|number_senses|senses... in TSV format:
    jq -sr 'map(.wordsenses|map([.lexicalForm, .senseKey]))|add|group_by(.[0])|map([(.[0]|.[0]), (. | length)] + map(.[1]))|.[]|@tsv'  mill-wn.json > sense.index
        

RDF

The RDF produced by mill (actually, temporarily by a Python script from mill’s JSON output) is contains basically three kinds of objects: synsets, wordsenses, and literals (strings and numbers).

How-to

  • Install Python 3
  • (optional) Create and enter a virtual environment
  • Install the libraries in the requirements.txt file:
    pip install -r python/requirements.txt
        

Run

python python/mill.py --help

for further instructions.

Specification

Synset

lexicographerFile
points to a synset’s lexicographer file name
containsWordSense
points to a synset wordsense
definition
points to a synset’s definition (= gloss minus examples)
example
points to a synset example
frame
points to a synset frame (verb synsets only)
<relation>
points to another synset which relates to this one by <relation>
comment
synset comment text
sourceBegin
beginning position of synset in source file
sourceEnd
ending position of synset in source file

WordSense

lexicalForm
points to the wordsense lexical form (spaces are substituted by underscores; capital letters are allowed)
lexicalId
points to the wordsense lexical id
senseKey
points to the wordsense sense key
<relation>
points to another wordsense which relates to this one by <relation>
frame
points to a wordsense frame (verb wordsenses only)

WNDB

See the official Princeton WordNet documentation. Any deviation from it is considered a bug.

TEXT

You can export mill RDF files back to text using python/mill.py.

Run

python python/mill.py --help

for further instructions.

Clone this wiki locally