-
Notifications
You must be signed in to change notification settings - Fork 1
Export
This page documents the possible mill
export formats. We describe
their release versions, but similar development versions are also
available.
TEXT -------hs-mill-----> JSON ^ \ | | \ | | ---hs-mill-> WNDB | | | py-mill py-mill \ / \ / \------- RDF <------/
Here we briefly describe general aspects of all export formats (except the WNDB output, which should be exactly as Princeton WordNet specifies it).
In many WordNets the primary ID for synsets is the synset offset (see PWN documentation). The synset offset is often the called synset ID. This is a very unstable ID, since almost any change on the source files will change several synset offsets in the WNDB output. A stable ID that’s very often used in the case of wordsenses (and can thus be used as a non-unique ID for synsets) are sense keys (see PWN documentation).
In mill
’s export options (except the WNDB format) we use a version
of sense keys: together, the wordnet name (when using multiple
wordnets), the lexicographer file name, the lexical form and the
lexical id form a unique ID for each wordsense, and a non-unique ID to
a synset. This ID scheme is in one-to-one correspondence to sense keys
since we have created a separate lexicographer file for adjective
satellites. The context is usually clear on whether an ID refers to a
wordsense or a synset; if not we will explicitly indicate which one we
mean.
The mill ‘sense keys’ are even more stable than PWN sense keys. Both identification schemes depend on the user not changing or reusing lexical ids, but PWN additionally depends on the lexicographers not changing the head word of an adjectives cluster — since these are included in the sense keys of satellite adjectives.
Synset and wordsense relations and their properties are specified by
the configuration file relations.tsv
.
Synset and wordsense frames are specified by the configuration file
frames.tsv
.
Lexicographer file names are specified by the configuration file
lexnames.tsv
. Export formats usually exclude the PoS part from the
name and include it separately on its own.
mill
’s outputs JSON in JSON-lines format, one synset per line.
mill export --help
A synset is an object with the following keys:
- id
- an array of four elements: POS (string), lexicographer file name (string), lexical form (string), and lexical id (integer). This same ID array is used for synsets and wordsenses. (array of four elements)
- position
- a two element array where the first element is the beginning position of the synset in the source file, and the second element is the end position. (array of two integers)
- definition
- the synset’s definition (string)
- examples
- the synset’s examples (array of strings)
- wordsenses
- the synset’s wordsenses (array of objects; see below for specification)
- relations
- the synset’s relations (array of objects; see below for specification)
- frames
- the synset’s frames (array of integers)
- comments
- the synset’s comment lines (stripped of comment character)
A wordsense is an object with the following keys:
- lexicalForm
- the wordsense lexical form (string)
- lexicalId
- the wordsense lexical id (string)
- frames
- the wordsense’s frames (array of integers)
- syntacticMarker
- the wordsense’s syntactic marker (adjectival position)
- pointers
- the wordsense’s relations (array of objects; see below for specification)
- senseKey
- the wordsense’s sense key (string)
Both synset relations and wordsense relations/pointers are encoded as objects with the following keys:
- name
- the name of the relation (string)
- id
- the array of four elements specified above
A collection of useful jq sripts:
- create table of
|lexical_form|number_senses|senses...
in TSV format:jq -sr 'map(.wordsenses|map([.lexicalForm, .senseKey]))|add|group_by(.[0])|map([(.[0]|.[0]), (. | length)] + map(.[1]))|.[]|@tsv' mill-wn.json > sense.index
The RDF produced by mill
(actually, temporarily by a Python script
from mill
’s JSON output) is contains basically three kinds of
objects: synsets, wordsenses, and literals (strings and numbers).
- Install Python 3
- (optional) Create and enter a virtual environment
- Install the libraries in the
requirements.txt
file:pip install -r python/requirements.txt
Run
python python/mill.py --help
for further instructions.
- lexicographerFile
- points to a synset’s lexicographer file name
- containsWordSense
- points to a synset wordsense
- definition
- points to a synset’s definition (= gloss minus examples)
- example
- points to a synset example
- frame
- points to a synset frame (verb synsets only)
- <relation>
- points to another synset which relates to this one by <relation>
- comment
- synset comment text
- sourceBegin
- beginning position of synset in source file
- sourceEnd
- ending position of synset in source file
- lexicalForm
- points to the wordsense lexical form (spaces are substituted by underscores; capital letters are allowed)
- lexicalId
- points to the wordsense lexical id
- senseKey
- points to the wordsense sense key
- <relation>
- points to another wordsense which relates to this one by <relation>
- frame
- points to a wordsense frame (verb wordsenses only)
See the official Princeton WordNet documentation. Any deviation from it is considered a bug.
You can export mill RDF files back to text using python/mill.py
.
Run
python python/mill.py --help
for further instructions.