Export

This page documents the possible mill export formats. We describe their release versions, but similar development versions are also available.

TEXT -------hs-mill-----> JSON
  ^ \                       |
  |  \                      |
  |   ---hs-mill-> WNDB     |
  |                         |
py-mill                   py-mill
   \                       /
    \                     /
     \------- RDF <------/

Introduction

Here we briefly describe general aspects of all export formats (except the WNDB output, which should be exactly as Princeton WordNet specifies it).

IDs

In many WordNets the primary ID for synsets is the synset offset (see PWN documentation). The synset offset is often the called synset ID. This is a very unstable ID, since almost any change on the source files will change several synset offsets in the WNDB output. A stable ID that’s very often used in the case of wordsenses (and can thus be used as a non-unique ID for synsets) are sense keys (see PWN documentation).

In mill’s export options (except the WNDB format) we use a version of sense keys: together, the wordnet name (when using multiple wordnets), the lexicographer file name, the lexical form and the lexical id form a unique ID for each wordsense, and a non-unique ID to a synset. This ID scheme is in one-to-one correspondence to sense keys since we have created a separate lexicographer file for adjective satellites. The context is usually clear on whether an ID refers to a wordsense or a synset; if not we will explicitly indicate which one we mean.

The mill ‘sense keys’ are even more stable than PWN sense keys. Both identification schemes depend on the user not changing or reusing lexical ids, but PWN additionally depends on the lexicographers not changing the head word of an adjectives cluster — since these are included in the sense keys of satellite adjectives.

Relations

Synset and wordsense relations and their properties are specified by the configuration file relations.tsv.

Frames

Synset and wordsense frames are specified by the configuration file frames.tsv.

Lexicographer file names

Lexicographer file names are specified by the configuration file lexnames.tsv. Export formats usually exclude the PoS part from the name and include it separately on its own.

JSON

mill’s outputs JSON in JSON-lines format, one synset per line.

How-to

mill export --help

Specification

A synset is an object with the following keys:

id: an array of four elements: POS (string), lexicographer file name (string), lexical form (string), and lexical id (integer). This same ID array is used for synsets and wordsenses. (array of four elements)
position: a two element array where the first element is the beginning position of the synset in the source file, and the second element is the end position. (array of two integers)
definition: the synset’s definition (string)
examples: the synset’s examples (array of strings)
wordsenses: the synset’s wordsenses (array of objects; see below for specification)
relations: the synset’s relations (array of objects; see below for specification)
frames: the synset’s frames (array of integers)
comments: the synset’s comment lines (stripped of comment character)

A wordsense is an object with the following keys:

lexicalForm: the wordsense lexical form (string)
lexicalId: the wordsense lexical id (string)
frames: the wordsense’s frames (array of integers)
syntacticMarker: the wordsense’s syntactic marker (adjectival position)
pointers: the wordsense’s relations (array of objects; see below for specification)
senseKey: the wordsense’s sense key (string)

Both synset relations and wordsense relations/pointers are encoded as objects with the following keys:

name: the name of the relation (string)
id: the array of four elements specified above

`jq` scripts

A collection of useful jq sripts:

create table of |lexical_form|number_senses|senses... in TSV format:

jq -sr 'map(.wordsenses|map([.lexicalForm, .senseKey]))|add|group_by(.[0])|map([(.[0]|.[0]), (. | length)] + map(.[1]))|.[]|@tsv'  mill-wn.json > sense.index

RDF

The RDF produced by mill (actually, temporarily by a Python script from mill’s JSON output) is contains basically three kinds of objects: synsets, wordsenses, and literals (strings and numbers).

How-to

Install Python 3
(optional) Create and enter a virtual environment
Install the libraries in the requirements.txt file:
```
pip install -r python/requirements.txt
    
```

Run

python python/mill.py --help

for further instructions.

Specification

Synset

lexicographerFile: points to a synset’s lexicographer file name
containsWordSense: points to a synset wordsense
definition: points to a synset’s definition (= gloss minus examples)
example: points to a synset example
frame: points to a synset frame (verb synsets only)
<relation>: points to another synset which relates to this one by <relation>
comment: synset comment text
sourceBegin: beginning position of synset in source file
sourceEnd: ending position of synset in source file

WordSense

lexicalForm: points to the wordsense lexical form (spaces are substituted by underscores; capital letters are allowed)
lexicalId: points to the wordsense lexical id
senseKey: points to the wordsense sense key
<relation>: points to another wordsense which relates to this one by <relation>
frame: points to a wordsense frame (verb wordsenses only)

WNDB

See the official Princeton WordNet documentation. Any deviation from it is considered a bug.

TEXT

You can export mill RDF files back to text using python/mill.py.

Run

python python/mill.py --help

for further instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export

Introduction

IDs

Relations

Frames

Lexicographer file names

JSON

How-to

Specification

`jq` scripts

RDF

How-to

Specification

Synset

WordSense

WNDB

TEXT

Clone this wiki locally

Export

Introduction

IDs

Relations

Frames

Lexicographer file names

JSON

How-to

Specification

jq scripts

RDF

How-to

Specification

Synset

WordSense

WNDB

TEXT

Clone this wiki locally

`jq` scripts