Skip to content
odanoburu edited this page Nov 29, 2019 · 5 revisions

A mill lexicographer file has more or less the same components of a Princeton WordNet lexicographer file (which see), but with nicer syntax.

Here is an EBNF grammar:

<lexfile> ::= <synset>+
<synset> ::= (<comment> "\n")*
             (<wordsense> "\n")+
	       <definition> "\n"
	       (<example> "\n")*
	       <frames> "\n"
	       (<relation> "\n")*
	       "\n"+
<comment> ::= "#" LINE
<wordsense> ::= "w:" <lexical_form>(<lexical_id>?)
		  <wordsense-frames>? <syntactic-marker>?
		  <word-relation>*
<definition> ::= "d:" TEXT
<example> ::= "e:" TEXT
<frames> ::= "fs:" INTEGER+
<relation> ::= <relation-name> ":" ID
<lexical_form> ::= TOKEN
<lexical_id> ::= "[" INTEGER "]"
<wordsense-frames> ::= "fs" INTEGER+
<syntactic-marker> ::= "marker" TOKEN
<word-relation> ::= <relation-name> ID
<relation-name> ::= TOKEN

where:

TOKEN
is a sequence of characters stopping at the first whitespace character
LINE
is a sequence of tokens stopping at the first newline character
TEXT
is any indented sequence of tokens
INTEGER
a decimal integer
ID
a WordNet identifier is composed of a WordNet name (e.g., a language name), a lexicographer file, a lexical form, and a lexical identifier (see WordNet Identifier explanation). In the source files, these get translated to
@<wn-name>:<lexicographer-file>:<lexical-form> <lexical-id>
    

but some of these elements may be omitted; if <wn-name> is not specified, the object being identified is assumed to be in the same WordNet; if it is specified, so must be the lexicographer file. If <lexicographer-file> is omitted, it is assumed to be the current lexicographer file. Finally, when <lexical-id> is omitted it is taken to be zero. <lexical-form> admits no whitespace in it, so if your word is composed of multiple tokens you must join them with underscores (_). There is currently no provision for escaping, but it may be added if deemed necessary.

A relation between two WordNet objects is only valid if it is in accordance with its specification in the configuration file (see Configuration).

Note that order matters for mill’s lexicographer files. (This is not encoded by the illustrative grammar above). We require:

  • wordsenses in a synset
  • relations in a synset/wordsense

to be lexicographically sorted. mill will warn you when anything is unsorted and show you how to fix it. Editor support for doing this automatically is planned.

Note that comments are only allowed at the beginning of lines preceding the start of synset. This is so one can serialize reliably to the text format, allowing one to programmatically alter the WordNet in one of its export (machine-readable) formats and then convert it back to the (human-readable) text format.

Note that this specification is informal; the authoritative source can be seen at mill’s Parser module (if you read Haskell), or can ascertained by trial and error – write file, export it, check output.

Here is an example synset:

# this is comment
w: computer drf verb.cognition:compute drf verb.creation:computerise drf verb.possession:computerise[1]
w: computing_device
w: computing_machine
d: a machine for performing calculations automatically
e: When unprocessed data is sent to the computer with the help of input devices,
   the data is processed and sent to output devices.
dt: adj.all:compatible[2]
dt: throughput
dt: noun.cognition:alpha_test
hp: platform[3]
hyper: Turing_machine
hyper: noun.communication:internet_site
hypo: machine
mp: monitor[2]
mt: noun.cognition:computer_science
Clone this wiki locally