-
Notifications
You must be signed in to change notification settings - Fork 1
Syntax
A mill
lexicographer file has more or less the same components of a
Princeton WordNet lexicographer file (which see), but with nicer
syntax.
Here is an EBNF grammar:
<lexfile> ::= <synset>+ <synset> ::= (<comment> "\n")* (<wordsense> "\n")+ <definition> "\n" (<example> "\n")* <frames> "\n" (<relation> "\n")* "\n"+ <comment> ::= "#" LINE <wordsense> ::= "w:" <lexical_form>(<lexical_id>?) <wordsense-frames>? <syntactic-marker>? <word-relation>* <definition> ::= "d:" TEXT <example> ::= "e:" TEXT <frames> ::= "fs:" INTEGER+ <relation> ::= <relation-name> ":" ID <lexical_form> ::= TOKEN <lexical_id> ::= "[" INTEGER "]" <wordsense-frames> ::= "fs" INTEGER+ <syntactic-marker> ::= "marker" TOKEN <word-relation> ::= <relation-name> ID <relation-name> ::= TOKEN
where:
- TOKEN
- is a sequence of characters stopping at the first whitespace character
- LINE
- is a sequence of tokens stopping at the first newline character
- TEXT
- is any indented sequence of tokens
- INTEGER
- a decimal integer
- ID
- a WordNet identifier is composed of a WordNet name (e.g., a
language name), a lexicographer file, a lexical form, and a
lexical identifier (see WordNet Identifier explanation). In
the source files, these get translated to
@<wn-name>:<lexicographer-file>:<lexical-form> <lexical-id>
but some of these elements may be omitted; if <wn-name> is not specified, the object being identified is assumed to be in the same WordNet; if it is specified, so must be the lexicographer file. If <lexicographer-file> is omitted, it is assumed to be the current lexicographer file. Finally, when <lexical-id> is omitted it is taken to be zero. <lexical-form> admits no whitespace in it, so if your word is composed of multiple tokens you must join them with underscores (
_
). There is currently no provision for escaping, but it may be added if deemed necessary.
A relation between two WordNet objects is only valid if it is in accordance with its specification in the configuration file (see Configuration).
Note that order matters for mill
’s lexicographer files. (This is not
encoded by the illustrative grammar above). We require:
- wordsenses in a synset
- relations in a synset/wordsense
to be lexicographically sorted. mill
will warn you when anything is
unsorted and show you how to fix it. Editor support for doing this
automatically is planned.
Note that comments are only allowed at the beginning of lines preceding the start of synset. This is so one can serialize reliably to the text format, allowing one to programmatically alter the WordNet in one of its export (machine-readable) formats and then convert it back to the (human-readable) text format.
Note that this specification is informal; the authoritative source can
be seen at mill
’s Parser module (if you read Haskell), or can
ascertained by trial and error – write file, export it, check output.
Here is an example synset:
# this is comment w: computer drf verb.cognition:compute drf verb.creation:computerise drf verb.possession:computerise[1] w: computing_device w: computing_machine d: a machine for performing calculations automatically e: When unprocessed data is sent to the computer with the help of input devices, the data is processed and sent to output devices. dt: adj.all:compatible[2] dt: throughput dt: noun.cognition:alpha_test hp: platform[3] hyper: Turing_machine hyper: noun.communication:internet_site hypo: machine mp: monitor[2] mt: noun.cognition:computer_science