-
Notifications
You must be signed in to change notification settings - Fork 20
User Dictionary
The User Dictionary currently serves 2 purposes : suppress or force sentence end conditions, and provide extra semantic information.
- Sentence End condition : iKnow uses simple heuristics to detect sentence endings. Each language model contains a list of generic abbreviations (e.g. English acronyms) to prevent unnatural sentence splitting. For finer user control, specific terms can be added to the user dictionary.
- User defined semantics : iKnow tags lexreps using labels (e.g. English labels). Next to these language specific labels, a vast set of language independent labels are used, a subset of these are User Dictionary (UD*) labels. These can be used to assign extra user defined semantics. User dictionary labels are assigned before lexrep lookup, and override the lexrep (e.g. English lexreps) labels. Beware: the language rules (e.g. English rules) need to pick up the UD labels to make them effective. If the language model does not support a specific label, it will not be taken into account. For an overview of the current state of UD label support, see the following table:
Label | en | cs | de | es | fr | ja | nl | pt | ru | sv | uk |
---|---|---|---|---|---|---|---|---|---|---|---|
UDConcept | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
UDRelation | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
UDNonrelevant | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
UDNegation | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
UDPosSentiment | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
UDNegSentiment | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
UDNumber | ✔️ | ✔️ | ✔️ | ||||||||
UDTime | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
UDUnit | ✔️ | ✔️ | ✔️ | ||||||||
UDCertainty | ✔️ | ||||||||||
UDGeneric1 | ✔️ | ||||||||||
UDGeneric2 | ✔️ | ||||||||||
UDGeneric3 | ✔️ | ||||||||||
UDIgnore* | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
The user dictionary is supported as of version 1.0. UDCertainty as of version 1.0.11. This new label is user defined semantics, but with an extra parameter : certainty level. UDGeneric1/2/3 are supported as of version 1.1.0. UDIgnore labels (UDIgnoreCertainty, UDIgnoreNegation, UDIgnoreSentiment, UDIgnoreNegSentiment, UDIgnorePosSentiment, UDIgnoreTime, UDIgnoreNumber, UDIgnoreUnit) are supported as of version 1.5.0 (more information).
Both functions have a corresponding method :
- influence the sentence boundary detection by defining abbreviations and sentence-ending strings
engine = iknowpy.iKnowEngine()
user_dictionary = iknowpy.UserDictionary()
user_dictionary.add_sent_end_condition("Fr.", False) # suppress 'Fr.' as a sentence terminator.
engine.load_user_dictionary(user_dictionary)
engine.index("some text Fr. and following.", "en")
# Normally 'Fr.' would split the sentence, but due to the 'False' parameter of method 'add_sent_end_condition()', this remains one sentence.
- Use a user dictionary label to tag a specific term
user_dictionary = iknowpy.UserDictionary()
user_dictionary.add_label("some text", "UDUnit") # "some text" will be labeled "UDUnit", before lexrep lookup
To ease the use of manual labeling, all available user labels have their corresponding shortcut version, making code more readable and preventing typo's in label names :
- enforce words or sequences of words to get a specified role (Concept - Relation - PathRelevant - NonRelevant)
user_dictionary.add_concept("one concept") # mark as a concept
user_dictionary.add_relation("one relation") # mark as a relation
user_dictionary.add_non_relevant("crap") # mark as non relevant
- define additional Negation markers
user_dictionary.add_negation("w/o") # mark w/o as a negation
- define Sentiment markers
user_dictionary.add_positive_sentiment("great") # mark as a positive sentiment
user_dictionary.add_negative_sentiment("awfull") # mark as a negative sentiment
- define Time markers
user_dictionary.add_time("future") # mark as a time attribute
- define units and numbers for Measurements
user_dictionary.add_unit("Hg") # mark as a unit
user_dictionary.add_number("magic number") # mark as a number
- define certainty levels
user_dictionary.add_certainty_level("suggests",4) # mark as a certainty with level 4
user_dictionary.add_certainty_level("maybe",2) # mark as a certainty with level 2
user_dictionary.add_certainty_level("certain",9) # mark as a certainty with level 9
A complete working scenario :
engine = iknowpy.iKnowEngine() # the iknow engine object
user_dictionary = iknowpy.UserDictionary() # the user dictionary object
user_dictionary.add_label("some text", "UDUnit") # add label UDUnit
user_dictionary.add_sent_end_condition("Fr.", False) # suppress 'Fr.' as sentence end
user_dictionary.add_concept("one concept") # short version, adds UDConcept
user_dictionary.add_relation("one relation") # adds UDRelation
user_dictionary.add_non_relevant("crap") # adds UDNonRelevant
user_dictionary.add_negation("w/o") # adds UDNegation
user_dictionary.add_positive_sentiment("great") # adds UDPositiveSentiment
user_dictionary.add_negative_sentiment("awfull") # adds UDNegativeSentiment
user_dictionary.add_unit("Hg") # adds UDUnit
user_dictionary.add_number("magic number") # adds UDNumber
user_dictionary.add_time("future") # adds UDTime
engine.load_user_dictionary(user_dictionary) # load user dictionary into the engine, this will activate the dictionary
engine.index("some text Fr. w/o one concept and crap one relation that's great and awfull, magic number 3 Hg from future", "en", True) # index the text and do generate Traces for inspection
for trace in engine.m_traces:
key, value = trace.split(':', 1)[0],trace.split(':', 1)[1]
if (key=='UserDictionaryMatch'): # User Dictionary match is traced
print(value)
engine.unload_user_dictionary() # unload user dictionary, this will deactivate the dictionary
engine.index("some text Fr. w/o one concept and crap one relation that's great and awfull, magic number 3 Hg from future", "en", True) # index the text again and generate Traces
for trace in engine.m_traces:
key, value = trace.split(':', 1)[0],trace.split(':', 1)[1]
if (key=='LexrepIdentified'): # No User Dictionary match anymore
print(value)
A few remarks :
-
engine.load_user_dictionary(user_dictionary)
loads and activates 'user_dictionary'. If a previous dictionary is active, it will be unloaded. -
engine.unload_user_dictionary()
unloads the active user dictionary. -
user_dictionary.clear()
you can reuse an user_dictionary object by calling it's 'clear()' method. -
user_dictionary.add_label("some text", "UDUnit")
if the labelname does not exist, this will throw an exception. Prefer the short method versions.
Since version 1.0.1 all labels can be added at once, calling the add_all() method. labels.py has a collection of supported labels. For sentence end conditions, 2 special labels have been created : SENTENCE_END and SENTENCE_NO_END, they translate internally to the .add_sent_end_condition() method.
udct_entries = [
{ 'literal': "some text", 'label':"UDConcept;UDUnit" },
{ 'literal': "Fr.", 'label':iknowpy.Labels.SENTENCE_NO_END },
{ 'literal': "one concept", 'label':iknowpy.Labels.CONCEPT },
{ 'literal': "one relation", 'label':iknowpy.Labels.RELATION },
{ 'literal': "crap", 'label':iknowpy.Labels.NONRELEVANT },
{ 'literal': "w/o", 'label':iknowpy.Labels.NEGATION },
{ 'literal': "great", 'label':iknowpy.Labels.POS_SENTIMENT },
{ 'literal': "awfull", 'label':iknowpy.Labels.NEG_SENTIMENT },
{ 'literal': "Hg", 'label':iknowpy.Labels.UNIT },
{ 'literal': "magic number", 'label':iknowpy.Labels.NUMBER },
{ 'literal': "future", 'label':iknowpy.Labels.TIME }
]
#
# test with add_labels
#
user_dictionary = iknowpy.UserDictionary() # the user dictionary object
user_dictionary.add_all(udct_entries)
if len(user_dictionary.entries) != 11:
print("ERROR: UD not fully loaded!")
engine.load_user_dictionary(user_dictionary)
engine.index(text, "en") # index text
If you pass a label list to the user dictionary constructor, the .add_all() method is called automatically, this enables loading all labels while constructing and loading the dictionary, see following example, one line of code will load all data.
#
# the shorted way : with user dictionary constructor
#
engine.load_user_dictionary(iknowpy.UserDictionary(udct_entries))
engine.index(text, "en") # index text
For some extra information on sentiment analysis, see this interesting article on the IRIS-embedded version Sentiment markers in IRIS
As of v1.0.11 a new udct-label: "UDCertainty" has been added. This new label is somewhat special, since it needs an extra parameter : the certainty level. You can specify it with a new user_dictionary method :
engine = iknowpy.iKnowEngine() # the iknow engine object
user_dictionary = iknowpy.UserDictionary() # the user dictionary object
user_dictionary.add_certainty_level("suggests",4) # mark as a certainty with level 4
user_dictionary.add_certainty_level("maybe",2) # mark as a certainty with level 2
user_dictionary.add_certainty_level("certain",9) # mark as a certainty with level 9
engine.load_user_dictionary(user_dictionary) # load user dictionary into the engine, this will activate the dictionary
engine.index("he suggests that maybe we will be certain.", "en", True) # index the text and do generate Traces for inspection
for trace in engine.m_traces:
key, value = trace.split(':', 1)[0],trace.split(':', 1)[1]
if (key=='UserDictionaryMatch'): # User Dictionary match is traced
print(value)
engine.unload_user_dictionary() # unload user dictionary, this will deactivate the dictionary
UDCertainty Entry in dct-file udct file entry
Python code to convert it to user dictionary method convert to method
As of v1.1.1 a new udct-label type: "UDGeneric" has been added. It has 3 instances : UDGeneric1, UDGeneric2, UDGeneric3. As it's name implies, these can be used for generic purposes, it can be added using the common .add_label() method:
user_dictionary = iknowpy.UserDictionary()
user_dictionary.add_label("gen1", "UDGeneric1") # "gen1" will be labeled "UDGeneric1", before lexrep lookup
user_dictionary.add_label("gen2", "UDGeneric2") # "gen2" will be labeled "UDGeneric2", before lexrep lookup
user_dictionary.add_label("gen3", "UDGeneric3") # "gen3" will be labeled "UDGeneric3", before lexrep lookup
but there's also shortcut methods available: see example for udct.py :
user_dictionary.clear()
user_dictionary.add_generic1("gen1")
user_dictionary.add_generic2("gen2")
user_dictionary.add_generic3("gen3")
engine.load_user_dictionary(user_dictionary)
engine.index("This gen1 followed by gen2 could be gen3.", "en", True)
for trace in engine.m_traces:
key, value = trace.split(':', 1)[0],trace.split(':', 1)[1]
if (key == 'UserDictionaryMatch'):
print(value)
print("\nDone")
UDGeneric Entry in dct-file udct file entry