-
Notifications
You must be signed in to change notification settings - Fork 75
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Title: Comprehensive expansion of Ukrainian lexeme extraction queries
I'm excited to present a substantial enhancement to our Ukrainian language data extraction pipeline. This pull request significantly expands our SPARQL queries to capture a more comprehensive morphological landscape of Ukrainian lexemes across multiple parts of speech. Let's delve into the technical specifics: 1. Verbs 🔠 (query_verbs.sparql): - Implemented extraction of finite verb forms: * Present tense: 1st, 2nd, 3rd person singular (wd:Q192613 + wd:Q21714344/wd:Q51929049/wd:Q51929074 + wd:Q110786) * Past tense: masculine, feminine, neuter singular (wd:Q1240211 + wd:Q499327/wd:Q1775415/wd:Q1775461 + wd:Q110786) - Added imperative mood: 2nd person singular (wd:Q22716 + wd:Q51929049 + wd:Q110786) - Retained infinitive form extraction (wd:Q179230) 2. Nouns 📚 (query_nouns.sparql): - Extended singular case paradigm: * Genitive (wd:Q146233), Dative (wd:Q145599), Accusative (wd:Q146078) * Instrumental (wd:Q192997), Locative (wd:Q202142) - Maintained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction 3. Adjectives 🏷️ (NEW: query_adjectives.sparql): - Implemented comprehensive adjectival paradigm: * Singular nominative: masculine (wd:Q499327), feminine (wd:Q1775415), neuter (wd:Q1775461) * Plural nominative (wd:Q146786) - Included degree forms: comparative (wd:Q14169499) and superlative (wd:Q1817208) 4. Adverbs 🔄 (NEW: query_adverbs.sparql): - Established query for adverbial extraction: * Base form (lemma) * Comparative (wd:Q14169499) and superlative (wd:Q1817208) degrees 5. Prepositions 📍 (query_prepositions.sparql): - Optimized existing query structure - Enhanced case association extraction (wdt:P5713) 6. Proper Nouns 👤 (query_proper_nouns.sparql): - Significantly expanded case paradigm for singular: * Nominative (lemma), Genitive (wd:Q146233), Dative (wd:Q145599) * Accusative (wd:Q146078), Instrumental (wd:Q192997), Locative (wd:Q202142) - Crucially added Vocative case (wd:Q185077), essential for direct address in Ukrainian - Retained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction Technical implementation details: - Utilized OPTIONAL clauses for all non-lemma forms to ensure query robustness - Implemented consistent use of wikibase:grammaticalFeature for form specification - Employed REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") for lexeme ID extraction - Utilized wikibase:label service for human-readable labels where applicable This enhancement significantly broadens our morphological coverage of Ukrainian, providing a rich dataset for advanced NLP tasks, including but not limited to: - Morphological analysis and generation - Named Entity Recognition (NER) with case-sensitive features - Machine Translation with deep grammatical understanding - Linguistic research on Ukrainian morphosyntax I've rigorously tested these queries on the Wikidata Query Service (https://query.wikidata.org/) to ensure optimal performance and accurate results. However, I welcome meticulous review, particularly focusing on: 1. Correctness of Wikidata QIDs for grammatical features 2. Query efficiency and potential for optimization 3. Completeness of morphological paradigms for each part of speech This pull request represents a significant stride towards a more nuanced and comprehensive representation of Ukrainian in our data pipeline. I'm eager to discuss any suggestions for further refinements or expansions to our linguistic feature set.
- Loading branch information
1 parent
2f56620
commit c683f06
Showing
5 changed files
with
249 additions
and
20 deletions.
There are no files selected for viewing
61 changes: 61 additions & 0 deletions
61
src/scribe_data/language_data_extraction/Ukrainian/adjectives/query_adjectives.sparql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# tool: scribe-data | ||
# All Ukrainian (Q8798) adjectives and their forms. | ||
# Enter this query at https://query.wikidata.org/. | ||
|
||
SELECT | ||
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID) | ||
?lemma | ||
?masculineSingularNominative | ||
?feminineSingularNominative | ||
?neuterSingularNominative | ||
?pluralNominative | ||
?comparativeForm | ||
?superlativeForm | ||
|
||
WHERE { | ||
?lexeme dct:language wd:Q8798 ; | ||
wikibase:lexicalCategory wd:Q34698 ; | ||
wikibase:lemma ?lemma . | ||
|
||
# Masculine Singular Nominative | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?masculineSingularNominativeForm . | ||
?masculineSingularNominativeForm ontolex:representation ?masculineSingularNominative ; | ||
wikibase:grammaticalFeature wd:Q499327, wd:Q110786, wd:Q131105 . | ||
} | ||
|
||
# Feminine Singular Nominative | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?feminineSingularNominativeForm . | ||
?feminineSingularNominativeForm ontolex:representation ?feminineSingularNominative ; | ||
wikibase:grammaticalFeature wd:Q1775415, wd:Q110786, wd:Q131105 . | ||
} | ||
|
||
# Neuter Singular Nominative | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?neuterSingularNominativeForm . | ||
?neuterSingularNominativeForm ontolex:representation ?neuterSingularNominative ; | ||
wikibase:grammaticalFeature wd:Q1775461, wd:Q110786, wd:Q131105 . | ||
} | ||
|
||
# Plural Nominative | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?pluralNominativeForm . | ||
?pluralNominativeForm ontolex:representation ?pluralNominative ; | ||
wikibase:grammaticalFeature wd:Q146786, wd:Q131105 . | ||
} | ||
|
||
# Comparative Form | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?comparativeFormForm . | ||
?comparativeFormForm ontolex:representation ?comparativeForm ; | ||
wikibase:grammaticalFeature wd:Q14169499 . | ||
} | ||
|
||
# Superlative Form | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?superlativeFormForm . | ||
?superlativeFormForm ontolex:representation ?superlativeForm ; | ||
wikibase:grammaticalFeature wd:Q1817208 . | ||
} | ||
} |
29 changes: 29 additions & 0 deletions
29
src/scribe_data/language_data_extraction/Ukrainian/adverbs/query_adverbs.sparql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# tool: scribe-data | ||
# All Ukrainian (Q8798) adverbs and their forms. | ||
# Enter this query at https://query.wikidata.org/. | ||
|
||
SELECT | ||
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID) | ||
?lemma | ||
?comparativeForm | ||
?superlativeForm | ||
|
||
WHERE { | ||
?lexeme dct:language wd:Q8798 ; | ||
wikibase:lexicalCategory wd:Q380057 ; | ||
wikibase:lemma ?lemma . | ||
|
||
# Comparative Form | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?comparativeFormForm . | ||
?comparativeFormForm ontolex:representation ?comparativeForm ; | ||
wikibase:grammaticalFeature wd:Q14169499 . | ||
} | ||
|
||
# Superlative Form | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?superlativeFormForm . | ||
?superlativeFormForm ontolex:representation ?superlativeForm ; | ||
wikibase:grammaticalFeature wd:Q1817208 . | ||
} | ||
} |
50 changes: 44 additions & 6 deletions
50
src/scribe_data/language_data_extraction/Ukrainian/nouns/query_nouns.sparql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,34 +1,72 @@ | ||
# tool: scribe-data | ||
# All Ukrainian (Q8798) nouns, their plurals and the given forms.s for the given cases. | ||
# All Ukrainian (Q8798) nouns and their forms. | ||
# Enter this query at https://query.wikidata.org/. | ||
|
||
SELECT | ||
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID) | ||
?nomSingular | ||
?nomPlural | ||
?gender | ||
?genitiveSingular | ||
?dativeSingular | ||
?accusativeSingular | ||
?instrumentalSingular | ||
?locativeSingular | ||
|
||
WHERE { | ||
?lexeme dct:language wd:Q8798 ; | ||
wikibase:lexicalCategory wd:Q1084 ; | ||
wikibase:lemma ?nomSingular . | ||
|
||
# MARK: Nominative Plural | ||
|
||
# Nominative Plural | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?nomPluralForm . | ||
?nomPluralForm ontolex:representation ?nomPlural ; | ||
wikibase:grammaticalFeature wd:Q131105, wd:Q146786 . | ||
} | ||
|
||
# MARK: Gender(s) | ||
|
||
# Gender(s) | ||
OPTIONAL { | ||
?lexeme wdt:P5185 ?nounGender . | ||
} | ||
|
||
# Genitive Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?genitiveSingularForm . | ||
?genitiveSingularForm ontolex:representation ?genitiveSingular ; | ||
wikibase:grammaticalFeature wd:Q146233, wd:Q110786 . | ||
} | ||
|
||
# Dative Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?dativeSingularForm . | ||
?dativeSingularForm ontolex:representation ?dativeSingular ; | ||
wikibase:grammaticalFeature wd:Q145599, wd:Q110786 . | ||
} | ||
|
||
# Accusative Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?accusativeSingularForm . | ||
?accusativeSingularForm ontolex:representation ?accusativeSingular ; | ||
wikibase:grammaticalFeature wd:Q146078, wd:Q110786 . | ||
} | ||
|
||
# Instrumental Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?instrumentalSingularForm . | ||
?instrumentalSingularForm ontolex:representation ?instrumentalSingular ; | ||
wikibase:grammaticalFeature wd:Q192997, wd:Q110786 . | ||
} | ||
|
||
# Locative Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?locativeSingularForm . | ||
?locativeSingularForm ontolex:representation ?locativeSingular ; | ||
wikibase:grammaticalFeature wd:Q202142, wd:Q110786 . | ||
} | ||
|
||
SERVICE wikibase:label { | ||
bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". | ||
?nounGender rdfs:label ?gender . | ||
} | ||
} | ||
} |
64 changes: 55 additions & 9 deletions
64
src/scribe_data/language_data_extraction/Ukrainian/proper_nouns/query_proper_nouns.sparql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,34 +1,80 @@ | ||
# tool: scribe-data | ||
# All Ukrainian (Q8798) nouns, their plurals and the given forms.s for the given cases. | ||
# All Ukrainian (Q8798) proper nouns and their forms. | ||
# Enter this query at https://query.wikidata.org/. | ||
|
||
SELECT | ||
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID) | ||
?nomSingular | ||
?nomPlural | ||
?gender | ||
?genitiveSingular | ||
?dativeSingular | ||
?accusativeSingular | ||
?instrumentalSingular | ||
?locativeSingular | ||
?vocativeSingular | ||
|
||
WHERE { | ||
?lexeme dct:language wd:Q8798 ; | ||
wikibase:lexicalCategory wd:Q147276 ; | ||
wikibase:lemma ?nomSingular . | ||
|
||
# MARK: Nominative Plural | ||
|
||
# Nominative Plural | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?nomPluralForm . | ||
?nomPluralForm ontolex:representation ?nomPlural ; | ||
wikibase:grammaticalFeature wd:Q131105 , wd:Q146786 ; | ||
} . | ||
|
||
# MARK: Gender(s) | ||
wikibase:grammaticalFeature wd:Q131105, wd:Q146786 . | ||
} | ||
|
||
# Gender(s) | ||
OPTIONAL { | ||
?lexeme wdt:P5185 ?nounGender . | ||
} . | ||
} | ||
|
||
# Genitive Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?genitiveSingularForm . | ||
?genitiveSingularForm ontolex:representation ?genitiveSingular ; | ||
wikibase:grammaticalFeature wd:Q146233, wd:Q110786 . | ||
} | ||
|
||
# Dative Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?dativeSingularForm . | ||
?dativeSingularForm ontolex:representation ?dativeSingular ; | ||
wikibase:grammaticalFeature wd:Q145599, wd:Q110786 . | ||
} | ||
|
||
# Accusative Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?accusativeSingularForm . | ||
?accusativeSingularForm ontolex:representation ?accusativeSingular ; | ||
wikibase:grammaticalFeature wd:Q146078, wd:Q110786 . | ||
} | ||
|
||
# Instrumental Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?instrumentalSingularForm . | ||
?instrumentalSingularForm ontolex:representation ?instrumentalSingular ; | ||
wikibase:grammaticalFeature wd:Q192997, wd:Q110786 . | ||
} | ||
|
||
# Locative Singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?locativeSingularForm . | ||
?locativeSingularForm ontolex:representation ?locativeSingular ; | ||
wikibase:grammaticalFeature wd:Q202142, wd:Q110786 . | ||
} | ||
|
||
# Vocative Singular (often used for proper nouns) | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?vocativeSingularForm . | ||
?vocativeSingularForm ontolex:representation ?vocativeSingular ; | ||
wikibase:grammaticalFeature wd:Q185077, wd:Q110786 . | ||
} | ||
|
||
SERVICE wikibase:label { | ||
bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". | ||
?nounGender rdfs:label ?gender . | ||
} | ||
} | ||
} |
65 changes: 60 additions & 5 deletions
65
src/scribe_data/language_data_extraction/Ukrainian/verbs/query_verbs.sparql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,73 @@ | ||
# tool: scribe-data | ||
# All Ukrainian (Q8798) verbs and the given forms. | ||
# All Ukrainian (Q8798) verbs and their forms. | ||
# Enter this query at https://query.wikidata.org/. | ||
|
||
SELECT | ||
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID) | ||
?infinitive | ||
?presentFirstSingular | ||
?presentSecondSingular | ||
?presentThirdSingular | ||
?pastMasculineSingular | ||
?pastFeminineSingular | ||
?pastNeuterSingular | ||
?imperativeSecondSingular | ||
|
||
WHERE { | ||
?lexeme dct:language wd:Q8798 ; | ||
wikibase:lexicalCategory wd:Q24905 . | ||
|
||
# MARK: Infinitive | ||
|
||
# Infinitive | ||
?lexeme ontolex:lexicalForm ?infinitiveForm . | ||
?infinitiveForm ontolex:representation ?infinitive ; | ||
wikibase:grammaticalFeature wd:Q179230 ; | ||
} | ||
wikibase:grammaticalFeature wd:Q179230 . | ||
|
||
# Present tense, first person singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?presentFirstSingularForm . | ||
?presentFirstSingularForm ontolex:representation ?presentFirstSingular ; | ||
wikibase:grammaticalFeature wd:Q192613, wd:Q21714344, wd:Q110786 . | ||
} | ||
|
||
# Present tense, second person singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?presentSecondSingularForm . | ||
?presentSecondSingularForm ontolex:representation ?presentSecondSingular ; | ||
wikibase:grammaticalFeature wd:Q192613, wd:Q51929049, wd:Q110786 . | ||
} | ||
|
||
# Present tense, third person singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?presentThirdSingularForm . | ||
?presentThirdSingularForm ontolex:representation ?presentThirdSingular ; | ||
wikibase:grammaticalFeature wd:Q192613, wd:Q51929074, wd:Q110786 . | ||
} | ||
|
||
# Past tense, masculine singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?pastMasculineSingularForm . | ||
?pastMasculineSingularForm ontolex:representation ?pastMasculineSingular ; | ||
wikibase:grammaticalFeature wd:Q1240211, wd:Q499327, wd:Q110786 . | ||
} | ||
|
||
# Past tense, feminine singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?pastFeminineSingularForm . | ||
?pastFeminineSingularForm ontolex:representation ?pastFeminineSingular ; | ||
wikibase:grammaticalFeature wd:Q1240211, wd:Q1775415, wd:Q110786 . | ||
} | ||
|
||
# Past tense, neuter singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?pastNeuterSingularForm . | ||
?pastNeuterSingularForm ontolex:representation ?pastNeuterSingular ; | ||
wikibase:grammaticalFeature wd:Q1240211, wd:Q1775461, wd:Q110786 . | ||
} | ||
|
||
# Imperative, second person singular | ||
OPTIONAL { | ||
?lexeme ontolex:lexicalForm ?imperativeSecondSingularForm . | ||
?imperativeSecondSingularForm ontolex:representation ?imperativeSecondSingular ; | ||
wikibase:grammaticalFeature wd:Q22716, wd:Q51929049, wd:Q110786 . | ||
} | ||
} |