This tool is intended to be used to create a set of dictionaries composed of two files, nouns and adjectives in JSON format.
If you want to download some generated wordsets, check out the Releases page and you will find a dictionaries.zip
and the languages included in the release's description.
# Clone the repo
$ git clone https://github.com/ZEBAS204/kaikki-adj-noun-parser
# Change working directory
$ cd kaikki-adj-noun-parser
# (optional) Use virtual environment
$ python3 -m pip install virtualenv
$ python3 -m venv venv
$ source env/bin/activate
# Install the requirements
$ python3 -m pip install -r requirements.txt
# After downloading the wordsets (.kds), run the
# build_data script to generate the wordsets.json
$ python3 build_data.py
# Now, you can parse the sets:
$ python3 parse_data.py
- Kaikki dictionary (Wiktextract can be used to generate the same dictionaries if you wish so).
- Wordfreq (used to tokenize words)
Note Some languages, like Japanese, need additional dependencies to be downloaded. Please check Wordfreq's Additional CJK installations.
If you want to use the CLI, some additional dependencies are needed:
- Pycountry (used to get the country code of languages)
- beautifulsoup4 (used to parse Kaikki's web page for all available dictionaries)
Before we start, "Word Sets" refer to nouns and adjectives of a language. In this case, mostly used when talking about the JSON files that contain the nouns and adjectives separately.
- Get the word sets of the desired language/s (see Getting the word sets of a language)
- Remove duplicates
- Remove spaced words (most likely sayings)
- Remove words with blacklisted tags
- Remove words with blacklisted characters
- Use Wordfreq to tokenize and remove any hyphened word with more than two hyphens
- Extract all filtered words into a JSON file as an array of words
- Sort words by length
To download the word sets of your desired languages to parse, inside the folder utils
you can use the function fetch_set
inside fetch_sets.py
script to automatically download and store them (by default, the download directory is ./sets
) from the official website.
Or use it directly from the console with the CLI command (example):
$ python3 utils/CLI.py --lang=spanish english german --location sets
Successfully downloaded en nouns
Successfully downloaded en adjectives
Successfully downloaded de nouns
Successfully downloaded de adjectives
For more information use the --help
command.
See all supported languages here or use the CLI:
$ python3 utils/CLI.py [-s | --supported-languages]
Supported Languages:
* English
* Latin
* Spanish
* Italian
...
To simplify the scope of the tool, you will have to manually download your desired language's "Senses with part-of-speech" for the Nouns and Adjectives. To do that and simplify:
- On the "List of kaikki.org machine-readable dictionaries", select the desired langauge/s from the "Available languages" list.
- Inside the language's dictionaries, browse inside the "Word sense lists" for the "Senses with part-of-speech Noun" and "Senses with part-of-speech Adjectives".
- Inside of them, under the list of all words senses, you will see a "Download JSON data for these senses (xx.xMB)". Download it.
- Rename the noun file to
[lang]_noun.kds
and save it inside thesets
folder. - Rename the adjectives file to
[lang]_adj.kds
and save it inside thesets
folder.
(The extension .kds
stands for Kaikki Dictionary Set and it should be treated as a JSON file)
To make it easier to understand, for example, I will go step by step to manually download the word sets for the English language:
- Go to the "List of kaikki.org machine-readable dictionaries" and select "English (1397862 senses)".
- Inside the "Word sense lists", I download the JSON data for all the senses from the "Senses with part-of-speech Noun (778087)" and "Senses with part-of-speech Adjective (187458)".
- The nouns JSON file gets renamed to
en_noun.kds
and the adjectives toen_adj.kds
. - After that, you manually move those two files inside the root of the directory
sets
. - Run the script or do whatever you wish to.
- Cursed/bad words are not filtered correctly (to get around this issue, you can use "Bad words list" or "List of Dirty, Naughty, Obscene, and Otherwise Bad Words" to later filter it)
- Does not differentiate dialects (e.g. American and British English)
- Sometimes single letters can not be filtered as they "mean" something (eg. the meaning of "a" is "letter A")
- Some words are wrongly categorized (eg. a noun that is a suffix or a verb)
- Filtering removes any spaced words
- Filtering removes any word with not commonly used Unicode symbols and was not tested with all languages, you might need to tweak it
- Filtering is highly dependent on tags (no tags for a word, filtering just allows that word)
- Filtering removes any hyphenated word with more than two hyphens (and also causes the next issue).
- May not work properly on syllabary-based languages
- Not tested with "Dictionaries for historical languages"
This project is licensed under the MIT license (MIT). But as this project uses content extracted using Wiktextract from the Wikimedia Foundation, may differ on licenses. See the LICENSE file for more details.