Merge branch 'main' into convert-CamelCase-to-snake_case

scribe-org · Nov 11, 2024 · 05bca42 · 05bca42
2 parents 1f0313f + 04e0955
commit 05bca42
Show file tree

Hide file tree

Showing 174 changed files with 2,707,221 additions and 2,706,529 deletions.
diff --git a/.github/workflows/python_package_ci.yaml b/.github/workflows/python_package_ci.yaml
@@ -14,11 +14,16 @@ jobs:
         os:
           - macos-latest
           - ubuntu-latest
+          - windows-latest
         python-version:
           - "3.9"
 
     runs-on: ${{ matrix.os }}
 
+    defaults:
+      run:
+        shell: bash
+
     steps:
       - uses: actions/checkout@v3
       - name: Set up Python ${{ matrix.python-version }}
@@ -28,17 +33,24 @@ jobs:
 
       - name: Create and Activate Virtual Environment
         run: |
-          python3 -m venv venv
-          source venv/bin/activate
+          if [ "$RUNNER_OS" == "Windows" ]; then
+            python -m venv venv
+            source venv/Scripts/activate
+          else
+            python3 -m venv venv
+            source venv/bin/activate
+          fi
 
       - name: Set up Homebrew
+        if: matrix.os == 'macos-latest'
         uses: Homebrew/actions/setup-homebrew@master
 
       - name: Install PyICU dependencies
+        if: matrix.os == 'macos-latest'
         run: |
           brew bundle install --file=Brewfile
           # configure PATH & PKG_CONFIG_PATH as per
-          # https://gitlab.pyicu.org/main/pyicu#installing-pyicu
+          # https://gitlab.pyicu.org/main/pyicu
           echo "/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" >> $GITHUB_PATH
           echo "PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig" >> $GITHUB_ENV
 

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -13,3 +13,9 @@ repos:
       - id: ruff
         args: [--fix, --exit-non-zero-on-fix]
       - id: ruff-format
+
+  - repo: https://github.com/tcort/markdown-link-check
+    rev: v3.13.6
+    hooks:
+      - id: markdown-link-check
+        args: [-q]
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,16 +14,19 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
 
 ### ✨ Features
 
+- Queries for countless data types for countless languages were expanded and added ❤️
 - Scribe-Data is now a fully functional CLI.
   - Querying Wikidata lexicographical data can be done via the `--query` command ([#159](https://github.com/scribe-org/Scribe-Data/issues/159)).
   - The output type of queries can be in JSON, CSV, TSV and SQLite, with conversions output types also being possible ([#145](https://github.com/scribe-org/Scribe-Data/issues/145), [#146](https://github.com/scribe-org/Scribe-Data/issues/146))
   - Output paths can be set for query results ([#144](https://github.com/scribe-org/Scribe-Data/issues/144)).
   - The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself ([#186](https://github.com/scribe-org/Scribe-Data/issues/186), [#157 ](https://github.com/scribe-org/Scribe-Data/issues/157)).
   - Total Wikidata lexemes for languages and data types can be derived with the `--total` command ([#147](https://github.com/scribe-org/Scribe-Data/issues/147)).
-  - Commands can be used via an interactive mode with the `--interactive` command ([#158](https://github.com/scribe-org/Scribe-Data/issues/158)).
-- Articles are removed from machine translations so they're more directly useful in Scribe applications ([#96](https://github.com/scribe-org/Scribe-Data/issues/96)).
-- Queries for Basque verbs and adjectives were expanded and added respectively ([#222](https://github.com/scribe-org/Scribe-Data/issues/222)).
-- The query for Danish verbs was expanded ([#225](https://github.com/scribe-org/Scribe-Data/issues/225)).
+  - Commands can be used via an interactive mode with the `--interactive` command ([#158](https://github.com/scribe-org/Scribe-Data/issues/158), [#203](https://github.com/scribe-org/Scribe-Data/issues/203)).
+    - Interactive mode works for `get` and `total` commands
+  - Outputs were standardized to assure that the CLI experience is consistent
+- The machine translation process has been removed to make way for the Wiktionary based implementation ([#292](https://github.com/scribe-org/Scribe-Data/issues/292)).
+- Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
+- CLI commands have an argument check that can suggest correct languages and data types ([#341](https://github.com/scribe-org/Scribe-Data/issues/341)).
 
 ### 🐞 Bug Fixes
 
@@ -32,10 +35,13 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
 ### ✅ Tests
 
 - Tests have been written for the CLI to assure that it's functionality remains consistent.
+- Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality ([#339](https://github.com/scribe-org/Scribe-Data/issues/339), [#357](https://github.com/scribe-org/Scribe-Data/issues/357))
+  - Project queries and its structure have been updated to match the rules developed for the checks.
 
 ### 📝 Documentation
 
-- The CLI's functionality has been fully documented ([#152](https://github.com/scribe-org/Scribe-Data/issues/152)).
+- The CLI's functionality has been fully documented ([#152](https://github.com/scribe-org/Scribe-Data/issues/152), [#208](https://github.com/scribe-org/Scribe-Data/issues/208)).
+- Documentation was created to show how to write Scribe-Data queries ([#395](https://github.com/scribe-org/Scribe-Data/issues/395)).
 
 ### ♻️ Code Refactoring
 
@@ -47,6 +53,9 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
 - Paths within the package have been updated to work for all operating systems via `pathlib` ([#125](https://github.com/scribe-org/Scribe-Data/issues/125)).
 - The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
 - The `update_files` directory was removed in preparation of other means of showing data totals.
+- The `language_data_extraction` directory was moved under the Wikidata directory as it's only used for those processes now ([#446](https://github.com/scribe-org/Scribe-Data/issues/446)).
+- The emoji keyword process was centralized to simplify project maintenance ([#359](https://github.com/scribe-org/Scribe-Data/issues/359)).
+- PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user ([#196](https://github.com/scribe-org/Scribe-Data/issues/196)).
 
 ## Scribe-Data 3.3.0
 

diff --git a/README.md b/README.md
@@ -224,6 +224,14 @@ The following table shows the supported languages and the amount of data availab
 
 <strong>2024</strong>
 
+- October: [Blog post on Medium](https://medium.com/@arpita151103/scribe-an-open-source-solution-for-language-learning-and-data-accessibility-092dab026fd6) discussing the [Scribe-Data](https://github.com/scribe-org/Scribe-Data) development process, community and features
+- October: [Blog post on medium](https://medium.com/@mhmohona/ins-and-outs-of-scribe-data-cli-bd51202aa7c6) describing the main features of [Scribe-Data](https://github.com/scribe-org/Scribe-Data)
+- September: [Final Google Summer of Code report](https://medium.com/@mhmohona/the-final-stretch-gsoc-journey-with-scribe-data-1740084c958d) on the creation of the [Scribe-Data](https://github.com/scribe-org/Scribe-Data) CLI
+- August: [Final Google Summer of Code report](https://jagmarcel.hashnode.dev/gsoc-2024-final-report) on the creation of Scribe's cross-language translation functionality
+- July: [Blog post on Medium](https://medium.com/@mhmohona/halfway-there-my-gsoc-adventure-with-scribe-data-cli-2ffe6d727ecb) about the progress on creating the [Scribe-Data](https://github.com/scribe-org/Scribe-Data) CLI
+- July: [Blog post on Hashnode](https://jagmarcel.hashnode.dev/gsoc-2024-midterm-report) providing an midterm report on the localization and translation expansion for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS)
+- July: [Blog post on Hashnode](https://jagmarcel.hashnode.dev/my-first-experiences-with-gsoc) about the initial steps towards the localization of [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS)
+- June: [Blog post on Medium](https://medium.com/@mhmohona/first-month-as-a-gsoc-intern-building-scribe-data-cli-d0c12c9e8371) about the planned [Scribe-Data](https://github.com/scribe-org/Scribe-Data) CLI
 - April: [Blog post on Medium](https://medium.com/@mhmohona/scribe-data-a-guide-to-open-source-language-data-a801c59db4c9) about [Scribe-Data](https://github.com/scribe-org/Scribe-Data) and its functionalities
 - February: [Presentation slides](https://docs.google.com/presentation/d/1lMhYiQx1R99SVGhbikUGjOVaFgPPASvbzM2Bsu3NXSg/edit?usp=sharing) for Scribe's participation at the [Wikimedia Tech Safari Program](https://www.mediawiki.org/wiki/Wikimedia_Tech_Safari_Program)
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -9,7 +9,7 @@
     :target: https://github.com/scribe-org/Scribe-Data
 
 .. |rtd| image:: https://img.shields.io/readthedocs/scribe-data.svg?label=%20&logo=read-the-docs&logoColor=ffffff
-    :target: http://scribe-datareadthedocs.io/en/latest/
+    :target: http://scribe-data.readthedocs.io/en/latest/
 
 .. |issues| image:: https://img.shields.io/github/issues/scribe-org/Scribe-Data?label=%20&logo=github
     :target: https://github.com/scribe-org/Scribe-Data/issues

diff --git a/docs/source/scribe_data/cli.rst b/docs/source/scribe_data/cli.rst
@@ -143,15 +143,32 @@ Options:
 - ``-ot, --output-type {json,csv,tsv}``: The output file type.
 - ``-ope, --outputs-per-entry OUTPUTS_PER_ENTRY``: How many outputs should be generated per data entry.
 - ``-o, --overwrite``: Whether to overwrite existing files (default: False).
-- ``-a, --all ALL``: Get all languages and data types.
+- ``-a, --all``: Get all languages and data types. Can be combined with `-dt` to get all languages for a specific data type, or with `-lang` to get all data types for a specific language.
 - ``-i, --interactive``: Run in interactive mode.
 - ``-ic, --identifier-case``: The case format for identifiers in the output data (default: camel).
 
-Example:
+Examples:
+
+.. code-block:: bash
+
+    $ scribe-data get --all
+    Getting data for all languages and all data types...
+
+.. code-block:: bash
+
+    $ scribe-data get --all -dt nouns
+    Getting all nouns for all languages...
+
+.. code-block:: bash
+
+    $ scribe-data get --all -lang English
+    Getting all data types for English...
 
 .. code-block:: bash
 
     $ scribe-data get -l English --data-type verbs -od ~/path/for/output
+    Getting and formatting English verbs
+    Data updated: 100%|████████████████████████| 1/1 [00:XY<00:00, XY.Zs/process]
 
 Behavior and Output:
 ^^^^^^^^^^^^^^^^^^^^
@@ -181,7 +198,7 @@ Behavior and Output:
     .. code-block:: text
 
         Getting and formatting English verbs
-        Data updated: 100%|████████████████████████| 1/1 [00:29<00:00, 29.73s/process]
+        Data updated: 100%|████████████████████████| 1/1 [00:XY<00:00, XY.Zs/process]
 
 4. If no data is found, you'll see a warning:
 
@@ -243,30 +260,63 @@ Usage:
 Options:
 ^^^^^^^^
 
-- ``-lang, --language LANGUAGE``: The language(s) to check totals for.
+- ``-lang, --language LANGUAGE``: The language(s) to check totals for. Can be a language name or QID.
 - ``-dt, --data-type DATA_TYPE``: The data type(s) to check totals for.
-- ``-a, --all ALL``: Get totals for all languages and data types.
+- ``-a, --all``: Get totals for all languages and data types.
 
 Examples:
 
 .. code-block:: text
 
-    $scribe-data total -dt nouns  # verbs, adjectives, etc
-    Data type: nouns
-    Total number of lexemes: 123456
+    $ scribe-data total --all
+    Total lexemes for all languages and data types:
+    ==============================================
+    Language     Data Type     Total Lexemes
+    ==============================================
+    English      nouns         123,456
+                 verbs         234,567
+    ...
 
 .. code-block:: text
 
-    $scribe-data total -lang English
-    Language: English
-    Total number of lexemes: 123456
+    $ scribe-data total --language English
+    Returning total counts for English data types...
+
+    Language        Data Type                 Total Wikidata Lexemes
+    ================================================================
+    English         adjectives                12,345
+                    adverbs                   23,456
+                    nouns                     34,567
+    ...
 
 .. code-block:: text
 
-    $scribe-data total -lang English -dt nouns  # verbs, adjectives, etc
+    $ scribe-data total --language Q1860
+    Wikidata QID Q1860 passed. Checking all data types.
+
+    Language        Data Type                 Total Wikidata Lexemes
+    ================================================================
+    Q1860           adjectives                12,345
+                    adverbs                   23,456
+                    articles                  30
+                    conjunctions              40
+                    nouns                     56,789
+                    personal pronouns         60
+    ...
+
+.. code-block:: text
+
+    $ scribe-data total --language English -dt nouns
     Language: English
     Data type: nouns
-    Total number of lexemes: 12345
+    Total number of lexemes: 12,345
+
+.. code-block:: text
+
+    $ scribe-data total --language Q1860 -dt verbs
+    Language: Q1860
+    Data type: verbs
+    Total number of lexemes: 23,456
 
 Convert Command
 ~~~~~~~~~~~~~~~

diff --git a/docs/source/scribe_data/load/index.rst b/docs/source/scribe_data/load/index.rst
@@ -3,11 +3,6 @@ load/
 
 `View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load>`_
 
-.. toctree::
-    :maxdepth: 2
-
-    update_files/index
-
 .. toctree::
     :maxdepth: 1
 

diff --git a/docs/source/scribe_data/load/update_files/index.rst b/docs/source/scribe_data/load/update_files/index.rst
diff --git a/docs/source/scribe_data/unicode/index.rst b/docs/source/scribe_data/unicode/index.rst
@@ -5,7 +5,7 @@ unicode/
 
 The Scribe-Data Unicode process is powered by `cldr-json <https://github.com/unicode-org/cldr-json>`_ data from the `Unicode Consortium <https://home.unicode.org/>`_ and `PyICU <https://gitlab.pyicu.org/main/pyicu>`_, a Python extension that wraps the Unicode Consortium's `International Components for Unicode (ICU) <https://github.com/unicode-org/icu>`_ C++ project.
 
-Please see the `installation guide for PyICU <https://gitlab.pyicu.org/main/pyicu#installing-pyicu>`_ as the extension must be linked to ICU on your machine to work properly.
+Please see the `installation guide for PyICU <https://gitlab.pyicu.org/main/pyicu>`_ as the extension must be linked to ICU on your machine to work properly.
 
 .. toctree::
     :maxdepth: 1

diff --git a/docs/source/scribe_data/wikidata/query_profanity.rst b/docs/source/scribe_data/wikidata/query_profanity.rst
@@ -13,8 +13,8 @@ Queries all profane words from a given language to be removed from autosuggest o
 
     WHERE {
         ?lexemeId dct:language wd:LANGUAGE_QID; # replace language qid here
-                    wikibase:lemma ?lemma;
-                    ontolex:sense ?sense.
+            wikibase:lemma ?lemma;
+            ontolex:sense ?sense.
 
         VALUES ?filter {
             wd:Q8102