Skip to content

Commit

Permalink
Merge branch 'main' into convert-CamelCase-to-snake_case
Browse files Browse the repository at this point in the history
  • Loading branch information
OmarAI2003 authored Nov 11, 2024
2 parents 1f0313f + 04e0955 commit 05bca42
Show file tree
Hide file tree
Showing 174 changed files with 2,707,221 additions and 2,706,529 deletions.
18 changes: 15 additions & 3 deletions .github/workflows/python_package_ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,16 @@ jobs:
os:
- macos-latest
- ubuntu-latest
- windows-latest
python-version:
- "3.9"

runs-on: ${{ matrix.os }}

defaults:
run:
shell: bash

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -28,17 +33,24 @@ jobs:

- name: Create and Activate Virtual Environment
run: |
python3 -m venv venv
source venv/bin/activate
if [ "$RUNNER_OS" == "Windows" ]; then
python -m venv venv
source venv/Scripts/activate
else
python3 -m venv venv
source venv/bin/activate
fi
- name: Set up Homebrew
if: matrix.os == 'macos-latest'
uses: Homebrew/actions/setup-homebrew@master

- name: Install PyICU dependencies
if: matrix.os == 'macos-latest'
run: |
brew bundle install --file=Brewfile
# configure PATH & PKG_CONFIG_PATH as per
# https://gitlab.pyicu.org/main/pyicu#installing-pyicu
# https://gitlab.pyicu.org/main/pyicu
echo "/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" >> $GITHUB_PATH
echo "PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig" >> $GITHUB_ENV
Expand Down
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,9 @@ repos:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- id: ruff-format

- repo: https://github.com/tcort/markdown-link-check
rev: v3.13.6
hooks:
- id: markdown-link-check
args: [-q]
19 changes: 14 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,19 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).

### ✨ Features

- Queries for countless data types for countless languages were expanded and added ❤️
- Scribe-Data is now a fully functional CLI.
- Querying Wikidata lexicographical data can be done via the `--query` command ([#159](https://github.com/scribe-org/Scribe-Data/issues/159)).
- The output type of queries can be in JSON, CSV, TSV and SQLite, with conversions output types also being possible ([#145](https://github.com/scribe-org/Scribe-Data/issues/145), [#146](https://github.com/scribe-org/Scribe-Data/issues/146))
- Output paths can be set for query results ([#144](https://github.com/scribe-org/Scribe-Data/issues/144)).
- The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself ([#186](https://github.com/scribe-org/Scribe-Data/issues/186), [#157 ](https://github.com/scribe-org/Scribe-Data/issues/157)).
- Total Wikidata lexemes for languages and data types can be derived with the `--total` command ([#147](https://github.com/scribe-org/Scribe-Data/issues/147)).
- Commands can be used via an interactive mode with the `--interactive` command ([#158](https://github.com/scribe-org/Scribe-Data/issues/158)).
- Articles are removed from machine translations so they're more directly useful in Scribe applications ([#96](https://github.com/scribe-org/Scribe-Data/issues/96)).
- Queries for Basque verbs and adjectives were expanded and added respectively ([#222](https://github.com/scribe-org/Scribe-Data/issues/222)).
- The query for Danish verbs was expanded ([#225](https://github.com/scribe-org/Scribe-Data/issues/225)).
- Commands can be used via an interactive mode with the `--interactive` command ([#158](https://github.com/scribe-org/Scribe-Data/issues/158), [#203](https://github.com/scribe-org/Scribe-Data/issues/203)).
- Interactive mode works for `get` and `total` commands
- Outputs were standardized to assure that the CLI experience is consistent
- The machine translation process has been removed to make way for the Wiktionary based implementation ([#292](https://github.com/scribe-org/Scribe-Data/issues/292)).
- Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
- CLI commands have an argument check that can suggest correct languages and data types ([#341](https://github.com/scribe-org/Scribe-Data/issues/341)).

### 🐞 Bug Fixes

Expand All @@ -32,10 +35,13 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
### ✅ Tests

- Tests have been written for the CLI to assure that it's functionality remains consistent.
- Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality ([#339](https://github.com/scribe-org/Scribe-Data/issues/339), [#357](https://github.com/scribe-org/Scribe-Data/issues/357))
- Project queries and its structure have been updated to match the rules developed for the checks.

### 📝 Documentation

- The CLI's functionality has been fully documented ([#152](https://github.com/scribe-org/Scribe-Data/issues/152)).
- The CLI's functionality has been fully documented ([#152](https://github.com/scribe-org/Scribe-Data/issues/152), [#208](https://github.com/scribe-org/Scribe-Data/issues/208)).
- Documentation was created to show how to write Scribe-Data queries ([#395](https://github.com/scribe-org/Scribe-Data/issues/395)).

### ♻️ Code Refactoring

Expand All @@ -47,6 +53,9 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
- Paths within the package have been updated to work for all operating systems via `pathlib` ([#125](https://github.com/scribe-org/Scribe-Data/issues/125)).
- The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
- The `update_files` directory was removed in preparation of other means of showing data totals.
- The `language_data_extraction` directory was moved under the Wikidata directory as it's only used for those processes now ([#446](https://github.com/scribe-org/Scribe-Data/issues/446)).
- The emoji keyword process was centralized to simplify project maintenance ([#359](https://github.com/scribe-org/Scribe-Data/issues/359)).
- PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user ([#196](https://github.com/scribe-org/Scribe-Data/issues/196)).

## Scribe-Data 3.3.0

Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,14 @@ The following table shows the supported languages and the amount of data availab

<strong>2024</strong>

- October: [Blog post on Medium](https://medium.com/@arpita151103/scribe-an-open-source-solution-for-language-learning-and-data-accessibility-092dab026fd6) discussing the [Scribe-Data](https://github.com/scribe-org/Scribe-Data) development process, community and features
- October: [Blog post on medium](https://medium.com/@mhmohona/ins-and-outs-of-scribe-data-cli-bd51202aa7c6) describing the main features of [Scribe-Data](https://github.com/scribe-org/Scribe-Data)
- September: [Final Google Summer of Code report](https://medium.com/@mhmohona/the-final-stretch-gsoc-journey-with-scribe-data-1740084c958d) on the creation of the [Scribe-Data](https://github.com/scribe-org/Scribe-Data) CLI
- August: [Final Google Summer of Code report](https://jagmarcel.hashnode.dev/gsoc-2024-final-report) on the creation of Scribe's cross-language translation functionality
- July: [Blog post on Medium](https://medium.com/@mhmohona/halfway-there-my-gsoc-adventure-with-scribe-data-cli-2ffe6d727ecb) about the progress on creating the [Scribe-Data](https://github.com/scribe-org/Scribe-Data) CLI
- July: [Blog post on Hashnode](https://jagmarcel.hashnode.dev/gsoc-2024-midterm-report) providing an midterm report on the localization and translation expansion for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS)
- July: [Blog post on Hashnode](https://jagmarcel.hashnode.dev/my-first-experiences-with-gsoc) about the initial steps towards the localization of [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS)
- June: [Blog post on Medium](https://medium.com/@mhmohona/first-month-as-a-gsoc-intern-building-scribe-data-cli-d0c12c9e8371) about the planned [Scribe-Data](https://github.com/scribe-org/Scribe-Data) CLI
- April: [Blog post on Medium](https://medium.com/@mhmohona/scribe-data-a-guide-to-open-source-language-data-a801c59db4c9) about [Scribe-Data](https://github.com/scribe-org/Scribe-Data) and its functionalities
- February: [Presentation slides](https://docs.google.com/presentation/d/1lMhYiQx1R99SVGhbikUGjOVaFgPPASvbzM2Bsu3NXSg/edit?usp=sharing) for Scribe's participation at the [Wikimedia Tech Safari Program](https://www.mediawiki.org/wiki/Wikimedia_Tech_Safari_Program)

Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
:target: https://github.com/scribe-org/Scribe-Data

.. |rtd| image:: https://img.shields.io/readthedocs/scribe-data.svg?label=%20&logo=read-the-docs&logoColor=ffffff
:target: http://scribe-datareadthedocs.io/en/latest/
:target: http://scribe-data.readthedocs.io/en/latest/

.. |issues| image:: https://img.shields.io/github/issues/scribe-org/Scribe-Data?label=%20&logo=github
:target: https://github.com/scribe-org/Scribe-Data/issues
Expand Down
76 changes: 63 additions & 13 deletions docs/source/scribe_data/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -143,15 +143,32 @@ Options:
- ``-ot, --output-type {json,csv,tsv}``: The output file type.
- ``-ope, --outputs-per-entry OUTPUTS_PER_ENTRY``: How many outputs should be generated per data entry.
- ``-o, --overwrite``: Whether to overwrite existing files (default: False).
- ``-a, --all ALL``: Get all languages and data types.
- ``-a, --all``: Get all languages and data types. Can be combined with `-dt` to get all languages for a specific data type, or with `-lang` to get all data types for a specific language.
- ``-i, --interactive``: Run in interactive mode.
- ``-ic, --identifier-case``: The case format for identifiers in the output data (default: camel).

Example:
Examples:

.. code-block:: bash
$ scribe-data get --all
Getting data for all languages and all data types...
.. code-block:: bash
$ scribe-data get --all -dt nouns
Getting all nouns for all languages...
.. code-block:: bash
$ scribe-data get --all -lang English
Getting all data types for English...
.. code-block:: bash
$ scribe-data get -l English --data-type verbs -od ~/path/for/output
Getting and formatting English verbs
Data updated: 100%|████████████████████████| 1/1 [00:XY<00:00, XY.Zs/process]
Behavior and Output:
^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -181,7 +198,7 @@ Behavior and Output:
.. code-block:: text
Getting and formatting English verbs
Data updated: 100%|████████████████████████| 1/1 [00:29<00:00, 29.73s/process]
Data updated: 100%|████████████████████████| 1/1 [00:XY<00:00, XY.Zs/process]
4. If no data is found, you'll see a warning:

Expand Down Expand Up @@ -243,30 +260,63 @@ Usage:
Options:
^^^^^^^^

- ``-lang, --language LANGUAGE``: The language(s) to check totals for.
- ``-lang, --language LANGUAGE``: The language(s) to check totals for. Can be a language name or QID.
- ``-dt, --data-type DATA_TYPE``: The data type(s) to check totals for.
- ``-a, --all ALL``: Get totals for all languages and data types.
- ``-a, --all``: Get totals for all languages and data types.

Examples:

.. code-block:: text
$scribe-data total -dt nouns # verbs, adjectives, etc
Data type: nouns
Total number of lexemes: 123456
$ scribe-data total --all
Total lexemes for all languages and data types:
==============================================
Language Data Type Total Lexemes
==============================================
English nouns 123,456
verbs 234,567
...
.. code-block:: text
$scribe-data total -lang English
Language: English
Total number of lexemes: 123456
$ scribe-data total --language English
Returning total counts for English data types...
Language Data Type Total Wikidata Lexemes
================================================================
English adjectives 12,345
adverbs 23,456
nouns 34,567
...
.. code-block:: text
$scribe-data total -lang English -dt nouns # verbs, adjectives, etc
$ scribe-data total --language Q1860
Wikidata QID Q1860 passed. Checking all data types.
Language Data Type Total Wikidata Lexemes
================================================================
Q1860 adjectives 12,345
adverbs 23,456
articles 30
conjunctions 40
nouns 56,789
personal pronouns 60
...
.. code-block:: text
$ scribe-data total --language English -dt nouns
Language: English
Data type: nouns
Total number of lexemes: 12345
Total number of lexemes: 12,345
.. code-block:: text
$ scribe-data total --language Q1860 -dt verbs
Language: Q1860
Data type: verbs
Total number of lexemes: 23,456
Convert Command
~~~~~~~~~~~~~~~
Expand Down
5 changes: 0 additions & 5 deletions docs/source/scribe_data/load/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,6 @@ load/

`View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load>`_

.. toctree::
:maxdepth: 2

update_files/index

.. toctree::
:maxdepth: 1

Expand Down
8 changes: 0 additions & 8 deletions docs/source/scribe_data/load/update_files/index.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/scribe_data/unicode/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ unicode/

The Scribe-Data Unicode process is powered by `cldr-json <https://github.com/unicode-org/cldr-json>`_ data from the `Unicode Consortium <https://home.unicode.org/>`_ and `PyICU <https://gitlab.pyicu.org/main/pyicu>`_, a Python extension that wraps the Unicode Consortium's `International Components for Unicode (ICU) <https://github.com/unicode-org/icu>`_ C++ project.

Please see the `installation guide for PyICU <https://gitlab.pyicu.org/main/pyicu#installing-pyicu>`_ as the extension must be linked to ICU on your machine to work properly.
Please see the `installation guide for PyICU <https://gitlab.pyicu.org/main/pyicu>`_ as the extension must be linked to ICU on your machine to work properly.

.. toctree::
:maxdepth: 1
Expand Down
4 changes: 2 additions & 2 deletions docs/source/scribe_data/wikidata/query_profanity.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ Queries all profane words from a given language to be removed from autosuggest o
WHERE {
?lexemeId dct:language wd:LANGUAGE_QID; # replace language qid here
wikibase:lemma ?lemma;
ontolex:sense ?sense.
wikibase:lemma ?lemma;
ontolex:sense ?sense.
VALUES ?filter {
wd:Q8102
Expand Down
Loading

0 comments on commit 05bca42

Please sign in to comment.