This document is intended for developers who want to install, test or contribute to the code.
To start working on the project:
git clone [email protected]:huggingface/datasets-server.git
cd datasets-server
Install docker (see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository and https://docs.docker.com/engine/install/linux-postinstall/)
Run the project locally:
make start
Run the project in development mode:
make dev-start
In development mode, you don't need to rebuild the docker images to apply a change in a worker. You can just restart the worker's docker container and it will apply your changes.
To install a single job (in jobs), library (in libs) or service (in services), go to their respective directory, and install Python 3.9 (consider pyenv) and poetry (don't forget to add poetry
to the PATH
environment variable).
If you use pyenv:
cd libs/libcommon/
pyenv install 3.9.15
pyenv local 3.9.15
poetry env use python3.9
then:
make install
It will create a virtual environment in a ./.venv/
subdirectory.
If you use VSCode, it might be useful to use the "monorepo" workspace (see a blogpost for more explanations). It is a multi-root workspace, with one folder for each library and service (note that we hide them from the ROOT to avoid editing there). Each folder has its own Python interpreter, with access to the dependencies installed by Poetry. You might have to manually select the interpreter in every folder though on first access, then VSCode stores the information in its local storage.
The repository is structured as a monorepo, with Python libraries and applications in jobs, libs and services:
- jobs contains the one-time jobs run by Helm before deploying the pods. For now, the only job migrates the databases when needed.
- libs contains the Python libraries used by the services and workers. For now, the only library is libcommon, which contains the common code for the services and workers.
- services contains the applications: the public API, the admin API (which is separated from the public API and might be published under its own domain at some point), the reverse proxy, and the worker that processes the queue asynchronously: it gets a "job" (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the cache.
If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5.
The application is distributed in several components.
api is a web server that exposes the API endpoints. Apart from some endpoints (valid
, is-valid
), all the responses are served from pre-computed responses. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users.
The precomputed responses are stored in a Mongo database called "cache". They are computed by workers which take their jobs from a job queue stored in a Mongo database called "queue", and store the results (error or valid response) into the "cache" (see libcommon).
The API service exposes the /webhook
endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database.
Note that every worker has its own job queue:
/splits
: the job is to refresh a dataset, namely to get the list of config and split names, then to create a new job for every split for the workers that depend on it./first-rows
: the job is to get the columns and the first 100 rows of the split./parquet
: the job is to download the dataset, prepare a parquet version of every split (various sharded parquet files), and upload them to theref/convert/parquet
"branch" of the dataset repository on the Hub.
Note also that the workers create local files when the dataset contains images or audios. A shared directory (ASSETS_STORAGE_DIRECTORY
) must therefore be provisioned with sufficient space for the generated files. The /first-rows
endpoint responses contain URLs to these files, served by the API under the /assets/
endpoint.
Hence, the working application has:
- one instance of the API service which exposes a port
- N1 instances of the
splits
worker, N2 instances of thefirst-rows
worker (N2 should generally be higher than N1), N3 instances of theparquet
worker - a Mongo server with two databases: "cache" and "queue"
- a shared directory for the assets
The application also has:
- a reverse proxy in front of the API to serve static files and proxy the rest to the API server
- an admin server to serve technical endpoints
The following environments contain all the modules: reverse proxy, API server, admin API server, workers, and the Mongo database.
Environment | URL | Type | How to deploy |
---|---|---|---|
Production | https://datasets-server.huggingface.co | Helm / Kubernetes | make upgrade-prod in chart |
Development | https://datasets-server.us.dev.moon.huggingface.tech | Helm / Kubernetes | make upgrade-dev in chart |
Local build | http://localhost:8100 | Docker compose | make start (builds docker images) |
The CI checks the quality of the code through a GitHub action. To manually format the code of a job, library, service or worker:
make style
To check the quality (which includes checking the style, but also security vulnerabilities):
make quality
The CI checks the tests a GitHub action. To manually test a job, library, service or worker:
make test
Note that it requires the resources to be ready, ie. mongo and the storage for assets.
To launch the end to end tests:
make e2e
If service is updated, we don't update its version in the pyproject.yaml
file. But we have to update the helm chart with the new image tag, corresponding to the last build docker published on docker.io by the CI.
All the contributions should go through a pull request. The pull requests must be "squashed" (ie: one commit per pull request).
You can use act to test the GitHub Actions (see .github/workflows/) locally. It reduces the retroaction loop when working on the GitHub Actions, avoid polluting the branches with empty pushes only meant to trigger the CI, and allows to only run specific actions.
For example, to launch the build and push of the docker images to Docker Hub:
act -j build-and-push-image-to-docker-hub --secret-file my.secrets
with my.secrets
a file with the secrets:
DOCKERHUB_USERNAME=xxx
DOCKERHUB_PASSWORD=xxx
GITHUB_TOKEN=xxx
Install pyenv:
$ curl https://pyenv.run | bash
Install Python 3.9.15:
$ pyenv install 3.9.15
Check that the expected local version of Python is used:
$ cd services/worker
$ python --version
Python 3.9.15
Install Poetry:
curl -sSL https://install.python-poetry.org | POETRY_VERSION=1.4.2 python3 -
Set the Python version to use with Poetry:
poetry env use 3.9.15
Install the dependencies:
make install
To install the worker on Mac OS, you can follow the next steps.
Install brew:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install ICU:
$ brew install icu4c
==> Caveats
icu4c is keg-only, which means it was not symlinked into /opt/homebrew,
because macOS provides libicucore.dylib (but nothing else).
If you need to have icu4c first in your PATH, run:
echo 'export PATH="/opt/homebrew/opt/icu4c/bin:$PATH"' >> ~/.zshrc
echo 'export PATH="/opt/homebrew/opt/icu4c/sbin:$PATH"' >> ~/.zshrc
For compilers to find icu4c you may need to set:
export LDFLAGS="-L/opt/homebrew/opt/icu4c/lib"
export CPPFLAGS="-I/opt/homebrew/opt/icu4c/include"
Add ICU to the path:
$ echo 'export PATH="/opt/homebrew/opt/icu4c/bin:$PATH"' >> ~/.zshrc
$ echo 'export PATH="/opt/homebrew/opt/icu4c/sbin:$PATH"' >> ~/.zshrc
Install pyenv:
$ curl https://pyenv.run | bash
append the following lines to ~/.zshrc:
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
Logout and login again.
Install Python 3.9.15:
$ pyenv install 3.9.15
Check that the expected local version of Python is used:
$ cd services/worker
$ python --version
Python 3.9.15
Install poetry:
curl -sSL https://install.python-poetry.org | POETRY_VERSION=1.4.2 python3 -
append the following lines to ~/.zshrc:
export PATH="/Users/slesage2/.local/bin:$PATH"
Install rust:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env
Set the python version to use with poetry:
poetry env use 3.9.15
Avoid an issue with Apache beam (python-poetry/poetry#4888 (comment)):
poetry config experimental.new-installer false
Install the dependencies:
make install