Skip to content

ETL pipeline and search engine for Edusources and Publinova

License

Notifications You must be signed in to change notification settings

surfedushare/harvester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Harvester

Data harvester and search API for finding (open access) higher education "products" like: learning materials and research output.

The project consists of a Django Rest Framework API with an interactive documentation at /api/v1/docs/. Harvesting background tasks are handled by a Celery. There is also an admin available to manage some configuration options and inspect responses from sources or locally stored data.

Prerequisites

This project uses Python 3.12, Docker, Docker Compose V2 and psql. Make sure they are installed on your system before installing the project. For Mac there are two additional requirements: Brew and libmagic.

Failing to install prerequisites will lead to strange errors during the installation process.

Installation

The local setup is made in such a way that you can run the project inside and outside of containers. It can be convenient to run some code for inspection outside of containers. To stay close to the production environment it works well to run the project in containers. External services like the database run in containers, so it's always necessary to use Docker.

Environment configuration

The environment configuration is managed by a combination of Docker and the invoke library. For localhost the Docker and application configuration can be changed using an .env file. The first step for installation is to create a simple default configuration that can be used during setup.

cp .env.example .env

Feel free to change your local env file. How this works is explained in the surgically adjust environment configurations section.

Mac OS Python setup

We recommend installing Python through Conda for Mac, especially for the M-serie.

brew install miniforge
conda env create -f environment.yml
source activate.sh

Non-Mac Python setup

To install the basic environment and tooling you'll need to setup a local environment on a host machine with:

python3 -m venv venv --copies --upgrade-deps
source activate.sh
pip install -r requirements.txt
pip install git+https://github.com/surfedushare/search-client.git@master

Other important setup

When using vscode copy activate.sh to venv/bin so pylance can find it.

If you want to run the project outside of a container you'll need to add the following to your hosts file. It's strongly recommended to update your /etc/hosts immediately, to prevent weird error messages if you ever run the project outside of its containers.

127.0.0.1 postgres
127.0.0.1 opensearch
127.0.0.1 harvester
127.0.0.1 redis
127.0.0.1 tika

This way you can reach these containers outside of the container network through their names. This is important for many setup commands as well as running tests during development.

To finish the setup you can run these commands to build all containers:

invoke aws.sync-repository-state --no-profile
invoke container.prepare-builds
docker compose up --build

After that you can seed the database with data:

invoke db.setup
invoke hrv.load-data localhost -a files -s development
invoke hrv.load-data localhost -a products -s development

The setup Postgres command will have created a superuser called supersurf. On localhost the password is "qwerty". For AWS environments you can find the admin password under the Django secrets in the Secret Manager. The secret value is named admin_password. You can copy it for each environment to your own password manager. The superuser is unavailable on production. A personal user will be given to you by SURF.

Getting started

The local setup is made in such a way that you can run the components of the project inside and outside of containers. External services like the database always run in containers. Make sure that you're using a terminal that you won't be using for anything else, as any containers will print their output to the terminal. Similar to how the Django developer server prints to the terminal.

When any containers run you can halt them with CTRL+C. To completely stop containers and release resources you'll need to run "stop" or "down" commands. As explained below.

With any setup it's always required to use the activate.sh script to load your environment. This takes care of important things like database and (external) API credentials as well as AWS access.

source activate.sh

It's possible to specify for which project you want to activate an environment. Especially when running Django commands outside of Docker containers it's important to specify for which project you intent to run these commands. To do this you can set the APPLICATION_PROJECT environment variable manually, or you can specify a project as argument to activate.sh like so:

source activate.sh publinova

After you've loaded your environment you can run all components of the project in containers with:

docker compose up
docker compose down

Alternatively you can run processes outside of containers. It can be useful to run services outside their containers for connecting debuggers or diagnose problems with Docker.

Available apps

Either way the database admin tool becomes available under:

http://localhost:6543/

Resetting your database

Sometimes you want to start fresh. If your database container is not running it's quite easy to throw all data away and create the database from scratch. To irreversibly destroy your local database with all data run:

docker volume rm harvester_postgres_database

Then re-run the database setup and data load commands described above.

Surgically adjust environment configuration

Because we leverage the invoke library inside the Django app it's relatively easy to adjust environment settings without adjusting code. In this section we demonstrate a basic usage, but read the Invoke documentation for a more comprehensive understanding.

Under the environments directory at the root of this repo you'll find directories for environments like localhost and production. Inside the directories sits an invoke.yml file which is the basis for most configuration. If you want to adjust configuration for a single process without adjusting processes within the same environment you can use environment variables. For instance the following will put any process into debug mode, even when using the production environment.

export DET_DJANGO_DEBUG=1

Alternatively you can prefix any command with the relevant environment variables like any bash command:

cd harvester
DET_DJANGO_DEBUG=1 python manage.py shell

If you want permanent changes for your localhost setup you can edit the .env file and re-run the activate.sh command to load the changes. In fact on localhost processes always run in debug mode regardless of environment selection to help with debugging environment differences.

Tests

You can run tests for the harvester by running:

invoke test.run

Deploy

Once your tests pass you can make a new build for the project you want to deploy. This section outlines the most common options for deployment. Use invoke -h <command> to learn more about any invoke command.

When you want to deploy the development image to acceptance, or acceptance to production you can skip the container.build and push commands.

Before deploying you'll want to decide on a version number. It's best to talk to the team about which version number you want to use for a deploy. To see a list of all currently available images for a project and the versions they are tagged with you can run the following command.

invoke aws.print-available-images

Make sure that the version inside of harvester/package.py is different from any other version in the AWS registries. Commit a version change if this is not the case.

You can build the harvester and nginx container by running the following command:

invoke container.build

After you have created the image you can push it to AWS. This command will push to a registry that's available to all environments on AWS:

invoke container.push --docker-login

When an image is pushed to the registry you need to promote it for the environment you desire:

APPLICATION_MODE=<environment> invoke container.promote

When promoting an existing image (for instance the image already running on development or acceptance), add --version=<version_number> as well as --docker-login. Running a command with --docker-login is only necessary once per day, but when promoting an existing image you often haven't pushed it on the same day

To change the running containers on AWS you then need to deploy for the environment you have updated images for:

APPLICATION_MODE=<environment> invoke container.deploy <environment>

This last deploy command will wait until all containers in the AWS cluster have been switched to the new version. This may take some time and the command will indicate that it is waiting to complete. If you do not want to wait you can CTRL+C in the terminal safely. This cancels the waiting, not the deploy itself.

Release

A special case of deploying is releasing. You should take the following steps during releasing:

  • There are a few things that you should check in a release PR, because it influences the release steps:
    • Are there any database migrations?
    • Are there changes to Open Search indices?
    • Is it changing the public harvester API that the clients are consuming? (Edusources, Publinova or MBO)
    • Is it depending on infrastructure changes?
  • Plan your release according to the questions above. Use common sense for this and take into account that we do rolling updates. For example if you're deleting stuff from the database, indices, API or infratructure, then code that stops using the stuff should be deployed before the actual deletions take place. If you're adding to the database, indices, API or infrastructure then they should get added before code runs that expect these additions. We write down these steps, together with their associated commands if applicable, in Gitlab tickets to remember them.
  • With complicated changes we prefer to try them on development and we create the release plan when putting the changes on acceptance. When we release to production following the plan should be sufficient to make a smooth release.
  • When dealing with breaking changes we make a release tag on the default branch. The tag is equal to the release version prefixed with a "v" so for instance: v0.0.1 This allows us to easily jump back to a version without these breaking changes through git.
  • Once the release plan is executed on production and a tag for the previous release is created when necessary then we merge the release PR into its branch.
  • Execute the necessary deploy commands described above.
  • check https://harvester.prod.surfedushare.nl/ for the right version
  • check https://harvester.prod.publinova.nl/ for the right version

This completes the release. Post a message into Teams if people are waiting for certain features.

As you can see the release may consist of many steps and release plans can become elaborate. Here is an overview of commands that are regularly used during a release and their relevant documentation:

Rollback

To execute a rollback you need to "promote" a previous version and then deploy it. First of all you need to list all versions that are available with the following command.

invoke aws.print-available-images <target-project-name>

You can pick a <rollback-version> from the command output. Then depending on the <environment> you want to rollback for: production, acceptance or development. You can run the following commands to rollback to the version you want.

APPLICATION_MODE=<environment> invoke container.promote --version=<rollback-version>

And after that you need to deploy the containers on AWS Fargate:

APPLICATION_MODE=<environment> invoke container.deploy <environment>

Migrate

To migrate the database on AWS you can run the migration command:

APPLICATION_MODE=<environment> invoke db.migrate <environment>

Provisioning

There are a few commands that can help to provision things like the database on AWS. We're using Fabric for provisioning. You can run fab -h <command> to learn more about a particular Fabric command.

For more details on how to provision things on AWS see provisioning the harvester

Linting

The python code uses flake8 as a linter. You can run it with the following command:

flake8 .

About

ETL pipeline and search engine for Edusources and Publinova

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages