Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Commit

Permalink
Simplify catalog folder structure (#133)
Browse files Browse the repository at this point in the history
* Simplify catalog folder structure

Signed-off-by: Olga Bulat <[email protected]>

* Move pyspark- requirements to archive

Signed-off-by: Olga Bulat <[email protected]>

* Simplify `common` module imports

Signed-off-by: Olga Bulat <[email protected]>

* Fix import paths

Signed-off-by: Olga Bulat <[email protected]>

* Remove `.github/workflows-disabled` folder

* Move jamendo and stocksnap scripts to the correct folders

Co-authored-by: Krystle Salazar <[email protected]>
  • Loading branch information
obulat and Krystle Salazar authored Aug 19, 2021
1 parent ea3b2b8 commit 6c17203
Show file tree
Hide file tree
Showing 434 changed files with 575 additions and 851 deletions.
16 changes: 8 additions & 8 deletions .github/ISSUE_TEMPLATE/image-provider-api-integration-request.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ No development should be done on a Provider API Script until the following info
## General Recommendations for implementation
<!-- modify this section if necessary -->

- The script should be in the `src/cc_catalog_airflow/dags/provider_api_scripts/` directory.
- The script should be in the `openverse_catalog/dags/provider_api_scripts/` directory.
- The script should have a test suite in the same directory.
- The script must use the `ImageStore` class (Import this from
`src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py`).
`openverse_catalog/dags/provider_api_scripts/common/storage/image.py`).
- The script should use the `DelayedRequester` class (Import this from
`src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py`).
`openverse_catalog/dags/provider_api_scripts/common/requester.py`).
- The script must not use anything from
`src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py`, since
`openverse_catalog/dags/provider_api_scripts/modules/etlMods.py`, since
that module is deprecated.
- If the provider API has can be queried by 'upload date' or something similar,
the script should take a `--date` parameter when run as a script, giving the
Expand All @@ -66,10 +66,10 @@ No development should be done on a Provider API Script until the following info

For example Provider API Scripts and accompanying test suites, please see

- `src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py` and
- `src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py`, or
- `src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py` and
- `src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py`.
- `openverse_catalog/dags/provider_api_scripts/flickr.py` and
- `openverse_catalog/dags/provider_api_scripts/test_flickr.py`, or
- `openverse_catalog/dags/provider_api_scripts/wikimedia_commons.py` and
- `openverse_catalog/dags/provider_api_scripts/test_wikimedia_commons.py`.

## Implementation
<!-- Replace the [ ] with [x] to check the box. -->
Expand Down
3 changes: 0 additions & 3 deletions .github/workflows-disabled/README.md

This file was deleted.

52 changes: 0 additions & 52 deletions .github/workflows-disabled/codeql.yml

This file was deleted.

26 changes: 0 additions & 26 deletions .github/workflows-disabled/main.yml

This file was deleted.

4 changes: 2 additions & 2 deletions .github/workflows/push_pull_request_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ jobs:
- uses: actions/checkout@v2
- name: Build the stack
run: |
cd ./src/cc_catalog_airflow
cd ./openverse_catalog
cp env.template .env
docker-compose up -d
- name: Test
run: |
sleep 10
docker exec cc_catalog_airflow_webserver_1 /usr/local/airflow/.local/bin/pytest
docker exec openverse_catalog_webserver_1 /usr/local/airflow/.local/bin/pytest
78 changes: 44 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Openverse Catalog

This repository contains the methods used to identify over 1.4 billion Creative Commons licensed works. The challenge is that these works are dispersed
This repository contains the methods used to identify over 1.4 billion Creative
Commons licensed works. The challenge is that these works are dispersed
throughout the web and identifying them requires a combination of techniques.

Two approaches are currently in use:
Expand Down Expand Up @@ -41,7 +42,7 @@ series of parquet files that contain:
The steps above are performed in [`ExtractCCLinks.py`][ex_cc_links].

[ccrawl_doc]: https://commoncrawl.org/the-data/get-started/
[ex_cc_links]: src/ExtractCCLinks.py
[ex_cc_links]: archive/ExtractCCLinks.py

## API Data

Expand All @@ -56,10 +57,14 @@ workflows run `provider_api_scripts` to load and extract media data from the API
Below are some of the daily DAG workflows that run the corresponding `provider_api_scripts`
daily:

- [Met Museum Workflow](src/cc_catalog_airflow/dags/metropolitan_museum_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/metropolitan_museum_of_art.py) )
- [PhyloPic Workflow](src/cc_catalog_airflow/dags/phylopic_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/phylopic.py) )
- [Flickr Workflow](src/cc_catalog_airflow/dags/flickr_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py) )
- [Wikimedia Commons Workflow](src/cc_catalog_airflow/dags/wikimedia_workflow.py) ( [Commons API script](src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py) )
- [Met Museum Workflow](openverse_catalog/dags/metropolitan_museum_workflow.py)
( [API script](openverse_catalog/dags/provider_api_scripts/metropolitan_museum_of_art.py) )
- [PhyloPic Workflow](openverse_catalog/dags/phylopic_workflow.py)
( [API script](openverse_catalog/dags/provider_api_scripts/phylopic.py) )
- [Flickr Workflow](openverse_catalog/dags/flickr_workflow.py)
( [API script](openverse_catalog/dags/provider_api_scripts/flickr.py) )
- [Wikimedia Commons Workflow](openverse_catalog/dags/wikimedia_workflow.py)
( [Commons API script](openverse_catalog/dags/provider_api_scripts/wikimedia_commons.py) )

### Monthly Workflow

Expand All @@ -68,43 +73,41 @@ month at 16:00 UTC. These workflows are reserved for long-running jobs or
APIs that do not have date filtering capabilities so the data is reprocessed
monthly to keep the catalog updated. The following tasks are performed monthly:

- [Cleveland Museum of Art](src/cc_catalog_airflow/dags/provider_api_scripts/cleveland_museum_of_art.py)
- [RawPixel](src/cc_catalog_airflow/dags/provider_api_scripts/raw_pixel.py)
- [Common Crawl Syncer](src/cc_catalog_airflow/dags/commoncrawl_s3_syncer/SyncImageProviders.py)

- [Cleveland Museum of Art](openverse_catalog/dags/provider_api_scripts/cleveland_museum_of_art.py)
- [RawPixel](openverse_catalog/dags/provider_api_scripts/raw_pixel.py)
- [Common Crawl Syncer](openverse_catalog/dags/commoncrawl_scripts/commoncrawl_s3_syncer/SyncImageProviders.py)
- [Brooklyn Museum](openverse_catalog/dags/provider_api_scripts/brooklyn_museum.py)
- [NYPL](openverse_catalog/dags/provider_api_scripts/nypl.py)

### DB_Loader

The Airflow DAG defined in [`loader_workflow.py`][db_loader] runs every minute,
and loads the oldest file which has not been modified in the last 15 minutes
into the upstream database. It includes some data preprocessing steps.

[db_loader]: src/cc_catalog_airflow/dags/loader_workflow.py

### Other API Jobs

- [Brooklyn Museum](src/cc_catalog_airflow/dags/provider_api_scripts/brooklyn_museum.py)
- [NYPL](src/cc_catalog_airflow/dags/provider_api_scripts/nypl.py)
[db_loader]: openverse_catalog/dags/loader_workflow.py

See each provider API script's notes in their respective [handbook][ov-handbook] entry.

[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/

## Development setup for Airflow and API puller scripts

There are a number of scripts in the directory
[`src/cc_catalog_airflow/dags/provider_api_scripts`][api_scripts] eventually
[`openverse_catalog/dags/provider_api_scripts`][api_scripts] eventually
loaded into a database to be indexed for searching in the Openverse API. These run in a
different environment than the PySpark portion of the project, and so have their
own dependency requirements.

[api_scripts]: src/cc_catalog_airflow/dags/provider_api_scripts
[api_scripts]: openverse_catalog/dags/provider_api_scripts

### Development setup

You'll need `docker` and `docker-compose` installed on your machine, with
versions new enough to use version `3` of Docker Compose `.yml` files.

To set up environment variables, navigate to the
[`src/cc_catalog_airflow`][cc_airflow] directory, and run
[`openverse_catalog`][cc_airflow] directory, and run

```shell
cp env.template .env
Expand All @@ -113,32 +116,32 @@ cp env.template .env
If needed, fill in API keys or other secrets and variables in `.env`. This is
not needed if you only want to run the tests. There is a
[`docker-compose.yml`][dockercompose] provided in the
[`src/cc_catalog_airflow`][cc_airflow] directory, so from that directory, run
[`openverse_catalog`][cc_airflow] directory, so from that directory, run

```shell
docker-compose up -d
```

This results, among other things, in the following running containers:

- `cc_catalog_airflow_webserver_1`
- `cc_catalog_airflow_postgres_1`
- `openverse_catalog_webserver_1`
- `openverse_catalog_postgres_1`

and some networking setup so that they can communicate. Note:

- `cc_catalog_airflow_webserver_1` is running the Apache Airflow daemon, and also
- `openverse_catalog_webserver_1` is running the Apache Airflow daemon, and also
has a few development tools (e.g., `pytest`) installed.
- `cc_catalog_airflow_postgres_1` is running PostgreSQL, and is setup with some
- `openverse_catalog_postgres_1` is running PostgreSQL, and is setup with some
databases and tables to emulate the production environment. It also provides a
database for Airflow to store its running state.
- The directory containing the DAG files, as well as dependencies will be
mounted to the `usr/local/airflow/dags` directory in the container
`cc_catalog_airflow_webserver_1`.
`openverse_catalog_webserver_1`.

At this stage, you can run the tests via:

```shell
docker exec cc_catalog_airflow_webserver_1 /usr/local/airflow/.local/bin/pytest
docker exec openverse_catalog_webserver_1 /usr/local/airflow/.local/bin/pytest
```

Edits to the source files or tests can be made on your local machine, then tests
Expand All @@ -147,14 +150,14 @@ can be run in the container via the above command to see the effects.
If you'd like, it's possible to login to the webserver container via

```shell
docker exec -it cc_catalog_airflow_webserver_1 /bin/bash
docker exec -it openverse_catalog_webserver_1 /bin/bash
```

It's also possible to attach to the running command process of the webserver
container via

```shell
docker attach --sig-proxy=false cc_catalog_airflow_webserver_1
docker attach --sig-proxy=false openverse_catalog_webserver_1
```

Attaching in this manner lets you see the output from both the Airflow webserver
Expand All @@ -169,7 +172,7 @@ If you'd like to bring down the containers, run
docker-compose down
```

from the [`src/cc_catalog_airflow`][cc_airflow] directory.
from the [`openverse_catalog`][cc_airflow] directory.

To reset the test DB (wiping out all databases, schemata, and tables), run

Expand All @@ -178,8 +181,8 @@ docker-compose down
rm -r /tmp/docker_postgres_data/
```

[dockercompose]: src/cc_catalog_airflow/docker-compose.yml
[cc_airflow]: src/cc_catalog_airflow/
[dockercompose]: openverse_catalog/docker-compose.yml
[cc_airflow]: openverse_catalog/

## PySpark development setup

Expand All @@ -203,15 +206,22 @@ python -m pytest tests/test_ExtractCCLinks.py

## Contributing

Pull requests are welcome! Feel free to [join us on Slack](https://make.wordpress.org/chat/) and discuss the project with the engineers and community memebers on #openverse.
Pull requests are welcome! Feel free to [join us on Slack][wp_slack] and discuss the
project with the engineers and community memebers on #openverse.

## Acknowledgments

Openverse, previously known as CC Search, was conceived and built at [Creative Commons](https://creativecommons.org). We thank them for their commitment to open source and openly licensed content, with particular thanks to original team members @kgodey, @annatuma, @mathemancer, @aldenstpage, @brenoferreira, and @sclachar, along with their [community of volunteers](https://opensource.creativecommons.org/community/community-team/).
Openverse, previously known as CC Search, was conceived and built at
[Creative Commons][cc]. We thank them for their commitment to open source and openly
licensed content, with particular thanks to original team members @kgodey, @annatuma,
@mathemancer, @aldenstpage, @brenoferreira, and @sclachar, along with their
[community of volunteers][cc_community].

## License

- [`LICENSE`](LICENSE) (Expat/[MIT][mit] License)

[mit]: http://www.opensource.org/licenses/MIT "The MIT License | Open Source Initiative"
[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
[wp_slack]: https://make.wordpress.org/chat/
[cc]: https://creativecommons.org
[cc_community]: https://opensource.creativecommons.org/community/community-team/
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,10 @@
import unittest
import pyspark
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
from mock import patch, MagicMock
from src.ExtractCCLinks import CCLinks
from archive.ExtractCCLinks import CCLinks
import shutil
import os.path
import botocore
import json
from io import StringIO
import types

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
import logging
from typing import Optional, Dict, Union

from common.licenses.licenses import LicenseInfo
from common.storage import columns
from common.storage.media import MediaStore
from common import LicenseInfo

logger = logging.getLogger(__name__)

Expand Down
Loading

0 comments on commit 6c17203

Please sign in to comment.