Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Commit

Permalink
Refactor Rawpixel to use ProviderDataIngester (#795)
Browse files Browse the repository at this point in the history
* Initial refactor for Rawpixel

* Format JSON

* Move everything into class

* Simplify ID get, add signature w/ notes (it is incomplete)

* Add logic for injecting API signature

* Add NO_LICENSE_FOUND constant

* Improve license capture

* Add function for retrieving direct URL

* Add logic to remove fluff text from title

* Fill out more fields

* Add popularity data

* Add ingestion callable to workflows list

* Update data based on new API response

* Use style_uri to determine image URL

* Remove get_response_json override, add docs

* Update tests

* Update DAGs.md

* Add Rawpixel key to env.template & sort values

* Add thumbnail capture into meta_data

* Remove unnecessary logging line

Co-authored-by: Krystle Salazar <[email protected]>

* Comment out Airflow Variable API key examples

* Add reference to image popularity metrics calculation

* Remove thumbnail extraction for now

* Fine-tune string cleaning a bit more

* More regex fine-tuning

* Revert "Comment out Airflow Variable API key examples"

This reverts commit f2b6a3a.

* Fix tests

* Update documentation

* Rename _get_source to _get_creator

Co-authored-by: Krystle Salazar <[email protected]>
  • Loading branch information
AetherUnbound and krysal authored Oct 27, 2022
1 parent a61bba4 commit 4d01826
Show file tree
Hide file tree
Showing 11 changed files with 928 additions and 436 deletions.
20 changes: 19 additions & 1 deletion DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ The following are DAGs grouped by their primary tag:
| `museum_victoria_workflow` | `@monthly` | `False` | image |
| `nypl_workflow` | `@monthly` | `False` | image |
| [`phylopic_workflow`](#phylopic_workflow) | `@daily` | `True` | image |
| `rawpixel_workflow` | `@monthly` | `False` | image |
| [`rawpixel_workflow`](#rawpixel_workflow) | `@monthly` | `False` | image |
| `science_museum_workflow` | `@monthly` | `False` | image |
| [`smithsonian_workflow`](#smithsonian_workflow) | `@weekly` | `False` | image |
| `smk_workflow` | `@monthly` | `False` | image |
Expand Down Expand Up @@ -124,6 +124,7 @@ The following is documentation associated with each DAG (where available):
1. [`oauth2_token_refresh`](#oauth2_token_refresh)
1. [`phylopic_workflow`](#phylopic_workflow)
1. [`pr_review_reminders`](#pr_review_reminders)
1. [`rawpixel_workflow`](#rawpixel_workflow)
1. [`recreate_audio_popularity_calculation`](#recreate_audio_popularity_calculation)
1. [`recreate_image_popularity_calculation`](#recreate_image_popularity_calculation)
1. [`report_pending_reported_media`](#report_pending_reported_media)
Expand Down Expand Up @@ -437,6 +438,23 @@ author of the PR to re-assign review if one of the randomly selected reviewers
is unavailable for the time period during which the PR should be reviewed.


## `rawpixel_workflow`


Content Provider: Rawpixel

ETL Process: Use the API to identify all CC-licensed images.

Output: TSV file containing the image meta-data.

Notes: Rawpixel has given Openverse beta access to their API.
This API is undocumented, and we will need to contact Rawpixel
directly if we run into any issues.
The public API max results range is limited to 100,000 results,
although the API key we've been given can circumvent this limit.
https://www.rawpixel.com/api/v1/search?tags=$publicdomain&page=1&pagesize=100


## `recreate_audio_popularity_calculation`


Expand Down
7 changes: 6 additions & 1 deletion docker/local_postgres/0004_openledger_image_view.sql
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@ CREATE TABLE public.image_popularity_metrics (
);


-- For more information on these values see:
-- https://github.com/cc-archive/cccatalog/issues/405#issuecomment-629233047
-- https://github.com/cc-archive/cccatalog/pull/477
INSERT INTO public.image_popularity_metrics (
provider, metric, percentile
) VALUES
('flickr', 'views', 0.85),
('wikimedia', 'global_usage_count', 0.85),
('stocksnap', 'downloads_raw', 0.85);
('stocksnap', 'downloads_raw', 0.85),
('rawpixel', 'download_count', 0.85)
;


CREATE FUNCTION image_popularity_percentile(
Expand Down
4 changes: 3 additions & 1 deletion env.template
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,12 @@ AIRFLOW_VAR_API_KEY_BROOKLYN_MUSEUM=not_set
AIRFLOW_VAR_API_KEY_DATA_GOV=not_set
AIRFLOW_VAR_API_KEY_EUROPEANA=not_set
AIRFLOW_VAR_API_KEY_FLICKR=not_set
AIRFLOW_VAR_API_KEY_FREESOUND=not_set
AIRFLOW_VAR_API_KEY_JAMENDO=not_set
AIRFLOW_VAR_API_KEY_NYPL=not_set
AIRFLOW_VAR_API_KEY_RAWPIXEL=not_set
AIRFLOW_VAR_API_KEY_THINGIVERSE=not_set
AIRFLOW_VAR_API_KEY_FREESOUND=not_set
AIRFLOW_VAR_API_KEY_WALTERS_ART_MUSEUM=not_set


########################################################################################
Expand Down
3 changes: 3 additions & 0 deletions openverse_catalog/dags/common/licenses/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@
get_license_info_from_license_pair,
is_valid_license_info,
)


NO_LICENSE_FOUND = LicenseInfo(None, None, None, None)
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.utils.task_group import TaskGroup
from common.constants import POSTGRES_CONN_ID
from common.licenses import LicenseInfo, get_license_info
from common.licenses import NO_LICENSE_FOUND, get_license_info
from common.loader import provider_details as prov
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester

Expand Down Expand Up @@ -87,7 +87,7 @@ def get_record_data(self, data):
return None
license_url = data.get("license_url")
license_info = get_license_info(license_url=license_url)
if license_info == LicenseInfo(None, None, None, None):
if license_info == NO_LICENSE_FOUND:
return None
record_data = {k: data[k] for k in data.keys() if k != "license_url"}
record_data["license_info"] = license_info
Expand Down
Loading

0 comments on commit 4d01826

Please sign in to comment.