Skip to content

Commit

Permalink
iNaturalist in-SQL loading (#745)
Browse files Browse the repository at this point in the history
* cleaning and temp table in pg

* sketch of full dag NOT TESTED

* inaturalist dag without tests or reporting (yet)

* complete dag, 25 mill recs in 5.5 hours local test

* Add passwords for s3 testing with new docker

* make temp loading table UNLOGGED to load it faster

* inat with translation 75 million recs in 8 hrs

* using OUTPUT_DIR for API files

* clarify delayed requester vs requester

* DRYer approach to tags TO DO

* comments on taxa transformation

* scientific names not ids for manual translation

* TO DO comment clean-up

* fix name insert syntax

* Merge 'main' into feature/inaturalist-performance

* add clarity on batch limit override

* missing piece of merge from main

* limit to 20 tags per photo

* add option to use alternate dag creation for sql

* adjust tests see issue #898

* slightly faster way to pull medium test sample

* Note another data source for vernacular names

* remove unnecessary test code

* clean and upsert one batch at a time

* log parsing resource doc

* use common.constants.IMAGE instead of MEDIA_TYPE

* add explanation of ancestry joins and taxa tags

* use existing clean_intermediate_table_data

* remove unnecessary env vars from load_to_s3

* declarative doc string for file update check

* update iNaturalist description

* remove message to Staci :)

* use dynamically generated load subtasks

* clarify taxa comments and include languages

* consolidate consolidation code

* add testing for consolidated metrics

* separate ti_mock instances per test

* test get batches

* shorter titles to save space

* add better testing instructions

* dag parameter to manage post-ingestion deletions

* Add kwargs to get_response_json call

* get_media_type can be static method

Co-authored-by: Krystle Salazar <[email protected]>

* link to original inaturalist photo, rather than medium

Co-authored-by: Krystle Salazar <[email protected]>

* prefer creator name over login

* remove unused constants

* add to do for extension cleanup

Co-authored-by: Madison Swain-Bowden <[email protected]>
Co-authored-by: Krystle Salazar <[email protected]>
  • Loading branch information
3 people authored Jan 13, 2023
1 parent e1edcba commit 2a0a1c3
Show file tree
Hide file tree
Showing 14 changed files with 1,112 additions and 282 deletions.
29 changes: 13 additions & 16 deletions DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,22 +301,19 @@ and related PRs:

Provider: iNaturalist

Output: TSV file containing the media metadata.

Notes: [The iNaturalist API is not intended for data scraping.]
(https://api.inaturalist.org/v1/docs/)

[But there is a full dump intended for sharing on S3.]
(https://github.com/inaturalist/inaturalist-open-data/tree/documentation/Metadata)

Because these are very large normalized tables, as opposed to more document
oriented API responses, we found that bringing the data into postgres first
was the most effective approach. [More detail in slack here.]
(https://wordpress.slack.com/archives/C02012JB00N/p1653145643080479?thread_ts=1653082292.714469&cid=C02012JB00N)

We use the table structure defined [here,]
(https://github.com/inaturalist/inaturalist-open-data/blob/main/Metadata/structure.sql)
except for adding ancestry tags to the taxa table.
Output: Records loaded to the image catalog table.

Notes: The iNaturalist API is not intended for data scraping.
https://api.inaturalist.org/v1/docs/ But there is a full dump intended for
sharing on S3.
https://github.com/inaturalist/inaturalist-open-data/tree/documentation/Metadata
Because these are very large normalized tables, as opposed to more document
oriented API responses, we found that bringing the data into postgres first was
the most effective approach. More detail in slack here:
https://wordpress.slack.com/archives/C02012JB00N/p1653145643080479?thread_ts=1653082292.714469&cid=C02012JB00N
We use the table structure defined here,
https://github.com/inaturalist/inaturalist-open-data/blob/main/Metadata/structure.sql
except for adding ancestry tags to the taxa table.

## `jamendo_workflow`

Expand Down
3 changes: 2 additions & 1 deletion openverse_catalog/dags/common/loader/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def create_loading_table(
columns_definition = f"{create_column_definitions(loading_table_columns)}"
table_creation_query = dedent(
f"""
CREATE TABLE public.{load_table}(
CREATE UNLOGGED TABLE public.{load_table}(
{columns_definition});
"""
)
Expand All @@ -96,6 +96,7 @@ def create_index(column, btree_column=None):
create_index(col.PROVIDER.db_name, None)
create_index(col.FOREIGN_ID.db_name, "provider")
create_index(col.DIRECT_URL.db_name, "provider")
return load_table


def load_local_data_to_intermediate_table(
Expand Down
Loading

0 comments on commit 2a0a1c3

Please sign in to comment.