iNaturalist in-SQL loading (#745)

* cleaning and temp table in pg * sketch of full dag NOT TESTED * inaturalist dag without tests or reporting (yet) * complete dag, 25 mill recs in 5.5 hours local test * Add passwords for s3 testing with new docker * make temp loading table UNLOGGED to load it faster * inat with translation 75 million recs in 8 hrs * using OUTPUT_DIR for API files * clarify delayed requester vs requester * DRYer approach to tags TO DO * comments on taxa transformation * scientific names not ids for manual translation * TO DO comment clean-up * fix name insert syntax * Merge 'main' into feature/inaturalist-performance * add clarity on batch limit override * missing piece of merge from main * limit to 20 tags per photo * add option to use alternate dag creation for sql * adjust tests see issue #898 * slightly faster way to pull medium test sample * Note another data source for vernacular names * remove unnecessary test code * clean and upsert one batch at a time * log parsing resource doc * use common.constants.IMAGE instead of MEDIA_TYPE * add explanation of ancestry joins and taxa tags * use existing clean_intermediate_table_data * remove unnecessary env vars from load_to_s3 * declarative doc string for file update check * update iNaturalist description * remove message to Staci :) * use dynamically generated load subtasks * clarify taxa comments and include languages * consolidate consolidation code * add testing for consolidated metrics * separate ti_mock instances per test * test get batches * shorter titles to save space * add better testing instructions * dag parameter to manage post-ingestion deletions * Add kwargs to get_response_json call * get_media_type can be static method Co-authored-by: Krystle Salazar <[email protected]> * link to original inaturalist photo, rather than medium Co-authored-by: Krystle Salazar <[email protected]> * prefer creator name over login * remove unused constants * add to do for extension cleanup Co-authored-by: Madison Swain-Bowden <[email protected]> Co-authored-by: Krystle Salazar <[email protected]>
WordPress · Jan 13, 2023 · 2a0a1c3 · 2a0a1c3
1 parent e1edcba
commit 2a0a1c3
Show file tree

Hide file tree

Showing 14 changed files with 1,112 additions and 282 deletions.
diff --git a/DAGs.md b/DAGs.md
@@ -301,22 +301,19 @@ and related PRs:
 
 Provider: iNaturalist
 
-Output: TSV file containing the media metadata.
-
-Notes: [The iNaturalist API is not intended for data scraping.]
-(https://api.inaturalist.org/v1/docs/)
-
-            [But there is a full dump intended for sharing on S3.]
-            (https://github.com/inaturalist/inaturalist-open-data/tree/documentation/Metadata)
-
-            Because these are very large normalized tables, as opposed to more document
-            oriented API responses, we found that bringing the data into postgres first
-            was the most effective approach. [More detail in slack here.]
-            (https://wordpress.slack.com/archives/C02012JB00N/p1653145643080479?thread_ts=1653082292.714469&cid=C02012JB00N)
-
-            We use the table structure defined [here,]
-            (https://github.com/inaturalist/inaturalist-open-data/blob/main/Metadata/structure.sql)
-            except for adding ancestry tags to the taxa table.
+Output: Records loaded to the image catalog table.
+
+Notes: The iNaturalist API is not intended for data scraping.
+https://api.inaturalist.org/v1/docs/ But there is a full dump intended for
+sharing on S3.
+https://github.com/inaturalist/inaturalist-open-data/tree/documentation/Metadata
+Because these are very large normalized tables, as opposed to more document
+oriented API responses, we found that bringing the data into postgres first was
+the most effective approach. More detail in slack here:
+https://wordpress.slack.com/archives/C02012JB00N/p1653145643080479?thread_ts=1653082292.714469&cid=C02012JB00N
+We use the table structure defined here,
+https://github.com/inaturalist/inaturalist-open-data/blob/main/Metadata/structure.sql
+except for adding ancestry tags to the taxa table.
 
 ## `jamendo_workflow`
 

diff --git a/openverse_catalog/dags/common/loader/sql.py b/openverse_catalog/dags/common/loader/sql.py
@@ -71,7 +71,7 @@ def create_loading_table(
     columns_definition = f"{create_column_definitions(loading_table_columns)}"
     table_creation_query = dedent(
         f"""
-    CREATE TABLE public.{load_table}(
+    CREATE UNLOGGED TABLE public.{load_table}(
     {columns_definition});
     """
     )
@@ -96,6 +96,7 @@ def create_index(column, btree_column=None):
     create_index(col.PROVIDER.db_name, None)
     create_index(col.FOREIGN_ID.db_name, "provider")
     create_index(col.DIRECT_URL.db_name, "provider")
+    return load_table
 
 
 def load_local_data_to_intermediate_table(