From 39dce73c94afe488cc3c3e7027baa3b407748281 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Tue, 3 Dec 2024 12:02:31 -0900 Subject: [PATCH 1/3] Add parquet download link to data dictionary --- docs/templates/resource.rst.jinja | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/templates/resource.rst.jinja b/docs/templates/resource.rst.jinja index 4eafd0b2fc..556054e53b 100644 --- a/docs/templates/resource.rst.jinja +++ b/docs/templates/resource.rst.jinja @@ -13,11 +13,14 @@ **This table has no primary key.** {%- endif %} +**Access methods:** + {% if resource.create_database_schema -%} -`Browse or query this table in Datasette. `__ +* `Browse or query this table in Datasette. `__ {% else -%} -This table is not published to Datasette. +* This table is not published to Datasette. {%- endif %} +* `Download this table as a Parquet file. `__ .. list-table:: :widths: auto From a56b721250d4977af1e8b30846fe90a53658d616 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Wed, 11 Dec 2024 14:43:19 -0900 Subject: [PATCH 2/3] Add parquet file access method to data access page --- docs/data_access.rst | 35 +++++++++++-------------------- docs/templates/resource.rst.jinja | 2 +- 2 files changed, 13 insertions(+), 24 deletions(-) diff --git a/docs/data_access.rst b/docs/data_access.rst index 8a6ac4507e..7f67055673 100644 --- a/docs/data_access.rst +++ b/docs/data_access.rst @@ -8,10 +8,12 @@ PUDL data, so if you have a suggestion, please `open a GitHub issue `__. If you have a question, you can `create a GitHub discussion `__. -PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with -tables with the ``out_`` prefix, as these tables contain the most complete and easiest -to work with data. For more information about the different types -of tables, read through :ref:`PUDL's naming conventions `. +PUDL's primary data output is the ``pudl.sqlite`` database. All the tables are also +distributed as individual Parquet files which are more space efficient, have richer +data types and are better suited for distributed and large-scale data analysis. +We recommend working with tables with the ``out_`` prefix, as these tables contain +the most complete and easiest to work with data. For more information about the +different types of tables, read through :ref:`PUDL's naming conventions `. .. _access-modes: @@ -106,8 +108,12 @@ resulting outputs pass all of the data validation tests we've defined, the outpu automatically uploaded to the `AWS Open Data Registry `__, and used to deploy a new version of Datasette (see above). These nightly build outputs can be accessed using the -AWS CLI, or programmatically via the S3 API. They can also be downloaded directly over -HTTPS using the following links: +AWS CLI, or programmatically via the S3 API. If you don't want to mess with the API +or CLI, you can also download the data directly over HTTPS. The download links for +each table's Parquet file can be found in +the :doc:`PUDL data dictionary page `. + +These are the download links for the PUDL and raw FERC SQLite databases: Fully Processed SQLite Databases ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -115,23 +121,6 @@ Fully Processed SQLite Databases * `Main PUDL Database `__ * `US Census DP1 Database (2010) `__ -Hourly Tables as Parquet -^^^^^^^^^^^^^^^^^^^^^^^^ - -Hourly time series take up a lot of space in SQLite and can be slow to query in bulk, -so we have moved to publishing all our hourly tables using the compressed, columnar -`Apache Parquet `__ file format. - -* `EIA-930 BA Hourly Interchange `__ -* `EIA-930 BA Hourly Net Generation by Energy Source `__ -* `EIA-930 BA Hourly Operations `__ -* `EIA-930 BA Hourly Subregion Demand `__ -* `EPA CEMS Hourly Emissions `__ -* `FERC-714 Hourly Estimated State Demand `__ -* `FERC-714 Hourly Planning Area Demand `__ -* `GridPath RA Toolkit Hourly Available Capacity Factors `__ -* `VCE Resource Adequacy Renewable Energy (RARE) Dataset `__ - Raw FERC DBF & XBRL data converted to SQLite ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/templates/resource.rst.jinja b/docs/templates/resource.rst.jinja index 556054e53b..fd1b45bfe6 100644 --- a/docs/templates/resource.rst.jinja +++ b/docs/templates/resource.rst.jinja @@ -20,7 +20,7 @@ {% else -%} * This table is not published to Datasette. {%- endif %} -* `Download this table as a Parquet file. `__ +* `Download this table as a Parquet file. `__ .. list-table:: :widths: auto From b20785299864c17a388ced3b103e69ff5398ffe6 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Tue, 17 Dec 2024 15:05:05 -0900 Subject: [PATCH 3/3] Make small parquet description changes and add hourly table descriptions back to data access page --- docs/data_access.rst | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/docs/data_access.rst b/docs/data_access.rst index 7f67055673..9a77efc525 100644 --- a/docs/data_access.rst +++ b/docs/data_access.rst @@ -9,7 +9,8 @@ PUDL data, so if you have a suggestion, please `open a GitHub issue can `create a GitHub discussion `__. PUDL's primary data output is the ``pudl.sqlite`` database. All the tables are also -distributed as individual Parquet files which are more space efficient, have richer +distributed as individual `Apache Parquet `__ files +which are more space efficient, have richer data types and are better suited for distributed and large-scale data analysis. We recommend working with tables with the ``out_`` prefix, as these tables contain the most complete and easiest to work with data. For more information about the @@ -108,7 +109,9 @@ resulting outputs pass all of the data validation tests we've defined, the outpu automatically uploaded to the `AWS Open Data Registry `__, and used to deploy a new version of Datasette (see above). These nightly build outputs can be accessed using the -AWS CLI, or programmatically via the S3 API. If you don't want to mess with the API +AWS CLI, or programmatically via the S3 API. + +If you don't want to mess with the API or CLI, you can also download the data directly over HTTPS. The download links for each table's Parquet file can be found in the :doc:`PUDL data dictionary page `. @@ -121,6 +124,22 @@ Fully Processed SQLite Databases * `Main PUDL Database `__ * `US Census DP1 Database (2010) `__ +Hourly Tables as Parquet +^^^^^^^^^^^^^^^^^^^^^^^^ + +Hourly time series take up a lot of space in SQLite and can be slow to query in bulk, +so all our hourly tables are only distributed as Parquet files: + +* `EIA-930 BA Hourly Interchange `__ +* `EIA-930 BA Hourly Net Generation by Energy Source `__ +* `EIA-930 BA Hourly Operations `__ +* `EIA-930 BA Hourly Subregion Demand `__ +* `EPA CEMS Hourly Emissions `__ +* `FERC-714 Hourly Estimated State Demand `__ +* `FERC-714 Hourly Planning Area Demand `__ +* `GridPath RA Toolkit Hourly Available Capacity Factors `__ +* `VCE Resource Adequacy Renewable Energy (RARE) Dataset `__ + Raw FERC DBF & XBRL data converted to SQLite ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^