Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet download link to data dictionary #3984

Merged
merged 5 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 16 additions & 8 deletions docs/data_access.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,13 @@ PUDL data, so if you have a suggestion, please `open a GitHub issue
<https://github.com/catalyst-cooperative/pudl/issues>`__. If you have a question, you
can `create a GitHub discussion <https://github.com/orgs/catalyst-cooperative/discussions/new?category=help-me>`__.

PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with
tables with the ``out_`` prefix, as these tables contain the most complete and easiest
to work with data. For more information about the different types
of tables, read through :ref:`PUDL's naming conventions <asset-naming>`.
PUDL's primary data output is the ``pudl.sqlite`` database. All the tables are also
distributed as individual `Apache Parquet <https://parquet.apache.org/docs/>`__ files
which are more space efficient, have richer
data types and are better suited for distributed and large-scale data analysis.
We recommend working with tables with the ``out_`` prefix, as these tables contain
the most complete and easiest to work with data. For more information about the
different types of tables, read through :ref:`PUDL's naming conventions <asset-naming>`.

.. _access-modes:

Expand Down Expand Up @@ -106,8 +109,14 @@ resulting outputs pass all of the data validation tests we've defined, the outpu
automatically uploaded to the `AWS Open Data Registry
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__, and used to deploy a new
version of Datasette (see above). These nightly build outputs can be accessed using the
AWS CLI, or programmatically via the S3 API. They can also be downloaded directly over
HTTPS using the following links:
AWS CLI, or programmatically via the S3 API.

If you don't want to mess with the API
or CLI, you can also download the data directly over HTTPS. The download links for
each table's Parquet file can be found in
the :doc:`PUDL data dictionary page </data_dictionaries/pudl_db>`.

These are the download links for the PUDL and raw FERC SQLite databases:

Fully Processed SQLite Databases
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -119,8 +128,7 @@ Hourly Tables as Parquet
^^^^^^^^^^^^^^^^^^^^^^^^

Hourly time series take up a lot of space in SQLite and can be slow to query in bulk,
so we have moved to publishing all our hourly tables using the compressed, columnar
`Apache Parquet <https://parquet.apache.org/docs/>`__ file format.
so all our hourly tables are only distributed as Parquet files:

* `EIA-930 BA Hourly Interchange <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_interchange.parquet>`__
* `EIA-930 BA Hourly Net Generation by Energy Source <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_net_generation_by_energy_source.parquet>`__
Expand Down
7 changes: 5 additions & 2 deletions docs/templates/resource.rst.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,14 @@
**This table has no primary key.**
{%- endif %}

**Access methods:**

{% if resource.create_database_schema -%}
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__
* `Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__
{% else -%}
This table is not published to Datasette.
* This table is not published to Datasette.
{%- endif %}
* `Download this table as a Parquet file. <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{{ resource.name }}.parquet>`__

.. list-table::
:widths: auto
Expand Down
Loading