-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet download link to data dictionary #3984
Changes from 3 commits
39dce73
ba080ea
a56b721
734073f
b207852
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,10 +8,12 @@ PUDL data, so if you have a suggestion, please `open a GitHub issue | |
<https://github.com/catalyst-cooperative/pudl/issues>`__. If you have a question, you | ||
can `create a GitHub discussion <https://github.com/orgs/catalyst-cooperative/discussions/new?category=help-me>`__. | ||
|
||
PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with | ||
tables with the ``out_`` prefix, as these tables contain the most complete and easiest | ||
to work with data. For more information about the different types | ||
of tables, read through :ref:`PUDL's naming conventions <asset-naming>`. | ||
PUDL's primary data output is the ``pudl.sqlite`` database. All the tables are also | ||
distributed as individual Parquet files which are more space efficient, have richer | ||
data types and are better suited for distributed and large-scale data analysis. | ||
We recommend working with tables with the ``out_`` prefix, as these tables contain | ||
the most complete and easiest to work with data. For more information about the | ||
different types of tables, read through :ref:`PUDL's naming conventions <asset-naming>`. | ||
|
||
.. _access-modes: | ||
|
||
|
@@ -106,32 +108,19 @@ resulting outputs pass all of the data validation tests we've defined, the outpu | |
automatically uploaded to the `AWS Open Data Registry | ||
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__, and used to deploy a new | ||
version of Datasette (see above). These nightly build outputs can be accessed using the | ||
AWS CLI, or programmatically via the S3 API. They can also be downloaded directly over | ||
HTTPS using the following links: | ||
AWS CLI, or programmatically via the S3 API. If you don't want to mess with the API | ||
or CLI, you can also download the data directly over HTTPS. The download links for | ||
each table's Parquet file can be found in | ||
the :doc:`PUDL data dictionary page </data_dictionaries/pudl_db>`. | ||
|
||
These are the download links for the PUDL and raw FERC SQLite databases: | ||
|
||
Fully Processed SQLite Databases | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
* `Main PUDL Database <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/pudl.sqlite.zip>`__ | ||
* `US Census DP1 Database (2010) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/censusdp1tract.sqlite.zip>`__ | ||
|
||
Hourly Tables as Parquet | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed this section because now all tables are available as Parquet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is true, but these tables aren't in SQLite and so I think shouting them out here is still helpful - if people are looking for them they won't be able to find them in the full DB. |
||
^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Hourly time series take up a lot of space in SQLite and can be slow to query in bulk, | ||
so we have moved to publishing all our hourly tables using the compressed, columnar | ||
`Apache Parquet <https://parquet.apache.org/docs/>`__ file format. | ||
|
||
* `EIA-930 BA Hourly Interchange <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_interchange.parquet>`__ | ||
* `EIA-930 BA Hourly Net Generation by Energy Source <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_net_generation_by_energy_source.parquet>`__ | ||
* `EIA-930 BA Hourly Operations <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet>`__ | ||
* `EIA-930 BA Hourly Subregion Demand <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_subregion_demand.parquet>`__ | ||
* `EPA CEMS Hourly Emissions <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_epacems__hourly_emissions.parquet>`__ | ||
* `FERC-714 Hourly Estimated State Demand <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_ferc714__hourly_estimated_state_demand.parquet>`__ | ||
* `FERC-714 Hourly Planning Area Demand <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_ferc714__hourly_planning_area_demand.parquet>`__ | ||
* `GridPath RA Toolkit Hourly Available Capacity Factors <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_gridpathratoolkit__hourly_available_capacity_factor.parquet>`__ | ||
* `VCE Resource Adequacy Renewable Energy (RARE) Dataset <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_vcerare__hourly_available_capacity_factor.parquet>`__ | ||
|
||
Raw FERC DBF & XBRL data converted to SQLite | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.