Spike: Prove Fast API with DuckDB and Parquet on S3 #102

Ben-Hodgkiss · 2024-10-18T13:47:34Z

Overview
Following the design proposal for an internal API, we would like to prove some technology choices which include the use of Fast API with DuckDB accessing Parquet on S3.

This work was identified during the spike on API design.

Tech Approach

Create pipeline-internal-api repository
Use Localstack S3 with Docker Compose to run a Fast API container locally
Produce test data in Parquet based on existing log and issue data
Push changes to pipeline-internal-api repo
The work ha sbeen done so that issue logs are now produced in parquet files in S3. You can use real data if required - you might be able to complete this ticket as part of this spike depending on the approach.

Acceptance Criteria/Tests

Example code pushed to pipeline-internal-api repo
Repo contains README explaining how to run the example code

Ticket Management - DELETE this section once completed

Complete all relevant tags - make sure Infrastructure is tagged so it is picked up by our filters!
Complete the time estimate field
Make sure you have a PR link in the Overviewabove.
If relevant, link to the relevant OKR as an attachment.
Link to any tickets in other boards that are dependent on it.

cpcundill · 2024-11-12T07:51:41Z

Code along with README has been pushed to the new pipeline-internal-api repository: https://github.com/digital-land/pipeline-internal-api

cpcundill · 2024-11-12T07:54:01Z

The work completed in this spike certainly contributed to the implementation required for #106. However, it doesn't provide the testing and deployment into AWS which will be required for ticket 106.

cpcundill · 2024-11-13T06:22:46Z

Reviewed with the team and made one change:

Switched over from explicit path manipulation for dataset and resource to replying upon the automatic hive_partitioning inference built into DuckDB. Dataset and resource are now just WHERE clause parameters, like the other query parameters

Ben-Hodgkiss added this to Infrastructure Oct 18, 2024

Ben-Hodgkiss converted this from a draft issue Oct 18, 2024

Ben-Hodgkiss moved this from Refine, Prioritise & Plan to Sprint Backlog in Infrastructure Nov 6, 2024

cpcundill self-assigned this Nov 6, 2024

Ben-Hodgkiss mentioned this issue Nov 12, 2024

Expose issues in Parquet format via datasette #105

Open

cpcundill closed this as completed by moving to Done - Consider for Weeknotes in Infrastructure Nov 13, 2024

Ben-Hodgkiss moved this from Done - Consider for Weeknotes to Done - This Period in Infrastructure Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Prove Fast API with DuckDB and Parquet on S3 #102

Spike: Prove Fast API with DuckDB and Parquet on S3 #102

Ben-Hodgkiss commented Oct 18, 2024 •

edited

Loading

cpcundill commented Nov 12, 2024 •

edited

Loading

cpcundill commented Nov 12, 2024

cpcundill commented Nov 13, 2024

Spike: Prove Fast API with DuckDB and Parquet on S3 #102

Spike: Prove Fast API with DuckDB and Parquet on S3 #102

Comments

Ben-Hodgkiss commented Oct 18, 2024 • edited Loading

cpcundill commented Nov 12, 2024 • edited Loading

cpcundill commented Nov 12, 2024

cpcundill commented Nov 13, 2024

Ben-Hodgkiss commented Oct 18, 2024 •

edited

Loading

cpcundill commented Nov 12, 2024 •

edited

Loading