Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Prove Fast API with DuckDB and Parquet on S3 #102

Closed
Ben-Hodgkiss opened this issue Oct 18, 2024 · 3 comments
Closed

Spike: Prove Fast API with DuckDB and Parquet on S3 #102

Ben-Hodgkiss opened this issue Oct 18, 2024 · 3 comments
Assignees

Comments

@Ben-Hodgkiss
Copy link
Contributor

Ben-Hodgkiss commented Oct 18, 2024

Overview
Following the design proposal for an internal API, we would like to prove some technology choices which include the use of Fast API with DuckDB accessing Parquet on S3.

This work was identified during the spike on API design.

Tech Approach

  • Create pipeline-internal-api repository
  • Use Localstack S3 with Docker Compose to run a Fast API container locally
  • Produce test data in Parquet based on existing log and issue data
  • Push changes to pipeline-internal-api repo
  • The work ha sbeen done so that issue logs are now produced in parquet files in S3. You can use real data if required - you might be able to complete this ticket as part of this spike depending on the approach.

Acceptance Criteria/Tests

  • Example code pushed to pipeline-internal-api repo
  • Repo contains README explaining how to run the example code

Ticket Management - DELETE this section once completed

  • Complete all relevant tags - make sure Infrastructure is tagged so it is picked up by our filters!
  • Complete the time estimate field
  • Make sure you have a PR link in the Overviewabove.
  • If relevant, link to the relevant OKR as an attachment.
  • Link to any tickets in other boards that are dependent on it.
@Ben-Hodgkiss Ben-Hodgkiss converted this from a draft issue Oct 18, 2024
@Ben-Hodgkiss Ben-Hodgkiss moved this from Refine, Prioritise & Plan to Sprint Backlog in Infrastructure Nov 6, 2024
@cpcundill cpcundill self-assigned this Nov 6, 2024
@cpcundill
Copy link
Contributor

cpcundill commented Nov 12, 2024

Code along with README has been pushed to the new pipeline-internal-api repository: https://github.com/digital-land/pipeline-internal-api

@cpcundill
Copy link
Contributor

The work completed in this spike certainly contributed to the implementation required for #106. However, it doesn't provide the testing and deployment into AWS which will be required for ticket 106.

@cpcundill
Copy link
Contributor

Reviewed with the team and made one change:

Switched over from explicit path manipulation for dataset and resource to replying upon the automatic hive_partitioning inference built into DuckDB. Dataset and resource are now just WHERE clause parameters, like the other query parameters

@cpcundill cpcundill closed this as completed by moving to Done - Consider for Weeknotes in Infrastructure Nov 13, 2024
@Ben-Hodgkiss Ben-Hodgkiss moved this from Done - Consider for Weeknotes to Done - This Period in Infrastructure Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - This Period
Development

No branches or pull requests

2 participants