Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for conditional column writes #3066

Open
panbamid-r opened this issue Jan 8, 2025 · 1 comment
Open

Support for conditional column writes #3066

panbamid-r opened this issue Jan 8, 2025 · 1 comment
Labels

Comments

@panbamid-r
Copy link

Is your feature request related to a problem? Please describe.
I'm working on a project and I need to overwrite some data (a specific subset of columns) on an Iceberg table via athena. The current implementation doesn't support it, and it's not feasible to load all the data in memory and do the processing in python, so that then I could overwrite the entire subset

Describe the solution you'd like
I would like the to_iceberg function in the awswrangler.athena module to support partial column overwrites. This could be an additional argument for the function.

Describe alternatives you've considered
The alternative would be to load everything in memory (as described above)

Additional context

@GrumpyCat51
Copy link

GrumpyCat51 commented Jan 8, 2025

Here is a minimal example of what we try to do:

We have an iceberg table structured like this:

id label ...(several other columns)
1 0 ...
2 0 ...
3 0 ...
4 0 ...
5 0 ...
6 0 ...
7 0 ...

Then we calculate updated label for a subset of them, e.g.

id label
3 1
4 1

Now we want to update the original table with these values without the need to first download all the additional columns in order to get

id label ...(several other columns)
1 0 ...
2 0 ...
3 1 ...
4 1 ...
5 0 ...
6 0 ...
7 0 ...

The current implementation cannot do this, as it will either try to change the table structure (with fill_missing_columns_in_df = False) and raise an exceptions.InvalidArgumentCombination error, or it will replace all additional columns with None/NULL.

I'd create a PR for this if wished. I've already made a fork here main...GrumpyCat51:aws-sdk-pandas:main that we tested successfully as a suggestion, but I'd be happy to change/improve/adapt it as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants