Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to write SUPER type column to Redshift using redshift.copy #3095

Open
duarteocarmo opened this issue Feb 14, 2025 · 6 comments · May be fixed by #3104
Open

Unable to write SUPER type column to Redshift using redshift.copy #3095

duarteocarmo opened this issue Feb 14, 2025 · 6 comments · May be fixed by #3104
Labels
bug Something isn't working

Comments

@duarteocarmo
Copy link

Describe the bug

Description

Having issues writing data to a Redshift table containing a SUPER type column using awswrangler.redshift.copy. Even with serialize_to_json=True, the SUPER type column is not properly handled.

Environment

  • awswrangler version: 3.11.0
  • Python version: 3.12.4
  • Operating System: Darwin

Table Schema

CREATE TABLE IF NOT EXISTS bss_dv.free_text_translation
(
    md5_hash VARCHAR(32) NOT NULL ENCODE RAW,
    translation SUPER ENCODE RAW,
    translation_date TIMESTAMP WITHOUT TIME ZONE ENCODE az64
)
DISTSTYLE KEY
DISTKEY (md5_hash)
SORTKEY (md5_hash)

My function

def write_table(
    table: pl.DataFrame,
    config: JobConfig,
    dest_table: str = "free_text_translation",
    dest_schema: str = "bss_dv",
) -> None:
    pdf = table.to_pandas()
    pdf["translation_date"] = pd.to_datetime(pdf["translation_date"])
    # example data from here

    with wr.redshift.connect(
        secret_id=config.REDSHIFT_SECRET_ID,
        dbname=config.REDSHIFT_DB,
        timeout=3600
    ) as con:
        wr.redshift.copy(
            df=pdf,
            path=config.TEMP_BUCKET,
            con=con,
            table="free_text_translation",
            schema="bss_dv",
            mode="append",
            serialize_to_json=True,
        )

Example data:

{
    'md5_hash': '00066f9e57abf748924808004fc504a7',
    'translation': '{"text": "...", "source_language": "de", "target_language": "en", "success": true, "position": 0, "should_translate": true, "translated_text": "...", "message": null, "version": "2025-01-06", "text_len_chars": 235, "manual": false}',
    'translation_date': Timestamp('2025-02-14 13:00:59')
}

When I query the data in redshift - the translation column is a string rather than SuperJson..

How to Reproduce

See above.

Expected behavior

The col in Redshift should be superjson

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.12.4

AWS SDK for pandas version

3.11.0

Additional context

No response

@duarteocarmo duarteocarmo added the bug Something isn't working label Feb 14, 2025
@Rutuja2506
Copy link

Rutuja2506 commented Feb 24, 2025

Hi @kukushking , Could you please try explicitly serializing 'translation' column as JSON by using *json.dumps, which ensures the column is in correct format

   # Explicitly serialize 'translation' column as JSON
   import json
   pdf["translation"] = pdf["translation"].apply(json.dumps)

@duarteocarmo
Copy link
Author

@Rutuja2506 - that worked for me!

@kukushking kukushking removed their assignment Feb 24, 2025
@misteliy
Copy link

@kukushking, could you please reopen this issue? The original problem persists as serialize_to_json=True is not functioning correctly.

To address this, please either:

Remove the option: If it's not feasible to fix at this time.
Correct the implementation: So that it works as intended.
While we acknowledge the workaround is effective, it's not an ideal long-term solution.

Thanks!

@duarteocarmo duarteocarmo reopened this Feb 24, 2025
@kukushking
Copy link
Contributor

@misteliy @duarteocarmo what is the type of the column in your source data frame: object or a string?

One thing below code ensures is that the column is a string that can be serialised to JSON by and written into SUPER type.

   pdf["translation"] = pdf["translation"].apply(json.dumps)

@misteliy
Copy link

Please have a look at the documentation:

Image

so, there should be no need to serialize!

@kukushking
Copy link
Contributor

kukushking commented Feb 27, 2025

@misteliy just to be clear SERIALIZETOJSON only alters COPY command with the ability to load columns into SUPER type column in redshift from Parquet. The data in your dataframe and resulting parquet still must be serialized.

I have added a test case and clarified the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants