Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issues with emojis on MySQL database when using load_data_local_infile #332

Closed
ecoffingould opened this issue May 1, 2024 · 5 comments · Fixed by #333
Closed
Labels

Comments

@ecoffingould
Copy link
Contributor

I am writing a data set with emojis in it to a MySQL database. When the load_data_local_infile flag is set to TRUE (which is necessary for speed purposes), the write fails due to encoding issues. When the load_data_local_infile flag is not called or set to FALSE, the write succeeds, but goes much slower.

I have triple-checked the encoding in all other places — the MySQL database, the connection, the R file itself. All are encoded to utf-8, and can be written normally if the load_data_local_infile flag isn't used. This is also an issue that I didn't have on the old RMySQL package.

I also tried specifying the charset or the encoding in the dbConnect() stage, but that didn't seem to do anything. This is driving me crazy, please help.

df_with_emojis
  db_con _1 <- dbConnect(
    MariaDB(),
    host = Sys.getenv('mysql_url'),
    port = 3306 ,
    user = Sys.getenv('mysql_username') ,
    password = Sys.getenv('mysql_pw') ,
    db = 'my_db',
    load_data_local_infile = TRUE,
    bigint = "integer"
  )

  db_con _2 <- dbConnect(
    MariaDB(),
    host = Sys.getenv('mysql_url'),
    port = 3306 ,
    user = Sys.getenv('mysql_username') ,
    password = Sys.getenv('mysql_pw') ,
    db = 'my_db',
    bigint = "integer"
  )

dbWriteTable(db_con_1,table_name_here,df_with_emojis, overwrite = TRUE,row.names = FALSE)
# This fails with the encoding error "Error executing query: Invalid utf8mb3 character string: xyz problematic emoji line here"

dbWriteTable(db_con_2,table_name_here,df_with_emojis, overwrite = TRUE,row.names = FALSE)
# This works, but slower
@krlmlr
Copy link
Member

krlmlr commented May 1, 2024

Thanks, Eric. Is the following relevant?

https://stackoverflow.com/a/75810281/946850

@ecoffingould
Copy link
Contributor Author

Thank you for the quick response!

My tables are in utf8mb4 — I checked the create statement for table_name_here. The default character set for my database + schema are both utf8mb4. My MySQL server version is 8.0.35

That makes sense with why this write works when load_data_local_infile is set to FALSE. And it's also why the error message is so weird — cause i've double- and triple-checked and the encodings are utf8mb4.

I think that whatever is happening in the load_data_local_infile step is somehow encoding the data as utf8mb3?

@krlmlr krlmlr added the bug label May 1, 2024
@krlmlr
Copy link
Member

krlmlr commented May 1, 2024

Thanks. With https://stackoverflow.com/a/75421637/946850, it becomes apparent that

RMariaDB/R/table.R

Lines 61 to 68 in 24e2520

sql <- paste0(
"LOAD DATA LOCAL INFILE ", dbQuoteString(conn, path), "\n",
"IGNORE\n",
"INTO TABLE ", quoted_name, "\n",
"CHARACTER SET utf8 \n",
"(", paste0(colnames, collapse = ", "), ")",
set
)

is wrong and should be utf8mb4 instead. Would you like to contribute a PR?

@ecoffingould
Copy link
Contributor Author

Here you are!

@krlmlr
Copy link
Member

krlmlr commented May 1, 2024

Thanks for the PR!

@krlmlr krlmlr closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants