utf8 string cut off #16

Keno-Chile · 2018-08-07T16:33:27Z

After runnig the script words like:
Américo, año, agüita, and other spanish words become:
Am, a, ag
(the words got sliced. I tryed changing ther encoding to utf8mb4, but got the same result.

nicjansma · 2018-09-29T23:39:02Z

@Keno-Chile can you share a partial dump of your input data you're trying to convert?

dszyfelb · 2019-10-19T20:11:54Z

I have the same issue.
For example "DO BI¯UTERII" is changed to "DO BI"

nicjansma · 2019-12-07T14:40:25Z

@dszyfelb a shared partial data dump might be helpful to debug!

jseaber · 2021-01-27T03:05:01Z

Thanks again for sharing your work! I've encountered the same issue.

I tested thoroughly on a staging server loaded with a copy of production data, first executing on the structure only (no errors), and then on the full database. All seemed to go well. Knowing many of the records in our 10 year old MySQL database are non-critical, I executed on the production database after confirming all current/sensitive records were intact. Here's the collation map I used:

// TODO: The collation you want to convert the overall database to
$defaultCollation = 'utf8mb4_0900_ai_ci';

// TODO Convert column collations and table defaults using this mapping
// latin1_swedish_ci is included since that's the MySQL default
$collationMap =
 array(
  'latin1_bin'        => 'utf8_bin',
  'latin1_general_ci' => 'utf8mb4_0900_ai_ci',
  'latin1_swedish_ci' => 'utf8mb4_0900_ai_ci',
 );

All instances of FedEx International Economy® became FedEx International Economy
Most German was dropped: Schönfliess become Sch. The same happened to words containing characters ü and ß, etc.
Działdowska remained Działdowska

PHP version: 7.4.3
MySQL: 8.0.22-0ubuntu0.20.04.3 (Ubuntu)

Although the lost (old) data is unimportant to our needs, I have full backups of the original data in latin1_swedish_ci and have pulled a few of the problematic records. I'll be glad to experiment further if you have any ideas.

Edit: Looking through my SQL backups, all backed up tables were dumped without explicit declaration of the collation, i.e.:
ENGINE=InnoDB DEFAULT CHARSET=latin1;

After running the script, all tables now show an explicit collation, even though this matches the MySQL 8.0 server default:
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Is it possible the script was unable to read the collation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8 string cut off #16

utf8 string cut off #16

Keno-Chile commented Aug 7, 2018

nicjansma commented Sep 29, 2018

dszyfelb commented Oct 19, 2019

nicjansma commented Dec 7, 2019

jseaber commented Jan 27, 2021 •

edited

Loading

utf8 string cut off #16

utf8 string cut off #16

Comments

Keno-Chile commented Aug 7, 2018

nicjansma commented Sep 29, 2018

dszyfelb commented Oct 19, 2019

nicjansma commented Dec 7, 2019

jseaber commented Jan 27, 2021 • edited Loading

jseaber commented Jan 27, 2021 •

edited

Loading