Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 string cut off #16

Open
Keno-Chile opened this issue Aug 7, 2018 · 4 comments
Open

utf8 string cut off #16

Keno-Chile opened this issue Aug 7, 2018 · 4 comments

Comments

@Keno-Chile
Copy link

After runnig the script words like:
Américo, año, agüita, and other spanish words become:
Am, a, ag
(the words got sliced. I tryed changing ther encoding to utf8mb4, but got the same result.

@nicjansma
Copy link
Owner

@Keno-Chile can you share a partial dump of your input data you're trying to convert?

@dszyfelb
Copy link

I have the same issue.
For example "DO BI¯UTERII" is changed to "DO BI"

@nicjansma
Copy link
Owner

@dszyfelb a shared partial data dump might be helpful to debug!

@jseaber
Copy link

jseaber commented Jan 27, 2021

Thanks again for sharing your work! I've encountered the same issue.

I tested thoroughly on a staging server loaded with a copy of production data, first executing on the structure only (no errors), and then on the full database. All seemed to go well. Knowing many of the records in our 10 year old MySQL database are non-critical, I executed on the production database after confirming all current/sensitive records were intact. Here's the collation map I used:

// TODO: The collation you want to convert the overall database to
$defaultCollation = 'utf8mb4_0900_ai_ci';

// TODO Convert column collations and table defaults using this mapping
// latin1_swedish_ci is included since that's the MySQL default
$collationMap =
 array(
  'latin1_bin'        => 'utf8_bin',
  'latin1_general_ci' => 'utf8mb4_0900_ai_ci',
  'latin1_swedish_ci' => 'utf8mb4_0900_ai_ci',
 );
  1. All instances of FedEx International Economy® became FedEx International Economy
  2. Most German was dropped: Schönfliess become Sch. The same happened to words containing characters ü and ß, etc.
  3. Działdowska remained Działdowska

PHP version: 7.4.3
MySQL: 8.0.22-0ubuntu0.20.04.3 (Ubuntu)

Although the lost (old) data is unimportant to our needs, I have full backups of the original data in latin1_swedish_ci and have pulled a few of the problematic records. I'll be glad to experiment further if you have any ideas.

Edit: Looking through my SQL backups, all backed up tables were dumped without explicit declaration of the collation, i.e.:
ENGINE=InnoDB DEFAULT CHARSET=latin1;

After running the script, all tables now show an explicit collation, even though this matches the MySQL 8.0 server default:
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Is it possible the script was unable to read the collation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants