Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve encoding detection #435

Merged
merged 2 commits into from
Sep 19, 2021
Merged

Improve encoding detection #435

merged 2 commits into from
Sep 19, 2021

Conversation

Ghabry
Copy link
Member

@Ghabry Ghabry commented Sep 7, 2021

The first commit is not really needed for it but it makes it after 0.7 easier to replace StringView with std::string_view (C++17). This is API compatible, no changes needed to consumers.

The second commit changes the Encoding Api from Streams to a Database-handle. Because we do not operate on global objects anymore the database checked by player was always empty. This fixes it (Player loads the DB and passes it in). I also noticed that for whatever reason the language detection is better when system tab is at the beginning (maybe filenames are better than terms because they are usually unaffected by translations?)

@Ghabry Ghabry added the Encoding label Sep 7, 2021
@Ghabry Ghabry added this to the 0.7.0 milestone Sep 7, 2021
Copy link
Member

@carstene1ns carstene1ns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is, if the encoding is not detected for ascii strings anymore (filtered out), how to know it is an english game at all?

@Ghabry
Copy link
Member Author

Ghabry commented Sep 11, 2021

Thats a good question. ICU seems to detect UTF-8 in that case. This may not be ideal, usually you want western european in that case...

@Ghabry
Copy link
Member Author

Ghabry commented Sep 11, 2021

Checked now the ICU sourcecode: They use ngrams (Split string in pairs of N chars) for language detection (this is an approach that works surprisingly good btw) so removing ASCII strings will actually make the result worse as it messes up the distribution. I will remove it and test again.

(For the ZIP archive encoding detection were I took this from this still makes sense though as all files are known beforehand, so ASCII implies UTF-8)

This reduces the amount of unnecessary database loads.

Check the system tab first as this leads to slightly better detection.
@Ghabry
Copy link
Member Author

Ghabry commented Sep 11, 2021

rechecked this: using the system tab before terms is good enough. Ascii filter was nonsense :)

@fdelapena
Copy link
Contributor

Related (fixes?): #169

@Ghabry
Copy link
Member Author

Ghabry commented Sep 11, 2021

it is still read twice. Not possible until the save data is also using DB String. Maybe 0.7.1 ;)

@carstene1ns carstene1ns merged commit a55a9d4 into EasyRPG:master Sep 19, 2021
@carstene1ns
Copy link
Member

lcftrans is broken now :P

@Ghabry
Copy link
Member Author

Ghabry commented Sep 19, 2021

yeah, everything that does encoding detection needs the API updated. Will provide a fix soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants