Improve encoding detection #435

Ghabry · 2021-09-07T20:25:10Z

The first commit is not really needed for it but it makes it after 0.7 easier to replace StringView with std::string_view (C++17). This is API compatible, no changes needed to consumers.

The second commit changes the Encoding Api from Streams to a Database-handle. Because we do not operate on global objects anymore the database checked by player was always empty. This fixes it (Player loads the DB and passes it in). I also noticed that for whatever reason the language detection is better when system tab is at the beginning (maybe filenames are better than terms because they are usually unaffected by translations?)

carstene1ns

Question is, if the encoding is not detected for ascii strings anymore (filtered out), how to know it is an english game at all?

src/reader_util.cpp

Ghabry · 2021-09-11T18:57:26Z

Thats a good question. ICU seems to detect UTF-8 in that case. This may not be ideal, usually you want western european in that case...

Ghabry · 2021-09-11T19:10:41Z

Checked now the ICU sourcecode: They use ngrams (Split string in pairs of N chars) for language detection (this is an approach that works surprisingly good btw) so removing ASCII strings will actually make the result worse as it messes up the distribution. I will remove it and test again.

(For the ZIP archive encoding detection were I took this from this still makes sense though as all files are known beforehand, so ASCII implies UTF-8)

This reduces the amount of unnecessary database loads. Check the system tab first as this leads to slightly better detection.

Ghabry · 2021-09-11T19:24:05Z

rechecked this: using the system tab before terms is good enough. Ascii filter was nonsense :)

fdelapena · 2021-09-11T19:33:08Z

Related (fixes?): #169

Ghabry · 2021-09-11T19:34:46Z

it is still read twice. Not possible until the save data is also using DB String. Maybe 0.7.1 ;)

carstene1ns · 2021-09-19T21:44:22Z

lcftrans is broken now :P

Ghabry · 2021-09-19T21:49:35Z

yeah, everything that does encoding detection needs the API updated. Will provide a fix soon

Move Load-Api to StringView

0cbf95e

Ghabry added the Encoding label Sep 7, 2021

Ghabry added this to the 0.7.0 milestone Sep 7, 2021

Ghabry mentioned this pull request Sep 7, 2021

Fix and improve encoding detection EasyRPG/Player#2641

Merged

fdelapena approved these changes Sep 8, 2021

View reviewed changes

carstene1ns reviewed Sep 11, 2021

View reviewed changes

src/reader_util.cpp Outdated Show resolved Hide resolved

Ghabry force-pushed the encoding branch from f90b05b to ea6ad73 Compare September 11, 2021 19:20

Change DetectEncoding Api from stream to database.

2134a2a

This reduces the amount of unnecessary database loads. Check the system tab first as this leads to slightly better detection.

Ghabry force-pushed the encoding branch from ea6ad73 to 2134a2a Compare September 11, 2021 19:22

fdelapena added the Performance label Sep 11, 2021

fdelapena requested a review from carstene1ns September 18, 2021 21:39

carstene1ns approved these changes Sep 19, 2021

View reviewed changes

carstene1ns merged commit a55a9d4 into EasyRPG:master Sep 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve encoding detection #435

Improve encoding detection #435

Ghabry commented Sep 7, 2021

carstene1ns left a comment

Ghabry commented Sep 11, 2021

Ghabry commented Sep 11, 2021

Ghabry commented Sep 11, 2021

fdelapena commented Sep 11, 2021

Ghabry commented Sep 11, 2021

carstene1ns commented Sep 19, 2021

Ghabry commented Sep 19, 2021

Improve encoding detection #435

Improve encoding detection #435

Conversation

Ghabry commented Sep 7, 2021

carstene1ns left a comment

Choose a reason for hiding this comment

Ghabry commented Sep 11, 2021

Ghabry commented Sep 11, 2021

Ghabry commented Sep 11, 2021

fdelapena commented Sep 11, 2021

Ghabry commented Sep 11, 2021

carstene1ns commented Sep 19, 2021

Ghabry commented Sep 19, 2021