Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support utf-8 encoding guessing #84495

Closed
sunbohong opened this issue Nov 11, 2019 · 8 comments · Fixed by sunbohong/vscode#1 or #84504
Closed

Support utf-8 encoding guessing #84495

sunbohong opened this issue Nov 11, 2019 · 8 comments · Fixed by sunbohong/vscode#1 or #84504
Assignees
Labels
feature-request Request for new features or functionality verification-needed Verification of issue is requested verified Verification succeeded
Milestone

Comments

@sunbohong
Copy link
Contributor

sunbohong commented Nov 11, 2019

utf-8(['ascii', 'utf-8', 'utf-16', 'utf-32']) encoding guessing is disabled by this line. Ignore encodings that cannot guess correctly
But this link didn't mention it.
And in my test,the follow text will be guessing as utf-8 correctly(result='utf-8').
image

jschardet result:

image

In the other hand,with the follow setting, it will be detect as GBK

image

image

@vscodebot
Copy link

vscodebot bot commented Nov 11, 2019

(Experimental duplicate detection)
Thanks for submitting this issue. Please also check if it is already covered by an existing one, like:

@sunbohong
Copy link
Contributor Author

Not as simple as you see...
See these issues:
https://github.com/microsoft/vscode/labels/file-guess-encoding
aadsm/jschardet#48
aadsm/jschardet#49

These issues will be solve by #84503

@sunbohong
Copy link
Contributor Author

Not as simple as you see...
See these issues:
https://github.com/microsoft/vscode/labels/file-guess-encoding
aadsm/jschardet#48
aadsm/jschardet#49

In particular, aadsm/jschardet#48 will be treat as utf-8

image

@bpasero bpasero added feature-request Request for new features or functionality file-guess-encoding labels Nov 12, 2019
@bpasero bpasero added this to the November 2019 milestone Nov 12, 2019
@bpasero
Copy link
Member

bpasero commented Nov 12, 2019

I like this change because it makes jschardet more deterministic by giving it full control over the detection. I still think more work is needed to increase the confidence of the detection, but that can continue in other issues.

@bpasero
Copy link
Member

bpasero commented Nov 12, 2019

@sunbohong I noticed a bad regression though that made me push ddfca30 to ignore the guessed encoding "ascii" for a simple reason:

  • user opens a text file with just ascii characters
  • we guess the encoding to be "ascii"
  • user types special characters (like german umlaut)
  • user saves and closes the file
  • user reopens the file

=> the file is still guessed as "ascii" because it was not saved with a proper encoding

@bpasero
Copy link
Member

bpasero commented Nov 19, 2019

Verification:

  • configure "files.autoGuessEncoding": true
  • configure "files.encoding": "windows1252" (simply to a non-UTF8 encoding)
  • save a file with special characters contents (e.g. 私は和食が好きです。) as UTF-8
  • close all files
  • open it in VSCode

=> the encoding should be UTF-8 from the status bar.

@connor4312 connor4312 added verified Verification succeeded verification-found Issue verification failed and removed verified Verification succeeded verification-found Issue verification failed labels Dec 3, 2019
@bpasero
Copy link
Member

bpasero commented Dec 5, 2019

Given scary issues such as #85821 I am putting UTF-16 and 32 back to the list of ignored encodings for guessing. Still, UTF-8 can be guessed.

@sunbohong
Copy link
Contributor Author

@bpasero From this issue #85821, we really need to replace jschardet with a new library

@vscodebot vscodebot bot locked and limited conversation to collaborators Dec 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request Request for new features or functionality verification-needed Verification of issue is requested verified Verification succeeded
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants