Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to configure a list of encoding-confidences to use when guessing #84503

Closed
sunbohong opened this issue Nov 11, 2019 · 16 comments
Closed

Allow to configure a list of encoding-confidences to use when guessing #84503

sunbohong opened this issue Nov 11, 2019 · 16 comments
Labels
*duplicate Issue identified as a duplicate of another issue(s) feature-request Request for new features or functionality info-needed Issue requires more information from poster
Milestone

Comments

@sunbohong
Copy link
Contributor

Since there were so many encodings issues,I plan to upgrade the code guessing workflow to V2.

  • First, jschardet will be upgraded and multiple results will be returned. For example, [{encoding: 'UTF-8', confidence: 0.95}, {encoding: 'GBK', confidence: 0.95}].

  • Then, we will support the configuration of multiple sets of coding confidence in the settings.files.encodingInitConfidences
    [{encoding:'utf-8',confidence:0.01},{encoding:'GBK',confidence:0.03}]。

  • Finally,add the two together to get the final result, because GBK has the highest confidence, so the file is recognized as GBK
    [{encoding:'utf-8',confidence:0.06},{encoding:'GBK',confidence:0.08}]

@vscodebot
Copy link

vscodebot bot commented Nov 11, 2019

(Experimental duplicate detection)
Thanks for submitting this issue. Please also check if it is already covered by an existing one, like:

@bpasero bpasero added feature-request Request for new features or functionality file-guess-encoding labels Nov 12, 2019
@bpasero bpasero added this to the Backlog Candidates milestone Nov 12, 2019
@bpasero bpasero removed their assignment Nov 12, 2019
@bpasero
Copy link
Member

bpasero commented Nov 12, 2019

First, jschardet will be upgraded and multiple results will be returned

@sunbohong are you planning to contribute to JSChardet?

@sunbohong
Copy link
Contributor Author

First, jschardet will be upgraded and multiple results will be returned

@sunbohong are you planning to contribute to JSChardet?

I found that https://github.com/runk/node-chardet can meet my requirements.

@bpasero
Copy link
Member

bpasero commented Nov 12, 2019

I am not sure VSCode would move off jschardet to another module, seems rather risky. I would suggest you try to become a contributor on jschardet to improve it if possible.

@sunbohong
Copy link
Contributor Author

I tried to add some logs to jschardet, but it failed the following test. aadsm/jschardet#48

node-chardet works well with this test.
The best is utf-8
image

jschardet cann't pass this test.

windows-1252 confidence 0.95
image
UTF-8 confidence = 0.505
image

@sunbohong
Copy link
Contributor Author

@bpasero

Since jschardet can't provide the expected results.
We can provide node-chardet as an optional option. If there are enough people to choose the new switch in the future, we can remove the jschardet in the future

@byyxx128
Copy link

byyxx128 commented Nov 19, 2019

I like the mode provided by node-chardet. Just a friendly reminder that please always use GB18030 to replace GB2312 or GBK regardless of the confidence value.

Reasons:

  • New characters in a file encoded in GB18030 will lose if the file is edited in lower encoding.
  • New characters cannot be displayed normally with lower encoding (nothing will lose if it is just opened, but if it is saved with lower encoding, information will be destroyed)
  • A file encoded in GB2312 or GBK can be decoded as GB18030. Even though new characters would lose if it is decoded by lower encoding again, nothing is wrong at least.

All in all, the safest way is always using GB18030 instead of GB2312 or GBK.

Here is a file encoded by GB18030 for your test.
test_GB18030.txt

@vscodebot
Copy link

vscodebot bot commented Jan 15, 2020

This feature request is now a candidate for our backlog. The community has 60 days to upvote the issue. If it receives 20 upvotes we will move it to our backlog. If not, we will close it. To learn more about how we handle feature requests, please see our documentation.

Happy Coding!

@vscodebot
Copy link

vscodebot bot commented Mar 6, 2020

This feature request has not yet received the 20 community upvotes it takes to make to our backlog. 10 days to go. To learn more about how we handle feature requests, please see our documentation.

Happy Coding

1 similar comment
@vscodebot
Copy link

vscodebot bot commented Mar 6, 2020

This feature request has not yet received the 20 community upvotes it takes to make to our backlog. 10 days to go. To learn more about how we handle feature requests, please see our documentation.

Happy Coding

@vscodebot
Copy link

vscodebot bot commented Mar 6, 2020

🙂 This feature request received a sufficient number of community upvotes and we moved it to our backlog. To learn more about how we handle feature requests, please see our documentation.

Happy Coding!

@bpasero
Copy link
Member

bpasero commented Nov 4, 2020

@sunbohong please merge this with #36951, I feel the 2 suggestions are very similar.

@bpasero bpasero added the info-needed Issue requires more information from poster label Nov 4, 2020
@sunbohong
Copy link
Contributor Author

I am not sure VSCode would move off jschardet to another module, seems rather risky. I would suggest you try to become a contributor on jschardet to improve it if possible.

When we use the new version of jschardet , we are also taking risks.
So, why not try the node-chardet?

@bpasero
Copy link
Member

bpasero commented Nov 4, 2020

Some requirements to go with another library such as node-chardet imho are:

  • it runs in web too (that is a requirement that is relatively new)
  • tests of jschardet work in node-chardet (this would give some confidence)

@bpasero
Copy link
Member

bpasero commented Nov 10, 2020

/duplicate #36951

@github-actions github-actions bot locked and limited conversation to collaborators Dec 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
*duplicate Issue identified as a duplicate of another issue(s) feature-request Request for new features or functionality info-needed Issue requires more information from poster
Projects
None yet
Development

No branches or pull requests

4 participants
@bpasero @sunbohong @byyxx128 and others