Please replace jschardet with a different library #85480

sunbohong · 2019-11-24T05:10:11Z

Please remove jschart. Many people have reported the problem of file coding detection in the community. These problems are caused by the design concept of jschart. In addition, these problems are labeled as "upstream", which leads to ineffective processing, and some of them are even closed directly.

https://github.com/aadsm/jschardet/issues

#23570
#24195
#27419
#33720
#36230
#36951
#41393
#47872
#51125
#52132
#61663
#64419
#64931

As the https://github.com/microsoft/vscode/wiki/Issue-Grooming#out-of-scope-feature-requests point out, we need more up-votes, if you have trouble in 中国简体,please give a up-vote.

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.

Chinese Version:
大家好，本issue希望vscode 移除一个第三方依赖 jschar.

jschart 通过检测遍历文档内容的方式输出预期编码类型。这种机制在纯西方文字环境下能够很好的工作。但是，如果代码中混合部分中文特有字符，比如“中国人”，则很可能返回为“Windows 1252”编码类型，导致文件内容无法正常展示。这种缺陷本质上是jschar无法很好平衡“精确度”与“鲁棒性”相抗衡的结果。

为此，我希望大家通过投票的方式支持改进文件编码检测逻辑。

真诚希望所有遇到问题，或者将来可能遇到问题的朋友，为本issue点赞。

您的点赞将直接影响vscode对这项改动的支持力度。

根据 vscode 团队的 feature-requests 处理流程，https://github.com/microsoft/vscode/wiki/Issue-> Grooming#out-of-scope-feature-requests ，如果点赞数量不够，本issue很可能不会被处理，而是被关闭。

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.
#84503 didn't have enough up-votes,

MxDany · 2019-11-29T05:59:11Z

It's hard to believe that this issues has existed for so long.

MxDany · 2019-12-04T05:48:20Z

The same file works fine in these editors: Visual Studio, Notepad ++, Beyond Compare, etc., no one cares.☹

amnore · 2019-12-04T09:19:35Z

Is there a better alternative than jschardet?

bpasero · 2019-12-05T10:22:49Z

I see the issues with jschardet and we can certainly remove the support for auto guessing encodings but we can also investigate to either fix the issues or replace jschardet with something else if that is possible.

Encoding detection is currently only used in 2 places:

from the encoding picker of the status bar (we offer the guessed encoding as one of the top entries if guessed)
for any editor that is opened if files.autoGuessEncoding is set to true

Note: it is unlikely that VSCode would make an investment in improving the libraries, so this would need community help. But I am certainly willing to adopt a different library if it has good test coverage and shows improvements over the current one.

sunbohong · 2019-12-05T13:34:54Z

Thank you for your interest in this issue.

In order to provide higher quality coding detection, we want to collect some test cases.

The test case consists of three parts: the original file, the feature, and the expected encoding type (which can be multiple encoding types)

Demo:

File1, containing only simplified Chinese characters (酷酷的哀殿), GBK

File2, including simplified Chinese and English letters (My name is 酷酷的哀殿), GBK

File2, including simplified Chinese and English letters (My name is 酷酷的哀殿), UTF-8

谢谢大家对本issue的关心。
为了提供更高质量的编码检测，我们希望收集一些测试case。
测试case应该包含三部分：原始的文件，特征、期望的编码类型（可以是多个编码类型）
demo：
file1，只包含简体汉字（酷酷的哀殿），GBK
file1，同时包含简体中文+英文字母（My name is 酷酷的哀殿），GBK
file1，同时包含简体中文+英文字母（My name is 酷酷的哀殿），UTF-8

bpasero · 2019-12-05T19:52:47Z

I think it would also be worthwhile seeing for existing test cases for jschardet and their underlying library.

byyxx128 · 2020-01-28T00:32:42Z

@rebornix
Could you please help publicize the potential solution?
Many users (especially Chinese users) may have the same problem. UNFORTUNATELY, they do NOT search the existed issues when reporting, and they will NOT follow up their issues anymore after initially reporting. Do you know what happens next? When the users meet the same problem again, they will report as a new issue immediately without any research, and they will NOT follow up anymore… Furthermore, non-English issues will be closed generally, which caused the authors opened plenty of duplicate issues as well. Consequently, although many people have the same problem, they do NOT follow up this issue/solution and do NOT vote it. 😒

rebornix · 2020-01-29T01:00:31Z

Thanks for contributing to this issue and keeping bringing awareness to it. I agree that the Chinese community might be silent on issues and IMHO providing right detection for Chinese (or probably even CJKV) is fundamental to our users. Thus I assigned this issue to myself and moved to backlog to ensure it won't get closed. @bpasero I have some capacity for CJKV support in VS Code this year so feel free to unassigned yourself if this topic doesn't fit into your plan.

As @bpasero mentioned above and in #84503 (comment), it's risky to move to another library as the encoding detection is based on heuristics, moving to another library might change the behavior significantly (for encodings whose confidence is low).

We can look into improving jschardet and then build helpers on top of jschardet (for example an proposal from #84503). The challenge here (if I understand correctly) is how to choose the best encoding when the content matches more than one. (see aadsm/jschardet#49 (comment))

I'd love to see feedbacks and suggestions from everyone and Encoding experts about how we can improve the workflow for building a robust encoding detection.

bpasero · 2020-01-29T06:15:54Z

+1 for contributing to jschardet and making it robust for edge cases. However, my thinking is still as it was from the beginning: there is no such thing as correct encoding detection, it can only ever be a guess with false positives. The only encoding guess you can do properly is if files include a byte-order-mark, such as UTF variants. Maybe on top of that you can detect UTF-16 looking at the byte patterns (that is what VSCode has actually implemented).

Given all the issues we have now, I think VSCode would be better off not providing an encoding guess at all that automatically changes the encoding for the file that opens. A better experience imho would be to give a hint to the user that the encoding might be different from the selected one and let the user make the choice of changing it.

wdtbrchan · 2020-01-29T07:16:33Z

It would be enough for me to define which encodings are to be detected.
For example: files.autoDetectEncodingsOnly: ['utf-8', 'windows-1250']
I think that solves most practical cases.

bpasero · 2020-01-29T07:21:43Z

Yeah that is an idea that circulated a while ago and probably makes sense.

rebornix · 2020-01-29T17:56:00Z

Allowing users to configure an order/priority list can probably help. For example GB18030 is the current standard for Chinese encoding and it is compatible with legacy encodings like GB2312 and GBK so choosing GB18030 is safer than GB2312 and GBK (see comment #84503 (comment)).

We also got a lot of complains about jschardet preferring windows-1252 over utf8 or GB18030 when users were dealing with Chinese content.

A priority list like ["GB18030", "GBK", "GB2312", "windows-1252"] can mitigate above issues.

sunbohong · 2020-01-30T02:39:02Z

Thanks for contributing to this issue and keeping bringing awareness to it. I agree that the Chinese community might be silent on issues and IMHO providing right detection for Chinese (or probably even CJKV) is fundamental to our users. Thus I assigned this issue to myself and moved to backlog to ensure it won't get closed. @bpasero I have some capacity for CJKV support in VS Code this year so feel free to unassigned yourself if this topic doesn't fit into your plan.

As @bpasero mentioned above and in #84503 (comment), it's risky to move to another library as the encoding detection is based on heuristics, moving to another library might change the behavior significantly (for encodings whose confidence is low).

We can look into improving jschardet and then build helpers on top of jschardet (for example an proposal from #84503). The challenge here (if I understand correctly) is how to choose the best encoding when the content matches more than one. (see aadsm/jschardet#49 (comment))

I'd love to see feedbacks and suggestions from everyone and Encoding experts about how we can improve the workflow for building a robust encoding detection.

There was an another issue,which suggest "Allow to configure a list of encoding-confidences to use when guessing" #84503

sunbohong · 2020-01-30T02:41:58Z

not providing an encoding

When I search a dirctionary

+1 for contributing to jschardet and making it robust for edge cases. However, my thinking is still as it was from the beginning: there is no such thing as correct encoding detection, it can only ever be a guess with false positives. The only encoding guess you can do properly is if files include a byte-order-mark, such as UTF variants. Maybe on top of that you can detect UTF-16 looking at the byte patterns (that is what VSCode has actually implemented).

Given all the issues we have now, I think VSCode would be better off not providing an encoding guess at all that automatically changes the encoding for the file that opens. A better experience imho would be to give a hint to the user that the encoding might be different from the selected one and let the user make the choice of changing it.

When we search thousands of files, there are still many problems. A lot of results will not be shown.
let the user make the choice of changing it. will not solve this.

rebornix · 2020-01-30T17:57:00Z

When we search thousands of files, there are still many problems. A lot of results will not be shown.
let the user make the choice of changing it. will not solve this.

@sunbohong can you please elaborate a bit more what's your suggestion here. Sorry I didn't catch up with it.

sunbohong · 2020-02-10T04:15:53Z

When we search thousands of files, there are still many problems. A lot of results will not be shown.
let the user make the choice of changing it. will not solve this.

@sunbohong can you please elaborate a bit more what's your suggestion here. Sorry I didn't catch up with it.

For example,if i have a folder,which contain a GBK file,a UTF8 file.
when i search with a key.

Case 1. You can locate UTF8 file in the index by entering 酷酷的哀殿.

Case 2.If we open GBK file,You can locate both files in the index.

If there were thousands of files, we have to open thousands of files.

rebornix · 2020-02-10T17:36:28Z

@sunbohong the issue you ran into is related to the search area, please file a separate issue as I think we don't have encoding detection in search, cc @roblourens is this a duplicate?

Tomek-PL · 2020-03-18T16:48:34Z

How to remove jschart? I can't find it

roblourens · 2020-03-18T18:33:50Z

Search can only work in one encoding at a time, based on files.encoding. If you open the file, then it will be searched with encoding detection if that's enabled. I think that explains the above.

egamma assigned egamma and bpasero and unassigned egamma Nov 25, 2019

byyxx128 mentioned this issue Nov 29, 2019

files.autoGuessEncoding is wrong with utf-8 chinese #85738

Closed

bpasero changed the title ~~Please remove jschart.~~ Please remove jschardet Dec 5, 2019

bpasero added file-guess-encoding under-discussion Issue is under discussion for relevance, priority, approach labels Dec 5, 2019

sunbohong changed the title ~~Please remove jschardet~~ Please replace jschardet with a different library Dec 5, 2019

byyxx128 mentioned this issue Dec 13, 2019

When opening a file containing Chinese, some text will become garbled. #86875

Closed

byyxx128 mentioned this issue Dec 25, 2019

通过内容猜测编码错误 #87690

Closed

rebornix added this to the Backlog milestone Jan 29, 2020

rebornix self-assigned this Jan 29, 2020

Yanpas mentioned this issue Feb 20, 2020

Try using a different encoding package JohnstonCode/svn-scm#830

Closed

Akarinnnnn mentioned this issue Mar 30, 2020

[File-guess-encoding] Sort encoding guess result by user current locale #93778

Closed

rebornix removed their assignment Oct 9, 2020

bpasero added the *out-of-scope Posted issue is not in scope of VS Code label Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please replace jschardet with a different library #85480

Please replace jschardet with a different library #85480

sunbohong commented Nov 24, 2019 •

edited

Loading

MxDany commented Nov 29, 2019

MxDany commented Dec 4, 2019

amnore commented Dec 4, 2019

bpasero commented Dec 5, 2019

sunbohong commented Dec 5, 2019

bpasero commented Dec 5, 2019

byyxx128 commented Jan 28, 2020 •

edited

Loading

rebornix commented Jan 29, 2020 •

edited

Loading

bpasero commented Jan 29, 2020 •

edited

Loading

wdtbrchan commented Jan 29, 2020

bpasero commented Jan 29, 2020

rebornix commented Jan 29, 2020 •

edited

Loading

sunbohong commented Jan 30, 2020

sunbohong commented Jan 30, 2020 •

edited

Loading

rebornix commented Jan 30, 2020

sunbohong commented Feb 10, 2020

rebornix commented Feb 10, 2020

Tomek-PL commented Mar 18, 2020

roblourens commented Mar 18, 2020

Please replace jschardet with a different library #85480

Please replace jschardet with a different library #85480

Comments

sunbohong commented Nov 24, 2019 • edited Loading

MxDany commented Nov 29, 2019

MxDany commented Dec 4, 2019

amnore commented Dec 4, 2019

bpasero commented Dec 5, 2019

sunbohong commented Dec 5, 2019

bpasero commented Dec 5, 2019

byyxx128 commented Jan 28, 2020 • edited Loading

rebornix commented Jan 29, 2020 • edited Loading

bpasero commented Jan 29, 2020 • edited Loading

wdtbrchan commented Jan 29, 2020

bpasero commented Jan 29, 2020

rebornix commented Jan 29, 2020 • edited Loading

sunbohong commented Jan 30, 2020

sunbohong commented Jan 30, 2020 • edited Loading

rebornix commented Jan 30, 2020

sunbohong commented Feb 10, 2020

rebornix commented Feb 10, 2020

Tomek-PL commented Mar 18, 2020

roblourens commented Mar 18, 2020

sunbohong commented Nov 24, 2019 •

edited

Loading

byyxx128 commented Jan 28, 2020 •

edited

Loading

rebornix commented Jan 29, 2020 •

edited

Loading

bpasero commented Jan 29, 2020 •

edited

Loading

rebornix commented Jan 29, 2020 •

edited

Loading

sunbohong commented Jan 30, 2020 •

edited

Loading