Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please replace jschardet with a different library #85480

Closed
sunbohong opened this issue Nov 24, 2019 · 20 comments
Closed

Please replace jschardet with a different library #85480

sunbohong opened this issue Nov 24, 2019 · 20 comments
Assignees
Labels
*out-of-scope Posted issue is not in scope of VS Code under-discussion Issue is under discussion for relevance, priority, approach
Milestone

Comments

@sunbohong
Copy link
Contributor

sunbohong commented Nov 24, 2019

Please remove jschart. Many people have reported the problem of file coding detection in the community. These problems are caused by the design concept of jschart. In addition, these problems are labeled as "upstream", which leads to ineffective processing, and some of them are even closed directly.

https://github.com/aadsm/jschardet/issues

#23570
#24195
#27419
#33720
#36230
#36951
#41393
#47872
#51125
#52132
#61663
#64419
#64931

As the https://github.com/microsoft/vscode/wiki/Issue-Grooming#out-of-scope-feature-requests point out, we need more up-votes, if you have trouble in 中国简体,please give a up-vote.

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.

Chinese Version:
大家好,本issue希望vscode 移除一个第三方依赖 jschar.

jschart 通过检测遍历文档内容的方式输出预期编码类型。这种机制在纯西方文字环境下能够很好的工作。但是,如果代码中混合部分中文特有字符,比如“中国人”,则很可能返回为“Windows 1252”编码类型,导致文件内容无法正常展示。这种缺陷本质上是jschar无法很好平衡“精确度”与“鲁棒性”相抗衡的结果。

为此,我希望大家通过投票的方式支持改进文件编码检测逻辑。

真诚希望所有遇到问题,或者将来可能遇到问题的朋友,为本issue点赞。

您的点赞将直接影响vscode对这项改动的支持力度。

根据 vscode 团队的 feature-requests 处理流程,https://github.com/microsoft/vscode/wiki/Issue-> Grooming#out-of-scope-feature-requests ,如果点赞数量不够,本issue很可能不会被处理,而是被关闭。

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.
#84503 didn't have enough up-votes,

@MxDany
Copy link

MxDany commented Nov 29, 2019

It's hard to believe that this issues has existed for so long.

@MxDany
Copy link

MxDany commented Dec 4, 2019

The same file works fine in these editors: Visual Studio, Notepad ++, Beyond Compare, etc., no one cares.☹

@amnore
Copy link

amnore commented Dec 4, 2019

Is there a better alternative than jschardet?

@bpasero bpasero changed the title Please remove jschart. Please remove jschardet Dec 5, 2019
@bpasero bpasero added file-guess-encoding under-discussion Issue is under discussion for relevance, priority, approach labels Dec 5, 2019
@bpasero
Copy link
Member

bpasero commented Dec 5, 2019

I see the issues with jschardet and we can certainly remove the support for auto guessing encodings but we can also investigate to either fix the issues or replace jschardet with something else if that is possible.

Encoding detection is currently only used in 2 places:

  • from the encoding picker of the status bar (we offer the guessed encoding as one of the top entries if guessed)
  • for any editor that is opened if files.autoGuessEncoding is set to true

Note: it is unlikely that VSCode would make an investment in improving the libraries, so this would need community help. But I am certainly willing to adopt a different library if it has good test coverage and shows improvements over the current one.

@sunbohong
Copy link
Contributor Author

Thank you for your interest in this issue.

In order to provide higher quality coding detection, we want to collect some test cases.

The test case consists of three parts: the original file, the feature, and the expected encoding type (which can be multiple encoding types)

Demo:

File1, containing only simplified Chinese characters (酷酷的哀殿), GBK

File2, including simplified Chinese and English letters (My name is 酷酷的哀殿), GBK

File2, including simplified Chinese and English letters (My name is 酷酷的哀殿), UTF-8

谢谢大家对本issue的关心。
为了提供更高质量的编码检测,我们希望收集一些测试case。
测试case应该包含三部分:原始的文件,特征、期望的编码类型(可以是多个编码类型)
demo:
file1,只包含简体汉字(酷酷的哀殿),GBK
file1,同时包含简体中文+英文字母(My name is 酷酷的哀殿),GBK
file1,同时包含简体中文+英文字母(My name is 酷酷的哀殿),UTF-8

@sunbohong sunbohong changed the title Please remove jschardet Please replace jschardet with a different library Dec 5, 2019
@bpasero
Copy link
Member

bpasero commented Dec 5, 2019

I think it would also be worthwhile seeing for existing test cases for jschardet and their underlying library.

@byyxx128
Copy link

byyxx128 commented Jan 28, 2020

@rebornix
Could you please help publicize the potential solution?
Many users (especially Chinese users) may have the same problem. UNFORTUNATELY, they do NOT search the existed issues when reporting, and they will NOT follow up their issues anymore after initially reporting. Do you know what happens next? When the users meet the same problem again, they will report as a new issue immediately without any research, and they will NOT follow up anymore… Furthermore, non-English issues will be closed generally, which caused the authors opened plenty of duplicate issues as well. Consequently, although many people have the same problem, they do NOT follow up this issue/solution and do NOT vote it. 😒

@rebornix rebornix added this to the Backlog milestone Jan 29, 2020
@rebornix rebornix self-assigned this Jan 29, 2020
@rebornix
Copy link
Member

rebornix commented Jan 29, 2020

Thanks for contributing to this issue and keeping bringing awareness to it. I agree that the Chinese community might be silent on issues and IMHO providing right detection for Chinese (or probably even CJKV) is fundamental to our users. Thus I assigned this issue to myself and moved to backlog to ensure it won't get closed. @bpasero I have some capacity for CJKV support in VS Code this year so feel free to unassigned yourself if this topic doesn't fit into your plan.

As @bpasero mentioned above and in #84503 (comment), it's risky to move to another library as the encoding detection is based on heuristics, moving to another library might change the behavior significantly (for encodings whose confidence is low).

We can look into improving jschardet and then build helpers on top of jschardet (for example an proposal from #84503). The challenge here (if I understand correctly) is how to choose the best encoding when the content matches more than one. (see aadsm/jschardet#49 (comment))

I'd love to see feedbacks and suggestions from everyone and Encoding experts about how we can improve the workflow for building a robust encoding detection.

@bpasero
Copy link
Member

bpasero commented Jan 29, 2020

+1 for contributing to jschardet and making it robust for edge cases. However, my thinking is still as it was from the beginning: there is no such thing as correct encoding detection, it can only ever be a guess with false positives. The only encoding guess you can do properly is if files include a byte-order-mark, such as UTF variants. Maybe on top of that you can detect UTF-16 looking at the byte patterns (that is what VSCode has actually implemented).

Given all the issues we have now, I think VSCode would be better off not providing an encoding guess at all that automatically changes the encoding for the file that opens. A better experience imho would be to give a hint to the user that the encoding might be different from the selected one and let the user make the choice of changing it.

@wdtbrchan
Copy link

It would be enough for me to define which encodings are to be detected.
For example: files.autoDetectEncodingsOnly: ['utf-8', 'windows-1250']
I think that solves most practical cases.

@bpasero
Copy link
Member

bpasero commented Jan 29, 2020

Yeah that is an idea that circulated a while ago and probably makes sense.

@rebornix
Copy link
Member

rebornix commented Jan 29, 2020

Allowing users to configure an order/priority list can probably help. For example GB18030 is the current standard for Chinese encoding and it is compatible with legacy encodings like GB2312 and GBK so choosing GB18030 is safer than GB2312 and GBK (see comment #84503 (comment)).

We also got a lot of complains about jschardet preferring windows-1252 over utf8 or GB18030 when users were dealing with Chinese content.

A priority list like ["GB18030", "GBK", "GB2312", "windows-1252"] can mitigate above issues.

@sunbohong
Copy link
Contributor Author

Thanks for contributing to this issue and keeping bringing awareness to it. I agree that the Chinese community might be silent on issues and IMHO providing right detection for Chinese (or probably even CJKV) is fundamental to our users. Thus I assigned this issue to myself and moved to backlog to ensure it won't get closed. @bpasero I have some capacity for CJKV support in VS Code this year so feel free to unassigned yourself if this topic doesn't fit into your plan.

As @bpasero mentioned above and in #84503 (comment), it's risky to move to another library as the encoding detection is based on heuristics, moving to another library might change the behavior significantly (for encodings whose confidence is low).

We can look into improving jschardet and then build helpers on top of jschardet (for example an proposal from #84503). The challenge here (if I understand correctly) is how to choose the best encoding when the content matches more than one. (see aadsm/jschardet#49 (comment))

I'd love to see feedbacks and suggestions from everyone and Encoding experts about how we can improve the workflow for building a robust encoding detection.

There was an another issue,which suggest "Allow to configure a list of encoding-confidences to use when guessing" #84503

@sunbohong
Copy link
Contributor Author

sunbohong commented Jan 30, 2020

not providing an encoding

When I search a dirctionary

+1 for contributing to jschardet and making it robust for edge cases. However, my thinking is still as it was from the beginning: there is no such thing as correct encoding detection, it can only ever be a guess with false positives. The only encoding guess you can do properly is if files include a byte-order-mark, such as UTF variants. Maybe on top of that you can detect UTF-16 looking at the byte patterns (that is what VSCode has actually implemented).

Given all the issues we have now, I think VSCode would be better off not providing an encoding guess at all that automatically changes the encoding for the file that opens. A better experience imho would be to give a hint to the user that the encoding might be different from the selected one and let the user make the choice of changing it.

When we search thousands of files, there are still many problems. A lot of results will not be shown.
let the user make the choice of changing it. will not solve this.

@rebornix
Copy link
Member

When we search thousands of files, there are still many problems. A lot of results will not be shown.
let the user make the choice of changing it. will not solve this.

@sunbohong can you please elaborate a bit more what's your suggestion here. Sorry I didn't catch up with it.

@sunbohong
Copy link
Contributor Author

When we search thousands of files, there are still many problems. A lot of results will not be shown.
let the user make the choice of changing it. will not solve this.

@sunbohong can you please elaborate a bit more what's your suggestion here. Sorry I didn't catch up with it.

For example,if i have a folder,which contain a GBK file,a UTF8 file.
when i search with a key.

Case 1. You can locate UTF8 file in the index by entering 酷酷的哀殿.

image

Case 2.If we open GBK file,You can locate both files in the index.
image

If there were thousands of files, we have to open thousands of files.

@rebornix
Copy link
Member

@sunbohong the issue you ran into is related to the search area, please file a separate issue as I think we don't have encoding detection in search, cc @roblourens is this a duplicate?

@Tomek-PL
Copy link

How to remove jschart? I can't find it

@roblourens
Copy link
Member

Search can only work in one encoding at a time, based on files.encoding. If you open the file, then it will be searched with encoding detection if that's enabled. I think that explains the above.

@rebornix rebornix removed their assignment Oct 9, 2020
@bpasero bpasero added the *out-of-scope Posted issue is not in scope of VS Code label Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
*out-of-scope Posted issue is not in scope of VS Code under-discussion Issue is under discussion for relevance, priority, approach
Projects
None yet
Development

No branches or pull requests

11 participants