-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please replace jschardet with a different library #85480
Comments
It's hard to believe that this issues has existed for so long. |
The same file works fine in these editors: Visual Studio, Notepad ++, Beyond Compare, etc., no one cares.☹ |
Is there a better alternative than jschardet? |
I see the issues with Encoding detection is currently only used in 2 places:
Note: it is unlikely that VSCode would make an investment in improving the libraries, so this would need community help. But I am certainly willing to adopt a different library if it has good test coverage and shows improvements over the current one. |
Thank you for your interest in this issue. In order to provide higher quality coding detection, we want to collect some test cases. The test case consists of three parts: the original file, the feature, and the expected encoding type (which can be multiple encoding types) Demo: File1, containing only simplified Chinese characters (酷酷的哀殿), GBK File2, including simplified Chinese and English letters (My name is 酷酷的哀殿), GBK File2, including simplified Chinese and English letters (My name is 酷酷的哀殿), UTF-8 谢谢大家对本issue的关心。 |
I think it would also be worthwhile seeing for existing test cases for jschardet and their underlying library. |
@rebornix |
Thanks for contributing to this issue and keeping bringing awareness to it. I agree that the Chinese community might be silent on issues and IMHO providing right detection for Chinese (or probably even CJKV) is fundamental to our users. Thus I assigned this issue to myself and moved to backlog to ensure it won't get closed. @bpasero I have some capacity for CJKV support in VS Code this year so feel free to unassigned yourself if this topic doesn't fit into your plan. As @bpasero mentioned above and in #84503 (comment), it's risky to move to another library as the encoding detection is based on heuristics, moving to another library might change the behavior significantly (for encodings whose confidence is low). We can look into improving jschardet and then build helpers on top of jschardet (for example an proposal from #84503). The challenge here (if I understand correctly) is how to choose the best encoding when the content matches more than one. (see aadsm/jschardet#49 (comment)) I'd love to see feedbacks and suggestions from everyone and Encoding experts about how we can improve the workflow for building a robust encoding detection. |
+1 for contributing to Given all the issues we have now, I think VSCode would be better off not providing an encoding guess at all that automatically changes the encoding for the file that opens. A better experience imho would be to give a hint to the user that the encoding might be different from the selected one and let the user make the choice of changing it. |
It would be enough for me to define which encodings are to be detected. |
Yeah that is an idea that circulated a while ago and probably makes sense. |
Allowing users to configure an order/priority list can probably help. For example We also got a lot of complains about A priority list like |
There was an another issue,which suggest "Allow to configure a list of encoding-confidences to use when guessing" #84503 |
When I search a dirctionary
When we search thousands of files, there are still many problems. A lot of results will not be shown. |
@sunbohong can you please elaborate a bit more what's your suggestion here. Sorry I didn't catch up with it. |
For example,if i have a folder,which contain a GBK file,a UTF8 file. Case 1. You can locate UTF8 file in the index by entering 酷酷的哀殿. Case 2.If we open GBK file,You can locate both files in the index. If there were thousands of files, we have to open thousands of files. |
@sunbohong the issue you ran into is related to the search area, please file a separate issue as I think we don't have encoding detection in search, cc @roblourens is this a duplicate? |
How to remove jschart? I can't find it |
Search can only work in one encoding at a time, based on |
Please remove jschart. Many people have reported the problem of file coding detection in the community. These problems are caused by the design concept of jschart. In addition, these problems are labeled as "upstream", which leads to ineffective processing, and some of them are even closed directly.
https://github.com/aadsm/jschardet/issues
#23570
#24195
#27419
#33720
#36230
#36951
#41393
#47872
#51125
#52132
#61663
#64419
#64931
As the https://github.com/microsoft/vscode/wiki/Issue-Grooming#out-of-scope-feature-requests point out, we need more up-votes, if you have trouble in 中国简体,please give a up-vote.
Chinese Version:
大家好,本issue希望vscode 移除一个第三方依赖 jschar.
jschart 通过检测遍历文档内容的方式输出预期编码类型。这种机制在纯西方文字环境下能够很好的工作。但是,如果代码中混合部分中文特有字符,比如“中国人”,则很可能返回为“Windows 1252”编码类型,导致文件内容无法正常展示。这种缺陷本质上是jschar无法很好平衡“精确度”与“鲁棒性”相抗衡的结果。
为此,我希望大家通过投票的方式支持改进文件编码检测逻辑。
真诚希望所有遇到问题,或者将来可能遇到问题的朋友,为本issue点赞。
您的点赞将直接影响vscode对这项改动的支持力度。
根据 vscode 团队的 feature-requests 处理流程,https://github.com/microsoft/vscode/wiki/Issue-> Grooming#out-of-scope-feature-requests ,如果点赞数量不够,本issue很可能不会被处理,而是被关闭。
The text was updated successfully, but these errors were encountered: