You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I haven't had time to dig into why this is (and whether it's intended behavior), so I'm opening this issue to look into it later. cc @stevecheckoway
Help us reproduce what you're seeing
#! /usr/bin/env ruby
$: << "lib"require'nokogiri'require_relative'test/helper'classTest < Nokogiri::TestCasedescribe"document encoding"dodescribe"HTML4"dodescribe"given a File"doit"should detect shift_jis"doassert_equal("Shift_JIS",Nokogiri::HTML4::Document.parse(File.open(SHIFT_JIS_HTML)).encoding,)endenddescribe"given a File and an encoding"doit"should detect shift_jis"doassert_equal("Shift_JIS",Nokogiri::HTML4::Document.parse(File.open(SHIFT_JIS_HTML),nil,"Shift_JIS").encoding,)endenddescribe"given a String"doit"should detect shift_jis"do# failsassert_equal("Shift_JIS",Nokogiri::HTML4::Document.parse(File.read(SHIFT_JIS_HTML,encoding: "Shift_JIS")).encoding,)endenddescribe"given a String and an encoding"doit"should detect shift_jis"doassert_equal("Shift_JIS",Nokogiri::HTML4::Document.parse(File.read(SHIFT_JIS_HTML),nil,"Shift_JIS").encoding,)endendenddescribe"HTML5"dodescribe"given a File"doit"should detect shift_jis"do# failsassert_equal("Shift_JIS",Nokogiri::HTML5::Document.parse(File.open(SHIFT_JIS_HTML)).encoding,)endenddescribe"given a File and an encoding"doit"should detect shift_jis"do# errorsassert_equal("Shift_JIS",Nokogiri::HTML5::Document.parse(File.open(SHIFT_JIS_HTML),nil,"Shift_JIS").encoding,)endenddescribe"given a String"doit"should detect shift_jis"do# failsassert_equal("Shift_JIS",Nokogiri::HTML5::Document.parse(File.read(SHIFT_JIS_HTML,encoding: "Shift_JIS")).encoding,)endenddescribe"given a String and an encoding"doit"should detect shift_jis"do# failsassert_equal("Shift_JIS",Nokogiri::HTML5::Document.parse(File.read(SHIFT_JIS_HTML),nil,"Shift_JIS").encoding,)endendendendend
yields
Error:
document encoding::HTML5::given a File and an encoding#test_0001_should detect shift_jis:
TypeError: no implicit conversion of Hash into Integer
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5.rb:266:in `read'
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5.rb:266:in `read_and_encode'
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5/document.rb:119:in `do_parse'
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5/document.rb:95:in `parse'
./html5-document-encoding.rb:64:in `block (4 levels) in <class:Test>'
Failure:
document encoding::HTML5::given a File#test_0001_should detect shift_jis [./html5-document-encoding.rb:52]
Minitest::Assertion: Expected: "Shift_JIS"
Actual: "UTF-8"
Failure:
document encoding::HTML5::given a String and an encoding#test_0001_should detect shift_jis [./html5-document-encoding.rb:82]
Minitest::Assertion: Expected: "Shift_JIS"
Actual: "UTF-8"
Failure:
document encoding::HTML5::given a String#test_0001_should detect shift_jis [./html5-document-encoding.rb:72]
Minitest::Assertion: Expected: "Shift_JIS"
Actual: "UTF-8"
Expected behavior
I think these should both be the same?
The text was updated successfully, but these errors were encountered:
There are a couple of things worth digging into here:
why isn't the HTML4 EncodingReader discovering the encoding in the "given a string" case? Never mind! This is because the parser uses the encoding of the String (so long as it's not binary/ascii-8bit).
looks like passing a File and an encoding isn't supported by HTML5 (it raises an exception)
can we make the HTML5 encoding detection match the HTML5 behavior? I think we might be able to use the EncodingReader if it's helpful.
Since Gumbo doesn't support anything other than UTF-8, it performs the standard encoding detection pre-scan that browsers are supposed to perform to decide on the encoding and then uses that to convert to UTF-8 to pass to Gumbo.
I'm wasn't sure what the encoding property is supposed to return so I set it to the encoding of the strings in the Document itself rather than the encoding of the input document. If we want the latter, that is probably easy to change.
I'm not sure how this is supposed to interact with #to_html or other serialization methods though. Should those produce strings in the Document's encoding? I know Ruby has a concept of internal and external encoding for streams but I don't know how those should interact with this.
Basically, I don't know what the correct behavior around encodings is supposed to be but since Gumbo only supports UTF-8, I punted on it.
Please describe the bug
The encoding of an HTML5 document differs from the encoding of an HTML4 document:
I haven't had time to dig into why this is (and whether it's intended behavior), so I'm opening this issue to look into it later. cc @stevecheckoway
Help us reproduce what you're seeing
yields
Expected behavior
I think these should both be the same?
The text was updated successfully, but these errors were encountered: