-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not encoding UTF-8 correctly #18
Comments
I have tried DomCrawler, but it is fine without any problem when encoding Utf-8. Code:
Result:
And I find they have fixed this bug already. |
I am trying to fix this problem. #19 |
I think it can help you |
Any update on this issue, do we still have problem with UTF-8 because this is a huge problem if it does exist, most sites use UTF-8 anyhow. |
Same problem here, in v1.3. |
I just decode entities after the save: $html = html_entity_decode($html, ENT_NOQUOTES, 'UTF-8'); |
seems the underlying problem is that symfony/dom-crawler switches to entities to avoid some other bugs: |
I think this is not good idea - because this decode all entities from HTML (for example i have bigger document where can by for example used > or ). I found solution which work for me - in line: htmlpagedom/src/HtmlPageCrawler.php Line 887 in 563bc7a
to:
based by https://stackoverflow.com/a/20675396 |
But my solution for some reason remove DOCTYPE in this test script: <?php
$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#">
<head>
<meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
</head>
<body>
网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
</body>
</html>
EOF;
use \Wa72\HtmlPageDom\HtmlPageCrawler;
$document = new HtmlPageCrawler($html);
echo "--- HtmlPageDom -----------------------------------------------------------" .PHP_EOL.PHP_EOL;
echo $document->saveHTML();
echo PHP_EOL;
use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
echo "--- DomCrawler ------------------------------------------------------------" .PHP_EOL.PHP_EOL;
echo $crawler->html(); This is output:
|
|
I make small test script for compare various DOM parsers - https://github.com/havran/php-html-parsers-test Old simplehtmldom seems still best :-). |
Code to reproduce:
Result:
Expected Result:
It is a known bug of PHP DomDocument. Here is the reference:
http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly
The text was updated successfully, but these errors were encountered: