Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction #27

sebastian-nagel · 2023-06-24T09:06:37Z

The WARC-Target-URI (from the WARC file) https://esfsport.ir/1173-دختران-والیبالیست-اصفهان-قهرمان-کشور-شدند.html looses all Unicode characters during WAT/WET extraction. Here the corresponding WAT file:

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://esfsport.ir/1173------.html
...

...,"WARC-Target-URI":"https://esfsport.ir/1173------.html"}}}

These URLs result from redirects which are deliberately not normalized. To address the issue:

use URI.toASCIIString() when writing WARC files - URI.toString() converts the URI to a string without percent-encoding the Unicode characters
try to fix the WAT/WET extractor to scope with these URLs

Quick estimate of the impact of this bug: < 0.05% of WAT/WET records

The text was updated successfully, but these errors were encountered:

sebastian-nagel added the bug label Jun 24, 2023

sebastian-nagel mentioned this issue Jun 24, 2023

WARC writer: use URI.toASCIIString() instead of URI.toString() commoncrawl/nutch#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction #27

Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction #27

sebastian-nagel commented Jun 24, 2023 •

edited

Loading

Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction #27

Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction #27

Comments

sebastian-nagel commented Jun 24, 2023 • edited Loading

sebastian-nagel commented Jun 24, 2023 •

edited

Loading