Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIR-125] feat(scrapeURL): add encodeRawHTML transformer for charset handling #1019

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ftonato
Copy link
Collaborator

@ftonato ftonato commented Dec 28, 2024

No description provided.

@ftonato ftonato requested a review from mogery December 28, 2024 00:07
@ftonato ftonato force-pushed the feat/encode-to-utf8 branch from 420506f to 652f209 Compare December 28, 2024 00:07
return document;
}
} catch (error) {
throw new Error("Failed to convert rawHtml to UTF-8");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always add the source error to the error object, so we can see the true error in the logs.

Suggested change
throw new Error("Failed to convert rawHtml to UTF-8");
throw new Error("Failed to convert rawHtml to UTF-8", { cause: error });

Comment on lines +48 to +55
try {
// Convert the response data if charset is not UTF-8
if (charset.toUpperCase() !== "UTF-8") {
const decoder = new TextDecoder(charset);
document.rawHtml = decoder.decode(data);

return document;
}
Copy link
Member

@mogery mogery Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach is good, but I think we need to move it to earlier parts of the stack, since the UTF-8 misencode is too destructive to get usable Shift_JIS out of just the parsed string. We need to move it up to a part of the code where we still have access to the raw buffer coming out of the network stack. Let's talk about this on Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants