The Javascript rendering problem #95

afontenot · 2021-07-19T00:43:26Z

The purpose of SFZ is to fully capture a webpage in its pre-rendered state - effective saving the source code of the page, or at least a reasonable intermediate representation of that code. If you wanted to save a static rendering of a page, you would simply print it to a PDF.

Normally this works okay even for pages that are client-rendered with Javascript, because SFZ works with a DOM snapshot. This is a nice, static approach which is about as good as you can get with a page that is meant to be dynamically rendered.

The problem is that a growing number of webpages are generating increasingly verbose garbage with their client-rendered Javascript. This doesn't matter much to casual viewers (other than increasing memory use slightly), but it effectively means that instead of a clean source document SFZ has to work with a bunch of non-compressible garbage. For example, a bunch of NY Times pages create SVG figures in the page with Javascript. The result of this is that every coordinate in the SVG is a JavaScript 64 bit number, like 27.89319248826291. Obviously a coordinate like this is many orders of magnitude more precise than necessary to give pixel level precision.

I'm adding this as an issue, not because it's a problem specific to SFZ, but because it would be useful to discuss potential workarounds that SFZ, other software, or the end user could employ.

For the SVG issue specifically, maybe SFZ could have an option to simplify SVG files by reducing the precision
SFZ could use some kind of heuristic to determine when a piece of Javascript is used to render something in the page, and (optionally) allow it to run to regenerate the DOM in the archive instead of saving it statically.
SFZ could have an option to save complex figures (detected with some heuristic) as static images.
Some type of offline "HTML simplifier" could be written that improves the size of HTML files. The goal should be "lossy" transparency - no noticeable visual changes to the page, even if it does not in theory have an identical rendering. I don't know of such a tool - the tidy tool did not change the size of the index.html file created by SFZ when I tested it.

Use case: I use SFZ to send articles to a friend who lives in Africa and has to use a slow, expensive, metered Internet connection. I download the page with SFZ, extract it to a directory, and upload it to my own server. SFZ allows me to make sure that ads and unnecessary connections are stripped out, and for very unoptimized pages I might even recompress the images.

However, for some pages the result is actually larger than the transfer size of the original page and all its required files. I discovered that for one such page, opening the index.html file in a text editor and naively using regex to reduce number precision to three decimal digits eliminated more than half the file size. A PDF printout of the page was actually even smaller than that!

The text was updated successfully, but these errors were encountered:

gildas-lormeau · 2021-07-27T23:01:40Z

* For the SVG issue specifically, maybe SFZ could have an option to simplify SVG files by reducing the precision

The support of SVG contents in SingleFile/SingleFileZ is quite minimalist. I agree there's room for some improvements.

* SFZ could use some kind of heuristic to determine when a piece of Javascript is used to render something in the page, and (optionally) allow it to run to regenerate the DOM in the archive instead of saving it statically.

Do you have an example of such a heuristic? In general, I refuse to hard-code values related to specific websites because it's a nightmare to maintain and not sustainable in the long run.

* SFZ could have an option to save complex figures (detected with some heuristic) as static images.

Complex figures should already be saved as static images. However, tainted canvases (https://developer.mozilla.org/en-US/docs/Web/HTML/CORS_enabled_image) can't be saved in an extension. Only browser vendors can fix this limitation.

* Some type of offline "HTML simplifier" could be written that improves the size of HTML files. The goal should be "lossy" transparency - no noticeable visual changes to the page, even if it does not in theory have an identical rendering. I don't know of such a tool - the `tidy` tool did not change the size of the `index.html` file created by SFZ when I tested it.

This already exists. SingleFile/SingleFileZ removes optional tags, unnecessary quotes and space characters. It also removes unused CSS rules and properties by computing the cascade of styles. It also removes unused fonts and can detect hidden elements in order to remove them. All these optimizations are enabled by default. I highly doubt a 'tidy' tool would produce a smaller file without any visual degradation otherwise I wouldn't have implemented these optimizations.

However, for some pages the result is actually larger than the transfer size of the original page and all its required files. I discovered that for one such page, opening the index.html file in a text editor and naively using regex to reduce number precision to three decimal digits eliminated more than half the file size. A PDF printout of the page was actually even smaller than that!

It would have been interesting to see the saved file. With the default options of SingleFileZ, the file should not be larger than the transfer of the original page and the required files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Javascript rendering problem #95

The Javascript rendering problem #95

afontenot commented Jul 19, 2021

gildas-lormeau commented Jul 27, 2021

The Javascript rendering problem #95

The Javascript rendering problem #95

Comments

afontenot commented Jul 19, 2021

gildas-lormeau commented Jul 27, 2021