Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve whitespace of removed "empty" elements #48

Merged
merged 2 commits into from
Jul 8, 2024
Merged

Preserve whitespace of removed "empty" elements #48

merged 2 commits into from
Jul 8, 2024

Conversation

newsch
Copy link
Collaborator

@newsch newsch commented Jul 8, 2024

The first commit removes the pretty-printing from the test examples and adds a lot of noise to the diff.

Some articles use non-breaking spaces between quantities and units, which Wikipedia seems to wrap with a span. Elements with no or whitespace-only text were previously removed to prune <link>s and parents of other removed elements.

This fix preserves the internal whitespace of elements that would other wise be removed for being "empty". It does not distinguish between "meaningful" whitespace and padding between elements that would otherwise be collapsed by HTML formatting rules. It also cannot distinguish between elements that started with only whitespace and nodes that now contain only whitespace after previous steps. The preserved whitespace in the latter case is unlikely to remain because of later processing steps.

Fixes #47, fixes organicmaps/organicmaps#8651

newsch added 2 commits July 8, 2024 13:44
Whitespace behavior is different between Html::html and this
half-working pretty printer. Now the tests match the parser output
exactly.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
Some articles use non-breaking spaces between quantities and units,
which Wikipedia seems to wrap with a span. Elements with no or
whitespace-only text were previously removed to prune `<link>`s and
parents of other removed elements.

This fix preserves the internal whitespace of elements that would
otherwise be removed for being "empty". It does not distinguish between
"meaningful" whitespace and padding between elements that would be
collapsed by HTML formatting rules. It also cannot distinguish between
elements that _started_ with only whitespace and nodes that now contain
only whitespace after previous steps. The preserved whitespace in the
latter case is unlikely to remain because of later processing steps.

Fixes #47, fixes organicmaps/organicmaps#8651

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

<li>Dologorukovskaya (Subatkan) yayla</li>
<li>Demirci yayla</li>
<li>Qarabiy yayla</li></ul><h2>Highest peaks</h2><p>The Crimea's highest peak is the Roman-Kosh (Ukrainian: <span lang="uk">Роман-Кош</span>; Russian: <span lang="ru">Роман-Кош</span>, Crimean Tatar: <span lang="crh">Roman Qoş</span>) on the Babugan Yayla at 1,545 metres (5,069&nbsp;ft). Other important peaks over 1,200 metres include:</p><ul><li>Demir-Kapu (Ukrainian: <span lang="uk">Демір-Капу</span>, Russian: <span lang="ru">Демир-Капу</span>, Crimean Tatar: <span lang="crh">Demir Qapı</span>) 1,540 m in the Babugan Yayla;</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it save a bit more space if the nbsp were encoded directly as  , instead of &nbsp;?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would, but we don't control that part of the writing. html5ever converts the literal to the escaped version, I assume because it is part of the serialization spec.
It's possible to write another Serializer like the pretty-printer that minifies instead, but I haven't figured out the whitespace collapsing rules enough to write one. There aren't any crates that implement an html5ever::Serializer minifier, so adding an external minifier would need to re-parse the html.

<p>The <b>Crimean Mountains</b> (Crimean Tatar: <span lang="crh">Qırım dağları</span>; Ukrainian: <span lang="uk">Кримські гори</span>; Russian: <span lang="ru">Крымские горы</span>; Turkish: <i lang="tr">Yayla Dağları</i>) or <b>Yayla Mountains</b> are a range of mountains running parallel to the south-eastern coast of Crimea, between about 8–13 kilometers (5–8 miles) from the sea. Toward the west, the mountains drop steeply to the Black Sea, and to the east, they change slowly into a steppe landscape.</p><p>The Crimean Mountains consist of three subranges. The highest is the Main Range, which is subdivided into several yaylas or mountain plateaus (<i>yayla</i> or <i>yaylak</i> is Turkic for "alpine meadow"). They are:</p><ul><li>Baydar yayla</li>
<li>Ai-Petri yayla</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any minification used later, to remove unnecessary line endings for final HTML pages?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above - we don't do a proper minification step, so whitespace within elements is left in the output.

@newsch newsch merged commit bab29c0 into main Jul 8, 2024
1 check passed
@newsch newsch deleted the nbsp branch July 8, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wikipedia text discards &nbsp; Wikipedia text discards &nbsp;
2 participants