`htmlparser2`

htmlparser2 and its associated utilities are used to:

convert Slate JSON objects into a DOM document object (output in slateToDom or serialize to a HTML string in slateToHtml)
parse a HTML string into a DOM document object before conversion to Slate JSON.

Rationale for using htmlparser2 and its associated utilities:

Works in all environments, including Node.js.
Speed - htmlparser2 is the fastest HTML parser.
No need to implement our formatting/encoding logic. dom-serializer handles this.

Whitespace

The presence of whitespace in the DOM can cause layout problems and make manipulation of the content tree difficult in unexpected ways, depending on where it is located.

https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace

Developers define their own schema for Slate. For example, Payload CMS renders a default <p> HTML element for any nodes without a type. <div> elements are not mapped. This means any top-level <div> elements will be rendered as paragraphs. Furthermore, slateToHtml makes the (reasonable) assumption that these elements are paragraphs, and renders the output as such.

Because we define our own schema, it's difficult to generalize rules for handling whitespace.

It may be helpful to minify HTML before passing it to htmlToSlate. This might reduce some false positives by removing extra whitespace. Some possible solutions:

https://www.npmjs.com/package/html-minifier
https://github.com/rehypejs/rehype-minify/tree/main/packages/rehype-minify-whitespace

`htmlToSlate`

Whitespace in text nodes is processed depending on context.

Content for text marks follow an inline formatting context.

reduce line breaks and surrounding space to a single space;
replace tabs with spaces;
reduce multiple spaces to a single space; and
remove space from the beginning and end of the contents of a block element (e.g. h1, p...etc).

Before:

<p>
  <b>
    <i>bar</i></b>
</p>

After:

<p>foo<b> <i>bar</i></b></p>

Block elements follow a block formatting context.

By default, the filterWhitespaceNodes is set to true. This option removes nodes from the Slate JSON object that:

contain only whitespace; and
have no context.

This helps when passing a HTML string that uses line breaks/tabs/spaces to make the document more readable. These spaces do not add any meaning to the document, and so it isn't helpful to represent them in the Slate JSON. Furthermore, depending on the schema, these nodes may be interpreted as elements or part of the content.

For text nodes inside <code> and/or <pre> HTML elements, whitespace is preserved.

References

fb55/htmlparser2#90
aknuds1/html-to-react#79

Payload CMS

The Slate configuration for Payload CMS results in some Slate nodes being stored with an undefined type. See payloadcms/payload#1141 (comment).

Note the defaultTag option that is passed in the Payload CMS configuration for slateToHtml/slateToDom. This creates a <p> HTML element tag whenever a Slate node has an undefined type. This is consistent with the approach taken by Payload CMS: In the docs for the rich text field, the serializer example renders the <p> HTML element as the default - i.e. if no types are found. See https://github.com/payloadcms/payload/blob/master/docs/fields/rich-text.mdx.

At the moment, we cannot convert from slateToHtml to htmlToSlate and vice versa and expect consistent results. This is because, with the Payload conifguration, slateToHtml adds p tags, and then htmlToSlate adds these p tags into the Slate JSON.

May be able to resolve the above by simply removing p tag conversion? Could possibly specify that.

HTML entity encoding/decoding

One of the tricker parts of serializing from between Slate and HTML is that Slate doesn't care about HTML entity encoding. This is expected - Slate is unaware of HTML, it offers a serializer friendly format.

Special considerations are made for htmlToSlate and slateToHtml.

`slateToHtml`

This becomes an issue for code and pre tags, where you may want to encode HTML entities in order that they display correctly.

To accommodate this, an option is available for slateToHtml: alwaysEncodeCodeEntities. If this option istrue and encodeEntities is false, this latter option will be ignored when dealing with code or pre tags, and the content within will always be encoded.

Note that in the default configuration, all HTML entities are encoded.

alwaysEncodeCodeEntities defaults to false.
encodeEntities defaults to true.

`htmlToslate`

htmlToSlate will always encode HTML entities. There is no option to disable this behaviour. This is because in a Slate editor, we do not expect to find any HTML entity codes. As mentioned in the introduction, Slate should be as unaware of HTML as possible.

Line breaks

`htmlToSlate`

br HTML elements get special treatment. The default configuration sets convertBrToLineBreak to true, and each br HTML element will be converted to a text node in Slate that contains \n.

If you have schema rules that process br tags (e.g. in elementTags in the configuration), you may choose to disable this behaviour by setting convertBrToLineBreak to false.

`slateToHtml`

Line breaks get special treatment. When convertLineBreakToBr is set to true, each text node in Slate that contains \n line break will be converted to an HTML <br> element.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engineering.md

engineering.md

`htmlparser2`

Whitespace

`htmlToSlate`

References

Payload CMS

HTML entity encoding/decoding

`slateToHtml`

`htmlToslate`

Line breaks

`htmlToSlate`

`slateToHtml`

Files

engineering.md

Latest commit

History

engineering.md

File metadata and controls

htmlparser2

Whitespace

htmlToSlate

References

Payload CMS

HTML entity encoding/decoding

slateToHtml

htmlToslate

Line breaks

htmlToSlate

slateToHtml

`htmlparser2`

`htmlToSlate`

`slateToHtml`

`htmlToslate`

`htmlToSlate`

`slateToHtml`