HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.

Install

$ npm install parse5

Which API should I use?

"I need to parse a HTML string"

Use parse5.parse method.

"I need to parse a HTML fragment string" or "I need to implement an `innerHTML` setter."

Use parse5.parseFragment method.

"I need to serialize a node to HTML string"

Use parse5.serialize method.

"I need to parse HTML streamed from network or from file." or "I need to implement `<script>` execution and `document.write`"

Use parse5.ParserStream class.

"I don't need a document tree, but I need a basic information about tags or attributes" or "I need to extract a text content from huge amount of documents" or "I need to analyze content that going through my proxy server".

Use parse5.SAXParser class.

"I need to implement my own tree format"

Implement TreeAdapter interface and then use treeAdapter option to pass it to parser or serializer.

TypeScript definitions

parse5 package includes a TypeScript definition file. Therefore you don't need to install any typings to use parse5 in TypeScript files. Note that since parse5 supports multiple output tree formats you need to manually cast generic node interfaces to the appropriate tree format to get access to the properties:

import * as parse5 from 'parse5';

// Using default tree adapter.
var document1 = parse5.parse('<div></div>') as parse5.AST.Default.Document;

// Using htmlparser2 tree adapter.
var document2 = parse5.parse('<div></div>', {
    treeAdapter: parse5.TreeAdapters.htmlparser2
}) as parse5.AST.HtmlParser2.Document;

You can find documentation for interfaces in API reference.

FAQ

Q: I want to work with my own document tree format. How can I do this?

You can create a custom tree adapter, so that parse5 can work with your own DOM-tree implementation. Then pass it to the parser or serializer via the treeAdapter option:

const parse5 = require('parse5');

const myTreeAdapter = {
   //Adapter methods...
};

const document = parse5.parse('<div></div>', { treeAdapter: myTreeAdapter });

const html = parse5.serialize(document, { treeAdapter: myTreeAdapter });

Refer to the API reference for the description of methods that should be exposed by the tree adapter, as well as links to their default implementation.

Q: How can I use parse5 in the browser?

Compile it with browserify and you're set.

Q: I'm parsing `<img src="foo">` with the `SAXParser` and I expect the `selfClosing` flag to be `true` for the `<img>` tag. But it's not. Is there something wrong with the parser?

No. A self-closing tag is a tag that has a / before the closing bracket. E.g: <br/>, <meta/>. In the provided example, the tag simply doesn't have an end tag. Self-closing tags and tags without end tags are treated differently by the parser: in case of a self-closing tag, the parser does not look up for the corresponding closing tag and expects the element not to have any content. But if a start tag is not self-closing, the parser treats everything that follows it (with a few exceptions) as the element content. However, if the start tag is in the list of void elements, the parser expects the corresponding element not to have content and behaves in the same way as if the element was self-closing. So, semantically, if an element is void, self-closing tags and tags without closing tags are equivalent, but it's not true for other tags.

TL;DR: selfClosing is a part of lexical information and is set only if the tag has / before the closing bracket in the source code.

Q: I have some weird output from the parser, seems like a bug.

Most likely, it's not. There are a lot of weird edge cases in HTML5 parsing algorithm, e.g.:

<b>1<p>2</b>3</p>

will be parsed as

<b>1</b><p><b>2</b>3</p>

Just try it in the latest version of your browser before submitting an issue.

Version history

3.0.2

Fixed: location.startTag is not available if end tag is missing (GH #181);

3.0.1

Fixed: MarkupData.Location.col description in TypeScript definition file (GH #170);

3.0.0

Added: parse5 now ships with TypeScript definitions from which new documentation website is generated (GH #125).
Added: PlainTextConversionStream (GH #135).
Updated: Significantly reduced initial memory consumption (GH #52).
Updated (breaking): Added support for limited quirks mode. document.quirksMode property was replaced with document.mode property which can have 'no-quirks', 'quirks' and 'limited-quirks' values. Tree adapter setQuirksMode and isQuirksMode methods were replaced with setDocumentMode and getDocumentMode methods (GH #83).
Updated (breaking): AST collections (e.g. attributes dictionary) don't have prototype anymore (GH #119).
Updated (breaking): Doctype now always serialized as <!DOCTYPE html> as per spec (GH #137).
Fixed: Incorrect line for __location.endTag when the start tag contains newlines (GH #166) (by @webdesus).

2.2.3

Fixed: Fixed incorrect LocationInfo.endOffset for non-implicitly closed elements (refix for GH #109) (by @wooorm).

2.2.2

Fixed: Incorrect location info for text in SAXParser (GH #153).
Fixed: Incorrect LocationInfo.endOffset for implicitly closed <p> element (GH #109).
Fixed: Infinite input data buffering in streaming parsers. Now parsers try to not buffer more than 64K of input data. However, there are still some edge cases left that will lead to significant memory consumption, but they are quite exotic and extremely rare in the wild (GH #102, GH #130);

2.2.1

Fixed: SAXParser HTML integration point handling for adjustable SVG tags.
Fixed: SAXParser now adjust SVG tag names for end tags.
Fixed: Location info line calculation on tokenizer character unconsumption (by @ChadKillingsworth).

2.2.0

SAXParser (by @RReverser)
Fixed: Handling of \n in <pre>, <textarea> and <listing>.
Fixed: Tag names and attribute names adjustment in foreign content (GH #99).
Fixed: Handling of <image>.
Latest spec changes
Updated: <isindex> now don't have special handling (GH #122).
Updated: Adoption agency algorithm now preserves lexical order of text nodes (GH #129).
Updated: <menuitem> now behaves like <option>.
Fixed: Element nesting corrections now take namespaces into consideration.

2.1.5

Fixed: ParserStream accidentally hangs up on scripts (GH #101).

2.1.4

Fixed: Keep ParserStream sync for the inline scripts (GH #98 follow up).

2.1.3

Fixed: Synchronously calling resume() leads to crash (GH #98).

2.1.2

Fixed: SAX parser silently exits on big files (GH #97).

2.1.1

Fixed: location info not attached for empty attributes (GH #96) (by @yyx990803).

2.1.0

Added: location info for attributes (GH #43) (by @sakagg and @yyx990803).
Fixed: parseFragment with locationInfo regression when parsing <template>(GH #90) (by @yyx990803).

2.0.2

Fixed: yet another case of incorrect parseFragment arguments fallback (GH #84).

2.0.1

Fixed: parseFragment arguments processing (GH #82).

2.0.0

Added: ParserStream with the scripting support. (GH #26).
Added: SerializerStream. (GH #26).
Added: Line/column location info. (GH #67).
Update (breaking): Location info properties start and end were renamed to startOffset and endOffset respectively.
Update (breaking): SimpleApiParser was renamed to SAXParser.
Update (breaking): SAXParser is the transform stream now. (GH #26).
Update (breaking): SAXParser handler subscription is done via events now.
Added: SAXParser.stop(). (GH #47).
Add (breaking): parse5.parse() and parse5.parseFragment() methods as replacement for the Parser class.
Add (breaking): parse5.serialize() method as replacement for the Serializer class.
Updated: parsing algorithm was updated with the latest HTML spec changes.
Removed (breaking): decodeHtmlEntities and encodeHtmlEntities options. (GH #75).
Add (breaking): TreeAdapter.setTemplateContent() and TreeAdapter.getTemplateContent() methods. (GH #78).
Update (breaking): default tree adapter now stores <template> content in template.content property instead of template.childNodes[0].

1.5.1

Fixed: Qualified tag name emission in Serializer (GH #79).

1.5.0

Added: Location info for the element start and end tags (by @sakagg).

1.4.2

Fixed: htmlparser2 tree adapter DocumentType.data property rendering (GH #45).

1.4.1

Fixed: Location info handling for the implicitly generated <html> and <body> elements (GH #44).

1.4.0

Added: Parser decodeHtmlEntities option.
Added: SimpleApiParser decodeHtmlEntities option.
Added: Parser locationInfo option.
Added: SimpleApiParser locationInfo option.

1.3.2

Fixed: <form> processing in <template> (GH #40).

1.3.1

Fixed: text node in <template> serialization problem with custom tree adapter (GH #38).

1.3.0

Added: Serializer encodeHtmlEntities option.

1.2.0

Added: <template> support
parseFragment now uses <template> as default contextElement. This leads to the more "forgiving" parsing manner.
TreeSerializer was renamed to Serializer. However, serializer is accessible as parse5.TreeSerializer for backward compatibility .

1.1.6

Fixed: apply latest changes to the htmlparser2 tree format (DOM Level1 node emulation).

1.1.5

Added: jsdom-specific parser with scripting support. Undocumented for jsdom internal use only.

1.1.4

Added: logo
Fixed: use fake document element for fragment parsing (required by jsdom).

1.1.3

Development files (e.g. .travis.yml, .editorconfig) are removed from NPM package.

1.1.2

Fixed: crash on Linux due to upper-case leading character in module name used in require().

1.1.1

Added: SimpleApiParser.
Fixed: new line serialization in <pre>.
Fixed: SYSTEM-only DOCTYPE serialization.
Fixed: quotes serialization in DOCTYPE IDs.

1.0.0

First stable release, switch to semantic versioning.

0.8.3

Fixed: siblings calculation bug in appendChild in htmlparser2 tree adapter.

0.8.1

Added: TreeSerializer.
Added: htmlparser2 tree adapter.

0.6.1

Fixed: incorrect <menuitem> handling in <body>.

0.6.0

Initial release.

Files

DOC_INDEX.md

Latest commit

History