HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.
- Install
- Which API should I use?
- TypeScript definitions
- API Reference
- FAQ
- Version history
- GitHub repository
$ npm install parse5
Use parse5.parse method.
Use parse5.parseFragment method.
Use parse5.serialize method.
"I need to parse HTML streamed from network or from file." or "I need to implement <script>
execution and document.write
"
Use parse5.ParserStream class.
"I don't need a document tree, but I need a basic information about tags or attributes" or "I need to extract a text content from huge amount of documents" or "I need to analyze content that going through my proxy server".
Use parse5.SAXParser class.
Use parse5.PlainTextConversionStream class.
Use parse5.SerializerStream class.
Use locationInfo
options: ParserOptions.locationInfo, SAXParserOptions.locationInfo.
Use treeAdapter
options: ParserOptions.treeAdapter and SerializerOptions.treeAdapter
with one of two built-in tree formats.
Implement TreeAdapter interface and then use treeAdapter option to pass it to parser or serializer.
parse5 package includes a TypeScript definition file. Therefore you don't need to install any typings to use parse5 in TypeScript files. Note that since parse5 supports multiple output tree formats you need to manually cast generic node interfaces to the appropriate tree format to get access to the properties:
import * as parse5 from 'parse5';
// Using default tree adapter.
var document1 = parse5.parse('<div></div>') as parse5.AST.Default.Document;
// Using htmlparser2 tree adapter.
var document2 = parse5.parse('<div></div>', {
treeAdapter: parse5.TreeAdapters.htmlparser2
}) as parse5.AST.HtmlParser2.Document;
You can find documentation for interfaces in API reference.
You can create a custom tree adapter, so that parse5 can work with your own DOM-tree implementation.
Then pass it to the parser or serializer via the treeAdapter
option:
const parse5 = require('parse5');
const myTreeAdapter = {
//Adapter methods...
};
const document = parse5.parse('<div></div>', { treeAdapter: myTreeAdapter });
const html = parse5.serialize(document, { treeAdapter: myTreeAdapter });
Refer to the API reference for the description of methods that should be exposed by the tree adapter, as well as links to their default implementation.
Compile it with browserify and you're set.
Q: I'm parsing <img src="foo">
with the SAXParser
and I expect the selfClosing
flag to be true
for the <img>
tag. But it's not. Is there something wrong with the parser?
No. A self-closing tag is a tag that has a /
before the closing bracket. E.g: <br/>
, <meta/>
.
In the provided example, the tag simply doesn't have an end tag. Self-closing tags and tags without end tags are treated differently by the
parser: in case of a self-closing tag, the parser does not look up for the corresponding closing tag and expects the element not to have any content.
But if a start tag is not self-closing, the parser treats everything that follows it (with a few exceptions) as the element content.
However, if the start tag is in the list of void elements, the parser expects the corresponding
element not to have content and behaves in the same way as if the element was self-closing. So, semantically, if an element is
void, self-closing tags and tags without closing tags are equivalent, but it's not true for other tags.
TL;DR: selfClosing
is a part of lexical information and is set only if the tag has /
before the closing bracket in the source code.
Most likely, it's not. There are a lot of weird edge cases in HTML5 parsing algorithm, e.g.:
<b>1<p>2</b>3</p>
will be parsed as
<b>1</b><p><b>2</b>3</p>
Just try it in the latest version of your browser before submitting an issue.
- Fixed:
location.startTag
is not available if end tag is missing (GH #181);
- Fixed:
MarkupData.Location.col
description in TypeScript definition file (GH #170);
- Added: parse5 now ships with TypeScript definitions from which new documentation website is generated (GH #125).
- Added: PlainTextConversionStream (GH #135).
- Updated: Significantly reduced initial memory consumption (GH #52).
- Updated (breaking): Added support for limited quirks mode.
document.quirksMode
property was replaced withdocument.mode
property which can have'no-quirks'
,'quirks'
and'limited-quirks'
values. Tree adaptersetQuirksMode
andisQuirksMode
methods were replaced withsetDocumentMode
andgetDocumentMode
methods (GH #83). - Updated (breaking): AST collections (e.g. attributes dictionary) don't have prototype anymore (GH #119).
- Updated (breaking): Doctype now always serialized as
<!DOCTYPE html>
as per spec (GH #137). - Fixed: Incorrect line for
__location.endTag
when the start tag contains newlines (GH #166) (by @webdesus).
- Fixed: Fixed incorrect LocationInfo.endOffset for non-implicitly closed elements (refix for GH #109) (by @wooorm).
- Fixed: Incorrect location info for text in SAXParser (GH #153).
- Fixed: Incorrect
LocationInfo.endOffset
for implicitly closed<p>
element (GH #109). - Fixed: Infinite input data buffering in streaming parsers. Now parsers try to not buffer more than 64K of input data. However, there are still some edge cases left that will lead to significant memory consumption, but they are quite exotic and extremely rare in the wild (GH #102, GH #130);
- Fixed: SAXParser HTML integration point handling for adjustable SVG tags.
- Fixed: SAXParser now adjust SVG tag names for end tags.
- Fixed: Location info line calculation on tokenizer character unconsumption (by @ChadKillingsworth).
-
SAXParser (by @RReverser)
-
Fixed: Handling of
\n
in<pre>
,<textarea>
and<listing>
. -
Fixed: Tag names and attribute names adjustment in foreign content (GH #99).
-
Fixed: Handling of
<image>
. -
Latest spec changes
-
Updated:
<isindex>
now don't have special handling (GH #122). -
Updated: Adoption agency algorithm now preserves lexical order of text nodes (GH #129).
-
Updated:
<menuitem>
now behaves like<option>
. -
Fixed: Element nesting corrections now take namespaces into consideration.
- Fixed: ParserStream accidentally hangs up on scripts (GH #101).
- Fixed: Keep ParserStream sync for the inline scripts (GH #98 follow up).
- Fixed: Synchronously calling resume() leads to crash (GH #98).
- Fixed: SAX parser silently exits on big files (GH #97).
- Fixed: location info not attached for empty attributes (GH #96) (by @yyx990803).
- Added: location info for attributes (GH #43) (by @sakagg and @yyx990803).
- Fixed:
parseFragment
withlocationInfo
regression when parsing<template>
(GH #90) (by @yyx990803).
- Fixed: yet another case of incorrect
parseFragment
arguments fallback (GH #84).
- Fixed:
parseFragment
arguments processing (GH #82).
- Added: ParserStream with the scripting support. (GH #26).
- Added: SerializerStream. (GH #26).
- Added: Line/column location info. (GH #67).
- Update (breaking): Location info properties
start
andend
were renamed tostartOffset
andendOffset
respectively. - Update (breaking):
SimpleApiParser
was renamed to SAXParser. - Update (breaking): SAXParser is the transform stream now. (GH #26).
- Update (breaking): SAXParser handler subscription is done via events now.
- Added: SAXParser.stop(). (GH #47).
- Add (breaking): parse5.parse() and parse5.parseFragment()
methods as replacement for the
Parser
class. - Add (breaking): parse5.serialize() method as replacement for the
Serializer
class. - Updated: parsing algorithm was updated with the latest HTML spec changes.
- Removed (breaking):
decodeHtmlEntities
andencodeHtmlEntities
options. (GH #75). - Add (breaking): TreeAdapter.setTemplateContent() and TreeAdapter.getTemplateContent() methods. (GH #78).
- Update (breaking):
default
tree adapter now stores<template>
content intemplate.content
property instead oftemplate.childNodes[0]
.
- Fixed: Qualified tag name emission in Serializer (GH #79).
- Added: Location info for the element start and end tags (by @sakagg).
- Fixed: htmlparser2 tree adapter
DocumentType.data
property rendering (GH #45).
- Fixed: Location info handling for the implicitly generated
<html>
and<body>
elements (GH #44).
- Added: Parser decodeHtmlEntities option.
- Added: SimpleApiParser decodeHtmlEntities option.
- Added: Parser locationInfo option.
- Added: SimpleApiParser locationInfo option.
- Fixed:
<form>
processing in<template>
(GH #40).
- Fixed: text node in
<template>
serialization problem with custom tree adapter (GH #38).
- Added: Serializer
encodeHtmlEntities
option.
- Added:
<template>
support parseFragment
now uses<template>
as defaultcontextElement
. This leads to the more "forgiving" parsing manner.TreeSerializer
was renamed toSerializer
. However, serializer is accessible asparse5.TreeSerializer
for backward compatibility .
- Fixed: apply latest changes to the
htmlparser2
tree format (DOM Level1 node emulation).
- Added: jsdom-specific parser with scripting support. Undocumented for
jsdom
internal use only.
- Added: logo
- Fixed: use fake
document
element for fragment parsing (required by jsdom).
- Development files (e.g.
.travis.yml
,.editorconfig
) are removed from NPM package.
- Fixed: crash on Linux due to upper-case leading character in module name used in
require()
.
- Added: SimpleApiParser.
- Fixed: new line serialization in
<pre>
. - Fixed:
SYSTEM
-onlyDOCTYPE
serialization. - Fixed: quotes serialization in
DOCTYPE
IDs.
- First stable release, switch to semantic versioning.
- Fixed: siblings calculation bug in
appendChild
inhtmlparser2
tree adapter.
- Added: TreeSerializer.
- Added: htmlparser2 tree adapter.
- Fixed: incorrect
<menuitem>
handling in<body>
.
- Initial release.