Skip to content

Latest commit

 

History

History
295 lines (215 loc) · 15.9 KB

DOC_INDEX.md

File metadata and controls

295 lines (215 loc) · 15.9 KB

parse5

NPM Version

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.

Table of contents

Install

$ npm install parse5

Which API should I use?

"I need to parse a HTML string"

Use parse5.parse method.

"I need to parse a HTML fragment string" or "I need to implement an innerHTML setter."

Use parse5.parseFragment method.

"I need to serialize a node to HTML string"

Use parse5.serialize method.

"I need to parse HTML streamed from network or from file." or "I need to implement <script> execution and document.write"

Use parse5.ParserStream class.

"I don't need a document tree, but I need a basic information about tags or attributes" or "I need to extract a text content from huge amount of documents" or "I need to analyze content that going through my proxy server".

Use parse5.SAXParser class.

"I need to parse plain text file as HTML document like browsers do"

Use parse5.PlainTextConversionStream class.

"I need to serialize a node and stream result to file or network"

Use parse5.SerializerStream class.

"I need a source file location information for the parsed document"

Use locationInfo options: ParserOptions.locationInfo, SAXParserOptions.locationInfo.

"I need to switch output tree format"

Use treeAdapter options: ParserOptions.treeAdapter and SerializerOptions.treeAdapter with one of two built-in tree formats.

"I need to implement my own tree format"

Implement TreeAdapter interface and then use treeAdapter option to pass it to parser or serializer.

TypeScript definitions

parse5 package includes a TypeScript definition file. Therefore you don't need to install any typings to use parse5 in TypeScript files. Note that since parse5 supports multiple output tree formats you need to manually cast generic node interfaces to the appropriate tree format to get access to the properties:

import * as parse5 from 'parse5';

// Using default tree adapter.
var document1 = parse5.parse('<div></div>') as parse5.AST.Default.Document;

// Using htmlparser2 tree adapter.
var document2 = parse5.parse('<div></div>', {
    treeAdapter: parse5.TreeAdapters.htmlparser2
}) as parse5.AST.HtmlParser2.Document;

You can find documentation for interfaces in API reference.

FAQ

Q: I want to work with my own document tree format. How can I do this?

You can create a custom tree adapter, so that parse5 can work with your own DOM-tree implementation. Then pass it to the parser or serializer via the treeAdapter option:

const parse5 = require('parse5');

const myTreeAdapter = {
   //Adapter methods...
};

const document = parse5.parse('<div></div>', { treeAdapter: myTreeAdapter });

const html = parse5.serialize(document, { treeAdapter: myTreeAdapter });

Refer to the API reference for the description of methods that should be exposed by the tree adapter, as well as links to their default implementation.

Q: How can I use parse5 in the browser?

Compile it with browserify and you're set.

Q: I'm parsing <img src="foo"> with the SAXParser and I expect the selfClosing flag to be true for the <img> tag. But it's not. Is there something wrong with the parser?

No. A self-closing tag is a tag that has a / before the closing bracket. E.g: <br/>, <meta/>. In the provided example, the tag simply doesn't have an end tag. Self-closing tags and tags without end tags are treated differently by the parser: in case of a self-closing tag, the parser does not look up for the corresponding closing tag and expects the element not to have any content. But if a start tag is not self-closing, the parser treats everything that follows it (with a few exceptions) as the element content. However, if the start tag is in the list of void elements, the parser expects the corresponding element not to have content and behaves in the same way as if the element was self-closing. So, semantically, if an element is void, self-closing tags and tags without closing tags are equivalent, but it's not true for other tags.

TL;DR: selfClosing is a part of lexical information and is set only if the tag has / before the closing bracket in the source code.

Q: I have some weird output from the parser, seems like a bug.

Most likely, it's not. There are a lot of weird edge cases in HTML5 parsing algorithm, e.g.:

<b>1<p>2</b>3</p>

will be parsed as

<b>1</b><p><b>2</b>3</p>

Just try it in the latest version of your browser before submitting an issue.

Version history

3.0.2

  • Fixed: location.startTag is not available if end tag is missing (GH #181);

3.0.1

  • Fixed: MarkupData.Location.col description in TypeScript definition file (GH #170);

3.0.0

  • Added: parse5 now ships with TypeScript definitions from which new documentation website is generated (GH #125).
  • Added: PlainTextConversionStream (GH #135).
  • Updated: Significantly reduced initial memory consumption (GH #52).
  • Updated (breaking): Added support for limited quirks mode. document.quirksMode property was replaced with document.mode property which can have 'no-quirks', 'quirks' and 'limited-quirks' values. Tree adapter setQuirksMode and isQuirksMode methods were replaced with setDocumentMode and getDocumentMode methods (GH #83).
  • Updated (breaking): AST collections (e.g. attributes dictionary) don't have prototype anymore (GH #119).
  • Updated (breaking): Doctype now always serialized as <!DOCTYPE html> as per spec (GH #137).
  • Fixed: Incorrect line for __location.endTag when the start tag contains newlines (GH #166) (by @webdesus).

2.2.3

  • Fixed: Fixed incorrect LocationInfo.endOffset for non-implicitly closed elements (refix for GH #109) (by @wooorm).

2.2.2

  • Fixed: Incorrect location info for text in SAXParser (GH #153).
  • Fixed: Incorrect LocationInfo.endOffset for implicitly closed <p> element (GH #109).
  • Fixed: Infinite input data buffering in streaming parsers. Now parsers try to not buffer more than 64K of input data. However, there are still some edge cases left that will lead to significant memory consumption, but they are quite exotic and extremely rare in the wild (GH #102, GH #130);

2.2.1

  • Fixed: SAXParser HTML integration point handling for adjustable SVG tags.
  • Fixed: SAXParser now adjust SVG tag names for end tags.
  • Fixed: Location info line calculation on tokenizer character unconsumption (by @ChadKillingsworth).

2.2.0

  • SAXParser (by @RReverser)

  • Fixed: Handling of \n in <pre>, <textarea> and <listing>.

  • Fixed: Tag names and attribute names adjustment in foreign content (GH #99).

  • Fixed: Handling of <image>.

  • Latest spec changes

  • Updated: <isindex> now don't have special handling (GH #122).

  • Updated: Adoption agency algorithm now preserves lexical order of text nodes (GH #129).

  • Updated: <menuitem> now behaves like <option>.

  • Fixed: Element nesting corrections now take namespaces into consideration.

2.1.5

  • Fixed: ParserStream accidentally hangs up on scripts (GH #101).

2.1.4

  • Fixed: Keep ParserStream sync for the inline scripts (GH #98 follow up).

2.1.3

  • Fixed: Synchronously calling resume() leads to crash (GH #98).

2.1.2

  • Fixed: SAX parser silently exits on big files (GH #97).

2.1.1

  • Fixed: location info not attached for empty attributes (GH #96) (by @yyx990803).

2.1.0

  • Added: location info for attributes (GH #43) (by @sakagg and @yyx990803).
  • Fixed: parseFragment with locationInfo regression when parsing <template>(GH #90) (by @yyx990803).

2.0.2

  • Fixed: yet another case of incorrect parseFragment arguments fallback (GH #84).

2.0.1

  • Fixed: parseFragment arguments processing (GH #82).

2.0.0

1.5.1

  • Fixed: Qualified tag name emission in Serializer (GH #79).

1.5.0

  • Added: Location info for the element start and end tags (by @sakagg).

1.4.2

  • Fixed: htmlparser2 tree adapter DocumentType.data property rendering (GH #45).

1.4.1

  • Fixed: Location info handling for the implicitly generated <html> and <body> elements (GH #44).

1.4.0

1.3.2

  • Fixed: <form> processing in <template> (GH #40).

1.3.1

  • Fixed: text node in <template> serialization problem with custom tree adapter (GH #38).

1.3.0

  • Added: Serializer encodeHtmlEntities option.

1.2.0

  • Added: <template> support
  • parseFragment now uses <template> as default contextElement. This leads to the more "forgiving" parsing manner.
  • TreeSerializer was renamed to Serializer. However, serializer is accessible as parse5.TreeSerializer for backward compatibility .

1.1.6

  • Fixed: apply latest changes to the htmlparser2 tree format (DOM Level1 node emulation).

1.1.5

  • Added: jsdom-specific parser with scripting support. Undocumented for jsdom internal use only.

1.1.4

  • Added: logo
  • Fixed: use fake document element for fragment parsing (required by jsdom).

1.1.3

  • Development files (e.g. .travis.yml, .editorconfig) are removed from NPM package.

1.1.2

  • Fixed: crash on Linux due to upper-case leading character in module name used in require().

1.1.1

  • Added: SimpleApiParser.
  • Fixed: new line serialization in <pre>.
  • Fixed: SYSTEM-only DOCTYPE serialization.
  • Fixed: quotes serialization in DOCTYPE IDs.

1.0.0

  • First stable release, switch to semantic versioning.

0.8.3

  • Fixed: siblings calculation bug in appendChild in htmlparser2 tree adapter.

0.8.1

0.6.1

  • Fixed: incorrect <menuitem> handling in <body>.

0.6.0

  • Initial release.