Skip to content

Latest commit

 

History

History
196 lines (158 loc) · 12.5 KB

javascript.md

File metadata and controls

196 lines (158 loc) · 12.5 KB

JavaScript Web Scraping

This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).

Network

  • node-http2 - An HTTP/2 client and server implementation for node.js
  • httpinvoke - A no-dependencies HTTP client library for browsers and Node.js with a promise-based or Node.js-style callback-based API to progress events, text and binary file upload and download, partial response body, request and response headers, status code.
  • request - Simplified HTTP request client.
  • socks5-http-client - SOCKS v5 HTTP client implementation in JavaScript for Node.js
  • rest - RESTful HTTP client for JavaScript
  • wreck - HTTP Client Utilities

Web-Scraping Frameworks

HTML/XML Parsing

  • General
    • parse5 - WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node and io.js
    • htmlparser2 - forgiving html and xml parser
    • sax-js - A sax style parser for JS
    • cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server
  • Sanitizing
    • js-xss - Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • string.js - Extra JavaScript string methods.
    • accounting.js - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
    • validator.js - String validation and sanitization.
  • Date and time
    • moment - Parse, validate, manipulate, and display dates in javascript.
    • date - Date() for humans.
    • ms.js - Tiny millisecond conversion utility.
  • HTML entities
    • he - A robust HTML entity encoder/decoder written in JavaScript.
  • Money
    • money.js - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
  • Color
    • chroma.js - JavaScript library for all kinds of color manipulations.
    • color - JavaScript color conversion and manipulation library.
    • TinyColor - Fast, small color manipulation and conversion for JavaScript.
  • User Agent
    • UAParser.js - Lightweight JavaScript-based User-Agent string parser. Supports browser & node.js environment.
  • Semantic Version

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General
    • jBinary - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
  • Office
    • js-xlsx - XLSX / XLSM / XLSB / XLS / SpreadsheetML (Excel Spreadsheet) / ODS parser and writer
  • CSV
    • BabyParse - Fast and reliable CSV parser based on Papa Parse. Papa Parse is for the browser, Baby Parse is for Node.js.
    • CSV - A simple, blazing-fast CSV parser and encoder. Full RFC 4180 compliance.
  • JSON
    • json3 - A modern JSON implementation compatible with nearly all JavaScript platforms.
  • EXIF
    • exif-js - JavaScript library for reading EXIF image metadata
  • CSS
    • parse-css - Standards-based CSS Parser
    • parser-lib CSS parser - The ParserLib CSS parser is a CSS3 SAX-inspired parser written in JavaScript. By default, the parser only deals with standard CSS syntax and doesn't do validation (checking of property names and values).
  • Torrent
    • parse-torrent - Parse a torrent identifier (magnet uri, .torrent file, info hash)
  • SQL
    • SQL Parser - SQL Parser is a lexer, grammar and parser for SQL written in JS. Currently it is only capable of parsing fairly basic SELECT queries.
  • YAML JS-YAML - JavaScript YAML parser and dumper. Very fast.
  • Markdown
    • markdown-it - Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
  • Atom/RSS

Natural Language Processing

Libraries for working with human languages.

  • General
    • natural - general natural language facilities for node
    • nlp_compromise - natural language processing
    • Hanzi - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
    • salient - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
    • node-summary - Node module that summarizes text using a naive summarization algorithm
  • Stemmer
    • snowball-js - javascript implementation of the popular snowball word stemming nlp algorithm
    • porter-stemmer - Martin Porter's stemmer for node.js
    • Porter-Stemmer - A Javascript Implementation of the Porter Stemmer
    • lunr-languages - a collection of languages stemmers and stopwords for Lunr Javascript library
  • Language detection
    • franc - Natural language detection
    • guessLanguage.js - A natural language detection library based on trigram statistical analysis for Node.js

Browser automation and emulation

  • phantomjs - Scriptable Headless WebKit.
  • slimerjs - A PhantomJS-like tool running Gecko.
  • casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
  • zombie - Insanely fast, full-stack, headless browser testing using node.js.
  • nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks

Multiprocessing

  • nexpect - spawn and control child processes in node.js with ease
  • respawn - Spawn a process and restart it if it crashes
  • node-webworker - A WebWorkers implementation for NodeJS

Asynchronous

Libraries for asynchronous networking programming.

  • socket.io - Realtime application framework (Node.JS server)
  • engine.io - Engine.IO is the implementation of transport-based cross-browser/cross-device bi-directional communication layer for Socket.IO
  • async - Async utilities for node and the browser

Queue

  • kue - Kue is a priority job queue backed by redis, built for node.js
  • bull - A lightweight, robust and fast job processing queue. Carefully written for rock solid stability and atomicity.

Email

Libraries for parsing email.

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

  • URL
    • query-string - Parse and stringify URL query strings.
    • URI.js - Javascript URL mutation library.
    • jsurl - Lightweight URL manipulation with JavaScript.
    • arg.js - Lightweight URL argument and parameter parser
  • Network Address
    • node-ip - IP address tools for node.js
    • ip-address - A library for parsing and manipulating IPv6 (and v4) addresses in JavaScript

Web Content Extracting

Libraries for extracting web contents.

  • node-read - Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.
  • node-ytdl-core - Youtube video downloader in javascript
  • ImageResolver - Does its best to determine the main image on a URL without loading all images.

WebSocket

Libraries for working with WebSocket.

  • websocket.io - WebSocket.IO is an abstraction of the websocket server previously used by Socket.IO. It has the broadest support for websocket protocol/specifications and an API that allows for interoperability with higher-level frameworks such as Engine, Socket.IO's realtime core.
  • WebScoket-Node - A WebSocket Implementation for Node.JS (Draft -08 through the final RFC 6455)

DNS Resolving

  • multicast-dns - Low level multicast-dns implementation in pure javascript
  • node-dns - Replacement dns module in pure javascript for node.js

Computer Vision

  • tracking.js - A modern approach for Computer Vision on the web.
  • ocrad.js - OCR in Javascript via Emscripten.

Proxy Server

  • toxy - Hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions

Data Structure

  • immutable - Immutable persistent data collections for Javascript which increase efficiency and simplicity.
  • lodash - More consistent cross-environment iteration support for arrays, strings, objects, and arguments objects

Other JavaScript lists