This project demonstrates how to use Puppeteer to render a web page and create a WARC file of the rendered page and its resources. This can be useful for archiving web pages for long-term storage or offline browsing.
- Node.js: Ensure you have Node.js installed (version 22 or higher is recommended). You can download it from nodejs.org.
-
Clone the repository: This step involves downloading the project files to your local machine.
git clone https://github.com/ganapativs/puppeteer-warc.git cd puppeteer-warc
-
Install the necessary dependencies: This command will install all the required Node.js packages specified in the
package.json
file.npm install
To create a WARC file from a website, use the src/write-warc-cli.mjs
script. This script will render the specified website and create a WARC file containing the page and its resources. It will also generate a screenshot of the web page, which can be useful for debugging.
-
Command:
node src/write-warc-cli.mjs <website-url>
-
Example: To create a WARC file for
https://example.com
, run:node src/write-warc-cli.mjs https://example.com
-
To read and print the contents of a WARC file, use the src/read-warc-cli.mjs
script. This script will output the records contained in the specified WARC file.
-
Command:
node src/read-warc-cli.mjs <path-to-warc-file>
-
Example: To read the contents of
examplecom.warc.gz
, run:node src/read-warc-cli.mjs examplecom.warc.gz
-
You can preview WARC files using ReplayWeb.page, a web-based tool for viewing archived web content. This tool allows you to interact with the archived pages as if you were browsing them live.
This project is licensed under the MIT License. See the LICENSE file for details.