Super Simple Scraper

A simple command-line tool for scraping HTML content from a given URL and extracting data for further processing.

Good for grabbing all image/link/magnet URLs from a page, or extracting the text of certain elements.

NOT for scraping entire page content.

Installation

From npm

npm i -g @trippnology/super-simple-scraper

From source

Clone the repository:

git clone https://github.com/trippnology/super-simple-scraper.git

Navigate to the project directory:
```
cd super-simple-scraper
```
Install dependencies:
```
npm install
```
Make the script executable (optional, for Unix-based systems):
```
chmod +x index.js
```
Link the repo as a local command (optional):
```
npm link
```
You can now run sss globally, as if it was installed by npm.

Usage

You can run the scraper using the following command:

sss [options]

Or if you installed from source:

node index.js [options]

Options

-u, --url <url>: The URL to scrape (required).
-s, --selector <selector>: CSS selector to find. Default is a.
-c, --content <type>: Process each element as this type of content (hash, html, image, json, link, object, or text). Default is link.
-o, --output <format>: Output format (html, json, object, or text). Default is text.

It's up to you to use sensible combinations of options. If you select all images, then try to process them as links, you're not going to get any results!

Use the -c object and -o object options together to get the full cheerio object for debugging. You can use this to make sure you are dealing with the DOM that you think you are!

Examples

Scrape a specific URL with default options: (this will find all links and return their hrefs)
```
sss -u https://example.com
```
Find all elements with a class of .foo and grab their HTML contents:
```
sss -u https://example.com -s .foo -c html
```

Find all links and return their href:

sss -u http://localhost:8080/test.html -s a -c link

Find all links and return their text:

sss -u http://localhost:8080/test.html -s a -c text

Find all images and return their src:

sss -u http://localhost:8080/test.html -s img -c image

Find all magnet links and return their infohash:

sss -u http://localhost:8080/test.html -s a[href^=magnet] -c hash

Find all scripts containing JSON and return their contents:

sss -u http://localhost:8080/test.html -s script[type="application/json"] -c json

Find all elements with a class of .foo and return the full cheerio object (useful for debugging):
```
sss -u http://localhost:8080/test.html -s .foo -c object -f object
```

Contributing

Fork it!
Create your feature branch: git checkout -b my-new-feature
Commit your changes: git commit -am 'Add some feature'
Push to the branch: git push origin my-new-feature
Submit a pull request :D

History

v1.0.0: Initial release with basic functionality.

Credits

© Trippnology

License

MIT See the full LICENSE file

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
test.html		test.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Super Simple Scraper

Installation

From npm

From source

Usage

Options

Examples

Contributing

History

Credits

License

About

Releases

Packages

Languages

License

Trippnology/super-simple-scraper

Folders and files

Latest commit

History

Repository files navigation

Super Simple Scraper

Installation

From npm

From source

Usage

Options

Examples

Contributing

History

Credits

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages