A simple command-line tool for scraping HTML content from a given URL and extracting data for further processing.
Good for grabbing all image/link/magnet URLs from a page, or extracting the text of certain elements.
NOT for scraping entire page content.
npm i -g @trippnology/super-simple-scraper
-
Clone the repository:
git clone https://github.com/trippnology/super-simple-scraper.git
-
Navigate to the project directory:
cd super-simple-scraper
-
Install dependencies:
npm install
-
Make the script executable (optional, for Unix-based systems):
chmod +x index.js
-
Link the repo as a local command (optional):
npm link
You can now run
sss
globally, as if it was installed by npm.
You can run the scraper using the following command:
sss [options]
Or if you installed from source:
node index.js [options]
-u, --url <url>
: The URL to scrape (required).-s, --selector <selector>
: CSS selector to find. Default isa
.-c, --content <type>
: Process each element as this type of content (hash
,html
,image
,json
,link
,object
, ortext
). Default islink
.-o, --output <format>
: Output format (html
,json
,object
, ortext
). Default istext
.
It's up to you to use sensible combinations of options. If you select all images, then try to process them as links, you're not going to get any results!
Use the -c object
and -o object
options together to get the full cheerio object for debugging. You can use this to make sure you are dealing with the DOM that you think you are!
-
Scrape a specific URL with default options: (this will find all links and return their hrefs)
sss -u https://example.com
-
Find all elements with a class of
.foo
and grab their HTML contents:sss -u https://example.com -s .foo -c html
-
Find all links and return their href:
sss -u http://localhost:8080/test.html -s a -c link
-
Find all links and return their text:
sss -u http://localhost:8080/test.html -s a -c text
-
Find all images and return their src:
sss -u http://localhost:8080/test.html -s img -c image
-
Find all magnet links and return their infohash:
sss -u http://localhost:8080/test.html -s a[href^=magnet] -c hash
-
Find all scripts containing JSON and return their contents:
sss -u http://localhost:8080/test.html -s script[type="application/json"] -c json
-
Find all elements with a class of
.foo
and return the full cheerio object (useful for debugging):sss -u http://localhost:8080/test.html -s .foo -c object -f object
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
- v1.0.0: Initial release with basic functionality.
MIT See the full LICENSE file