This tool crawls a documentation website and converts the pages into a single Markdown document. It intelligently removes common sections that appear across multiple pages to avoid duplication, including them once at the end of the document.
- Crawls documentation websites and combines pages into a single Markdown file.
- Removes common sections that appear across many pages, including them once at the beginning.
- Customizable threshold for similarity.
- Configurable selectors to remove specific elements from pages.
- Supports robots.txt compliance with an option to ignore it.
- NEW in v0.3.3: Ability to skip URLs based on ignore-paths both pre-fetch (before requesting content) and post-fetch (after redirects).
- Python 3.6 or higher is required.
- (Optional) It is recommended to use a virtual environment to avoid dependency conflicts with other projects.
If you have already cloned the repository or downloaded the source code, you can install the package using pip
:
pip install .
This will install the package in your current Python environment.
If you are a developer or want to modify the source code and see your changes reflected immediately, you can install the package in editable mode. This allows you to edit the source files and test the changes without needing to reinstall the package:
pip install -e .
It is recommended to use a virtual environment to isolate the package and its dependencies. Follow these steps to set up a virtual environment and install the package:
-
Create a virtual environment (e.g., named
venv
):python -m venv venv
-
Activate the virtual environment:
-
On macOS/Linux:
source venv/bin/activate
-
On Windows:
.\venv\Scripts\activate
-
-
Install the package inside the virtual environment:
pip install .
This ensures that all dependencies are installed within the virtual environment.
Once the package is published on PyPI, you can install it directly using:
pip install libcrawler
To upgrade the package to the latest version, use:
pip install --upgrade libcrawler
This will upgrade the package to the newest version available.
You can verify that the package has been installed correctly by running:
pip show libcrawler
This will display information about the installed package, including the version, location, and dependencies.
crawl-docs BASE_URL STARTING_POINT [OPTIONS]
BASE_URL
: The base URL of the documentation site (e.g., https://example.com).STARTING_POINT
: The starting path of the documentation (e.g., /docs/).
-o, --output OUTPUT
: Output filename (default: documentation.md).--no-robots
: Ignore robots.txt rules.--delay DELAY
: Delay between requests in seconds (default: 1.0).--delay-range DELAY_RANGE
: Range for random delay variation (default: 0.5).--remove-selectors SELECTOR [SELECTOR ...]
: Additional CSS selectors to remove from pages.--similarity-threshold SIMILARITY_THRESHOLD
: Similarity threshold for section comparison (default: 0.8).--allowed-paths PATH [PATH ...]
: List of URL paths to include during crawling.--ignore-paths PATH [PATH ...]
: List of URL paths to skip during crawling, either before or after fetching content.--user-agent USER_AGENT
: Specify a custom User-Agent string (which will be harmonized with any additional headers).--headers-file FILE
: Path to a JSON file containing optional headers. Only one of--headers-file
or--headers-json
can be used.--headers-json JSON
(JSON string): Optional headers as JSON
crawl-docs https://example.com /docs/ -o output.md
crawl-docs https://example.com /docs/ -o output.md \
--similarity-threshold 0.7 \
--delay-range 0.3
crawl-docs https://example.com /docs/ -o output.md \
--remove-selectors ".sidebar" ".ad-banner"
crawl-docs https://example.com / -o output.md \
--allowed-paths "/docs/" "/api/"
Copiar código
crawl-docs https://example.com /docs/ -o output.md \
--ignore-paths "/old/" "/legacy/"
- Python 3.6 or higher
- BeautifulSoup4
- datasketch
- requests
- markdownify
Install dependencies using:
pip install -r requirements.txt
This project is licensed under the LGPLv3.