AsyncURLCrawler

AsyncURLCrawler navigates through web pages concurrently by following hyperlinks to collect URLs. AsyncURLCrawler uses BFS algorithm. To make use of it check robots.txt of the domains first.

👉 For complete documentation read here

👉 Source code on Github here

Install Pacakge

pip install AsyncURLCrawler

pip install AsyncURLCrawler==<version>

👉 The official page of the project in PyPi.

Usage Example in Code

Here is a simple python script to show how to use the package:

import asyncio
import os
from AsyncURLCrawler.parser import Parser
from AsyncURLCrawler.crawler import Crawler
import yaml


async def main():
    parser = Parser(
        delay_start=0.1, 
        max_retries=5, 
        request_timeout=1,
        user_agent="Mozilla",
    )
    crawler = Crawler( 
        seed_urls=["https://pouyae.ir"],
        parser=parser,
        exact=True,
        deep=False,
        delay=0.1,
    )
    result = await crawler.crawl()
    with open(
            os.path.join(output_path, 'result.yaml'), 'w') as file:
        for key in result:
            result[key] = list(result[key])
        yaml.dump(result, file)


if __name__ == "__main__":
    asyncio.run(main())

This is the output for the above code:

https://pouyae.ir:
- https://github.com/PouyaEsmaeili/AsyncURLCrawler
- https://pouyae.ir/images/pouya3.jpg
- https://github.com/PouyaEsmaeili/CryptographicClientSideUserState
- https://github.com/PouyaEsmaeili/RateLimiter
- https://pouyae.ir/
- https://github.com/luizdepra/hugo-coder/
- https://duman.pouyae.ir/
- https://pouyae.ir/projects/
- https://pouyae.ir/images/pouya4.jpg
- https://pouyae.ir/images/pouya5.jpg
- https://pouyae.ir/gallery/
- https://github.com/PouyaEsmaeili
- https://pouyae.ir/blog/
- https://www.linkedin.com/in/pouya-esmaeili-9124b839/
- https://pouyae.ir/about/
- https://stackoverflow.com/users/13118327/pouya-esmaeili?tab=profile
- https://pouyae.ir/contact-me/
- https://github.com/PouyaEsmaeili/SnowflakeID
- https://pouyae.ir/images/pouya2.jpg
- https://github.com/PouyaEsmaeili/gFuzz
- https://linktr.ee/pouyae
- https://gohugo.io/
- https://pouyae.ir/images/pouya1.jpg

👉 There is also a blog post about using AsyncURLCrawler to find malicious URLs in a web page. Read here.

Commandline Tool

The script can be customized using the src/cmd/cmd.py file, which accepts various arguments to configure the crawler's behavior:

argument	description
`--url`	Specifies a list of URLs to crawl. At least one URL must be provided.
`--exact`	Optional flag; if set, the crawler will restrict crawling to the specified subdomain/domain only. Default is False.
`--deep`	Optional flag; if enabled, the crawler will explore all visited URLs. Not recommended due to potential resource intensity. If --deep is True, the --exact flag is ignored.
`--delay`	Sets the delay between consecutive HTTP requests, in seconds.
`--output`	Specifies the path for the output file, which will be saved in YAML format.

Run Commandline Tool in Docker Container 🐳

There is a Dockerfile in src/cmd to run the above-mentioned cmd tool in a docker container.

docker build -t crawler .

docker run -v my_dir:/src/output --name crawler crawler

After execution of the container, the resulting output file will be accessible in the directory named my_dir as defined in the above. To configure the tool based on your needs check the CMD in Dockerfile.

Build and Publish to Python Package Index(PyPi)

Requirements:

python3 -m pip install --upgrade build

python3 -m pip install --upgrade twine

👉 For more details check Packaging Python Projects.

Build and upload:

python3 -m build

python3 -m twine upload --repository pypi dist/*

Build Documentation with Sphinx

Install packages listed in docs/doc-requirements.txt.

cd docs

pip install -r doc-requirements.txt

make clean

make html

HTML files will be generated in docs/build. Push them the repository and deploy on pages.dev.

Workflow

Branch off, implement features and merge them to main. Remove feature branches.
Update version in pyproject.toml and push to main.
Add release tag in Github.
Build and push the package to PyPi.
Build documentation and push HTML files to AsyncURLCrawlerDocs repo
Documentation will be deployed on pages.dev automatically.

Contact

Find me @ My Homepage

Disclaimer

⚠️ Use at your own risk. The author and contributors are not responsible for any misuse or consequences resulting from the use of this project.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
make.bat		make.bat
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AsyncURLCrawler

Install Pacakge

Usage Example in Code

Commandline Tool

Run Commandline Tool in Docker Container 🐳

Build and Publish to Python Package Index(PyPi)

Build Documentation with Sphinx

Workflow

Contact

Disclaimer

About

Releases 3

Packages

Languages

License

PouyaEsmaeili/AsyncURLCrawler

Folders and files

Latest commit

History

Repository files navigation

AsyncURLCrawler

Install Pacakge

Usage Example in Code

Commandline Tool

Run Commandline Tool in Docker Container 🐳

Build and Publish to Python Package Index(PyPi)

Build Documentation with Sphinx

Workflow

Contact

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages