AsyncURLCrawler
navigates through web pages concurrently by following hyperlinks to collect URLs.
AsyncURLCrawler
uses BFS algorithm
. To make use of it check robots.txt
of the domains first.
👉 For complete documentation read here
👉 Source code on Github here
pip install AsyncURLCrawler
pip install AsyncURLCrawler==<version>
👉 The official page of the project in PyPi.
Here is a simple python script to show how to use the package:
import asyncio
import os
from AsyncURLCrawler.parser import Parser
from AsyncURLCrawler.crawler import Crawler
import yaml
async def main():
parser = Parser(
delay_start=0.1,
max_retries=5,
request_timeout=1,
user_agent="Mozilla",
)
crawler = Crawler(
seed_urls=["https://pouyae.ir"],
parser=parser,
exact=True,
deep=False,
delay=0.1,
)
result = await crawler.crawl()
with open(
os.path.join(output_path, 'result.yaml'), 'w') as file:
for key in result:
result[key] = list(result[key])
yaml.dump(result, file)
if __name__ == "__main__":
asyncio.run(main())
This is the output for the above code:
https://pouyae.ir:
- https://github.com/PouyaEsmaeili/AsyncURLCrawler
- https://pouyae.ir/images/pouya3.jpg
- https://github.com/PouyaEsmaeili/CryptographicClientSideUserState
- https://github.com/PouyaEsmaeili/RateLimiter
- https://pouyae.ir/
- https://github.com/luizdepra/hugo-coder/
- https://duman.pouyae.ir/
- https://pouyae.ir/projects/
- https://pouyae.ir/images/pouya4.jpg
- https://pouyae.ir/images/pouya5.jpg
- https://pouyae.ir/gallery/
- https://github.com/PouyaEsmaeili
- https://pouyae.ir/blog/
- https://www.linkedin.com/in/pouya-esmaeili-9124b839/
- https://pouyae.ir/about/
- https://stackoverflow.com/users/13118327/pouya-esmaeili?tab=profile
- https://pouyae.ir/contact-me/
- https://github.com/PouyaEsmaeili/SnowflakeID
- https://pouyae.ir/images/pouya2.jpg
- https://github.com/PouyaEsmaeili/gFuzz
- https://linktr.ee/pouyae
- https://gohugo.io/
- https://pouyae.ir/images/pouya1.jpg
👉 There is also a blog post about using AsyncURLCrawler
to find malicious URLs in a web page. Read here.
The script can be customized using the src/cmd/cmd.py
file, which accepts various arguments to configure the crawler's behavior:
argument | description |
---|---|
--url |
Specifies a list of URLs to crawl. At least one URL must be provided. |
--exact |
Optional flag; if set, the crawler will restrict crawling to the specified subdomain/domain only. Default is False. |
--deep |
Optional flag; if enabled, the crawler will explore all visited URLs. Not recommended due to potential resource intensity. If --deep is True, the --exact flag is ignored. |
--delay |
Sets the delay between consecutive HTTP requests, in seconds. |
--output |
Specifies the path for the output file, which will be saved in YAML format. |
There is a Dockerfile in src/cmd
to run the above-mentioned cmd tool in a docker container.
docker build -t crawler .
docker run -v my_dir:/src/output --name crawler crawler
After execution of the container,
the resulting output file will be accessible in the directory named my_dir
as defined in the above.
To configure the tool based on your needs check the CMD
in Dockerfile
.
Requirements:
python3 -m pip install --upgrade build
python3 -m pip install --upgrade twine
👉 For more details check Packaging Python Projects.
Build and upload:
python3 -m build
python3 -m twine upload --repository pypi dist/*
Install packages listed in docs/doc-requirements.txt
.
cd docs
pip install -r doc-requirements.txt
make clean
make html
HTML files will be generated in docs/build
. Push them the repository and deploy on pages.dev.
- Branch off, implement features and merge them to
main
. Remove feature branches. - Update version in
pyproject.toml
and push tomain
. - Add release tag in Github.
- Build and push the package to PyPi.
- Build documentation and push HTML files to AsyncURLCrawlerDocs repo
- Documentation will be deployed on pages.dev automatically.