simple-web-crawler

Simple web crawler written in Python with help of BeautifulSoup (and Flask!)

This program uses Flask to be exposed like an API and crawls any website. It retrieves the links to the pages but also to JS, CSS and image files.

This was available at a free Heroku server, but since those are not a thing anymore, it should be downloaded, built and tested locally.

Hitting the API

The program have only one endpoint, and it expects a GET request with two parameters: domain link and depth. Example:

http://localhost:5000/crawl?domain=http://nubank.com.br/&depth=0

You can call it with your browser, but something like Postman may be a better choice, since the visualization of the result will be better.

The endpoint returns a list of links with all the JS links, CSS links, Image links and also the depth of the page related to the link provided.

Running locally

To run it locally, you'll need Python 3 installed and also Flask and BeautifulSoup.

Installing dependencies

To install Python 3, please check the official website: https://www.python.org/

Also, check BeautifulSoup and Flask pages to follow the specific steps to your OS:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

https://flask.palletsprojects.com/en/1.1.x/installation/#

Running the program

In your Terminal, go to the project folder and run the "app.py" file:

python3 app.py

Usually, Flask runs on port 5000, so when you run the program, you should be able to access it through:

http://localhost:5000/

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
flask		flask
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
web_crawler.py		web_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simple-web-crawler

Hitting the API

Running locally

Installing dependencies

Running the program

About

Releases

Packages

Languages

bruno-monteiro1/simple-web-crawler

Folders and files

Latest commit

History

Repository files navigation

simple-web-crawler

Hitting the API

Running locally

Installing dependencies

Running the program

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages