- Create your Python virtualenv
- Run this:
docker run -e POSTGRES_USER=postgres -e POSTGRESS_PASSWORD=<pass> -e POSTGRES_DB=ads -p 5345:5432 -d postgres
- (Optional) Get a Telegram Bot token & add it to
.env
- Complete the
.env
as needed. Runexport $(cat .env)
- Run your scraper with: `scrapy crawl <scraper_name>
This makes use of a notification bot, through the SpiderBotCallback
, you can disable it in the settings if you want no bot interaction (just data gathering).
Remember to set database and bot specific variables as environment variable before starting! Open .env file and complete it, then run source .env
Database:
- POSTGRES_PASSWORD
- POSTGRES_HOST (if needed)
- POSTGRES_PORT (if needed)
Bot:
- BOT_USER_SETTINGS_FILE
- BOT_TOKEN
- Install the following:
- git
- docker
- run-one (optional)
- Clone the repository
- Build the scraper docker image
- Create pgdata and httpcache docker volumes
Run scrapers manually:
- Install:
- pip
- virtualenvwrapper
- Create python3 virtualenvironment
- Install requirements.txt
docker volume create pgdata
docker volume create httpcache
# Build the image for all scrapers
docker build -t scraper .
# Build the image for running individual scrapers
docker build -f Dockerfile-single-spider -t single_scraper .
Please see the docker_env.list file and set the following:
POSTGRES_USER=postgres
POSTGRES_PASSWORD=<pass>
POSTGRES_DB=realestate // database name
PGDATA=/var/lib/postgresql/data // postgres docker volume mount point
Then, you can use this command to start the postgres instance with the path to the env file:
docker run --env-file "<path/to/docker_env.list>" -p <exposed port>:5432 -d postgres
docker run --network=host -v httpcache:/var/lib/httpcache/ scraper
cat <dump_name>.sql | docker exec -i <docker-postgres-container> psql -U postgres -W -d realestate
To rescrape all urls from httpcache (you need to edit the spider name in Dockerfile-only-httpcache first):
docker build -t scraper_only_httpcache . -f Dockerfile-only-httpcache
docker run --network=host -v httpcache:/var/lib/httpcache/ scraper_only_httpcache
sudo usermod -aG docker $USER
Make sure you have run-one installed, this is used to ensure that only one instance of the scraper is running at any given time
sudo apt-get install run-one
python setup_crontab.py
To run all scrapers you need only this line:
* * * * * run-one docker run --network=host -v httpcache:/var/lib/httpcache/ scraper
To run individual scrapers you'll need one of these for each scraper:
* * * * * run-one docker run --network=host -v httpcache:/var/lib/httpcache/ -e spider_name=<spider name> single_scraper
Don't forget the backup script!
* * * * * run-one /path/to/backup.sh
This software and the data gathered/being sent is for my personal use only. I am not responsible for any damages cause by proper/improper use of the software. This software is in development phase and subject to change. Any data retrieved or stored does not contain any personal identifying information. Please contact me concerning data usage clarification/requests of removal.