Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Scapinhub Migration #81

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@ sudo: false
python:
- "2.7"

install: pip install -r requirements-testing.txt
install: pip install -r requirements.txt

script: make coverage

before_install:
- pip install codecov

after_success:
- coveralls
- codecov
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
[![Build Status](https://travis-ci.org/manolo-rocks/manolo_scraper.svg)](https://travis-ci.org/aniversarioperu/manolo_scraper)
[![Coverage Status](https://coveralls.io/repos/aniversarioperu/manolo_scraper/badge.svg?branch=master&service=github)](https://coveralls.io/github/aniversarioperu/manolo_scraper?branch=master)
[![codecov.io](http://codecov.io/github/aniversarioperu/manolo_scraper/coverage.svg?branch=master)](http://codecov.io/github/aniversarioperu/manolo_scraper?branch=master)
[![codecov.io](http://codecov.io/github/manolo-rocks/manolo_scraper/coverage.svg?branch=master)](http://codecov.io/github/aniversarioperu/manolo_scraper?branch=master)
[![Code Issues](https://www.quantifiedcode.com/api/v1/project/396d38fe507441fa92d7286d07c8577a/badge.svg)](https://www.quantifiedcode.com/app/project/396d38fe507441fa92d7286d07c8577a)

# All spiders go here
Expand Down Expand Up @@ -93,3 +92,11 @@ production database.

* [x] Ministerio de Vivienda
* **url**: http://geo.vivienda.gob.pe/Visitas/controlVisitas/index.php?r=consultas/visitaConsulta/index


## Deploying to Scrapinghub

```
cd manolo_scraper
shub deploy
```
28 changes: 15 additions & 13 deletions manolo_scraper/manolo_scraper/pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,19 @@ def process_item(self, item, spider):


class CleanItemPipeline(object):

def save_item(self, item):
db = db_connect()
table = db['visitors_visitor']

if table.find_one(sha1=item['sha1']) is None:
item['created'] = datetime.datetime.now()
item['modified'] = datetime.datetime.now()
table.insert(item)
logging.info("Saving: {0}, date: {1}".format(item['sha1'], item['date']))
else:
logging.info("{0}, date: {1} is found in db, not saving".format(item['sha1'], item['date']))

def process_item(self, item, spider):
for k, v in item.items():
if isinstance(v, basestring) is True:
Expand Down Expand Up @@ -60,17 +73,6 @@ def process_item(self, item, spider):
if 'HORA DE' in item['time_start']:
raise DropItem("This is a header, drop it: {}".format(item))

self.save_item(item)
if spider.settings.get('STORE_IN_DATABASE'):
self.save_item(item)
return item

def save_item(self, item):
db = db_connect()
table = db['visitors_visitor']

if table.find_one(sha1=item['sha1']) is None:
item['created'] = datetime.datetime.now()
item['modified'] = datetime.datetime.now()
table.insert(item)
logging.info("Saving: {0}, date: {1}".format(item['sha1'], item['date']))
else:
logging.info("{0}, date: {1} is found in db, not saving".format(item['sha1'], item['date']))
2 changes: 2 additions & 0 deletions manolo_scraper/manolo_scraper/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,5 @@ def get_secret(setting, secrets=secrets):
DUPEFILTER_DEBUG = True
COOKIES_DEBUG = True
COOKIES_ENABLED = True

STORE_IN_DATABASE = False
10 changes: 7 additions & 3 deletions manolo_scraper/manolo_scraper/spiders/spiders.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
import re
from exceptions import NotImplementedError


import scrapy
from scrapy import exceptions
from delorean import Delorean, parse

from ..items import ManoloItem
from ..item_loaders import ManoloItemLoader
Expand All @@ -20,7 +22,9 @@ def __init__(self, date_start=None, date_end=None, *args, **kwargs):
self.date_start = date_start
self.date_end = date_end

today = datetime.date.today()
d = Delorean()
d.shift("America/Lima")
today = d.date

if self.date_start is None:
self.date_start = today.strftime('%Y-%m-%d')
Expand All @@ -39,8 +43,8 @@ def days_between_dates(date_start, date_end):
return delta.days

def start_requests(self):
d1 = datetime.datetime.strptime(self.date_start, '%Y-%m-%d').date()
d2 = datetime.datetime.strptime(self.date_end, '%Y-%m-%d').date()
d1 = parse(self.date_start, dayfirst=False, yearfirst=True).date
d2 = parse(self.date_end, dayfirst=False, yearfirst=True).date
# range to fetch
delta = d2 - d1

Expand Down
8 changes: 8 additions & 0 deletions manolo_scraper/scrapinghub.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
endpoints:
default: https://app.scrapinghub.com/api/
projects:
default: 104926
version: GIT
stacks:
default: scrapy:1.1
requirements_file: ../requirements.txt
5 changes: 3 additions & 2 deletions manolo_scraper/scrapy.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@
default = manolo_scraper.settings

[deploy]
#url = http://localhost:6800/
project = manolo_scraper
url = http://dash.scrapinghub.com/api/scrapyd/
project = 104926
version = GIT
5 changes: 0 additions & 5 deletions requirements-testing.txt

This file was deleted.

75 changes: 47 additions & 28 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,29 +1,48 @@
Mako==1.0.0
alembic==0.8.8
attrs==16.2.0
Babel==2.3.4
cffi==1.8.3
click==6.6
colorama==0.3.7
cryptography==1.5.2
cssselect==0.9.2
dataset==0.7.0
Delorean==0.6.0
enum34==1.1.6
hubstorage==0.23.2
humanize==0.5.1
idna==2.1
ipaddress==1.0.17
lxml==3.6.4
Mako==1.0.4
MarkupSafe==0.23
PyYAML==3.11
SQLAlchemy==1.0.6
scrapy==1.0.3
Twisted==14.0.2
Unidecode==0.04.16
Unipath==1.0
alembic==0.7.1
argparse==1.2.1
cffi==0.8.6
characteristic==14.2.0
cryptography==0.6.1
cssselect==0.9.1
dataset==0.6.0
lxml==3.4.1
pyOpenSSL==0.14
pyasn1==0.1.7
pyasn1-modules==0.0.5
pycparser==2.10
python-slugify==0.1.0
psycopg2==2.6.1
queuelib==1.2.2
scrapylib==1.5.0
service-identity==14.0.0
six==1.8.0
w3lib==1.10.0
wsgiref==0.1.2
zope.interface==4.1.1
normality==0.2.4
nose==1.3.7
parsel==1.0.3
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycparser==2.14
PyDispatcher==2.0.5
pyOpenSSL==16.1.0
python-dateutil==2.5.3
python-editor==1.0.1
python-termstyle==0.1.10
pytz==2016.6.1
PyYAML==3.12
queuelib==1.4.2
rednose==1.2.1
requests==2.11.1
retrying==1.3.3
scrapinghub==1.8.0
Scrapy==1.1.3
scrapylib==1.7.0
service-identity==16.0.0
shub==2.4.2
six==1.10.0
SQLAlchemy==1.0.15
Twisted==16.4.1
tzlocal==1.2.2
Unidecode==0.4.19
Unipath==1.1
w3lib==1.15.0
zope.interface==4.3.2