tags | aliases | cssclass | |
---|---|---|---|
💽/🐍 |
|
Note: All instructions are performed in MacOS on Scrapy 2.8 with Python 3.9.13 virtual environment
pip install Scrapy
This command will create a PROJECT_NAME directory with multiple .py files.
scrapy startproject PROJECT_NAME
A Spider is a class that you define which Scrapy uses to scrape information from a website (or group of websites).
Example Spider from Scrapy tutorial:
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
Path(filename).write_bytes(response.body)
self.log(f'Saved file {filename}')
scrapy crawl quotes
- Scrapy schedules the
scrapy.Request
objects returned by thestart_requests
method of the Spider. Upon receiving a response for each one, it instantiatesResponse
objects and calls the callback method associated with the request (in this case, theparse
method) passing the response as argument.
In this case, the parse()
method will be called to handle each of the requests for those URLs in start_urls
, even though we haven't explicitly told Scrapy to do so.
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
Path(filename).write_bytes(response.body)
Extracting data using Scrapy shell:
Note: If on Windows, use double quotes (" ") instead of single
scrapy shell 'https://quotes.toscrape.com/page/1/'
After executing the line above, Scrapy will scrape the website pasting the response in the terminal. The following example commands are ran inside this terminal, however Scrapy allows python commands to be submitted.
response.css('title') # -> [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
response.css('title::text').getall() # -> ['Quotes to Scrape']
response.css('title').getall() # -> ['<title>Quotes to Scrape</title>']
response.css('title::text').get() # -> 'Quotes to Scrape'
response.css('title::text')[0].get() # -> 'Quotes to Scrape'
response.css('noelement')[0].get() # -> IndexError: list index out of range
# Use `.get()` directly on the `SelectorList` instance to return `None` if there are no results
response.css("noelement").get() # -> None
# Regular Expressions
response.css('title::text').re(r'Quotes.*') # -> ['Quotes to Scrape']
response.css('title::text').re(r'Q\w+') # -> ['Quotes']
response.css('title::text').re(r'(\w+) to (\w+)') # -> ['Quotes', 'Scrape']
# XPath (besides CSS, Scrapy selectors supports XPath expressions)
response.xpath('//title') # -> [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
response.xpath('//title/text()').get() # -> 'Quotes to Scrape'
The HTML div
element looks like this:
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
This:
# Get a list of selectors for the quote HMTL elements with:
response.css("div.quote")
Returns:
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]
# Extract the following as such
quote = response.css("div.quote")[0]
text = quote.css("span.text::text").get()
author = quote.css("small.author::text").get()
# text -> '"The world as we have created it is a process...etc"'
# author -> 'Albert Einstein'
Given that the tags are a list of strings, we can use the .getall()
method to get all of them:
tags = quote.css("div.tags a.tag::text").getall()
# tags -> ['change', 'deep-thoughts', 'thinking', 'world']
Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary: This:
for quote in response.css("div.quote"):
text = quote.css("span.text::text").get()
author = quote.css("small.author::text").get()
tags = quote.css("div.tags a.tag::text").getall()
print(dict(text=text, author=author, tags=tags))
Returns:
{
'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'author': 'Albert Einstein',
'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
}
{
'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'author': 'J.K. Rowling',
'tags': ['abilities', 'choices'],
}
In this example, the spider generates many dictionaries containing the data extracted from the page. This is why we use yield
in the callback.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
The ability to scrape stuff from all pages in a website
HTML Link to the next page
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
Link extraction code
response.css('li.next a').get() # -> '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
response.css('li.next a::attr(href)').get() # -> '/page/2/'
response.css('li.next a').attrib['href'] # -> '/page/2/'
Final Spider with link following code
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Unlike scrapy.Request
above, response.follow
supports relative URLs directly–no need to call response.urljoin
.
# response.follow uses their href attribute automatically
for a in response.css('ul.pager a'):
yield response.follow(a, callback=self.parse)
# To create multiple requests from an iterable, you can use `response.follow_all` instead
anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)
# Above consolidated further
yield from response.follow_all(css='ul.pager a', callback=self.parse)
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['https://quotes.toscrape.com/']
def parse(self, response):
author_page_links = response.css('.author + a')
yield from response.follow_all(author_page_links, self.parse_author)
pagination_links = response.css('li.next a')
yield from response.follow_all(pagination_links, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author
callback for each of them, and also the pagination links with the parse
callback as we saw before.
Here we’re passing callbacks to response.follow_all
as positional arguments to make the code shorter; it also works for Request
.
The parse_author
callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS
.
- Does Scrapy respect the site's robots.txt file? link
- According to the docs, it's enabled by default only when you create a project using
scrapy startproject
command, otherwise should be defaultFalse
. - Answering your question, yes,
scrapy shell
command does respectrobots.txt
configuration defined insettings.py
. IfROBOTSTXT_OBEY = True
, trying to usescrapy shell
command on a protected URL will generate a responseNone
. - You can also test it passing robots.txt settings via command line:
- `scrapy shell https://www.netflix.com --set="ROBOTSTXT_OBEY=True"
- According to the docs, it's enabled by default only when you create a project using
🔗 Links to this page: [[Python]]