diff --git a/website/blog/2025/01-03-scrape-crunchbase/img/cloudflare_link.webp b/website/blog/2025/01-03-scrape-crunchbase/img/cloudflare_link.webp new file mode 100644 index 000000000000..f24d46b4009c Binary files /dev/null and b/website/blog/2025/01-03-scrape-crunchbase/img/cloudflare_link.webp differ diff --git a/website/blog/2025/01-03-scrape-crunchbase/img/data_json.webp b/website/blog/2025/01-03-scrape-crunchbase/img/data_json.webp new file mode 100644 index 000000000000..c2267e2ca9ff Binary files /dev/null and b/website/blog/2025/01-03-scrape-crunchbase/img/data_json.webp differ diff --git a/website/blog/2025/01-03-scrape-crunchbase/img/scrape_crunchbase.webp b/website/blog/2025/01-03-scrape-crunchbase/img/scrape_crunchbase.webp new file mode 100644 index 000000000000..685afc988975 Binary files /dev/null and b/website/blog/2025/01-03-scrape-crunchbase/img/scrape_crunchbase.webp differ diff --git a/website/blog/2025/01-03-scrape-crunchbase/img/search_protect.webp b/website/blog/2025/01-03-scrape-crunchbase/img/search_protect.webp new file mode 100644 index 000000000000..37cef562dc4c Binary files /dev/null and b/website/blog/2025/01-03-scrape-crunchbase/img/search_protect.webp differ diff --git a/website/blog/2025/01-03-scrape-crunchbase/img/sitemap_lvl_one.webp b/website/blog/2025/01-03-scrape-crunchbase/img/sitemap_lvl_one.webp new file mode 100644 index 000000000000..32ddd79fde45 Binary files /dev/null and b/website/blog/2025/01-03-scrape-crunchbase/img/sitemap_lvl_one.webp differ diff --git a/website/blog/2025/01-03-scrape-crunchbase/img/sitemap_lvl_two.webp b/website/blog/2025/01-03-scrape-crunchbase/img/sitemap_lvl_two.webp new file mode 100644 index 000000000000..ea547d390858 Binary files /dev/null and b/website/blog/2025/01-03-scrape-crunchbase/img/sitemap_lvl_two.webp differ diff --git a/website/blog/2025/01-03-scrape-crunchbase/index.md b/website/blog/2025/01-03-scrape-crunchbase/index.md new file mode 100644 index 000000000000..a84082fecf10 --- /dev/null +++ b/website/blog/2025/01-03-scrape-crunchbase/index.md @@ -0,0 +1,546 @@ +--- +slug: scrape-crunchbase-python +title: 'How to scrape Crunchbase using Python in 2024 (Easy Guide)' +tags: [community] +description: 'Learn how to scrape Crunchbase using Crawlee for Python' +image: "./img/scrape_crunchbase.webp" +authors: [MaxB] +--- + +Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective [Crunchbase](https://www.crunchbase.com/) scraper in Python that gets you the data you need. + +Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format. + +By the end of this blog, we'll explore three different ways to extract data from Crunchbase using [`Crawlee for Python`](https://github.com/apify/crawlee-python). We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly [choose the right data source](https://www.crawlee.dev/blog/web-scraping-tips#1-choosing-a-data-source-for-the-project). + +:::note + +This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on [Discord](https://discord.com/invite/jyEM2PRvMU) to share your experiences and blog ideas - we value these contributions from developers like you. + +::: + +![How to Scrape Crunchbase Using Python](./img/scrape_crunchbase.webp) + +Key steps we'll cover: + +1. Project setup +2. Choosing the data source +3. Implementing sitemap-based crawler +4. Analysis of search-based approach and its limitations +5. Implementing the official API crawler +6. Conclusion and repository access + + + +## Prerequisites + +- Python 3.9 or higher +- Familiarity with web scraping concepts +- Crawlee for Python `v0.5.0` +- poetry `v2.0` or higher + +### Project setup + +Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (`Playwright` and `Beautifulsoup`), so we'll set up the project manually. + +1. Install [`Poetry`](https://python-poetry.org/) + + ```bash + pipx install poetry + ``` + +2. Create and navigate to the project folder. + + ```bash + mkdir crunchbase-crawlee && cd crunchbase-crawlee + ``` + +3. Initialize the project using Poetry, leaving all fields empty. + + ```bash + poetry init + ``` + + When prompted: + - For "Compatible Python versions", enter: `>={your Python version},<4.0` + (For example, if you're using Python 3.10, enter: `>=3.10,<4.0`) + - Leave all other fields empty by pressing Enter + - Confirm the generation by typing "yes" + +4. Add and install Crawlee with necessary dependencies to your project using `Poetry.` + + ```bash + poetry add crawlee[parsel,curl-impersonate] + ``` + +5. Complete the project setup by creating the standard file structure for `Crawlee for Python` projects. + + ```bash + mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py} + ``` + +After setting up the basic project structure, we can explore different methods of obtaining data from Crunchbase. + +### Choosing the data source + +While we can extract target data directly from the [company page](https://www.crunchbase.com/organization/apify), we need to choose the best way to navigate the site. + +A careful examination of Crunchbase's structure shows that we have three main options for obtaining data: + +1. [`Sitemap`](https://www.crunchbase.com/www-sitemaps/sitemap-index.xml) - for complete site traversal. +2. [`Search`](https://www.crunchbase.com/discover/organization.companies) - for targeted data collection. +3. [Official API](https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-getting-started) - recommended method. + +Let's examine each of these approaches in detail. + +## Scraping Crunchbase using sitemap and Crawlee for Python + +`Sitemap` is a standard way of site navigation used by crawlers like [`Google`](https://google.com/), [`Ahrefs`](https://ahrefs.com/), and other search engines. All crawlers must follow the rules described in [`robots.txt`](https://www.crunchbase.com/robots.txt). + +Let's look at the structure of Crunchbase's Sitemap: + +![Sitemap first lvl](./img/sitemap_lvl_one.webp) + +As you can see, links to organization pages are located inside second-level `Sitemap` files, which are compressed using `gzip`. + +The structure of one of these files looks like this: + +![Sitemap second lvl](./img/sitemap_lvl_two.webp) + +The `lastmod` field is particularly important here. It allows tracking which companies have updated their information since the previous data collection. This is especially useful for regular data updates. + +### 1. Configuring the crawler for scraping + +To work with the site, we'll use [`CurlImpersonateHttpClient`](https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient), which impersonates a `Safari` browser. While this choice might seem unexpected for working with a sitemap, it's necessitated by Crunchbase's protection features. + +The reason is that Crunchbase uses [Cloudflare](https://www.cloudflare.com/) to protect against automated access. This is clearly visible when analyzing traffic on a company page: + +![Cloudflare Link](./img/cloudflare_link.webp) + +An interesting feature is that `challenges.cloudflare` is executed after loading the document with data. This means we receive the data first, and only then JavaScript checks if we're a bot. If our HTTP client's fingerprint is sufficiently similar to a real browser, we'll successfully receive the data. + +Cloudflare [also analyzes traffic at the sitemap level](https://developers.cloudflare.com/waf/custom-rules/use-cases/allow-traffic-from-verified-bots/). If our crawler doesn't look legitimate, access will be blocked. That's why we impersonate a real browser. + +To prevent blocks due to overly aggressive crawling, we'll configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). + +When scaling this approach, you'll likely need proxies. Detailed information about proxy setup can be found in the [documentation](https://www.crawlee.dev/python/docs/guides/proxy-management). + +We'll save our scraping results in `JSON` format. Here's how the basic crawler configuration looks: + +```python +# main.py + +from crawlee import ConcurrencySettings, HttpHeaders +from crawlee.crawlers import ParselCrawler +from crawlee.http_clients import CurlImpersonateHttpClient + +from .routes import router + + +async def main() -> None: + """The crawler entry point.""" + concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50) + + http_client = CurlImpersonateHttpClient( + impersonate='safari17_0', + headers=HttpHeaders( + { + 'accept-language': 'en', + 'accept-encoding': 'gzip, deflate, br, zstd', + } + ), + ) + crawler = ParselCrawler( + request_handler=router, + max_request_retries=1, + concurrency_settings=concurrency_settings, + http_client=http_client, + max_requests_per_crawl=30, + ) + + await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml']) + + await crawler.export_data_json('crunchbase_data.json') +``` + +### 2. Implementing sitemap navigation + +Sitemap navigation happens in two stages. In the first stage, we need to get a list of all files containing organization information: + +```python +# routes.py + +from crawlee.crawlers import ParselCrawlingContext +from crawlee.router import Router +from crawlee import Request + +router = Router[ParselCrawlingContext]() + + +@router.default_handler +async def default_handler(context: ParselCrawlingContext) -> None: + """Default request handler.""" + context.log.info(f'default_handler processing {context.request} ...') + + requests = [ + Request.from_url(url, label='sitemap') + for url in context.selector.xpath('//loc[contains(., "sitemap-organizations")]/text()').getall() + ] + + # Since this is a tutorial, I don't want to upload more than one sitemap link + await context.add_requests(requests, limit=1) +``` + +In the second stage, we process second-level sitemap files stored in `gzip` format. This requires a special approach as the data needs to be decompressed first: + +```python +# routes.py + +from gzip import decompress +from parsel import Selector + + +@router.handler('sitemap') +async def sitemap_handler(context: ParselCrawlingContext) -> None: + """Sitemap gzip request handler.""" + context.log.info(f'sitemap_handler processing {context.request.url} ...') + + data = context.http_response.read() + data = decompress(data) + + selector = Selector(data.decode()) + + requests = [Request.from_url(url, label='company') for url in selector.xpath('//loc/text()').getall()] + + await context.add_requests(requests) +``` + +### 3. Extracting and saving data + +Each company page contains a large amount of information. For demonstration purposes, we'll focus on the main fields: `Company Name`, `Short Description`, `Website`, and `Location`. + +One of Crunchbase's advantages is that all data is stored in `JSON` format within the page: + +![Company Data](./img/data_json.webp) + +This significantly simplifies data extraction - we only need to use one `Xpath` selector to get the `JSON`, and then apply [`jmespath`](https://jmespath.org/) to extract the needed fields: + +```python +# routes.py + +@router.handler('company') +async def company_handler(context: ParselCrawlingContext) -> None: + """Company request handler.""" + context.log.info(f'company_handler processing {context.request.url} ...') + + json_selector = context.selector.xpath('//*[@id="ng-state"]/text()') + + await context.push_data( + { + 'Company Name': json_selector.jmespath('HttpState.*.data[].properties.identifier.value').get(), + 'Short Description': json_selector.jmespath('HttpState.*.data[].properties.short_description').get(), + 'Website': json_selector.jmespath('HttpState.*.data[].cards.company_about_fields2.website.value').get(), + 'Location': '; '.join( + json_selector.jmespath( + 'HttpState.*.data[].cards.company_about_fields2.location_identifiers[].value' + ).getall() + ), + } + ) +``` + +The collected data is saved in `Crawlee for Python`'s internal storage using the `context.push_data` method. When the crawler finishes, we export all collected data to a JSON file: + +```python +# main.py + +await crawler.export_data_json('crunchbase_data.json') +``` + +### 4. Running the project + +With all components in place, we need to create an entry point for our crawler: + +```python +# __main__.py +import asyncio + +from .main import main + +if __name__ == '__main__': + asyncio.run(main()) +``` + +Execute the crawler using Poetry: + +```bash +poetry run python -m crunchbase-crawlee +``` + +### 5. Finally, characteristics of using the sitemap crawler + +The sitemap approach has its distinct advantages and limitations. It's ideal in the following cases: + +- When you need to collect data about all companies on the platform +- When there are no specific company selection criteria +- If you have sufficient time and computational resources + +However, there are significant limitations to consider: + +- Almost no ability to filter data during collection +- Requires constant monitoring of Cloudflare blocks +- Scaling the solution requires proxy servers, which increases project costs + +## Using search for scraping Crunchbase + +The limitations of the sitemap approach might point to search as the next solution. However, Crunchbase applies tighter security measures to its search functionality compared to its public pages. + +The key difference lies in how Cloudflare protection works. While we receive data before the `challenges.cloudflare` check when accessing a company page, the search API requires valid `cookies` that have passed this check. + +Let's verify this in practice. Open the following link in Incognito mode: + +```plaintext + +``` + +When analyzing the traffic, we'll see the following pattern: + +![Search Protect](./img/search_protect.webp) + +The sequence of events here is: + +1. First, the page is blocked with code `403` +2. Then the `challenges.cloudflare` check is performed +3. Only after successfully passing the check do we receive data with code `200` + +Automating this process would require a `headless` browser capable of bypassing [`Cloudflare Turnstile`](https://www.cloudflare.com/application-services/products/turnstile/). The current version of `Crawlee for Python` (v0.5.0) doesn't provide this functionality, although it's planned for future development. + +You can extend the capabilities of Crawlee for Python by integrating [`Camoufox`](https://camoufox.com/) following this [example.](https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-camoufox) + +## Working with the official Crunchbase API + +Crunchbase provides a [free API](https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-using-api) with basic functionality. Paid subscription users get expanded data access. Complete documentation for available endpoints can be found in the [official API specification](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api). + +### 1. Setting up API access + +To start working with the API, follow these steps: + +1. [Create a Crunchbase account](https://www.crunchbase.com/register) +2. Go to the Integrations section +3. Create a Crunchbase Basic API key + +Although the documentation states that key activation may take up to an hour, it usually starts working immediately after creation. + +### 2. Configuring the crawler for API work + +An important API feature is the limit - no more than 200 requests per minute, but in the free version, this number is significantly lower. Taking this into account, let's configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). Since we're working with the official API, we don't need to mask our HTTP client. We'll use the standard ['HttpxHttpClient'](https://www.crawlee.dev/python/api/class/HttpxHttpClient) with preset headers. + +First, let's save the API key in an environment variable: + +```bash +export CRUNCHBASE_TOKEN={YOUR KEY} +``` + +Here's how the crawler configuration for working with the API looks: + +```python +# main.py + +import os + +from crawlee.crawlers import HttpCrawler +from crawlee.http_clients import HttpxHttpClient +from crawlee import ConcurrencySettings, HttpHeaders + +from .routes import router + +CRUNCHBASE_TOKEN = os.getenv('CRUNCHBASE_TOKEN', '') + + +async def main() -> None: + """The crawler entry point.""" + + concurrency_settings = ConcurrencySettings(max_tasks_per_minute=60) + + http_client = HttpxHttpClient( + headers=HttpHeaders({'accept-encoding': 'gzip, deflate, br, zstd', 'X-cb-user-key': CRUNCHBASE_TOKEN}) + ) + crawler = HttpCrawler( + request_handler=router, + concurrency_settings=concurrency_settings, + http_client=http_client, + max_requests_per_crawl=30, + ) + + await crawler.run( + ['https://api.crunchbase.com/api/v4/autocompletes?query=apify&collection_ids=organizations&limit=25'] + ) + + await crawler.export_data_json('crunchbase_data.json') +``` + +### 3. Processing search results + +For working with the API, we'll need two main endpoints: + +1. [get_autocompletes](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Autocomplete/get_autocompletes) - for searching +2. [get_entities_organizations__entity_id](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Entity/get_entities_organizations__entity_id_) - for getting data + +First, let's implement search results processing: + +```python +import json + +from crawlee.crawlers import HttpCrawler +from crawlee.router import Router +from crawlee import Request + +router = Router[HttpCrawlingContext]() + + +@router.default_handler +async def default_handler(context: HttpCrawlingContext) -> None: + """Default request handler.""" + context.log.info(f'default_handler processing {context.request.url} ...') + + data = json.loads(context.http_response.read()) + + requests = [] + + for entity in data['entities']: + permalink = entity['identifier']['permalink'] + requests.append( + Request.from_url( + url=f'https://api.crunchbase.com/api/v4/entities/organizations/{permalink}?field_ids=short_description%2Clocation_identifiers%2Cwebsite_url', + label='company', + ) + ) + + await context.add_requests(requests) +``` + +### 4. Extracting company data + +After getting the list of companies, we extract detailed information about each one: + +```python +@router.handler('company') +async def company_handler(context: HttpCrawlingContext) -> None: + """Company request handler.""" + context.log.info(f'company_handler processing {context.request.url} ...') + + data = json.loads(context.http_response.read()) + + await context.push_data( + { + 'Company Name': data['properties']['identifier']['value'], + 'Short Description': data['properties']['short_description'], + 'Website': data['properties'].get('website_url'), + 'Location': '; '.join([item['value'] for item in data['properties'].get('location_identifiers', [])]), + } + ) +``` + +### 5. Advanced location-based search + +If you need more flexible search capabilities, the API provides a special [`search`](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Search/post_searches_organizations) endpoint. Here's an example of searching for all companies in Prague: + +```python +payload = { + 'field_ids': ['identifier', 'location_identifiers', 'short_description', 'website_url'], + 'limit': 200, + 'order': [{'field_id': 'rank_org', 'sort': 'asc'}], + 'query': [ + { + 'field_id': 'location_identifiers', + 'operator_id': 'includes', + 'type': 'predicate', + 'values': ['e0b951dc-f710-8754-ddde-5ef04dddd9f8'], + }, + {'field_id': 'facet_ids', 'operator_id': 'includes', 'type': 'predicate', 'values': ['company']}, + ], +} + +serialiazed_payload = json.dumps(payload) +await crawler.run( + [ + Request.from_url( + url='https://api.crunchbase.com/api/v4/searches/organizations', + method='POST', + payload=serialiazed_payload, + use_extended_unique_key=True, + headers=HttpHeaders({'Content-Type': 'application/json'}), + label='search', + ) + ] +) +``` + +For processing search results and pagination, we use the following handler: + +```python +@router.handler('search') +async def search_handler(context: HttpCrawlingContext) -> None: + """Search results handler with pagination support.""" + context.log.info(f'search_handler processing {context.request.url} ...') + + data = json.loads(context.http_response.read()) + + last_entity = None + results = [] + + for entity in data['entities']: + last_entity = entity['uuid'] + results.append( + { + 'Company Name': entity['properties']['identifier']['value'], + 'Short Description': entity['properties']['short_description'], + 'Website': entity['properties'].get('website_url'), + 'Location': '; '.join([item['value'] for item in entity['properties'].get('location_identifiers', [])]), + } + ) + + if results: + await context.push_data(results) + + if last_entity: + payload = json.loads(context.request.payload) + payload['after_id'] = last_entity + payload = json.dumps(payload) + + await context.add_requests( + [ + Request.from_url( + url='https://api.crunchbase.com/api/v4/searches/organizations', + method='POST', + payload=payload, + use_extended_unique_key=True, + headers=HttpHeaders({'Content-Type': 'application/json'}), + label='search', + ) + ] + ) +``` + +### 6. Finally, free API limitations + +The free version of the API has significant limitations: + +- Limited set of available endpoints +- Autocompletes function only works for company searches +- Not all data fields are accessible +- Limited search filtering capabilities + +Consider a paid subscription for production-level work. The API provides the most reliable way to access Crunchbase data, even with its rate constraints. + +## What’s your best path forward? + +We've explored three different approaches to obtaining data from Crunchbase: + +1. **Sitemap** - for large-scale data collection +2. **Search** - difficult to automate due to Cloudflare protection +3. **Official API** - the most reliable solution for commercial projects + +Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version. + +The complete source code is available in my [repository](https://github.com/Mantisus/crunchbase-crawlee). Have questions or want to discuss implementation details? Join our [Discord](https://discord.com/invite/jyEM2PRvMU) - our community of developers is there to help.