Skip to content

Commit

Permalink
GITBOOK-15: Remove preprocess queue
Browse files Browse the repository at this point in the history
  • Loading branch information
[email protected] authored and gitbook-bot committed Nov 30, 2024
1 parent 34b1db8 commit 093525d
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 45 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 3 additions & 20 deletions docs/advanced/migrations/reacher-configuration-v0.10.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,28 +106,10 @@ enable = true
# Env variable: RCH__WORKER__RABBITMQ__URL
url = "amqp://guest:guest@localhost:5672"

# Queues to consume emails from. By default, the worker consumes from all
# queues.
#
# To consume from only a subset of queues, uncomment the line `queues = "all"`
# and specify the queues you want to consume from.
#
# Below is the exhaustive list of queue names that the worker can consume from:
# - "check.gmail": subscribe exclusively to Gmail emails.
# - "check.hotmailb2b": subscribe exclusively to Hotmail B2B emails.
# - "check.hotmailb2c": subscribe exclusively to Hotmail B2C emails.
# - "check.yahoo": subscribe exclusively to Yahoo emails.
# - "check.everything_else": subscribe to all emails that are not Gmail, Yahoo, or Hotmail.
#
# Env variable: RCH__WORKER__RABBITMQ__QUEUES
#
# queues = ["check.gmail", "check.hotmail.b2b", "check.hotmail.b2c", "check.yahoo", "check.everything_else"]
queues = "all"

# Number of concurrent emails to verify for this worker across all queues.
# Number of concurrent emails to verify for this worker.
#
# Env variable: RCH__WORKER__RABBITMQ__CONCURRENCY
concurrency = 20
concurrency = 5

# Throttle the maximum number of requests per second, per minute, per hour, and
# per day for this worker.
Expand Down Expand Up @@ -159,6 +141,7 @@ db_url = "postgresql://localhost/reacherdb"
#
# Env variable: RCH__SENTRY_DSN
# sentry_dsn = "<PASTE_YOUR_DSN_NOW>"

```

## Usage with Docker
Expand Down
37 changes: 12 additions & 25 deletions docs/self-hosting/scaling-for-production.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,49 +12,36 @@ The architecture contains 4 components:

Note that Reacher provides the same Docker image `reacherhq/backend` which can act as both a **Worker** and a **HTTP server**.

<figure><img src="../.gitbook/assets/Screenshot 2024-11-27 at 14.43.50.png" alt=""><figcaption><p>Reacher architecture for scaling</p></figcaption></figure>
<figure><img src="../.gitbook/assets/Screenshot 2024-11-30 at 15.33.27.png" alt=""><figcaption><p>Reacher queue architecture</p></figcaption></figure>

With this architecture, it's possible to horizontally scale the number of workers, while making sure that the individual IPs don't get blacklisted. To do so, we propose to start with two types of workers.
With this architecture, it's possible to horizontally scale the number of workers. However, to prevent spawning to many workers at once resulting in blacklisted IPs, we need to configure some concurrency and throttling parameters below.

### Shared Configuration between both workers
### Worker Configuration

To enable the above worker architecture, set the following parameters in [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention"):&#x20;
To enable the above worker architecture without getting blacklisted, we need to set some parameters in [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention"):

* `worker.enable`: true
* `worker.rabbitmq.url`: Points to the URL of the RabbitMQ instance.
* `worker.postgres.db_url`: A Postgres database to store the email verification results.

### 1st worker type: SMTP worker using Proxy
Since spawning workers (generally on cloud providers) doesn't guarantee a reputable IP assigned to the worker, we propose to configure all workers to use a proxy. Proxies generally offer a pricing per IP per month; we recommend buying one IP for each 10000 email verifications you do per day.

These workers will consume all emails that should be verified through SMTP. Currently, this includes all emails, except Hotmail B2C and Yahoo emails, which are best verified using a headless navigator. Since maintaing IP addresses is hard, we recommend using a proxy, see [proxies.md](proxies.md "mention").
* `worker.proxy.{host,port}`: Set a proxy to route all SMTP requests through. You can optionally pass in `username` and `password` if required.

Assuming your proxy has `N` available IP addresses, we recommend spawning the same number `N` of workers, each with the config below:
We also propose some recommended values for concurrency and throttling parameters. These parameters ensure that the proxy that we use will have its IP well maintained.

* `worker.rabbitmq.queues`: `["check.gmail","check.hotmailb2b","everything_else"]`. The SMTP workers will listen to these queues.
* `worker.proxy.{host,port}`: Set a proxy to route all SMTP requests through. You can optionally pass in `username` and `password` if required.
* `worker.rabbitmq.concurrency`: 10.
* `worker.throttle.max_requests_per_minute`: 100.
* `worker.throttle.max_requests_per_day`: 10000. This is the recommended number of verifications per IP per day. Assuming there are `N` IP addresses and `N` workers, each worker should perform 10000 verifications per day.
* `worker.rabbitmq.concurrency`: 5. Each worker can process 5 emails at a time.
* `worker.throttle.max_requests_per_minute`: 60. If this value is too high, the recipient SMTP server might see sudden spikes of email verifications, resulting in an IP blacklist.
* `worker.throttle.max_requests_per_day`: 10000. This is the recommended number of verifications per IP per day. Assuming our proxy has `N` IP addresses and `N` workers, each worker will perform 10000 verifications per day in average.

You can scale up the number `N` as much as you need. Remember, the rule of thumb is 10000 verifications per IP per day. For example, if you're aiming for 10 millions verifications per month, we recommend 33 or 34 IPs.
You can scale up the number `N` as much as you need, by buying more IPs and spawning more workers. Remember, the rule of thumb is 10000 verifications per IP per day. For example, if you're aiming for 10 millions verifications per month, we recommend buying 33 or 34 IPs:

```
10,000,000 emails per month / 30 = 33,000 emails per day / 10000 = 33 IPs
```

Refer to [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention")to see how to set these settings.

### 2nd worker type: Headless worker

These workers will consume all emails that are best verified using a headless browser. The idea behind this verification method is to spawn a headless browser that will navigate to the email provider's password recovery page, and parse the website's response to inputting emails. This method currently works well for Hotmail and Yahoo emails.

To spawn such a worker, provide the config:

* `worker.rabbitmq.queues`: `["check.hotmailb2c","check.yahoo"]`. These are the emails that are best verified using headless.
* `worker.throttle.max_requests_per_minute`: 100

Refer to [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention")to see how to set these settings.

## Understanding the architecture with Docker Compose

We do not recommend using Docker Compose for a high-volume production setup. However, for understanding the architecture, the different Docker images, as well as how to configure the workers, this [`docker_compose.yaml`](../../docker-compose.yaml) file can be useful.
Expand All @@ -64,4 +51,4 @@ We do not recommend using Docker Compose for a high-volume production setup. How
Contact [[email protected]](https://app.gitbook.com/u/F1LnsqPFtfUEGlcILLswbbp5cgk2 "mention")if you have more questions about this architecture, such as:

* deploying on Kubernetes (Ansible playbook, Pulumi)
* more specialized workers (e.g. Gmail and Hotmail B2B workers can be separated)
* more specialized workers (e.g. some workers doing headless verification only, others doing SMTP only)

0 comments on commit 093525d

Please sign in to comment.