This repository has been archived by the owner on Feb 22, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 50
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add zero-downtime deployments & data transformations guide (#1082)
* Add incomplete draft of data migration guidelines * Add more details about data migrations * Refocus document on zero-downtime deployments generally * Update api/docs/guides/zero-downtime-database-management.md Co-authored-by: Madison Swain-Bowden <[email protected]> * Remove outstanding comment * Add additional clarifications from @krysal --------- Co-authored-by: Madison Swain-Bowden <[email protected]>
- Loading branch information
1 parent
e5b6f0c
commit 2450c38
Showing
1 changed file
with
292 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,292 @@ | ||
# Zero Downtime and Database Management | ||
|
||
Openverse practices zero-downtime deployments. This puts a handful of | ||
constraints on our database migration and data management practices. This | ||
document describes how to ensure migrations can be deployed with zero downtime | ||
and how to implement and manage long-running data migrations. | ||
|
||
Zero-downtime deployments are important to ensure service reliability. Following | ||
the practices that enable zero-downtime deployments also promotes best practices | ||
like ensuring changes are incremental and more easily reversible. | ||
|
||
## External resources | ||
|
||
This document assumes a general understanding of relational databases, including | ||
concepts like database tables, columns, constraints, and indexes. If this is not | ||
something you are familiar with, | ||
[the Wikipedia article on relation databases](https://en.wikipedia.org/wiki/Relational_database) | ||
is a good starting point. | ||
|
||
Django's | ||
[database migration documentation](https://docs.djangoproject.com/en/4.1/topics/migrations/) | ||
also contains helpful background knowledge, though this document takes a more | ||
general approach than addressing only Django specific scenarios. | ||
|
||
## Terms | ||
|
||
- "Zero-downtime deployment": An application deployment that does not result in | ||
any period of time during which a service is inaccessible. For the purposes of | ||
Openverse, these require running two versions of the application at once that | ||
share the same underlying database infrastructure. | ||
- "Schema": The structure of a database. The tables, columns, and their types. | ||
- "Downtime deployment": An application deployment that does result in a period | ||
of time during which a service is inaccessible. The Openverse project goes to | ||
great lengths to avoid these. These are often caused when a new version of an | ||
application is incompatible with the underlying infrastructure of the | ||
previously deployed version. | ||
- "Database migration": A change to the schema of a database. Common migrations | ||
include the addition or removal of tables and columns. | ||
- "Data transformation": A change to the data held in a database that does not | ||
include (but can be related) to a database migration. Common examples include | ||
backfilling data to remove null values from a column or moving data between | ||
two related columns. | ||
- "Data migration": A data transformation that is executed as part of a Django | ||
migration. | ||
- "Long-running data transformation": A data transformation that lasts longer | ||
than a few seconds. Long-running data transformations are commonly caused by | ||
the modification of massive data, especially data in indexed columns. | ||
|
||
## How zero-downtime deployments work | ||
|
||
To understand the motivations of these best practices, it is important to | ||
understand how zero-downtime deployments are implemented. Openverse uses the | ||
[blue-green deployment strategy](https://en.wikipedia.org/wiki/Blue-green_deployment). | ||
The blue-green strategy requires running the new version of the application and | ||
the previous version at the same time during the duration of the deployment. | ||
This allows us to replace the multiple, load-balanced instances of our | ||
application one-by-one. As a result, we are able to verify the health of the | ||
instances running the new version, before fully replacing our entire cluster of | ||
application instances with the new version. At all times during a successful | ||
deployment process, both versions of the application must be fully operable and | ||
healthy and able to handle requests. During deployment, the load-balancer will | ||
send requests to both the previous and new versions of the application during | ||
the entire time of the deployment, which can be several minutes. This requires | ||
both versions of the application to be strictly compatible with the underlying | ||
database schema. | ||
|
||
## What causes downtime during a deployment? | ||
|
||
The most common cause of downtime during a deployment are database schema | ||
incompatibilities between the previous and new version of the application. The | ||
classic example of a schema incompatibility involves column name changes. | ||
Imagine there is a column on a table of audio files called "length", but we | ||
wanted to change the column name to specify the expected units, to make it | ||
clearer for new contributors. If we simply change the name of the column to | ||
"length_ms", then when the new version of the application deploys, it will apply | ||
the migration to change the name. The new version will, of course, work just | ||
fine, in this case. However, during deployments, the previous version of the | ||
application will still be running for a period of time. Requests by the previous | ||
version of the application to retrieve the "length" column with fail | ||
catastrophically because the "length" column will no longer exist! It has been | ||
renamed to "length_ms". If we prevented the new version of the application from | ||
applying the migration, the same issue would arise, but for the new versions as | ||
the "length_ms" column would not yet exist. This, in addition to column | ||
data-type changes, is the most common reason why downtime would be required | ||
during a deployment process that is otherwise capable of deploying without | ||
downtime. When schema incompatibilities arise between new and the previous | ||
version of an application, it is impossible to safely serve requests from both | ||
using the same underlying database. | ||
|
||
Other causes are variations on this same pattern: a shared dependency is neither | ||
forward nor backwards compatible between two subsequent versions of the | ||
application. | ||
|
||
> **Note**: This issue of incompatibility only applies to _subsequent_ versions | ||
> of an application because only subsequent versions are ever deployed | ||
> simultaneously with the same underlying support infrastructure. So long as | ||
> there is at least one version between them, application versions may and | ||
> indeed sometimes do have fundamental incompatibilities with each other and | ||
> could not be simultaneously deployed. | ||
## How to achieve zero-downtime deployments | ||
|
||
Sometimes you need to change the name of a column or introduce some other, | ||
non-backwards compatible change to the database schema. Luckily, this is still | ||
possible, even with zero-downtime deployments, though admittedly the process is | ||
more tedious. | ||
|
||
Continuing with the column name change case-study, the following approach must | ||
be followed. | ||
|
||
1. Create a new column with the desired name and data type. The new column must | ||
be nullable and should default to null. This step should happen with a new | ||
version of the application that continues to use the existing column. | ||
1. If the column is written to by the application, deploy a new version that | ||
starts writing new or updated data to both columns. It should read the data | ||
from the new column and only fall back to the old column if the new column is | ||
not yet populated. | ||
1. Use a data transformation management command to move data from the previous | ||
column to the new column. To find the rows that need updating, iterate | ||
through the table by querying for rows that do not have a value in the new | ||
column yet. Because the version of the application running at this point is | ||
writing and reading from the new column (falling back to the old for reads | ||
when necessary), the query will eventually return zero rows. | ||
1. Once the data transformation is complete, deploy a new version of the | ||
application that removes the old column and the fallback reads to it and only | ||
uses the new column. Also, add the corresponding constraints for the said | ||
column if required, e.g. non-nullable, default value, etc. | ||
|
||
To reiterate, yes, this is a much more tedious process. However, the benefits to | ||
this approach are listed below. | ||
|
||
Relatively similar processes and patterns can be applied to other | ||
"downtime-causing" database changes. These are covered in | ||
[this GitHub gist](https://gist.github.com/majackson/493c3d6d4476914ca9da63f84247407b) | ||
with specific instructions for handling them in a Django context. | ||
|
||
### Benefits of this approach | ||
|
||
#### Zero-downtime | ||
|
||
The entire point, of course. This benefits everyone who depends on the | ||
application's uptime and reliability. | ||
|
||
#### Reversibility | ||
|
||
If the new version of the application has a critical bug, whether related to the | ||
data changes or not, we can revert each step to the previous version without | ||
issue or data loss. Even during the data transformation process, because the | ||
version of the application running is updating both columns, if you have to | ||
revert to the first version (or even earlier) that doesn't use the new column, | ||
the old column will still have up-to-date data and no user data will be lost. | ||
This would complicate the data migration process, however, as previous versions | ||
of the application will not be updating the new column and would likely require | ||
deleting the data from the new column to start the data migration process over | ||
from the start. This can cause massive time consumption but is overall less of a | ||
headache than data loss or fully broken deployments. | ||
|
||
#### Intentionality and expediency | ||
|
||
Due to the great lengths it takes to change a column name, the process will | ||
inevitably cause contributors to ask themselves: is this worth it? While | ||
changing the name of a column can be helpful to disambiguate the data in the | ||
column, using a model attribute alias can be just as helpful without any of the | ||
disruption or time of a data transformation. These kinds of questions prompt us | ||
to make expedient choices that deliver features, bug fixes, and developer | ||
experience improvements faster. | ||
|
||
#### Shorter deployment times | ||
|
||
Ideally maintainers orchestrating a production deployment of the service are | ||
keenly aware of the progress of the deployment. This is only a realistic and | ||
sustainable expectation, however, if deployments take a "short" amount of time. | ||
What "short" means is up for debate, but an initial benchmark can be the | ||
Openverse production frontend deployments, which currently take about 10 | ||
minutes. Longer than this seems generally unreasonable to expect someone to keep | ||
a very close eye on the process. Sticking to zero-downtime deployments helps | ||
keep short deployments the norm. Even though it sometimes asks us to deploy more | ||
_often_, those deployments can—and in all likelihood, should—be spread over | ||
multiple days. This makes the expectation of keeping a close watch on the | ||
deployment more sustainable long-term and helps encourage us to deploy more | ||
often. In turn, this means new features and bug fixes get to production sooner. | ||
|
||
#### Possibility to throttle | ||
|
||
Management commands that iterate over data progressively can be throttled to | ||
prevent excessive load on the database or other related services that need to be | ||
accessed. | ||
|
||
#### Unit testing | ||
|
||
Management command data migrations can be far more easily unit tested using our | ||
existing tools and fixture utilities. | ||
|
||
### Long running migrations | ||
|
||
Sometimes long-running schema changes are unavoidable. In these cases, provided | ||
that the instructions above are followed to prevent the need for downtime, it is | ||
reasonable to take alternative approaches to deploying the migration. | ||
|
||
At the moment we do not have specific recommendations or policies regarding | ||
these hopefully rare instances. If you come across the need for this, please | ||
carefully consider the reasons why it is necessary in the particular case and | ||
document the steps taken to prepare and deploy the migration. Please update this | ||
document with any general findings or advice, as applicable. | ||
|
||
## Django management command based data transformations | ||
|
||
### Why use management commands for data transformations instead of Django migrations? | ||
|
||
Django comes with a data transformation feature built in that allows executing | ||
data transformations during the migration process. Transformations are described | ||
in Django's ORM and executed in a single pass at migration time. If you want to | ||
move data between two columns, it is trivial to do so with these "data | ||
migrations" and Django makes it just as easy. | ||
[Documentation for this Django feature is available here](https://docs.djangoproject.com/en/4.1/topics/migrations/#data-migrations). | ||
|
||
When considering the potential issues with using Django migrations for data | ||
transformations with our current deployment strategy, keep in mind the following | ||
details: | ||
|
||
- Migrations are run _at the time of deployment_ by the first instance of the | ||
new version of the application that runs in the pool. | ||
- **Note**: This specific detail will only be the case once we've fully | ||
migrated to ECS based deployments. For now one of the people deploying the | ||
application manually runs the migrations before deploying. The effect is the | ||
same though: we end up with a version of the application running against a | ||
database schema that it's not entirely configured to work with. Whether that | ||
is an issue depends solely on whether the practices described in this | ||
document regarding migrations have been followed. | ||
- Deployments should be timely so that developers are able to reasonably monitor | ||
their progress and have clear expectations for how long a deployment should | ||
take. Ideally a full production deployment should not take much longer than 10 | ||
minutes once the Docker images are built. Those minutes are already spent by | ||
the process ECS undergoes to deploy a new version of the application. | ||
|
||
With those two key details in mind, the main deficiency of using migrations for | ||
data transformations may already be evident: time. Django migration based data | ||
transformations dealing with certain smaller tables may not take very long and | ||
this issue, in some cases, might not be applicable. However, because it is | ||
extremely difficult to predetermine the amount of time a migration will take, | ||
even data transformations for small datasets should still heed the | ||
recommendation to use management commands. In particular, it can be difficult to | ||
predict tables with indexes (especially unique constraints) will perform during | ||
a SQL data migration. | ||
|
||
Realistically (and provided it is avoidable), any Django migration that takes | ||
longer than 30 or so seconds, is not acceptable for our current deployment | ||
strategy. Because the vast majority of them will take longer than a few seconds, | ||
there is a strong, blanket recommendation against using them. Exceptions may | ||
exist for this recommendation, however. If you're working on an issue that | ||
involves a data transformation, and you think a migration is truly the best tool | ||
for the job and can demonstrate that it will not take longer than 30 seconds in | ||
production, then please include these details in the PR. | ||
|
||
### General rules for data transformations | ||
|
||
These rules apply for data transformations executed as management commands or | ||
otherwise. | ||
|
||
#### Data transformations must be [idempotent](https://en.wikipedia.org/wiki/Idempotence) | ||
|
||
This one particularly applies to management commands because they can | ||
theoretically be run multiple times, either by accident or as an attempt to | ||
recover from or continue after a failure. | ||
|
||
Idempotency is important for data transformations because it prevents | ||
unnecessary duplicate processing of data. Idempotency can be achieved in three | ||
ways: | ||
|
||
1. By checking the state of the data and only applying the transformation to | ||
rows for which the transformation has not yet been applied. For example, if | ||
moving data between two columns, only process rows for which the new column | ||
is null. Once data has been moved for a row, it will no longer be null and | ||
will be ignored from the query. | ||
1. By checking a timestamp available for each row before which it is known that | ||
data transformations have already been applied. | ||
1. By caching a list of identifiers for already processed rows in Redis. | ||
|
||
#### Data transformations should not be destructive | ||
|
||
Data transformations should avoid being destructive, if possible. Sometimes it | ||
is avoidable because data needs to be updated "in place". In these cases, it is | ||
imperative to save a list of modified rows (for example, in a Redis set) so that | ||
the transformation can be reversed if necessary. | ||
|
||
### If a Django migration _must_ be used | ||
|
||
In the rare case where a Django migration must be used, keep in mind that using | ||
a | ||
[non-atomic migration](https://docs.djangoproject.com/en/4.1/howto/writing-migrations/#non-atomic-migrations) | ||
can help make it easier to recover from unexpected errors without causing the | ||
entire transformation process to be reversed. |