Skip to content

A high-performance database dumping application

Notifications You must be signed in to change notification settings

susom/db-to-avro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Database Backups to Avro

The goal of this project is to load database backups (eg. Oracle datapump exports, MS Sql Server .bak files) into a temporary Docker container running the appropriate vendor database, and then export the tables within that restored database as .avro files for loading into other Big data systems.

At the present time (August 2022) this goal remains aspirational. The current system requires a pre-existing database, the connect string and credentials to which must be passed in on the command line.

Current Functionality

This jar file connects to an existing database and emits .avro files, one for each table in the database or as specified on the command line.

MSSQL Server backups and Oracle Datapump exports are currently supported. Oracle supports row-level multi-threading for high performance.

Architecture

On a host running docker, the application will (one day) instantiate a database (as a container), load the backup files into the container, and then dump the database as Avro files. If a database is already running, the restore step can be skipped by providing a JDBC connection string.

The application is asynchronous and multi-threaded using RxJava2.

Basic Command Line Usage

MSSQL Server

java -jar db-to-avro.jar \
   --flavor=sqlserver \
   --backup-dir=backups/ \
   --backup-files=AdventureWorksDW2016_EXT.bak
   --catalog=AdventureWorks2016 \
   --schema=dbo \
   --destination=avro/

This command currently produces an error message , "You must provide database.url".

Once this project is complete, the above command will restore the Microsoft AdventureWorks database from a backup set located in backups/, using a single backup file AdventureWorksDW2016_EXT.bak. All tables within schema AdventureWorks2016.dbo will be exported as Avro files to the directory avro/

Oracle

java -jar db-to-avro.jar \
   --flavor=oracle \
   --backup-dir=backups/ \
   --backup-files=HR.par
   --schemas=HRDEMO \
   --destination=avro/

Once this project is complete, this will restore the Oracle HR demo database from a backup set located in backups/, using a par file HR.par. All tables within schema HR will be exported as Avro files to the directory avro/

To actually get the above two commands to work at this time, you must also supply the following on the command line as documented in the following section:

--connect=...
--user=..
--password=..

Exporting an Existing Database

MSSQL Server

java -jar db-to-avro.jar \
  --connect="jdbc:sqlserver://localhost" \
  --user="SA" \
  --password="..."  \
  --flavor=sqlserver  \
  --catalog=AdventureWorks2016 \
  --schemas=dbo  \
  --destination=avro/

Oracle

java -jar db-to-avro.jar \
  --connect="jdbc:oracle:thin:@localhost:1521/ORCLPDB1" \
  --user="system" \
  --password="..."  \
  --flavor=oracle  \
  --schemas=HR  \
  --destination=avro/

These do the same as the above, but instead of loading a database backup, it will use an existing database via the connection string.

Command Line Options

Option (* = required)    Description
---------------------    -----------
--backup-dir             directory containing backup to restore
--backup-files           comma-delimited list of backup files, or a single .par file
--catalog                catalog to export (Oracle N/A)
--connect                jdbc connection string for existing database
--destination            avro destination directory
--exclude                exclusions in form schema(.table)(.column)
* --flavor               database type (sqlserver, oracle)
--password               database password (existing db)
--password-file          read database password from file (existing db)
--post-sql               path of sql file to execute after restore/connect
--pre-sql                path of sql file to execute before restore/connect
--schemas                only export this comma-delimited list of schemas
--tables                 only export this comma-delimited list of tables
--user                   database user (existing db)
--split-table-strategy   flexible split table stratgies (only applicable for sqlserver)
-h, --help               show help

Example Configuration Properties

# Application defaults, can be overridden by a job
#
# Note ${UUID} is substituted everywhere with the same random UUID at startup time

# Docker host for managing containers
docker.host=unix:///var/run/docker.sock

# Make sure this is at least # threads or you will get Hikari connection timeouts
database.pool.size=64

# MS SQL Server defaults
sqlserver.database.url=jdbc:sqlserver://localhost;user=SA;password=${UUID};database=master;autoCommit=false
sqlserver.database.user=SA
sqlserver.database.password=${UUID}
sqlserver.image=mcr.microsoft.com/mssql/server:2017-latest
sqlserver.env=ACCEPT_EULA=Y,SA_PASSWORD=${UUID}
# Attempt to dump a table in parallel using partitions based on PKs (beta)
sqlserver.optimized.enable=false

# Oracle defaults
oracle.database.url=jdbc:oracle:thin:@10.10.10.100:1521/ORCLPDB1
oracle.database.user=system
oracle.database.password=${UUID}
oracle.ports=1521:1521
oracle.image=gcr.io/som-rit-phi-starr-dev/oracle-database:12.2.0.1-ee
oracle.mounts=/mnt/database:/opt/oracle/oradata,/mnt/backups:/backup
oracle.env=ORACLE_CHARACTERSET=WE8ISO8859P1,ORACLE_PWD=${UUID}
oracle.impdp.core.multiplier=1
oracle.optimized.enable=true

# Target size for generated Avro files, based on *uncompressed* source table bytes.
# Set to zero for unlimited file size.
avro.filename=%{SCHEMA}.%{TABLE}-%{PART}.avro
avro.logfile=job.json
avro.codec=snappy
avro.targetsize=1000000000
avro.fetchsize=5000
# Normalize table names (columns always normalized)
avro.tidy=true

# Core-count multiplier determines number of avro threads
avro.core.multiplier=0.75

Future Features

Currently the application consists of a single module "db-to-avro-runner". More modules in the future can be added to call this runner from a REST API, Pubsub queue, etc.

TODO, in order of priority:

  • Pass docker database mounts as command line options in addition to properties file

  • Resume features:

    • Use temp files when writing to disk, rename to final name when complete.

    • Cancelled/crashed jobs should resume at last table exported

    • Option: if destination file exists, don’t export that table

  • Option to save directly to a GCS bucket

  • Support for regex in schema/table/column exclusion filters

  • Get row counts from Avro generation, instead of DB introspection (which is too slow for billions of rows)

  • Ability to restore backup into pre-existing database (eg. from CoS boot creation)

  • Schema introspection should be able to exclude tables that will cause problems like '?' in column name

  • Validation: Number of rows in database should match number of rows written by db-goodies ETL

  • Automation

    • Ability to self-bootstrap into a new VM created in GCP and monitor output (?)

    • Job runner that reads VM metadata for job input (?)

    • Pub/sub job runner (?)

  • Formalized progress reporting (not logging)

  • Switches for:

    • Deleting docker container after successful export

    • Listing catalogs, schemas, and tables

    • Testing connection to db

  • Option for saving output as .avro instead of .json, so manifest is included with the dataset itself.

  • Unit tests(!)

Known Issues

  • Table and column names are normalized in db-goodies ETL, which is not reflected in job output log.

  • Excluded tables are not explicitly noted in job output (they just aren’t listed)

About

A high-performance database dumping application

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published