CLI tool for easy execution of Singer Taps, managing module dependencies and persistence of state files and config files in AWS.
- package the whole project as Docker image
- create infrastructure: S3 buckets, SSM parameter store items (manually),
SingerTap
IAM role tied to S3 bucket and allowed to access Redshift - store logs in logs/ folder
Currently this module can be installed only by running pip install .
at the root of the project, locally.
- Add item to the list in
singer_project_config.yml
. You may give that item a custom name (different from tap module name), just remember to addmodule: ...
parameter with the valid name of a tap. This can be useful e.g. if you want to maintain two tap integrations based on the same tap library (e.g. one integration tied to release0.9.1
and the other tied to1.0.0
), you would assign two different aliases. - add folder in
./taps/...
with name corresponding to tap alias - add a file with all modules needed to tap installation, by creating
./taps/tap-alias/requirements.txt
file - create virtual environment for the tap by running:
singer-aws-install
- if you want to store tap config in the project folder, add config.json to
.taps/tap-alias/config.json
. Otherwise make sure that tap configuration is available as environment variable or AWS Parameter Store. Check Tap and target config files section.
Same steps can be performed for adding target integration, just replace tap
with target
when following the instructions.
The order of proceeding when preparing singer config file (applies to both tap and target):
- Use
config.json
file found intaps/tap-alias/
(targets/target-name/
) directory - Create
config.json
file based onTAP_TAPNAME_CONFIG
orTARGET_TARGETNAME_CONFIG
env variables - Create
config.json
file based on SSM parameter with a path like/ssm_prefix/TAP_TAPNAME_CONFIG
or/ssm_prefix/TARGET_TARGETNAME_CONFIG
. Value of ssm_prefix is configurable from singer_project_config.yml file, e.g./acme_singer_project/
- in the command line, export AWS_PROFILE that has permissions to read from SSM and read/write to S3:
export AWS_PROFILE=your_profile
- run
singer-aws-discover --tap adwords
Subsequent schema discoveries, e.g. updates or selecting new fields for replication may be annoying because the schema-discovery
utility (see credits below) only shows what's available, without showing what's already selected (i.e. you kind of start catalog selection form the scratch with each discovery). To allow for more seamless workflow, there's a singer-aws-inspect
command that can be used to inspect current state of catalog of a given tap. For instance:
singer-aws-inspect --tap adwords
would print:
...
inspecting stream: ads...
❌ stream is NOT SELECTED
• found 6 available properties
• found 0 selected properties
• selected properties: []
inspecting stream: accounts...
❌ stream is NOT SELECTED
• found 6 available properties
• found 0 selected properties
• selected properties: []
inspecting stream: KEYWORDS_PERFORMANCE_REPORT...
✅ stream is SELECTED
• found 139 available properties
• found 12 selected properties
• selected properties: ['account', 'campaignID', 'campaign', 'clicks', 'conversions', 'cost', 'keyword', 'day', 'device', 'customerID', 'keywordID', 'impressions']
...
singer-aws-discover
command leverages an amazing utility: singer-discover. Thank you @chrisgoddard!
- in the command line, export AWS_PROFILE that has permissions to read from SSM and read/write to S3:
export AWS_PROFILE=your_profile
- run
export AWS_PROFILE=your_profile_name
singer-aws-sync --tap adwords --target redshift
which will run a command identical to:
venv/tap-adwords/bin/tap-adwords --config taps/tap-adwords/config.json | venv/target-redshift/bin/target-redshift --config targets/target-redshift/config.json > state.json
or, if there is any state file existing for a given tap (state file must exist in a remote S3 bucket), the following command will be executed:
venv/tap-adwords/bin/tap-adwords --config taps/tap-adwords/config.json --properties taps/tap-adwords/catalog.json --state states_in/tap-adwords-state.json | venv/target-redshift/bin/target-redshift --config targets/target-redshift/config.json > /Users/tomaszzbrozek/singer-aws/states_out/tap-adwords-state.json
☝️ state file (in states_in/
directory) is fetched on the fly before executing the tap | target
. Specify S3 bucket name and prefix in the states_bucket
parameter inside singer_project_config.yml
file.
You can pass --ignore-state
to ignore previous state files when executing the tap and use start_date
as specified in tap's config to replicate all streams:
singer-aws-sync --tap adwords --target redshift --ignore-state
- this project will not create schemas in your target data warehouse. You still need to do it as you would need when working with singer tap directly.