EUBFR makes heavy use of AWS cloud services and technologies. This is great for reducing costs and utilizing modern tools and approaches for solving problems.
However, being able to develop with these services locally is not always the most easy and straightforward task.
In this section of the documentation you will find hints on how to improve your development workflows in a way that will make you work faster and easier with the codebase locally.
EUBFR is a data lake project. ETLs are the plugins being used for transforming producers' data from one format and structure to a desired project model target.
Normally, when new producers are introduced in the system, they provide raw files which serve as the base for building ETLs. Use these raw files for:
- Create a mapping document in the corresponding ETL's folder.
- Open a pull request to build a preview and reach an agreement about the mapping.
- Use this mapping agreement as a base for the development of the transform function.
Example: Imagine you receive a file foo.csv
from BAR
producer. In this case:
- Create the mapping document in
/services/ingestion/etl/bar/csv/README.md
You can find an example here - Open a pull request, like this
- Use this mapping as described in the next section
Producers' raw data files vary in format and structure, but all of them should reach a consistent structure and format. This stage is the "harmonization" stage. Transform functions are functions which are called on each record/row of data and output of the process is a new line delimited JSON file stored in harmonized storage.
This is the conventional structure of an ETL:
.
└── csv
├── node_modules
├── package.json
├── README.md
├── serverless.yml
├── src
│ ├── events
│ │ └── onParseCSV.js
│ └── lib
│ ├── sns.js
│ ├── transform.js
│ └── uploadFromStream.js
├── test
│ ├── stubs
│ │ └── record.json
│ └── unit
│ ├── events
│ │ └── onParseCSV.spec.js
│ └── lib
│ ├── __snapshots__
│ │ └── transform.spec.js.snap
│ └── transform.spec.js
└── webpack.config.js
Here's a possible workflow which has proved efficient so far:
- Create the skeleton of the serverless service
It includes:
- serverless service manifest file:
serverless.yml
- defines cloud resources and glue between logic of the service - AWS lambda function which gets triggered when a new file for the ETL comes
in - in this case it's
./src/events/onParseCSV.js
. Without going into too many details here - you can take an example from another service matching the file format (extension) from another service. (CSV for example) - Creating boilerplate assets:
test
folder,webpack.config.js
file, etc. These you possibly already have taking the example service from the previous step.
- Grab an example record structure
- Deploy the service by running
npx serverless deploy
from the root of the service - Trigger the function and
console.log
the incomingevent
parameter in order to generate stubs for event and record. These are later useful for local development. You might want toJSON.strigify
objects for better readability in CloudWatch logs.
- Write a small transform function
- It's usually placed at
src/lib/transform.js
- It should include the flow type for Project
- Include JSDoc comments to all helper functions in the transform in order to expose the smaller transformation steps in a user friendly API pages
- Write a test for the transform function
- It's usually placed at
test/unit/lib/transform.spec.js
- Include assertions for matching snapshot
test('Produces correct JSON output structure', () => {
expect(result).toMatchSnapshot();
});
- Run jest in watch mode to iterate faster. Each time you edit the file with
the transform function, the test runner will restart the test. You might want to
use
console.log
to see results of your transform function while you develop it.
As AWS lambda functions are also known as cloud functions, they are not always easy to debug locally. However, making use of the serverless framework, we have some helpers to do debugging locally.
-
In
webpack.config.js
, set a new property if not already setdevtool: 'source-map'
. This way, Webpack will generate human-readable code. -
Include a
debugger;
statement on the line of your code where you want to set a breakpoint. -
Run the following in the CLI
node --inspect-brk ./node_modules/.bin/serverless invoke local -f fooFunction --path eventStub.json
Where:
node --inspect-brk
is the Node.js core inspector protocol which opens a session in Chrome browserserverless invoke local
is a specific command of the serverless CLIfooFunction
is the name of the function as defined in theserverless.yml
fileeventStub
is the JSON file which you can take by usingconsole.log(JSON.strigify(event))
in the beginning of the lambda function
To use Chrome DevTools for debugging:
- Open Chrome
- Open DevTools panel
- Use the debugger as shown in this video
Alternatively, you can also use Node.js core REPL to step into your code.
As a result, you will be able step in and debug your cloud function locally with an emulation decently close to the real environment.
Another debugger you can use is ndb. If you decide to use ndb
, the command could be:
$ ndb npx sls invoke local -f parseXls --path eventStub.json
Note
Because the lambda function of a given ETL depends on a file being present on S3 and the transform function is meant to input this file and output another file with normalized structure, do not remove the given S3 file during the debugging phase. This way, the resource you want to test will be an actual file existing in the cloud and the lambda function will operate quite closer to real life.
Another close resemblance to S3 read stream of files is to use the Node.js's core fs.createReadStream(path[, options]). If you use this swapping approach you win independence of AWS services, but you lose possible system-specific behaviors of AWS SDK.
Apart from using unit tests for faster development of ETLs, TDD can be also useful for the faster development of any other type of custom logic which does not rely on AWS functions.
In fact, we do sometimes mock AWS services, but these tests have far less value in comparison to custom code which contain specific business logic and is more important to be maintainable. AWS services are not only maintained by AWS, but they could also change in time.
For an example, you can have a look at @eubfr/ingestion-quality-analyzer
service.
Its use cases are described in the assertions and also they make it easy to iterate
faster on logic by again using the watch mode of jest and comparing the results of
various helper functions.
Each producer has its own dashboard through which data is loaded to the data lake. These dashboards are based on Create React App. The APIs used by the client apps are based on serverless services.
In a regular workflow, when in dev
, each developer deploys the necessary resources in order to work with a given dashboard, both locally and remotely when deployed on S3. This is not always a convenient workflow for developers who want ot focus their work only on the React web app, without spending the time and CloudFormation resources.
In order to be able to develop locally, one still needs a set of deployed cloud resources. The test
stage is one which is always deployed and if the developer needs to improve the web app without adding a new ETL, then it's acceptable to use the outputs of the already existing test
CloudFormation resources.
To do so, one first needs to get the required information about surface APIs and set the necessary environment variables. Steps to achieve that:
- Edit
config.json
file
{
"eubfr_env": "test",
"region": "eu-central-1",
"stage": "test",
"username": "euinvest",
"demo": ["euinvest"]
}
- Export environment variables for the existing resources on AWS
$ eubfr-cli env generate
After the success of this operation, 3 files .env
will be created on your local file system. For working with the React web app, the interesting file is at eubfr-data-lake/demo/dashboard/client/.env
with the following contents:
REACT_APP_STAGE=test
REACT_APP_PRODUCER=
REACT_APP_DEMO_SERVER=
REACT_APP_ES_PUBLIC_ENDPOINT=
REACT_APP_ES_PRIVATE_ENDPOINT=
The only property to change here is REACT_APP_PRODUCER
in order to specify which producer's dashboard to visualize.
Then, when you have your producer selected, start the project:
$ yarn react:start
This will use the cloud existing resources for the given set of environment variables and will start the dashboard locally.