Skip to content

Latest commit

 

History

History
246 lines (184 loc) · 10.3 KB

DEVELOPMENT-TIPS-TRICKS.md

File metadata and controls

246 lines (184 loc) · 10.3 KB

Development tips and tricks

EUBFR makes heavy use of AWS cloud services and technologies. This is great for reducing costs and utilizing modern tools and approaches for solving problems.

However, being able to develop with these services locally is not always the most easy and straightforward task.

In this section of the documentation you will find hints on how to improve your development workflows in a way that will make you work faster and easier with the codebase locally.

Developing an ETL service

EUBFR is a data lake project. ETLs are the plugins being used for transforming producers' data from one format and structure to a desired project model target.

Normally, when new producers are introduced in the system, they provide raw files which serve as the base for building ETLs. Use these raw files for:

  • Create a mapping document in the corresponding ETL's folder.
  • Open a pull request to build a preview and reach an agreement about the mapping.
  • Use this mapping agreement as a base for the development of the transform function.

Example: Imagine you receive a file foo.csv from BAR producer. In this case:

  • Create the mapping document in /services/ingestion/etl/bar/csv/README.md You can find an example here
  • Open a pull request, like this
  • Use this mapping as described in the next section

Developing a new ETL

Producers' raw data files vary in format and structure, but all of them should reach a consistent structure and format. This stage is the "harmonization" stage. Transform functions are functions which are called on each record/row of data and output of the process is a new line delimited JSON file stored in harmonized storage.

This is the conventional structure of an ETL:

.
└── csv
    ├── node_modules
    ├── package.json
    ├── README.md
    ├── serverless.yml
    ├── src
    │   ├── events
    │   │   └── onParseCSV.js
    │   └── lib
    │       ├── sns.js
    │       ├── transform.js
    │       └── uploadFromStream.js
    ├── test
    │   ├── stubs
    │   │   └── record.json
    │   └── unit
    │       ├── events
    │       │   └── onParseCSV.spec.js
    │       └── lib
    │           ├── __snapshots__
    │           │   └── transform.spec.js.snap
    │           └── transform.spec.js
    └── webpack.config.js

Here's a possible workflow which has proved efficient so far:

  1. Create the skeleton of the serverless service

It includes:

  • serverless service manifest file: serverless.yml - defines cloud resources and glue between logic of the service
  • AWS lambda function which gets triggered when a new file for the ETL comes in - in this case it's ./src/events/onParseCSV.js. Without going into too many details here - you can take an example from another service matching the file format (extension) from another service. (CSV for example)
  • Creating boilerplate assets: test folder, webpack.config.js file, etc. These you possibly already have taking the example service from the previous step.
  1. Grab an example record structure
  • Deploy the service by running npx serverless deploy from the root of the service
  • Trigger the function and console.log the incoming event parameter in order to generate stubs for event and record. These are later useful for local development. You might want to JSON.strigify objects for better readability in CloudWatch logs.
  1. Write a small transform function
  • It's usually placed at src/lib/transform.js
  • It should include the flow type for Project
  • Include JSDoc comments to all helper functions in the transform in order to expose the smaller transformation steps in a user friendly API pages
  1. Write a test for the transform function
  • It's usually placed at test/unit/lib/transform.spec.js
  • Include assertions for matching snapshot
test('Produces correct JSON output structure', () => {
  expect(result).toMatchSnapshot();
});
  • Run jest in watch mode to iterate faster. Each time you edit the file with the transform function, the test runner will restart the test. You might want to use console.log to see results of your transform function while you develop it.

Debugging a lambda function locally

As AWS lambda functions are also known as cloud functions, they are not always easy to debug locally. However, making use of the serverless framework, we have some helpers to do debugging locally.

  1. In webpack.config.js, set a new property if not already set devtool: 'source-map'. This way, Webpack will generate human-readable code.

  2. Include a debugger; statement on the line of your code where you want to set a breakpoint.

  3. Run the following in the CLI node --inspect-brk ./node_modules/.bin/serverless invoke local -f fooFunction --path eventStub.json

Where:

  • node --inspect-brk is the Node.js core inspector protocol which opens a session in Chrome browser
  • serverless invoke local is a specific command of the serverless CLI
  • fooFunction is the name of the function as defined in the serverless.yml file
  • eventStub is the JSON file which you can take by using console.log(JSON.strigify(event)) in the beginning of the lambda function

To use Chrome DevTools for debugging:

Alternatively, you can also use Node.js core REPL to step into your code.

As a result, you will be able step in and debug your cloud function locally with an emulation decently close to the real environment.

Another debugger you can use is ndb. If you decide to use ndb, the command could be:

$ ndb npx sls invoke local -f parseXls --path eventStub.json

Serverless debugging

Note

Because the lambda function of a given ETL depends on a file being present on S3 and the transform function is meant to input this file and output another file with normalized structure, do not remove the given S3 file during the debugging phase. This way, the resource you want to test will be an actual file existing in the cloud and the lambda function will operate quite closer to real life.

Another close resemblance to S3 read stream of files is to use the Node.js's core fs.createReadStream(path[, options]). If you use this swapping approach you win independence of AWS services, but you lose possible system-specific behaviors of AWS SDK.

Use TDD for faster iterations on custom features

Apart from using unit tests for faster development of ETLs, TDD can be also useful for the faster development of any other type of custom logic which does not rely on AWS functions.

In fact, we do sometimes mock AWS services, but these tests have far less value in comparison to custom code which contain specific business logic and is more important to be maintainable. AWS services are not only maintained by AWS, but they could also change in time.

For an example, you can have a look at @eubfr/ingestion-quality-analyzer service. Its use cases are described in the assertions and also they make it easy to iterate faster on logic by again using the watch mode of jest and comparing the results of various helper functions.

Develop demo clients locally

Each producer has its own dashboard through which data is loaded to the data lake. These dashboards are based on Create React App. The APIs used by the client apps are based on serverless services.

In a regular workflow, when in dev, each developer deploys the necessary resources in order to work with a given dashboard, both locally and remotely when deployed on S3. This is not always a convenient workflow for developers who want ot focus their work only on the React web app, without spending the time and CloudFormation resources.

In order to be able to develop locally, one still needs a set of deployed cloud resources. The test stage is one which is always deployed and if the developer needs to improve the web app without adding a new ETL, then it's acceptable to use the outputs of the already existing test CloudFormation resources.

To do so, one first needs to get the required information about surface APIs and set the necessary environment variables. Steps to achieve that:

  1. Edit config.json file
{
  "eubfr_env": "test",
  "region": "eu-central-1",
  "stage": "test",
  "username": "euinvest",
  "demo": ["euinvest"]
}
  1. Export environment variables for the existing resources on AWS
$ eubfr-cli env generate

After the success of this operation, 3 files .env will be created on your local file system. For working with the React web app, the interesting file is at eubfr-data-lake/demo/dashboard/client/.env with the following contents:

REACT_APP_STAGE=test
REACT_APP_PRODUCER=
REACT_APP_DEMO_SERVER=
REACT_APP_ES_PUBLIC_ENDPOINT=
REACT_APP_ES_PRIVATE_ENDPOINT=

The only property to change here is REACT_APP_PRODUCER in order to specify which producer's dashboard to visualize.

Then, when you have your producer selected, start the project:

$ yarn react:start

This will use the cloud existing resources for the given set of environment variables and will start the dashboard locally.