Skip to content
This repository has been archived by the owner on Jan 26, 2024. It is now read-only.

Latest commit

 

History

History
180 lines (163 loc) · 6.22 KB

README.md

File metadata and controls

180 lines (163 loc) · 6.22 KB

Mastro

Data Catalogue

Data providers can describe and publish data using a shared definition format. Consequently, data definitions can be crawled from networked and distributed file systems, as well as directly published to a common endpoint.

Catalogue API

A Catalogue service endpoint implements the following interface:

type CatalogueService interface {
	Init(cfg *conf.Config) *resterrors.RestErr
	UpsertAssets(assets *[]Asset) (*[]Asset, *resterrors.RestErr)
	GetAssetByID(assetID string) (*Asset, *resterrors.RestErr)
	GetAssetByName(name string) (*Asset, *resterrors.RestErr)
	SearchAssetsByTags(tags []string, limit int, page int) (*Paginated[Asset], *resterrors.RestErr)
	Search(query string, limit int, page int) (*Paginated[Asset], *resterrors.RestErr)
	ListAllAssets(limit int, page int) (*Paginated[Asset], *resterrors.RestErr)
}

This can be easily mapped to a DAO backend:

type AssetDAOProvider interface {
	Init(*conf.DataSourceDefinition)
	Upsert(asset *Asset) error
	GetById(id string) (*Asset, error)
	GetByName(id string) (*Asset, error)
	SearchAssetsByTags(tags []string, limit int, page int) (*Paginated[Asset], error)
	ListAllAssets(limit int, page int) (*Paginated[Asset], error)
	Search(query string, limit int, page int) (*Paginated[Asset], error)
	CloseConnection()
}

Have a look at catalogue/daos/* for example implementations.

This is translated to the following endpoint:

Verb Endpoint Maps to
GET /healthcheck/asset github.com/data-mill-cloud/mastro/catalogue.Ping
GET /asset/id/:asset_id github.com/data-mill-cloud/mastro/catalogue.GetAssetByID
GET /asset/name/:asset_name github.com/data-mill-cloud/mastro/catalogue.GetAssetByName
PUT /asset/ github.com/data-mill-cloud/mastro/catalogue.UpsertAsset
PUT /assets/ github.com/data-mill-cloud/mastro/catalogue.BulkUpsert
POST /assets/tags github.com/data-mill-cloud/mastro/catalogue.SearchAssetsByTags
POST /assets/search github.com/data-mill-cloud/mastro/catalogue.Search
GET /assets/ github.com/data-mill-cloud/mastro/catalogue.ListAllAssets

Those crossed out are meant for testing purposes and will be removed in the following releases.

Examples

We provide a few examples below:

List all - GET on localhost:8085/assets with empty body, has result:

{
    "message": "Error while retrieving asset :: mongo: no documents in result",
    "status": 404,
    "error": "not_found"
}

Upsert - PUT on localhost:8085/asset with body:

{"last-discovered-at" : "2021-03-22T21:19:39.634Z", "published-on" : "0001-01-01T00:00:00.000Z", "name" : "example_featureset", "description" : "my first featureset pushed to the catalogue", "depends-on" : ["table.mydb.mytable"], "type" : "featureset"}

Bulk upsert - PUT on localhost:8085/assets with body:

[
	{"last-discovered-at" : "2021-03-22T21:19:39.634Z", "published-on" : "0001-01-01T00:00:00.000Z", "name" : "example_featureset", "description" : "my first featureset pushed to the catalogue", "depends-on" : ["table.mydb.mytable"], "type" : "featureset", "tags" : ["featureset"]},
    {"last-discovered-at" : "2021-03-22T21:19:39.634Z", "published-on" : "0001-01-01T00:00:00.000Z", "name" : "example_featureset", "description" : "my first featureset pushed to the catalogue", "depends-on" : ["table.mydb.mytable"], "type" : "featureset", "tags" : ["featureset"]}    
]

GetByName - GET on localhost:8085/asset/example_featureset has now result:

{
    "last-discovered-at": "2021-03-23T13:52:43.787Z",
    "published-on": "0001-01-01T00:00:00Z",
    "name": "example_featureset",
    "description": "my first featureset pushed to the catalogue",
    "depends-on": [
        "table.mydb.mytable"
    ],
    "type": "featureset",
	"tags": [
        "featureset"
    ]
}

SearchAssetsByTags - POST on localhost:8085/assets/tags passing a Json body of kind:

{
    "tags" : ["something"],
	"limit": 4,
    "page": 1
}

returns an HTTP error status with a Json body of kind:

{
    "message": "Error while retrieving assets using filter :: empty result set",
    "status": 404,
    "error": "not_found"
}

while with body:

{
    "tags" : ["featureset"],
	"limit": 4,
    "page": 1
}

we get a paginated list of all assets having the provided tags:

{
    "data": [
        {
            "last-discovered-at": "2021-03-23T13:52:43.787Z",
            "published-on": "0001-01-01T00:00:00Z",
            "name": "example_featureset",
            "description": "my first featureset pushed to the catalogue",
            "depends-on": [
                "table.mydb.mytable",
                "example_featureset1",
                "otherdep2"
            ],
            "type": "featureset",
            "labels": {
                "environment": "test",
                "instance": "2020_08",
                "project": "example"
            },
            "tags": [
                "featureset",
                "example"
            ],
            "versions": {}
        },
        {
            "last-discovered-at": "2021-03-23T13:52:43.788Z",
            "published-on": "0001-01-01T00:00:00Z",
            "name": "example_featureset1",
            "description": "my second featureset pushed to the catalogue",
            "depends-on": [
                "table.mydb.mytable",
                "example_featureset2",
                "otherdep2"
            ],
            "type": "featureset",
            "labels": {
                "environment": "test",
                "instance": "2020_08",
                "project": "example"
            },
            "tags": [
                "featureset",
                "example"
            ],
            "versions": {}
        },
    ],
    "pagination": {
        "total": 15,
        "page": 1,
        "perPage": 4,
        "prev": 0,
        "next": 2,
        "totalPage": 4
    }
}