Mastro can be easily deployed to a K8s cluster. This guide is a walk through of the steps required to deploy Mastro to a K8s cluster.
The catalogue and the feature store, as well as the metric store are services that can be easily compiled statically and moved across environments.
The main difference is the flag CGO_ENABLED=1
(as by default) set in the dynamically compiled version (the static version has it set to 0). Please have a look at the catalogue Dockerfile and the crawler Dockerfile.
Accordingly, the crawler may depend on system libraries (e.g. Kerberos auth libraries) and requires being compiled dynamically.
If you have limited time, please use the available Helm Chart.
helm repo add mastro https://data-mill-cloud.github.io/mastro/helm-charts
You can also use a GitOps tool, such as ArgoCD.
For instance, this can be done by inheriting and installing mongo and mastro as a new chart, for instance (Chart.yaml
):
apiVersion: v2
name: mastro
description: A Helm chart for Mastro
type: Application
version: 0.1.0
appVersion: "0.3.1"
dependencies:
- name: mongodb
version: 10.7.1
repository: https://charts.bitnami.com/bitnami
- name: mastro
version: 0.1.0
repository: https://data-mill-cloud.github.io/mastro/helm-charts
Mind that the Chart, however, does not install any of the crawlers, for which you will have to read the following sections.
In the examples below, we assume a previously deployed mongo database available on the same namespace or any reachable host at mongo-mongodb:27017
.
For instance, we used the one using a StatefulSet and deployed as a Helm chart provided by bitnami (see here).
The config for the catalogue can be defined as a K8s config map, as follows:
apiVersion: v1
data:
catalogue-conf.yaml: |
type: catalogue
details:
port: 8085
backend:
name: catalogue-mongo
type: mongo
settings:
database: mastro
collection: mastro-catalogue
connection-string: "mongodb://mastro:mastro@mongo-mongodb:27017/mastro"
kind: ConfigMap
metadata:
name: catalogue-conf
Mind that in the example above we specified directly the DB user and password (i.e., mastro:mastro
).
A K8s secret or one injected by an external vault (e.g. hashicorp) can be used for this purpose.
A deployment can be created to spawn multiple replicas for the catalogue.
The configuration is mounted as volume and its path set using the MASTRO_CONFIG variable.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: mastro-catalogue
name: mastro-catalogue
spec:
replicas: 1
selector:
matchLabels:
app: mastro-catalogue
strategy: {}
template:
metadata:
labels:
app: mastro-catalogue
spec:
containers:
- image: datamillcloud/mastro-catalogue:v0.3.1
imagePullPolicy: Always
name: mastro-catalogue
resources: {}
ports:
- containerPort: 8085
protocol: TCP
env:
- name: MASTRO_CONFIG
value: /conf/catalogue-conf.yaml
volumeMounts:
- mountPath: /conf
name: catalogue-conf-volume
securityContext: {}
volumes:
- name: catalogue-conf-volume
configMap:
defaultMode: 420
name: catalogue-conf
A service is created with:
apiVersion: v1
kind: Service
metadata:
labels:
app: mastro-catalogue
name: mastro-catalogue
spec:
ports:
- name: rest-8085
port: 8085
protocol: TCP
targetPort: 8085
selector:
app: mastro-catalogue
type: ClusterIP
Mind that the service only exposes the catalogue across the namespace.
You will have to create an ingress or a route (respectively on plain K8s and openshift) to make it reachable from the outside world.
apiVersion: v1
data:
fs-conf.yaml: |
type: featurestore
details:
port: 8085
backend:
name: fs-mongo
type: mongo
settings:
database: mastro
collection: mastro-featurestore
connection-string: "mongodb://mastro:mastro@mongo-mongodb:27017/mastro"
kind: ConfigMap
metadata:
creationTimestamp: null
name: fs-conf
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: mastro-featurestore
name: mastro-featurestore
spec:
replicas: 1
selector:
matchLabels:
app: mastro-featurestore
strategy: {}
template:
metadata:
labels:
app: mastro-featurestore
spec:
containers:
- image: datamillcloud/mastro-featurestore:v0.3.1
imagePullPolicy: Always
name: mastro-featurestore
resources: {}
ports:
- containerPort: 8085
protocol: TCP
env:
- name: MASTRO_CONFIG
value: /conf/fs-conf.yaml
volumeMounts:
- mountPath: /conf
name: fs-conf-volume
securityContext: {}
volumes:
- name: fs-conf-volume
configMap:
defaultMode: 420
name: fs-conf
apiVersion: v1
kind: Service
metadata:
labels:
app: mastro-featurestore
name: mastro-featurestore
spec:
ports:
- name: rest-8085
port: 8085
protocol: TCP
targetPort: 8085
selector:
app: mastro-featurestore
type: ClusterIP
The crawling agent can be easily debugged locally by overwriting the default entrypoint:
docker run --entrypoint "/bin/sh" -it datamillcloud/mastro-crawlers:v0.3.1
A config for the crawler can be mounted as config map, for instance:
apiVersion: v1
data:
crawler-conf.yaml: |
type: crawler
backend:
name: impala-enterprise-datalake
type: impala
crawler:
root: ""
schedule-period: sunday
schedule-value: 1
start-now: true
catalogue-endpoint: "http://mastro-catalogue:8085/assets/"
settings:
host: "impala.domain.com"
port: "21000"
use-kerberos: true
kind: ConfigMap
metadata:
name: crawler-conf
This sets the agent to run every sunday, as well as right now after its Pod is created.
For the example Impala crawler, we need to both spawn a mastro-crawler and a Kerberos authentication process.
To this end, we use an init container, doing a kinit
on behalf the user and renewing the ticket cache upon expiration.
We previously documented this process in this blog post.
Specifically, we rely on another Github project, named Geronzio to automatically build a Docker container including (see Dockerfile here) krb5 and kstart, respectively the Kerberos client libraries and k5start client.
For Kerberos configuration we need: i) a krb5.conf file, and ii) a keytab or password to authenticate.
A krb5.conf
file defines the REALM and location of the KDC.
See here for a full documentation.
apiVersion: v1
data:
krb5.conf: |
[logging]
...
[libdefaults]
...
[realms]
...
[appdefaults]
..
kind: ConfigMap
metadata:
creationTimestamp: null
name: krb5-conf
The user keytab can be mounted as a secret, directly on K8s, or mounted from an external Vault. For instance:
apiVersion: v1
data:
user.keytab: blablablablablablablablablabla
kind: Secret
metadata:
name: user-keytab
Depending on the crawled source, a crawler may be scheduled to run once or periodically.
There are 3 possibilities to deploy a crawler on K8s: using i) a deployment, ii) a Job or a iii) Cron Job.
When using a deployment the github.com/go-co-op/gocron
library is used to schedule the agent runs.
The deployment implies a Pod being created to run either once or periodically.
On K8s, however, the Job and CronJob resources can be used for a similar purpose, respectively for one-time and periodical jobs.
Another possibility, currently out of the scope of this document, is to use a Workflow manager such as Argo-workflows.
To deploy the crawler with an auth sidecar container, the following steps are taken:
- add as a sidecar a container having an available kerberos client;
- mount the krb5.conf map on both the application container and the sidecar (as a read only volume); as you may try and see, it is not a good idea to mount something at /etc as kubernetes normally injects host and dns info at this location and may result in the Pod being rejected by the admission controller or an error. A similar behavior may occurr at /tmp. So just change the default paths with something creative. Mind that we can use KRB5_CONFIG and KRB5CCNAME to respectively overwrite the default location of the krb5.conf and cache files. Specifically, the cache file can be set to be written to an ephimeral volume used as communication means between main and sidecar container.
- mount the keytab secret on the sidecar container, i.e., as a read only volume at the /keytabs location.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: mastro-impala-crawler
name: mastro-impala-crawler
spec:
replicas: 1
selector:
matchLabels:
app: mastro-impala-crawler
strategy: {}
template:
metadata:
labels:
app: mastro-impala-crawler
spec:
containers:
# sidecar container
- image: pilillo/geronzio:20210305
imagePullPolicy: IfNotPresent
name: geronzio
env:
- name: KRB5_CONFIG
value: /etc-krb5/krb5.conf
- name: KRB5CCNAME
value: /tmp-krb5/krb5cc
- name: KRBUSER
value: SMARTUSER01
- name: REALM
value: DOMAIN.COM
# directly using the entrypoint
#command: ["kinit", "-kt", "/keytabs/user.keytab", "$(KRBUSER)@$(REALM)"]
command: ["/bin/sh", "-c"]
# renew the ticket every 60 secs using k5start (blocks waiting for renewal to occur)
args: ["k5start -f /keytabs/user.keytab $(KRBUSER)@$(REALM) -v -K 60 -x"]
restartPolicy: OnFailure
lifecycle:
type: Sidecar
volumeMounts:
- mountPath: /keytabs
name: keytab-volume
readOnly: true
- mountPath: /etc-krb5
name: krb5-conf-volume
readOnly: true
- mountPath: /tmp-krb5
name: shared-cache
# actual crawler
- image: datamillcloud/mastro-crawlers:v0.3.1
imagePullPolicy: Always
#IfNotPresent
name: mastro-crawler
resources: {}
env:
- name: KRB5_CONFIG
value: /etc-krb5/krb5.conf
- name: KRB5CCNAME
value: /tmp-krb5/krb5cc
- name: MASTRO_CONFIG
value: /conf/crawler-conf.yaml
volumeMounts:
- mountPath: /conf
name: crawler-conf-volume
- mountPath: /etc-krb5
name: krb5-conf-volume
readOnly: true
- mountPath: /tmp-krb5
name: shared-cache
securityContext: {}
volumes:
- name: crawler-conf-volume
configMap:
defaultMode: 420
name: crawler-conf
- name: krb5-conf-volume
configMap:
defaultMode: 420
name: krb5-conf
- name: shared-cache
emptyDir: {}
- name: keytab-volume
secret:
secretName: user-keytab
A K8s Job is a batch process meant to run once. For instance:
apiVersion: batch/v1
kind: Job
metadata:
creationTimestamp: null
name: mastro-impala-crawler
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: datamillcloud/mastro-crawlers:v0.3.1
name: mastro-crawler
resources: {}
restartPolicy: Never
status: {}
This is only a Job example. Please refer to the full description provided in the Deployment case for the complete Impala deployment.
A CronJob can be created with the following syntax.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
creationTimestamp: null
name: mastro-impala-crawler
spec:
jobTemplate:
metadata:
creationTimestamp: null
name: mastro-impala-crawler
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: datamillcloud/mastro-crawlers:v0.3.1
name: mastro-crawler
resources: {}
restartPolicy: OnFailure
schedule: 0 0 * * 0
status: {}
This is only a CronJob example. Please refer to the full description provided in the Deployment case for the complete Impala deployment.
In this section we describe the introduced schedule format, i.e., the format string defining the schedule interval of cron jobs.
Specifically, this consists of the following 5 fields:
Field | Description | Values |
---|---|---|
1 | Minute | 0 to 59, or * |
2 | Hour | 0 to 23, or * |
3 | Day of the Month | 1 to 31, or * |
4 | Month | 1 to 12, or * |
5 | Day of the Week | 0 to 7, with (0 == 7, sunday), or * |
Mind that the string must contain entries for each field, or an asterisk (i.e., *
otherwise).
For instance:
0 0 * * 0
schedules the job every sunday at midnight (00:00)
Also, instead of specifying a specific time, a slash and a period size can be defined to periodically schedule at the field granularity:
*/5 * * * *
schedules every 5 minutes0 */2 * * *
schedule every second hour, at o'clock