We're going to provision two workload clusters, integrate them with Tanzu Service Mesh, deploy some apps, configure a Global Namespace, define a public service and execute an application failover scenario the result of which should demonstrate resiliency and high-availability across clusters (no matter where they're running).
Assumes you have access to Harbor (or other container image registry) from which to source container images for the applications you'll deploy.
We're just going to create a couple clusters in the same region but in different availability zones. We assume a management cluster is already provisioned.
cat > zoolabs-workload-1.yml <<EOF
CLUSTER_NAME: zoolabs-workload-1
CLUSTER_PLAN: dev
NAMESPACE: default
CNI: antrea
IDENTITY_MANAGEMENT_TYPE: none
CONTROL_PLANE_MACHINE_TYPE: t3.large
NODE_MACHINE_TYPE: m5.xlarge
AWS_REGION: "us-west-2"
AWS_NODE_AZ: "us-west-2a"
AWS_SSH_KEY_NAME: "se-cphillipson-cloudgate-aws-us-west-2"
BASTION_HOST_ENABLED: false
ENABLE_MHC: true
MHC_UNKNOWN_STATUS_TIMEOUT: 5m
MHC_FALSE_STATUS_TIMEOUT: 12m
ENABLE_AUDIT_LOGGING: false
ENABLE_DEFAULT_STORAGE_CLASS: true
CLUSTER_CIDR: 100.96.0.0/11
SERVICE_CIDR: 100.64.0.0/13
ENABLE_AUTOSCALER: false
EOF
cat > zoolabs-workload-2.yml <<EOF
CLUSTER_NAME: zoolabs-workload-2
CLUSTER_PLAN: dev
NAMESPACE: default
CNI: antrea
IDENTITY_MANAGEMENT_TYPE: none
CONTROL_PLANE_MACHINE_TYPE: t3.large
NODE_MACHINE_TYPE: m5.xlarge
AWS_REGION: "us-west-2"
AWS_NODE_AZ: "us-west-2b"
AWS_SSH_KEY_NAME: "se-cphillipson-cloudgate-aws-us-west-2"
BASTION_HOST_ENABLED: false
ENABLE_MHC: true
MHC_UNKNOWN_STATUS_TIMEOUT: 5m
MHC_FALSE_STATUS_TIMEOUT: 12m
ENABLE_AUDIT_LOGGING: false
ENABLE_DEFAULT_STORAGE_CLASS: true
CLUSTER_CIDR: 100.96.0.0/11
SERVICE_CIDR: 100.64.0.0/13
ENABLE_AUTOSCALER: false
EOF
tanzu cluster create --file zoolabs-workload-1.yml
tanzu cluster create --file zoolabs-workload-2.yml
tanzu cluster scale zoolabs-workload-1 --worker-machine-count 3
tanzu cluster scale zoolabs-workload-2 --worker-machine-count 3
You'll want to change a few of the values for the properties of each cluster's configuration above; minimally
CLUSTER_NAME
andAWS_SSH_KEY_NAME
. Also, it's worth mentioning that you'll want to specify aNODE_MACHINE_TYPE
that has at least 4 CPU. It'll take ~15-20 minutes to provision the supporting infrastructure and scale the worker nodes for each cluster.
Obtain the new workload cluster kubectl configuration.
tanzu cluster kubeconfig get zoolabs-workload-1 --admin
tanzu cluster kubeconfig get zoolabs-workload-2 --admin
While we could enable TSM for an organization then integrate both clusters with TSM via Tanzu Mission Control, we're going to follow these instructions:
- Create an IAM policy for managing domain records in a Route53 hosted zone
- Create account, attach policy, and obtain credentials
- Manage integration
A visual montage...
You are creating a Domain provider.
Follow these instructions.
A visual montage...
If you want to be able to encrypt traffic using TLS, then you will need to manage keys and certificates.
Follow these instructions.
Rinse-and-repeat these instructions for each cluster
A visual montage...
That last screenshot is multi-step. Make sure you pay attention to detail. Supply names for the clusters (in each dialog box). Generate the security token. Target a workload cluster, apply the YAML (to register then connect), then choose whether to install TSM cluster-wide or in just certain namespaces (honoring excludes), then click on Install Tanzu Service Mesh. Rinse-and-repeat for the second and subsequent clusters.
Rinse-and-repeat these instructions targeting each cluster:
- primes
- Employ Option 1
- console-availability
- Choose to follow one section's steps (employing Public or Private manifests)
Once both clusters are integrated we will create a Global Namespace.
Follow these instructions.
Note: the Global Namespace name should never be named the same as the Domain name.
A visual montage...
Make sure that the Domain is not set to your real domain.
Follow these instructions.
Here's what no checks configuration looks like...
Whew! You'd think you were done by now.
For the primes app, since it's a simple micro-service application, so you are done with that one. But the console availability application consists of two micro-services: a client and a server. The client to this point has been configured to interact with the server via a cluster local DNS service name.
To leverage service discovery within the mesh, and to facilitate failover in case of a pod or cluster failure (where a server instance would become unavailable), we'll want to update the configuration of the client to employ the GNS domain.
Fork and clone either of https://github.com/pacphi/k8s-manifests or https://github.com/pacphi/k8s-manifests-private. Then make the necessary update.
E.g.,
git clone ... # fill in the rest targeting your fork
git checkout mesh
cd k8s-manifests/com/vmware/console-availability/client/apps
sed -i 's/apps.svc.cluster.local/{gns-domain}/g' values.yml
Replace
{gns-domain}
above with your actual GNS domain.
Commit and push the update.
git add .
git commit -m "Update GNS domain"
git push -u origin mesh
kubectl edit app console-availability-client -n apps
Look for spec.fetch.git.ref
and update the value to be origin/mesh
.
Then save your changes and exit from the editor. (If you're relying on vi
, type :wq
)
Guess what? You're still not done. You need to rinse-and-repeat the above edit targeting your other cluster to be truly resilient. Do that now.
Congratulations! You've just completed multi-cluster, high-availability deployments of two applications.
Here's what things should look like now...
You should see some new records in the hosted zone(s) you permitted Tanzu Service Mesh to access and write to.
Let's test that we can interact with the service we exposed.
curl http://primes.lab.zoolabs.me/primes/1/50 | jq
Your domain may be different than what's in the example above.
Sample output
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 72 0 72 0 0 545 0 --:--:-- --:--:-- --:--:-- 545
{
"end": 50,
"primes": [
2,
3,
5,
7,
11,
13,
17,
19,
23,
29,
31,
37,
41,
43,
47
],
"start": 1
}
Start by targeting a cluster.
We're going to simulate application failure by reconfiguring the liveness probe of an application in a cluster. Because we have configured continuous deployment we can't just edit or patch an existing deployment. Instead, we will create a branch on the git repository (where the deployment manifests are maintained) and make an update to the manifest like so:
git branch broken
git checkout broken
cd com/vmware/console-availability/server/apps
sed -i '0,/health/{s/health/broken/}' config.yml
Assumes you've checked out and have been working with a fork of source from either https://github.com/pacphi/k8s-manifests or https://github.com/pacphi/k8s-manifests-private.
Commit and push the update.
git add .
git commit -m "Intentionally break liveness probe"
git push -u origin broken
It's not enough to have made the update to the manifest on the branch to trigger a deployment. Why? Because the original App CR was configured to watch for updates on origin/main
.
So we need to update the App CR as well to have it point to our new branch. The easiest way to do this is to
kubectl edit app primes-dev -n apps
Look for spec.fetch.git.ref
and update the value to be origin/broken
.
Then save your changes and exit from the editor. (If you're relying on vi
, type :wq
)
After a few moments, go have another look the GNS Topology view. You should see that clients in both clusters failover to the available server instance in one cluster.
Pretty neat, huh?