Generate High level Design Document for HA builder #1675

rahulgoel1 · 2021-12-13T21:51:36Z

Author high level design document for Highly Available Builder pattern :-

The research on the pattern has been done and documented in the issue below.

mwrock · 2021-12-22T22:31:49Z

Current State

Today we provide an installer for a stand alone single node builder deployment which includes postgres, minio, middle tier api service, memcache and nginx front end. Both the data and object storage services can easily be configured to reside external to the front end services. We also provide some instructions on how to scale out front ends. The scaling out guidance includes an --install-frontend argument that can be used with the above install.sh to install only front end services (builder-api-proxy, builder-api, and memcache). This guidance has a key flaw: the memcache instances are each set up as a standalone instance and the cache is therefore not distributed among all front end nodes. This requires a load balancer to be configured with sticky sessions.

Proposed plan

This issue proposes the following in an initial HA offering:

Data Services

Provide documentation for configuring an external PostgreSQL installation. This should include memory and CPU recommendations and any other Habitat database specific requirements if any. We will not attempt to provide an automated multi instance postgres install. I would argue that we rely heavily on the HA documentation provided by PostgreSQL and AWS RDS services. If a customer simply wants a single PostgreSQL instance for offloading data services from the front end, we could provide a --install-postgresql argument to the install.sh installer they could run on that instance.

Object Storage Services

Take an similar approach for object storage as discussed above for data services. If a customer simply wants to setup a single node minio instance, we can provide a --install-minio arg to install.sh. Otherwise we should point customers to the docs for AWS S3, artifactory or MinIO's distributed setup guide.

Memcache

Adjust how builder-api configures memcache so that it can include all front end nodes in the cache and thereby not require sticky sessions on the load balancer. Alternatively we could recommend an entirely external memcache cluster. However I would argue that adding a separate memcache cluster adds unneeded complexity since the resources used on the front end nodes is relatively small. Currently builder-api can recognize and connect to the memcache instances on all other front end nodes as long as all nodes are participating in a Habitat supervisor ring. My concern is that designating a permanent peer for the ring potentially adds more complexity than the value added by the ring. If the builder setup connects to a single on-prem PostgreSQL node, we could easily use that as the permanent peer. However there is no guarantee that a single on prem data node will exist. Provisioning a bastion cluster for this is just overkill if all we are getting is the ability to auto configure memcache endpoints.

Next Steps

So this work will be largely documentation based including basic requirements for external services but relying on vendor guidance to install those services. Having reviewed the HA support docs of other distributed software (examples: Artifactory, TeamCity, Microsoft TFS) this seems like a common pattern. It should be safe to assume that customers demanding full HA capabilities will be familiar with HA concepts and configurations.

The only builder specific work I anticipate is the memcache configuration discussed above.

The on-prem scripts will add arguments similar to the existing --install-frontend for single instance database or minio.

sajjaphani · 2022-01-04T13:15:23Z

habitat-sh/on-prem-builder#253

pozsgaic · 2022-01-04T17:36:35Z

I agree that if one of our products already has a documented HA solution, we probably do not want to try to re-invent one of our own. Since many of these commercial-grade solutions, such as Postgres, we would not want to deviate from their solutions.
As for the memcache solution to front-end scaling, this would be a good step to remove sticky sessions and have a cleaner load-balanced solution. Some questions to consider with regards to memcache update:

What is the impact of having multiple builder-api instances hitting the front-ends?
Does there need to be any sync between builder-api instances?
In addition to any research on new HA development work, such as memcache, we should also invest some time to understand each component’s HA solution so that we can be proactive in supporting new customer configurations. For example, Minio uses the concept of Erasure Code as part of its solution to achieve HA. We want to have a decent grasp how mechanisms such as this work if a customer decides to go that route.
Also, I believe there was an issue where a large number of supervisor nodes were performing downloads at the same time and this caused a number of failures.
Can we use this memcache upgrade as an opportunity to mitigate this issue or to address other known pain points?

mwrock · 2022-01-04T17:59:22Z

The memcache impact would be the same as what we see in the public saas offering since this would mimic how multiple builder-API interact wil multiple memcache instances.

There may need to be a sync after removing or adding a builder-api/memcache node. That should be investigared.

Of course another possibility introduced by habitat-sh/on-prem-builder#253 is to rmove memcache from on-prem deployments.

Definitely a good point about having familiarity with third party HA configs of postgres, minio, etc. I do wonder if that should be owned by support rather than the development team.

rahulgoel1 · 2022-01-04T18:21:53Z

I remember that in one of the calls it was mentioned that there was at attempt to setup postgres on Azure and apparently there was an issue with that. Should we consider supporting Azure cloud for builder and HA setup in Azure ? It could be a separate work item which is fine.

mwrock · 2022-01-04T18:26:33Z

Yes supporting azure postgres definitely seems like something worth looking into. And I agree that it can be tracked separately.

mwrock · 2022-01-10T21:27:58Z

Based in some offline feedback, there is a desire to utilize a habitat ring for the front end services. Currently we support binding the builder-api to memcached. If all front ends are peered in the same ring, then the cache will be distributed across all memcache instances. Additionally there is a boolean load_balanced setting in the builder-api-proxy config. If this is true (its false by default), then a single builder-api-proxy node can balance api requests accross multiple builder-api services.

There was also interest in supporting the ability to spin up an HA postgresql and/or minio cluster via habitat. Currently the habitat configuration for builder-datastore and builder-minio do not support HA. While the postgresql/minio clusters would not need to be in the same ring as the front-end services, it would be nice to support an HA setup for these services supported from our habitat configuration.

I think there was consensus in the above feedback that supporting habitat enabled clustering of postgresql and minio could be delayed to a subsequent iteration of this HA work.

mwrock · 2022-01-10T21:52:14Z

I think we can go ahead and add login to our install script to peer the front end nodes. This absolutely make sense for the memcached config, however I wonder if we should support the load_balanced option. Our scaled out SASS configuration does not enable this option and has a separate builder-api-proxy service on each builder-api node. A single builder-api-proxy service would allow for only one instance to host the builder UI and is therefore not truly HA. My suggestion would be for customers to stand up their own load balancer pointing to the builder-api-proxy service on each node where builder-api and memcache are running.

mwrock · 2022-01-12T22:02:24Z

We had a follow up discussion regarding the basic HA topology. We aggreed to go ahead with enabling the load_balanced option and to continue with a proxy service on each frontend node. The idea here is that customers would still put a load balancer in front of the proxies but if the builder-api service on the proxy node was down, the load_balanced config would route the api request to a different and healthy node.

mwrock · 2022-01-12T23:56:38Z

This is the basic topology of our recommended scaled out configuration. The Load balancer will be provisioned by the customer. The minio and postgresql instances can either be installed via our install.sh installer or they can be provisioned by the customer if the customer already manages their own postgresql or S3 compatible infrastructure. The install.sh can install the minio instance on the same node as postgresql or on separate nodes:

./install.sh --install-minio --install-postgresql

or

./install.sh --install-minio
./install.sh --install-postgresql

The frontend nodes will each host builder-api-proxy, builder-api, and builder-memcached services. These services will all be joined to a single habitat ring. The requests from builder-api will be routed to all memcached services. Also, builder-api-proxy will load balance the builder-api services so that builder-api requests from any builder-api-proxy service will be round robbined across all available builder-api services.

mwrock · 2022-01-13T00:00:34Z

I'm going to close this for now since it sounds like we have general consensus to move forward with this.

mwrock self-assigned this Dec 15, 2021

mwrock closed this as completed Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate High level Design Document for HA builder #1675

Generate High level Design Document for HA builder #1675

rahulgoel1 commented Dec 13, 2021 •

edited

Loading

mwrock commented Dec 22, 2021

sajjaphani commented Jan 4, 2022

pozsgaic commented Jan 4, 2022

mwrock commented Jan 4, 2022

rahulgoel1 commented Jan 4, 2022

mwrock commented Jan 4, 2022

mwrock commented Jan 10, 2022 •

edited

Loading

mwrock commented Jan 10, 2022

mwrock commented Jan 12, 2022

mwrock commented Jan 12, 2022

mwrock commented Jan 13, 2022

Generate High level Design Document for HA builder #1675

Generate High level Design Document for HA builder #1675

Comments

rahulgoel1 commented Dec 13, 2021 • edited Loading

mwrock commented Dec 22, 2021

Current State

Proposed plan

Data Services

Object Storage Services

Memcache

Next Steps

sajjaphani commented Jan 4, 2022

pozsgaic commented Jan 4, 2022

mwrock commented Jan 4, 2022

rahulgoel1 commented Jan 4, 2022

mwrock commented Jan 4, 2022

mwrock commented Jan 10, 2022 • edited Loading

mwrock commented Jan 10, 2022

mwrock commented Jan 12, 2022

mwrock commented Jan 12, 2022

mwrock commented Jan 13, 2022

rahulgoel1 commented Dec 13, 2021 •

edited

Loading

mwrock commented Jan 10, 2022 •

edited

Loading