-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate High level Design Document for HA builder #1675
Comments
Current StateToday we provide an installer for a stand alone single node builder deployment which includes postgres, minio, middle tier api service, memcache and nginx front end. Both the data and object storage services can easily be configured to reside external to the front end services. We also provide some instructions on how to scale out front ends. The scaling out guidance includes an Proposed planThis issue proposes the following in an initial HA offering: Data ServicesProvide documentation for configuring an external PostgreSQL installation. This should include memory and CPU recommendations and any other Habitat database specific requirements if any. We will not attempt to provide an automated multi instance postgres install. I would argue that we rely heavily on the HA documentation provided by PostgreSQL and AWS RDS services. If a customer simply wants a single PostgreSQL instance for offloading data services from the front end, we could provide a Object Storage ServicesTake an similar approach for object storage as discussed above for data services. If a customer simply wants to setup a single node minio instance, we can provide a MemcacheAdjust how builder-api configures memcache so that it can include all front end nodes in the cache and thereby not require sticky sessions on the load balancer. Alternatively we could recommend an entirely external memcache cluster. However I would argue that adding a separate memcache cluster adds unneeded complexity since the resources used on the front end nodes is relatively small. Currently builder-api can recognize and connect to the memcache instances on all other front end nodes as long as all nodes are participating in a Habitat supervisor ring. My concern is that designating a permanent peer for the ring potentially adds more complexity than the value added by the ring. If the builder setup connects to a single on-prem PostgreSQL node, we could easily use that as the permanent peer. However there is no guarantee that a single on prem data node will exist. Provisioning a bastion cluster for this is just overkill if all we are getting is the ability to auto configure memcache endpoints. Next StepsSo this work will be largely documentation based including basic requirements for external services but relying on vendor guidance to install those services. Having reviewed the HA support docs of other distributed software (examples: Artifactory, TeamCity, Microsoft TFS) this seems like a common pattern. It should be safe to assume that customers demanding full HA capabilities will be familiar with HA concepts and configurations. The only builder specific work I anticipate is the memcache configuration discussed above. The on-prem scripts will add arguments similar to the existing |
I agree that if one of our products already has a documented HA solution, we probably do not want to try to re-invent one of our own. Since many of these commercial-grade solutions, such as Postgres, we would not want to deviate from their solutions.
|
The memcache impact would be the same as what we see in the public saas offering since this would mimic how multiple builder-API interact wil multiple memcache instances. There may need to be a sync after removing or adding a builder-api/memcache node. That should be investigared. Of course another possibility introduced by habitat-sh/on-prem-builder#253 is to rmove memcache from on-prem deployments. Definitely a good point about having familiarity with third party HA configs of postgres, minio, etc. I do wonder if that should be owned by support rather than the development team. |
I remember that in one of the calls it was mentioned that there was at attempt to setup postgres on Azure and apparently there was an issue with that. Should we consider supporting Azure cloud for builder and HA setup in Azure ? It could be a separate work item which is fine. |
Yes supporting azure postgres definitely seems like something worth looking into. And I agree that it can be tracked separately. |
Based in some offline feedback, there is a desire to utilize a habitat ring for the front end services. Currently we support binding the builder-api to memcached. If all front ends are peered in the same ring, then the cache will be distributed across all memcache instances. Additionally there is a boolean There was also interest in supporting the ability to spin up an HA postgresql and/or minio cluster via habitat. Currently the habitat configuration for builder-datastore and builder-minio do not support HA. While the postgresql/minio clusters would not need to be in the same ring as the front-end services, it would be nice to support an HA setup for these services supported from our habitat configuration. I think there was consensus in the above feedback that supporting habitat enabled clustering of postgresql and minio could be delayed to a subsequent iteration of this HA work. |
I think we can go ahead and add login to our install script to peer the front end nodes. This absolutely make sense for the memcached config, however I wonder if we should support the |
We had a follow up discussion regarding the basic HA topology. We aggreed to go ahead with enabling the |
This is the basic topology of our recommended scaled out configuration. The Load balancer will be provisioned by the customer. The minio and postgresql instances can either be installed via our
or
The frontend nodes will each host |
I'm going to close this for now since it sounds like we have general consensus to move forward with this. |
Author high level design document for Highly Available Builder pattern :-
The research on the pattern has been done and documented in the issue below.
#1623
The text was updated successfully, but these errors were encountered: