From 8faf9192ba902437e7adc6d6d415c35aff8128cd Mon Sep 17 00:00:00 2001 From: sj-williams Date: Mon, 11 Nov 2024 11:32:12 +0000 Subject: [PATCH 1/3] =?UTF-8?q?docs:=20=E2=9C=8F=EF=B8=8F=20fix=20broken?= =?UTF-8?q?=20links,=20some=20updates=20to=20out-of-date=20info?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Relates to [link-checker-report](https://github.com/ministryofjustice/cloud-platform/issues/6429) --- architecture-decision-record/022-EKS.md | 6 +++--- architecture-decision-record/023-Logging.md | 7 +++++-- architecture-decision-record/026-Managed-Prometheus.md | 6 ++++-- runbooks/source/leavers-guide.html.md.erb | 2 +- 4 files changed, 13 insertions(+), 8 deletions(-) diff --git a/architecture-decision-record/022-EKS.md b/architecture-decision-record/022-EKS.md index 94942929..e9c17a1d 100644 --- a/architecture-decision-record/022-EKS.md +++ b/architecture-decision-record/022-EKS.md @@ -1,6 +1,6 @@ # EKS -Date: 02/05/2021 +Date: 11/11/2024 ## Status @@ -32,7 +32,7 @@ We already run the Manager cluster on EKS, and have gained a lot of insight and Developers in service teams need to use the k8s auth, and GitHub continues to be the most common SSO amongst them with good tie-in to JML processes - see [ADR 6 Use GitHub as our identity provider](006-Use-github-as-user-directory.md) -Auth0 is useful as a broker, for a couple of important [rules that it runs at login time](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/global-resources/resources/auth0-rules): +Auth0 is useful as a broker, for a couple of important [rules that it runs at login time](https://github.com/ministryofjustice/cloud-platform-terraform-global-resources-auth0): * it ensures that the user is in the ministryofjustice GitHub organization, so only staff can get a kubeconfig and login to CP websites like Grafana * it inserts the user's GitHub teams into the OIDC ID token as claims. These are used by k8s RBAC to authorize the user for the correct namespaces @@ -157,7 +157,7 @@ Advantages of AWS's CNI: * it is the default with EKS, native to AWS, is fully supported by AWS - low management overhead * offers good network performance -The concern with AWS's CNI would be that it uses an IP address for every pod, and there is a [limit per node](https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt), depending on the EC2 instance type and the number of ENIs it supports. The calculations in [Node Instance Types](#node-instance-types) show that with a change of instance type, the cost of the cluster increases by 17% or $8k, which is acceptable - likely less than the engineering cost of maintaining and supporting full Calico networking and custom node image. +The concern with AWS's CNI would be that it uses an IP address for every pod, and there is a [limit per node](https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/kubelet/eni-max-pods.txt), depending on the EC2 instance type and the number of ENIs it supports. The calculations in [Node Instance Types](#node-instance-types) show that with a change of instance type, the cost of the cluster increases by 17% or $8k, which is acceptable - likely less than the engineering cost of maintaining and supporting full Calico networking and custom node image. The alternative considered was [Calico networking](https://docs.projectcalico.org/getting-started/kubernetes/managed-public-cloud/eks#install-eks-with-calico-networking). This has the advantage of not needing an IP address per pod, and associated instance limit. And it is open source. However: diff --git a/architecture-decision-record/023-Logging.md b/architecture-decision-record/023-Logging.md index 1eb16149..64168926 100644 --- a/architecture-decision-record/023-Logging.md +++ b/architecture-decision-record/023-Logging.md @@ -1,6 +1,6 @@ # 23 Logging -Date: 02/06/2021 +Date: 11/11/2024 ## Status @@ -8,7 +8,10 @@ Date: 02/06/2021 ## Context -Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). This allows [service teams](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/access-logs.html#accessing-application-log-data) and Cloud Platform team to use Kibana's search and browse functionality, for the purpose of debug and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days. +> Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). + +As of November 2024, we have migrated the logging service over to AWS OpenSearch, with ElasticSearch due for retirement (pending some decisions and actions on how to manage existing data retention on that cluster). +Service teams can use OpenSearch's [search and browse functionality](https://app-logs.cloud-platform.service.justice.gov.uk/_dashboards/app/home#/) for the purposes of debugging and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days. Concerns with existing ElasticSearch logging: diff --git a/architecture-decision-record/026-Managed-Prometheus.md b/architecture-decision-record/026-Managed-Prometheus.md index face2c2a..45c0cfc0 100644 --- a/architecture-decision-record/026-Managed-Prometheus.md +++ b/architecture-decision-record/026-Managed-Prometheus.md @@ -1,6 +1,6 @@ # 26 Managed Prometheus -Date: 2021-10-08 +Date: 2024-11-11 ## Status @@ -67,7 +67,9 @@ We also need to address: **Sharding**: We could split/shard the Prometheus instance: perhaps dividing into two - tenants and platform. Or if we did multi-cluster we could have one Prometheus instance per cluster. This appears relatively straightforward to do. There would be concern that however we split it, as we scale in the future we'll hit future scaling thresholds, where it will be necessary to change how to divide it into shards, so a bit of planning would be needed. -**High Availability**: The recommended approach would be to run multiple instances of Prometheus configured the same, scraping the same endpoints independently. [Source](https://prometheus-operator.dev/docs/operator/high-availability/#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://prometheus-operator.dev/docs/operator/high-availability/) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since. +**High Availability**: We are now running Prometheus in HA mode [with 3 replicas](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/pull/239). Keeping the findings below as we may have some additional elements of HA to consider in the future: + +> [Source](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since. **Managed Prometheus**: Using a managed service of prometheus, such as AMP, would address most of these concerns, and is evaluated in detail in the next section. diff --git a/runbooks/source/leavers-guide.html.md.erb b/runbooks/source/leavers-guide.html.md.erb index cd8f3c1a..718188cd 100644 --- a/runbooks/source/leavers-guide.html.md.erb +++ b/runbooks/source/leavers-guide.html.md.erb @@ -70,7 +70,7 @@ Below are the list of 3rd party accounts that need to be removed when a member l 4. [Pagerduty](https://moj-digital-tools.pagerduty.com/users) -5. [DockerHub MoJ teams](https://cloud.docker.com/orgs/ministryofjustice/teams) +5. DockerHub MoJ teams 6. [Pingdom](https://www.pingdom.com) From 2bc873c14b5295a3e4f36c5cd1d58b9869d23f5f Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Mon, 11 Nov 2024 11:33:03 +0000 Subject: [PATCH 2/3] Commit changes made by code formatters --- architecture-decision-record/022-EKS.md | 123 +++++++++--------- architecture-decision-record/023-Logging.md | 2 +- .../026-Managed-Prometheus.md | 2 +- 3 files changed, 65 insertions(+), 62 deletions(-) diff --git a/architecture-decision-record/022-EKS.md b/architecture-decision-record/022-EKS.md index e9c17a1d..8480dc45 100644 --- a/architecture-decision-record/022-EKS.md +++ b/architecture-decision-record/022-EKS.md @@ -14,13 +14,13 @@ Use Amazon EKS for running the main cluster, which hosts MOJ service teams' appl Benefits of EKS: -* a managed control plane (master nodes), reducing operational overhead compared to kOps, such as scaling the control plane nodes. And reduces risk to k8s API availability, if there was a sudden increase in k8s API traffic. -* [managed nodes](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html), further reducing operational overhead -* Kubernetes upgrades are smoother: - * kOps rolling upgrades have been problematic. e.g. during 1.18 to 1.19 upgrade kOps caused us to have to [work around a networking issue](https://docs.google.com/document/d/1HzmTk0IvuW1XsXmVJEsSOzB4jkiMpjwZWlUwIF7P9Gc/edit) - * CP team sees kOps upgrades as particularly stressful, and 3rd on our risk register -* it opens the door to using [ELB for ingress](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html). Being managed, it is seen as preferable to self-managed nginx, which requires upgrades, scaling etc. -* avoid security challenge of managing tokens that are exported with `kops export kubeconfig` +- a managed control plane (master nodes), reducing operational overhead compared to kOps, such as scaling the control plane nodes. And reduces risk to k8s API availability, if there was a sudden increase in k8s API traffic. +- [managed nodes](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html), further reducing operational overhead +- Kubernetes upgrades are smoother: + - kOps rolling upgrades have been problematic. e.g. during 1.18 to 1.19 upgrade kOps caused us to have to [work around a networking issue](https://docs.google.com/document/d/1HzmTk0IvuW1XsXmVJEsSOzB4jkiMpjwZWlUwIF7P9Gc/edit) + - CP team sees kOps upgrades as particularly stressful, and 3rd on our risk register +- it opens the door to using [ELB for ingress](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html). Being managed, it is seen as preferable to self-managed nginx, which requires upgrades, scaling etc. +- avoid security challenge of managing tokens that are exported with `kops export kubeconfig` We already run the Manager cluster on EKS, and have gained a lot of insight and experience of using it. @@ -34,13 +34,13 @@ Developers in service teams need to use the k8s auth, and GitHub continues to be Auth0 is useful as a broker, for a couple of important [rules that it runs at login time](https://github.com/ministryofjustice/cloud-platform-terraform-global-resources-auth0): -* it ensures that the user is in the ministryofjustice GitHub organization, so only staff can get a kubeconfig and login to CP websites like Grafana -* it inserts the user's GitHub teams into the OIDC ID token as claims. These are used by k8s RBAC to authorize the user for the correct namespaces +- it ensures that the user is in the ministryofjustice GitHub organization, so only staff can get a kubeconfig and login to CP websites like Grafana +- it inserts the user's GitHub teams into the OIDC ID token as claims. These are used by k8s RBAC to authorize the user for the correct namespaces Future options: -* Azure AD SSO is growing in MOJ - there's a case for switching to that, if it is adopted amongst our users -* IAM auth has the benefit of immediately revoking access. Maybe we could use federated login with GitHub? (But would that give only temporary kubecfg?) Or sync the GitHub team info to IAM? +- Azure AD SSO is growing in MOJ - there's a case for switching to that, if it is adopted amongst our users +- IAM auth has the benefit of immediately revoking access. Maybe we could use federated login with GitHub? (But would that give only temporary kubecfg?) Or sync the GitHub team info to IAM? **Status**: Completed 2/6/21 [#2854](https://github.com/ministryofjustice/cloud-platform/issues/2854) @@ -52,12 +52,12 @@ We've long used Kuberos for issuing kubecfg credentials to users. The [original Other options considered: -* [Gangway](https://github.com/heptiolabs/gangway) - similar to Kuberos, it has not had releases for 2 years (v3.2.0) -* [kubelogin](https://github.com/int128/kubelogin) - * CP team would have to distribute the client secret to all users. It seems odd to go to the trouble of securely sharing that secret, to overcome the perceived difficulty of issuing kubecfg credentials. - * Requires all of our users to install the software, rather than doing it server-side centrally -* [kubehook](https://github.com/negz/kubehook) - not compatible with EKS - doesn't support web hook authn -* [dex](https://github.com/dexidp/dex) - doesn't have a web front-end for issuing creds - it is more of an OIDC broker +- [Gangway](https://github.com/heptiolabs/gangway) - similar to Kuberos, it has not had releases for 2 years (v3.2.0) +- [kubelogin](https://github.com/int128/kubelogin) + - CP team would have to distribute the client secret to all users. It seems odd to go to the trouble of securely sharing that secret, to overcome the perceived difficulty of issuing kubecfg credentials. + - Requires all of our users to install the software, rather than doing it server-side centrally +- [kubehook](https://github.com/negz/kubehook) - not compatible with EKS - doesn't support web hook authn +- [dex](https://github.com/dexidp/dex) - doesn't have a web front-end for issuing creds - it is more of an OIDC broker **Status:** Completed 24/6/21 [#1254](https://github.com/ministryofjustice/cloud-platform/issues/2945) @@ -75,9 +75,9 @@ We'll continue to use our existing RBAC configuration from the previous cluster. Options: -* Self-managed nodes -* Managed node groups - automates various aspects of the node lifecycle, including creating the EC2s, the auto scaling group, registration of nodes with kubernetes and recycling nodes -* Fargate nodes - fully automated nodes, the least to manage. Benefits from more isolation between pods and automatic scaling. Doesn't support daemonsets. +- Self-managed nodes +- Managed node groups - automates various aspects of the node lifecycle, including creating the EC2s, the auto scaling group, registration of nodes with kubernetes and recycling nodes +- Fargate nodes - fully automated nodes, the least to manage. Benefits from more isolation between pods and automatic scaling. Doesn't support daemonsets. We aim to take advantage of as much automation as possible, to minimize the team's operational overhead and risk. Initially we'll use managed node groups, before looking at Fargate for workloads. @@ -85,18 +85,18 @@ We aim to take advantage of as much automation as possible, to minimize the team #### Future Fargate considerations -*Pod limits* - there is a quota limit of [500 Fargate pods per region per AWS Account](https://aws.amazon.com/about-aws/whats-new/2020/09/aws-fargate-increases-default-resource-count-service-quotas/) which could be an issue, considering we currently run ~2000 pods. We can request AWS raise the limit - not currently sure what scope there is. With Multi-cluster stage 5, the separation of loads into different AWS accounts will settle this issue. +_Pod limits_ - there is a quota limit of [500 Fargate pods per region per AWS Account](https://aws.amazon.com/about-aws/whats-new/2020/09/aws-fargate-increases-default-resource-count-service-quotas/) which could be an issue, considering we currently run ~2000 pods. We can request AWS raise the limit - not currently sure what scope there is. With Multi-cluster stage 5, the separation of loads into different AWS accounts will settle this issue. -*Daemonset functionality* - needs replacement: +_Daemonset functionality_ - needs replacement: -* fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. -* prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network +- fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. +- prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network -*No EBS support* - Prometheus will run still in a managed node group. Likely other workloads too to consider. +_No EBS support_ - Prometheus will run still in a managed node group. Likely other workloads too to consider. -*how people check the status of their deployments* - to be investigated +_how people check the status of their deployments_ - to be investigated -*ingress can't be nginx? - just the load balancer in front* - to be investigated +_ingress can't be nginx? - just the load balancer in front_ - to be investigated If we don't use Fargate then we should take advantage of Spot instances for reduced costs. However Fargate is the priority, because the main driver here is engineer time, not EC2 cost. @@ -124,12 +124,12 @@ The choice of AWS CNI networking in the new cluster, initially added a constrain So the primary choice is: -* r5.xlarge - 4 vCPUs, 32GB memory +- r5.xlarge - 4 vCPUs, 32GB memory With fallbacks (should the cloud provider run out of these in the AZ): -* r5.2xlarge - 8 vCPUs, 64GB memory -* r5a.xlarge - 4 vCPUs, 32GB memory +- r5.2xlarge - 8 vCPUs, 64GB memory +- r5a.xlarge - 4 vCPUs, 32GB memory In the future we might consider the ARM processor ranges, but we'd need to consider the added complexity of cross-compiled container images. @@ -137,12 +137,13 @@ In the future we might consider the ARM processor ranges, but we'd need to consi The existing cluster uses r5.2xlarge, so we'll continue with that, and add some fall-backs: -* r5.2xlarge - memory optimized range - 8 cores 64 GB -* r4.2xlarge - memory optimized range - 8 cores 61 GB +- r5.2xlarge - memory optimized range - 8 cores 64 GB +- r4.2xlarge - memory optimized range - 8 cores 61 GB **Status:** -* 1/9/21 r5.xlarge is in place for main node group - a temporarily a high number of instances -* IP prefixes is in the backlog [#3086](https://github.com/ministryofjustice/cloud-platform/issues/3086) + +- 1/9/21 r5.xlarge is in place for main node group - a temporarily a high number of instances +- IP prefixes is in the backlog [#3086](https://github.com/ministryofjustice/cloud-platform/issues/3086) ### Pod networking (CNI) @@ -154,16 +155,16 @@ AWS's CNI is used for the pod networking (IPAM, CNI and Routing). Each pod is gi Advantages of AWS's CNI: -* it is the default with EKS, native to AWS, is fully supported by AWS - low management overhead -* offers good network performance +- it is the default with EKS, native to AWS, is fully supported by AWS - low management overhead +- offers good network performance The concern with AWS's CNI would be that it uses an IP address for every pod, and there is a [limit per node](https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/kubelet/eni-max-pods.txt), depending on the EC2 instance type and the number of ENIs it supports. The calculations in [Node Instance Types](#node-instance-types) show that with a change of instance type, the cost of the cluster increases by 17% or $8k, which is acceptable - likely less than the engineering cost of maintaining and supporting full Calico networking and custom node image. The alternative considered was [Calico networking](https://docs.projectcalico.org/getting-started/kubernetes/managed-public-cloud/eks#install-eks-with-calico-networking). This has the advantage of not needing an IP address per pod, and associated instance limit. And it is open source. However: -* We wouldn't have any support from the cloud provider if there were networking issues. -* We have to maintain a customized image with Calico installed. It's likely that changes to EKS over time will frequently cause breakages with this networking setup. -* Installation requires recycling the nodes, which is not a good fit with declarative config. +- We wouldn't have any support from the cloud provider if there were networking issues. +- We have to maintain a customized image with Calico installed. It's likely that changes to EKS over time will frequently cause breakages with this networking setup. +- Installation requires recycling the nodes, which is not a good fit with declarative config. **Status**: Completed 2/6/21 [#2854](https://github.com/ministryofjustice/cloud-platform/issues/2854) @@ -187,13 +188,14 @@ Cluster auto-scaling should be considered soon though. This is to embrace one of Considerations for auto-scaler: -* we need to maintain spare capacity, so that workloads that scale up don't have to wait for nodes to start-up, which can take about 7 minutes. This may require some tuning. -* tenants should be encouraged to auto-scale their pods effectively (e.g. using the Horizontal pod autoscaler), to capitalize on cluster auto-scaling -* scaling down non-prod namespaces will need agreement from service teams +- we need to maintain spare capacity, so that workloads that scale up don't have to wait for nodes to start-up, which can take about 7 minutes. This may require some tuning. +- tenants should be encouraged to auto-scale their pods effectively (e.g. using the Horizontal pod autoscaler), to capitalize on cluster auto-scaling +- scaling down non-prod namespaces will need agreement from service teams **Status:** -* 18/8/21 Manual scaling in place [#3033](https://github.com/ministryofjustice/cloud-platform/issues/3033) -* 23/9/21 Auto-scaler is still desired + +- 18/8/21 Manual scaling in place [#3033](https://github.com/ministryofjustice/cloud-platform/issues/3033) +- 23/9/21 Auto-scaler is still desired ### Network policy enforcement @@ -231,29 +233,30 @@ The logs go to CloudWatch. Maybe we need to export them elsewhere. Further discu Links: -* https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/install-ssm-agent-on-amazon-eks-worker-nodes-by-using-kubernetes-daemonset.html -* https://github.com/aws/containers-roadmap/issues/593 +- https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/install-ssm-agent-on-amazon-eks-worker-nodes-by-using-kubernetes-daemonset.html +- https://github.com/aws/containers-roadmap/issues/593 AWS Systems Manager Session Manager benefits: -* easy to install - daemonset -* auth is via a team member's AWS creds, so it's tied into JML processes and access can be removed immediately if they leave the team, and 2FA is the norm -* terminal commands are logged - useful for audit purposes -* [it's an EKS best practice](https://aws.github.io/aws-eks-best-practices/security/docs/hosts/#minimize-access-to-worker-nodes) -* we can take advantage of other Systems Manager features in future, including diagnostic and compliance monitoring +- easy to install - daemonset +- auth is via a team member's AWS creds, so it's tied into JML processes and access can be removed immediately if they leave the team, and 2FA is the norm +- terminal commands are logged - useful for audit purposes +- [it's an EKS best practice](https://aws.github.io/aws-eks-best-practices/security/docs/hosts/#minimize-access-to-worker-nodes) +- we can take advantage of other Systems Manager features in future, including diagnostic and compliance monitoring To note: -* requires permissions `hostNetwork: true` and `privileged: true` so may need its own PSP -* it's no use if the node is failing to boot or join the cluster properly, but we can live with that - it's likely that it's the pods we want to characterize, not the node, because the node is managed +- requires permissions `hostNetwork: true` and `privileged: true` so may need its own PSP +- it's no use if the node is failing to boot or join the cluster properly, but we can live with that - it's likely that it's the pods we want to characterize, not the node, because the node is managed The traditional method of node access would be to SSH in via a bastion. This involves a shared ssh key, and shared credentials is not an acceptable security practice. **Status** Completed 2/9/21 -* Implementation ticket: https://github.com/ministryofjustice/cloud-platform/issues/2962 -* Runbook for usage: https://runbooks.cloud-platform.service.justice.gov.uk/eks-node-terminal-access.html + +- Implementation ticket: https://github.com/ministryofjustice/cloud-platform/issues/2962 +- Runbook for usage: https://runbooks.cloud-platform.service.justice.gov.uk/eks-node-terminal-access.html ### PodSecurityPolicies @@ -287,11 +290,11 @@ PSPs are [deprecated](https://kubernetes.io/blog/2021/04/06/podsecuritypolicy-de Benefits of IRSA over kiam or kube2iam: -* kiam/kube2iam require running and managing a daemonset container. -* kiam/kube2iam require [powerful AWS credentials](https://github.com/jtblin/kube2iam#iam-roles), which allow EC2 boxes to assume any role. Appropriate configuration of kiam/kube2iam aims to provide containers with only a specific role. However there are security concerns with this approach: - * With kube2iam you have to remember to set a `--default-role` to use when annotation is not set on a pod. - * When a node boots, there may be a short window until kiam/kube2iam starts up, when there is no protection of the instance metadata. In comparison, IRSA injects the token into the pod, avoiding this concern. - * With kube2iam/kiam, an attacker able to get root on the node could access the credentials and therefore any AWS Role. In comparison, with IRSA a breach of k8s might only have bring access to the AWS Roles that are associated with k8s service roles. +- kiam/kube2iam require running and managing a daemonset container. +- kiam/kube2iam require [powerful AWS credentials](https://github.com/jtblin/kube2iam#iam-roles), which allow EC2 boxes to assume any role. Appropriate configuration of kiam/kube2iam aims to provide containers with only a specific role. However there are security concerns with this approach: + - With kube2iam you have to remember to set a `--default-role` to use when annotation is not set on a pod. + - When a node boots, there may be a short window until kiam/kube2iam starts up, when there is no protection of the instance metadata. In comparison, IRSA injects the token into the pod, avoiding this concern. + - With kube2iam/kiam, an attacker able to get root on the node could access the credentials and therefore any AWS Role. In comparison, with IRSA a breach of k8s might only have bring access to the AWS Roles that are associated with k8s service roles. #### Blocking access instance metadata diff --git a/architecture-decision-record/023-Logging.md b/architecture-decision-record/023-Logging.md index 64168926..9bca70a0 100644 --- a/architecture-decision-record/023-Logging.md +++ b/architecture-decision-record/023-Logging.md @@ -8,7 +8,7 @@ Date: 11/11/2024 ## Context -> Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). +> Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). As of November 2024, we have migrated the logging service over to AWS OpenSearch, with ElasticSearch due for retirement (pending some decisions and actions on how to manage existing data retention on that cluster). Service teams can use OpenSearch's [search and browse functionality](https://app-logs.cloud-platform.service.justice.gov.uk/_dashboards/app/home#/) for the purposes of debugging and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days. diff --git a/architecture-decision-record/026-Managed-Prometheus.md b/architecture-decision-record/026-Managed-Prometheus.md index 45c0cfc0..39b8d3a9 100644 --- a/architecture-decision-record/026-Managed-Prometheus.md +++ b/architecture-decision-record/026-Managed-Prometheus.md @@ -67,7 +67,7 @@ We also need to address: **Sharding**: We could split/shard the Prometheus instance: perhaps dividing into two - tenants and platform. Or if we did multi-cluster we could have one Prometheus instance per cluster. This appears relatively straightforward to do. There would be concern that however we split it, as we scale in the future we'll hit future scaling thresholds, where it will be necessary to change how to divide it into shards, so a bit of planning would be needed. -**High Availability**: We are now running Prometheus in HA mode [with 3 replicas](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/pull/239). Keeping the findings below as we may have some additional elements of HA to consider in the future: +**High Availability**: We are now running Prometheus in HA mode [with 3 replicas](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/pull/239). Keeping the findings below as we may have some additional elements of HA to consider in the future: > [Source](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since. From d5212b724ff92e4e07bc0fd508fba51e1f4b1cb6 Mon Sep 17 00:00:00 2001 From: Steve Williams <105657964+sj-williams@users.noreply.github.com> Date: Mon, 11 Nov 2024 11:38:19 +0000 Subject: [PATCH 3/3] Update 023-Logging.md --- architecture-decision-record/023-Logging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture-decision-record/023-Logging.md b/architecture-decision-record/023-Logging.md index 9bca70a0..13f1f846 100644 --- a/architecture-decision-record/023-Logging.md +++ b/architecture-decision-record/023-Logging.md @@ -11,7 +11,7 @@ Date: 11/11/2024 > Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). As of November 2024, we have migrated the logging service over to AWS OpenSearch, with ElasticSearch due for retirement (pending some decisions and actions on how to manage existing data retention on that cluster). -Service teams can use OpenSearch's [search and browse functionality](https://app-logs.cloud-platform.service.justice.gov.uk/_dashboards/app/home#/) for the purposes of debugging and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days. +Service teams can use OpenSearch's [search and browse functionality](https://app-logs.cloud-platform.service.justice.gov.uk/_dashboards/app/home#/) for the purposes of debugging and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days. The full lifecycle policy configuration for OpenSearch can be viewed [here](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/account/resources/opensearch/ism-policy.json.tpl). Concerns with existing ElasticSearch logging: