Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding EKS-Addon Gpu test #411

Merged
merged 62 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
53ffe75
adding e2e
Paramadon May 17, 2024
8523e45
adding if statments to go test so failure can occur
Paramadon May 20, 2024
3dcc4ca
fixing tests
Paramadon May 20, 2024
7d84ad6
adding security
Paramadon May 20, 2024
1c366e6
fixing test
Paramadon May 20, 2024
3dbcb3a
adding locals
Paramadon May 20, 2024
8c30284
making test
Paramadon May 20, 2024
ad3663d
adding to test
Paramadon May 21, 2024
031d072
fixing some issues
Paramadon May 21, 2024
b97c68c
fixing terrafom
Paramadon May 21, 2024
f8514ed
reverting back and adding kubectl commands
Paramadon May 22, 2024
e9ae147
fixing dupes
Paramadon May 22, 2024
db72428
adding test
Paramadon May 22, 2024
a3dcc91
fixing log error
Paramadon May 23, 2024
b35c91e
fixing test
Paramadon May 23, 2024
747b439
EksClusterValidationMap had wrong casing for ClusterDaemonSet
Paramadon May 23, 2024
7dd0669
tyring to fix logging
Paramadon May 23, 2024
399ede3
fixing some issues
Paramadon May 23, 2024
f4f64fe
adding other mocks to see if test works
Paramadon May 24, 2024
0ae4f56
fixing schema
Paramadon May 24, 2024
c8c5fea
fixing test
Paramadon May 24, 2024
e0c5902
fixing cluster deployment
Paramadon May 24, 2024
ba229fa
increasing size of instance
Paramadon May 24, 2024
b07a1a6
adding a sleep
Paramadon May 24, 2024
4469748
removing logs just go to the commit before if you want lags
Paramadon May 24, 2024
7d8b9db
Merge branch 'main' of github.com:aws/amazon-cloudwatch-agent-test in…
Paramadon May 28, 2024
f785e3f
resolving comments
Paramadon May 28, 2024
09f98ba
fixing lint
Paramadon May 28, 2024
ddc61ec
organized code
Paramadon Jun 6, 2024
78bf10e
fixing terraform
Paramadon Jun 6, 2024
133aca1
fixing terraform
Paramadon Jun 6, 2024
ee46157
removing change to ethtool
Paramadon Jun 6, 2024
07e2141
fixing aws cluster name
Paramadon Jun 6, 2024
28ea49e
fixing add_version
Paramadon Jun 6, 2024
2a28a1c
fixed file organization
Paramadon Jun 6, 2024
4fbdf0d
change cluster name
Paramadon Jun 6, 2024
a1eec84
adding to test generator
Paramadon Jun 6, 2024
05ce023
adding to generate test matrix
Paramadon Jun 6, 2024
b4c57ec
fixing test
Paramadon Jun 7, 2024
7e31f9b
update the agent image
Paramadon Jun 7, 2024
3951058
toml file
Paramadon Jun 7, 2024
ac98069
adding extra vars
Paramadon Jun 7, 2024
92a6217
correcting terraform dir
Paramadon Jun 7, 2024
6fb6a51
correcting terraform dir
Paramadon Jun 9, 2024
89d58e2
fixing dir
Paramadon Jun 9, 2024
3889710
removing patching of agent to see if that is the problem
Paramadon Jun 10, 2024
b2b5490
adding kubectl get pods -A
Paramadon Jun 10, 2024
c8aac72
adding test name
Paramadon Jun 10, 2024
5855c50
removing go test
Paramadon Jun 10, 2024
db41f67
Test works locally needed to add some metrics to dims
Paramadon Jun 10, 2024
dee4d19
Merge branch 'main' of github.com:aws/amazon-cloudwatch-agent-test in…
Paramadon Jun 10, 2024
29288f8
cleaning up unused vars in terraform (test works)
Paramadon Jun 11, 2024
dd4ee55
adding test_dir to fix yml
Paramadon Jun 11, 2024
e76ae00
adding sleep and log statements
Paramadon Jun 12, 2024
5ad174f
adding go test retry instead of retrying all of terraform
Paramadon Jun 12, 2024
74844bd
removing sleep
Paramadon Jun 12, 2024
63de98d
removing log lines
Paramadon Jun 12, 2024
82c2b9e
making sure to seperate integ test metrics with e2e metrics (prev com…
Paramadon Jun 13, 2024
d4440eb
adding retries to integ test to help it pass
Paramadon Jun 13, 2024
13ac63d
fixing test
Paramadon Jun 13, 2024
cbbab97
fixing test
Paramadon Jun 13, 2024
234bc29
test works
Paramadon Jun 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions generator/resources/eks_addon_test_matrix.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[
{
"k8s_version": "1.29",
"addon_name":"amazon-cloudwatch-observability",
"addon_version":"v1.6.0-eksbuild.1",
"ami_type": "AL2_x86_64_GPU",
"terraform_dir": "terraform/eks/addon/gpu",
"test_dir": "../../../../test/gpu"
}
]
14 changes: 10 additions & 4 deletions generator/test_case_generator.go
jefchien marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -179,10 +179,17 @@ var testTypeToTestConfig = map[string][]testConfig{
targets: map[string]map[string]struct{}{"metadataEnabled": {"enabled": {}}},
},
},
"eks_addon": {
{
testDir: "../../../../test/gpu",
terraformDir: "terraform/eks/addon/gpu",
},
},
"eks_daemon": {
{
testDir: "./test/metric_value_benchmark",
targets: map[string]map[string]struct{}{"arc": {"amd64": {}}},
testDir: "./test/metric_value_benchmark",
targets: map[string]map[string]struct{}{"arc": {"amd64": {}}},
instanceType: "g4dn.xlarge",
Paramadon marked this conversation as resolved.
Show resolved Hide resolved
},
{
testDir: "./test/metric_value_benchmark",
Expand Down Expand Up @@ -210,8 +217,7 @@ var testTypeToTestConfig = map[string][]testConfig{
{testDir: "./test/fluent", terraformDir: "terraform/eks/daemon/fluent/windows/2022"},
{
testDir: "./test/gpu", terraformDir: "terraform/eks/daemon/gpu",
targets: map[string]map[string]struct{}{"arc": {"amd64": {}}},
instanceType: "g4dn.xlarge",
targets: map[string]map[string]struct{}{"arc": {"amd64": {}}},
},
},
"eks_deployment": {
Expand Down
29 changes: 29 additions & 0 deletions terraform/eks/addon/gpu/gpuBurner.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
kind: Deployment
apiVersion: apps/v1
metadata:
name: gpu-burn
namespace: amazon-cloudwatch
labels:
app: gpu-burn
spec:
replicas: 1
selector:
matchLabels:
app: gpu-burn
template:
metadata:
labels:
app: gpu-burn
spec:
containers:
- name: main
image: oguzpastirmaci/gpu-burn
imagePullPolicy: IfNotPresent
command:
- bash
- '-c'
- while true; do /app/gpu_burn 20; sleep 20; done
resources:
limits:
nvidia.com/gpu: 1

134 changes: 134 additions & 0 deletions terraform/eks/addon/gpu/main.tf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't using the current build of the agent image, so currently, it's more like a canary test instead of an integration test.

The other eks tests use the agent from the provided repo/tag https://github.com/aws/amazon-cloudwatch-agent-test/blob/main/terraform/eks/daemon/main.tf#L201. This needs to do that as well.

Copy link
Contributor Author

@Paramadon Paramadon Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change I replace CWA image to the current build

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this being done?

Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT

module "common" {
source = "../../../common"
}

module "basic_components" {
source = "../../../basic_components"
region = var.region
}


data "aws_eks_cluster_auth" "this" {
name = aws_eks_cluster.this.name
}

locals {
role_arn = format("%s%s", module.basic_components.role_arn, var.beta ? "-eks-beta" : "")
aws_eks = format("%s%s", "aws eks --region ${var.region}", var.beta ? " --endpoint ${var.beta_endpoint}" : "")
}

resource "aws_eks_cluster" "this" {
name = "cwagent-operator-eks-integ-${module.common.testing_id}"
role_arn = local.role_arn
version = var.k8s_version
enabled_cluster_log_types = [
"api",
"audit",
"authenticator",
"controllerManager",
"scheduler"
]
vpc_config {
subnet_ids = module.basic_components.public_subnet_ids
security_group_ids = [module.basic_components.security_group]
}
}

# EKS Node Groups
resource "aws_eks_node_group" "this" {
cluster_name = aws_eks_cluster.this.name
node_group_name = "cwagent-operator-eks-integ-node"
node_role_arn = aws_iam_role.node_role.arn
subnet_ids = module.basic_components.public_subnet_ids

scaling_config {
desired_size = 2
max_size = 2
min_size = 2
}

ami_type = "AL2_x86_64_GPU"
capacity_type = "ON_DEMAND"
disk_size = 20
instance_types = [var.instance_type]

depends_on = [
aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly,
aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.node_CloudWatchAgentServerPolicy
]
}

# EKS Node IAM Role
resource "aws_iam_role" "node_role" {
name = "cwagent-operator-eks-Worker-Role-${module.common.testing_id}"

assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
POLICY
}

resource "aws_iam_role_policy_attachment" "node_AmazonEKSWorkerNodePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.node_role.name
}

resource "aws_iam_role_policy_attachment" "node_AmazonEKS_CNI_Policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.node_role.name
}

resource "aws_iam_role_policy_attachment" "node_AmazonEC2ContainerRegistryReadOnly" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.node_role.name
}

resource "aws_iam_role_policy_attachment" "node_CloudWatchAgentServerPolicy" {
policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
role = aws_iam_role.node_role.name
}


resource "null_resource" "kubectl" {
depends_on = [
aws_eks_cluster.this,
aws_eks_node_group.this
]
provisioner "local-exec" {
command = <<-EOT
${local.aws_eks} update-kubeconfig --name ${aws_eks_cluster.this.name}
${local.aws_eks} list-clusters --output text
${local.aws_eks} describe-cluster --name ${aws_eks_cluster.this.name} --output text
EOT
}
}

resource "aws_eks_addon" "this" {
depends_on = [
null_resource.kubectl
]
addon_name = var.addon_name
cluster_name = aws_eks_cluster.this.name
addon_version = var.addon_version
}
output "eks_cluster_name" {
value = aws_eks_cluster.this.name
}



20 changes: 20 additions & 0 deletions terraform/eks/addon/gpu/providers.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT

provider "aws" {
region = var.region
endpoints {
eks = var.beta ? var.beta_endpoint : null
}
}

provider "kubernetes" {
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", aws_eks_cluster.this.name]
}
host = aws_eks_cluster.this.endpoint
cluster_ca_certificate = base64decode(aws_eks_cluster.this.certificate_authority.0.data)
token = data.aws_eks_cluster_auth.this.token
}
47 changes: 47 additions & 0 deletions terraform/eks/addon/gpu/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT

variable "region" {
type = string
default = "us-west-2"
}

variable "test_dir" {
type = string
default = "../../../../test/gpu"
}

variable "addon_name" {
type = string
default = "amazon-cloudwatch-observability"
}

variable "addon_version" {
type = string
default = "v1.6.0-eksbuild.1"
}

variable "k8s_version" {
type = string
default = "1.29"
}

variable "ami_type" {
type = string
default = "AL2_x86_64_GPU"
}

variable "instance_type" {
type = string
default = "g4dn.xlarge"
}

variable "beta" {
type = bool
default = true
}

variable "beta_endpoint" {
type = string
default = "https://api.beta.us-west-2.wesley.amazonaws.com"
}
13 changes: 9 additions & 4 deletions terraform/eks/daemon/gpu/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -701,19 +701,24 @@ resource "kubernetes_cluster_role_binding" "rolebinding" {
namespace = "amazon-cloudwatch"
}
}

resource "null_resource" "validator" {
depends_on = [
aws_eks_node_group.this,
kubernetes_daemonset.service,
kubernetes_cluster_role_binding.rolebinding,
kubernetes_service_account.cwagentservice,
]

provisioner "local-exec" {
command = <<-EOT
echo "Validating EKS metrics/logs for EMF"
cd ../../../..
go test ${var.test_dir} -eksClusterName=${aws_eks_cluster.this.name} -computeType=EKS -v -eksDeploymentStrategy=DAEMON -eksGpuType=nvidia
i=0
while [ $i -lt 10 ]; do
Copy link
Contributor

@movence movence Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10? I'm curious how long each iteration will take. If there is an issue with a new build or something, this test will delay integ test execution time. should the test itself iterate for x times (probably <10) with some sleep in between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reduce it to 5, but the terraform has a timeout of 1 hour already and since each test takes around 3 minutes this will add up to 30 minutes +10/15 minutes for terraform. So its about 5 mintues less than the timeout time. I think 10 is fine

i=$((i+1))
go test ${var.test_dir} -eksClusterName=${aws_eks_cluster.this.name} -computeType=EKS -v -eksDeploymentStrategy=DAEMON -eksGpuType=nvidia && exit 0
sleep 60
done
exit 1
jefchien marked this conversation as resolved.
Show resolved Hide resolved
EOT
}
}
}
Loading
Loading