Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding EKS-Addon Gpu test #411

Merged
merged 62 commits into from
Jun 13, 2024
Merged

Adding EKS-Addon Gpu test #411

merged 62 commits into from
Jun 13, 2024

Conversation

Paramadon
Copy link
Contributor

@Paramadon Paramadon commented May 24, 2024

Description of the issue

Adding gpu e2e test using terraform to create the infrastructure and the eks addon in order to get dcgm exporter for the metrics.

Description of changes

Here are some images of the metrics

Screenshot 2024-05-24 at 3 00 40 PM

And these are the pods that being run
Screenshot 2024-05-24 at 3 01 08 PM

Passing test:
Screenshot 2024-05-24 at 4 08 00 PM

Link to test: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/9500991317/job/26185580186

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Describe what tests you have done.

@Paramadon Paramadon requested a review from a team as a code owner May 24, 2024 20:15
terraform/gpu/main.tf Outdated Show resolved Hide resolved
terraform/gpu/main.tf Outdated Show resolved Hide resolved
test/metric/container_insights_util.go Outdated Show resolved Hide resolved
terraform/gpu/main.tf Outdated Show resolved Hide resolved
terraform/gpu/main.tf Outdated Show resolved Hide resolved
cd ../../../..
go test ${var.test_dir} -eksClusterName=${aws_eks_cluster.this.name} -computeType=EKS -v -eksDeploymentStrategy=DAEMON -eksGpuType=nvidia
i=0
while [ $i -lt 10 ]; do
Copy link
Contributor

@movence movence Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10? I'm curious how long each iteration will take. If there is an issue with a new build or something, this test will delay integ test execution time. should the test itself iterate for x times (probably <10) with some sleep in between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reduce it to 5, but the terraform has a timeout of 1 hour already and since each test takes around 3 minutes this will add up to 30 minutes +10/15 minutes for terraform. So its about 5 mintues less than the timeout time. I think 10 is fine

@@ -83,6 +90,45 @@ var expectedDimsToMetrics = map[string][]string{
},
}

var expectedDimsToMetricsE2E = map[string][]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different than expectedDimsToMetricsIntegTest? why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integ test does not use the current build of the agent, it uses the public image. So metrics like pod_gpu_request, pod_gpu_total, pod_gpu_limit, ... aren't begin accounted for in the integ test, but they are in the E2E because we are using the build image of CWA.

terraform/eks/daemon/gpu/main.tf Show resolved Hide resolved
test/gpu/nvidia_test.go Show resolved Hide resolved
test/gpu/nvidia_test.go Show resolved Hide resolved
@Paramadon Paramadon changed the title Gpu e2e test Adding Eks Addon Gpu test Jun 13, 2024
@Paramadon Paramadon changed the title Adding Eks Addon Gpu test Adding EKS-Addon Gpu test Jun 13, 2024
@Paramadon Paramadon merged commit 2cd967b into main Jun 13, 2024
2 checks passed
@Paramadon Paramadon deleted the gpuE2eTest branch June 13, 2024 21:04
@Paramadon Paramadon restored the gpuE2eTest branch July 2, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants