-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding EKS-Addon Gpu test #411
Conversation
test/metric_value_benchmark/eks_resources/test_schemas/cluster_daemonset.json
Show resolved
Hide resolved
cd ../../../.. | ||
go test ${var.test_dir} -eksClusterName=${aws_eks_cluster.this.name} -computeType=EKS -v -eksDeploymentStrategy=DAEMON -eksGpuType=nvidia | ||
i=0 | ||
while [ $i -lt 10 ]; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10? I'm curious how long each iteration will take. If there is an issue with a new build or something, this test will delay integ test execution time. should the test itself iterate for x times (probably <10) with some sleep in between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reduce it to 5, but the terraform has a timeout of 1 hour already and since each test takes around 3 minutes this will add up to 30 minutes +10/15 minutes for terraform. So its about 5 mintues less than the timeout time. I think 10 is fine
@@ -83,6 +90,45 @@ var expectedDimsToMetrics = map[string][]string{ | |||
}, | |||
} | |||
|
|||
var expectedDimsToMetricsE2E = map[string][]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different than expectedDimsToMetricsIntegTest
? why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integ test does not use the current build of the agent, it uses the public image. So metrics like pod_gpu_request, pod_gpu_total, pod_gpu_limit, ... aren't begin accounted for in the integ test, but they are in the E2E because we are using the build image of CWA.
Description of the issue
Adding gpu e2e test using terraform to create the infrastructure and the eks addon in order to get dcgm exporter for the metrics.
Description of changes
Here are some images of the metrics
And these are the pods that being run
Passing test:
Link to test: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/9500991317/job/26185580186
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Describe what tests you have done.