Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 add benchmark pipeline #1651

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jianzhangbjz
Copy link

@jianzhangbjz jianzhangbjz commented Jan 27, 2025

Description

See more: #920

  • Serial run on AWS, Cluster version is 4.18.0-0.nightly-2025-01-25-163410
jiazha-mac:e2e jiazha$ go test -run=^$ -bench=. -count=10 -memprofile=mem.out -cpuprofile=cpu.out
goos: darwin
goarch: arm64
pkg: github.com/operator-framework/operator-controller/test/e2e
cpu: Apple M1 Pro
BenchmarkCreateClusterCatalog-10    	       1	1444716583 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 619736229 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 594375416 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 599388104 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 578713375 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 604820354 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 614665062 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 613025938 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 622365104 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 591780896 ns/op
PASS
ok  	github.com/operator-framework/operator-controller/test/e2e	18.360s
jiazha-mac:e2e jiazha$ go tool pprof mem.out 
File: e2e.test
Type: alloc_space
Time: Jan 27, 2025 at 2:00pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 6893.09kB, 72.92% of 9453.29kB total
Showing top 10 nodes out of 93
      flat  flat%   sum%        cum   cum%
 1762.94kB 18.65% 18.65%  1762.94kB 18.65%  runtime/pprof.StartCPUProfile
  902.59kB  9.55% 28.20%  1485.59kB 15.72%  compress/flate.NewWriter
  583.01kB  6.17% 34.36%   583.01kB  6.17%  compress/flate.newDeflateFast (inline)
  548.84kB  5.81% 40.17%  1573.29kB 16.64%  k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
  532.26kB  5.63% 45.80%   532.26kB  5.63%  github.com/gogo/protobuf/proto.RegisterType
  513.50kB  5.43% 51.23%   513.50kB  5.43%  k8s.io/apimachinery/pkg/conversion.ConversionFuncs.AddUntyped
  512.75kB  5.42% 56.66%   512.75kB  5.42%  vendor/golang.org/x/crypto/cryptobyte.(*Builder).add
  512.62kB  5.42% 62.08%   512.62kB  5.42%  k8s.io/api/apps/v1beta2.addKnownTypes
  512.44kB  5.42% 67.50%   512.44kB  5.42%  sync.(*Map).dirtyLocked
  512.14kB  5.42% 72.92%   512.14kB  5.42%  k8s.io/api/resource/v1alpha3.init
(pprof) exit
jiazha-mac:e2e jiazha$ go tool pprof cpu.out 
File: e2e.test
Type: cpu
Time: Jan 27, 2025 at 2:00pm (CST)
Duration: 17.83s, Total samples = 160ms (  0.9%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 160ms, 100% of 160ms total
Showing top 10 nodes out of 93
      flat  flat%   sum%        cum   cum%
      40ms 25.00% 25.00%       40ms 25.00%  runtime.pthread_cond_signal
      40ms 25.00% 50.00%       40ms 25.00%  runtime.scanobject
      40ms 25.00% 75.00%       40ms 25.00%  syscall.syscall
      10ms  6.25% 81.25%       10ms  6.25%  crypto/internal/edwards25519/field.addMul64 (inline)
      10ms  6.25% 87.50%       10ms  6.25%  k8s.io/apimachinery/pkg/runtime.(*clientNegotiator).Decoder
      10ms  6.25% 93.75%       10ms  6.25%  runtime.pthread_kill
      10ms  6.25%   100%       10ms  6.25%  runtime.usleep
         0     0%   100%       30ms 18.75%  bufio.(*Writer).Flush
         0     0%   100%       10ms  6.25%  crypto/ecdh.(*PrivateKey).PublicKey
         0     0%   100%       10ms  6.25%  crypto/ecdh.(*PrivateKey).PublicKey.func1
(pprof) exit
  • parallel run on AWS, Cluster version is 4.18.0-0.nightly-2025-01-25-163410
jiazha-mac:e2e jiazha$ go test -run=^$ -bench=. -count=10 -memprofile=mem.out -cpuprofile=cpu.out
goos: darwin
goarch: arm64
pkg: github.com/operator-framework/operator-controller/test/e2e
cpu: Apple M1 Pro
BenchmarkCreateClusterCatalog-10    	       1	1559796167 ns/op
BenchmarkCreateClusterCatalog-10    	      12	 105038868 ns/op
BenchmarkCreateClusterCatalog-10    	      13	  88542141 ns/op
BenchmarkCreateClusterCatalog-10    	      13	  94152035 ns/op
BenchmarkCreateClusterCatalog-10    	      13	  93457205 ns/op
BenchmarkCreateClusterCatalog-10    	      13	  94673955 ns/op
BenchmarkCreateClusterCatalog-10    	      20	  62803019 ns/op
BenchmarkCreateClusterCatalog-10    	      13	  87578115 ns/op
BenchmarkCreateClusterCatalog-10    	      12	 107728125 ns/op
BenchmarkCreateClusterCatalog-10    	      12	  98580924 ns/op
PASS
ok  	github.com/operator-framework/operator-controller/test/e2e	38.984s
jiazha-mac:e2e jiazha$ go tool pprof cpu.out 
File: e2e.test
Type: cpu
Time: Jan 27, 2025 at 2:09pm (CST)
Duration: 38.06s, Total samples = 570ms ( 1.50%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 410ms, 71.93% of 570ms total
Showing top 10 nodes out of 201
      flat  flat%   sum%        cum   cum%
      70ms 12.28% 12.28%       70ms 12.28%  runtime.kevent
      70ms 12.28% 24.56%       70ms 12.28%  runtime.pthread_cond_signal
      60ms 10.53% 35.09%       60ms 10.53%  runtime.pthread_cond_wait
      60ms 10.53% 45.61%       60ms 10.53%  syscall.syscall
      50ms  8.77% 54.39%       50ms  8.77%  runtime.pthread_kill
      30ms  5.26% 59.65%       30ms  5.26%  runtime.madvise
      30ms  5.26% 64.91%       30ms  5.26%  runtime.pthread_cond_timedwait_relative_np
      20ms  3.51% 68.42%       20ms  3.51%  runtime.(*mspan).writeHeapBitsSmall
      10ms  1.75% 70.18%       10ms  1.75%  crypto/internal/bigmod.(*Nat).reset
      10ms  1.75% 71.93%       10ms  1.75%  k8s.io/apimachinery/pkg/runtime.setTargetKind
(pprof) exit
jiazha-mac:e2e jiazha$ go tool pprof mem.out 
File: e2e.test
Type: alloc_space
Time: Jan 27, 2025 at 2:09pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 10718.54kB, 61.67% of 17380.55kB total
Showing top 10 nodes out of 166
      flat  flat%   sum%        cum   cum%
 2048.12kB 11.78% 11.78%  2048.12kB 11.78%  path.Join
 1762.94kB 10.14% 21.93%  1762.94kB 10.14%  runtime/pprof.StartCPUProfile
 1536.56kB  8.84% 30.77%  1536.56kB  8.84%  golang.org/x/net/http2.(*ClientConn).roundTrip
 1065.48kB  6.13% 36.90%  2101.58kB 12.09%  k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
 1025.38kB  5.90% 42.80%  1025.38kB  5.90%  sync.(*Pool).pinSlow
 1024.14kB  5.89% 48.69%  1536.17kB  8.84%  k8s.io/client-go/rest.(*Request).URL
  650.62kB  3.74% 52.43%   650.62kB  3.74%  compress/flate.(*compressor).init
  553.04kB  3.18% 55.62%   553.04kB  3.18%  github.com/gogo/protobuf/proto.RegisterType
  528.17kB  3.04% 58.65%   528.17kB  3.04%  regexp.(*bitState).reset
  524.09kB  3.02% 61.67%   524.09kB  3.02%  k8s.io/apimachinery/pkg/conversion.ConversionFuncs.AddUntyped
(pprof) exit

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@jianzhangbjz jianzhangbjz requested a review from a team as a code owner January 27, 2025 06:21
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025
Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 17c9d51
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/67adb658357e960008fa1987
😎 Deploy Preview https://deploy-preview-1651--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Jan 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.45%. Comparing base (becde51) to head (17c9d51).
Report is 14 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1651   +/-   ##
=======================================
  Coverage   67.45%   67.45%           
=======================================
  Files          61       61           
  Lines        5245     5245           
=======================================
  Hits         3538     3538           
  Misses       1446     1446           
  Partials      261      261           
Flag Coverage Δ
e2e 52.07% <ø> (-0.08%) ⬇️
unit 55.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jianzhangbjz
Copy link
Author

jianzhangbjz commented Jan 27, 2025

  • parallel run on IBMCloud, Cluster version is 4.18.0-0.nightly-2025-01-25-163410
jiazha-mac:e2e jiazha$ go test -run=^$ -bench=. -count=10 -memprofile=mem.out -cpuprofile=cpu.out
goos: darwin
goarch: arm64
pkg: github.com/operator-framework/operator-controller/test/e2e
cpu: Apple M1 Pro
BenchmarkCreateClusterCatalog-10    	       1	2093042334 ns/op
BenchmarkCreateClusterCatalog-10    	       4	 611432146 ns/op
BenchmarkCreateClusterCatalog-10    	      10	 224809304 ns/op
BenchmarkCreateClusterCatalog-10    	      13	  92630189 ns/op
BenchmarkCreateClusterCatalog-10    	       6	 174559444 ns/op
BenchmarkCreateClusterCatalog-10    	      12	 107088038 ns/op
BenchmarkCreateClusterCatalog-10    	       1	1003581583 ns/op
BenchmarkCreateClusterCatalog-10    	       3	 379469264 ns/op
BenchmarkCreateClusterCatalog-10    	       2	 606936271 ns/op
BenchmarkCreateClusterCatalog-10    	       1	2773485917 ns/op
PASS
ok  	github.com/operator-framework/operator-controller/test/e2e	34.115s
jiazha-mac:e2e jiazha$ go tool pprof mem.out 
File: e2e.test
Type: alloc_space
Time: Jan 27, 2025 at 5:22pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 8433.33kB, 62.21% of 13555.35kB total
Showing top 10 nodes out of 126
      flat  flat%   sum%        cum   cum%
 1536.16kB 11.33% 11.33%  1536.16kB 11.33%  golang.org/x/net/http2.(*ClientConn).roundTrip
 1536.09kB 11.33% 22.66%  1536.09kB 11.33%  path.Join
 1184.27kB  8.74% 31.40%  1184.27kB  8.74%  runtime/pprof.StartCPUProfile
  902.59kB  6.66% 38.06%  1553.21kB 11.46%  compress/flate.NewWriter
  650.62kB  4.80% 42.86%   650.62kB  4.80%  compress/flate.(*compressor).init
  553.04kB  4.08% 46.94%   553.04kB  4.08%  github.com/gogo/protobuf/proto.RegisterType
  528.17kB  3.90% 50.84%   528.17kB  3.90%  regexp.(*bitState).reset
  516.01kB  3.81% 54.64%   516.01kB  3.81%  google.golang.org/protobuf/internal/filedesc.(*File).initDecls
  513.69kB  3.79% 58.43%   513.69kB  3.79%  regexp.mergeRuneSets.func2
  512.69kB  3.78% 62.21%   512.69kB  3.78%  regexp/syntax.(*compiler).inst
(pprof) exit
jiazha-mac:e2e jiazha$ go tool pprof cpu.out 
File: e2e.test
Type: cpu
Time: Jan 27, 2025 at 5:22pm (CST)
Duration: 33.19s, Total samples = 380ms ( 1.14%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 310ms, 81.58% of 380ms total
Showing top 10 nodes out of 154
      flat  flat%   sum%        cum   cum%
      90ms 23.68% 23.68%       90ms 23.68%  runtime.kevent
      50ms 13.16% 36.84%       50ms 13.16%  runtime.pthread_cond_signal
      50ms 13.16% 50.00%       50ms 13.16%  syscall.syscall
      40ms 10.53% 60.53%       40ms 10.53%  runtime.pthread_cond_wait
      30ms  7.89% 68.42%       60ms 15.79%  runtime.scanobject
      10ms  2.63% 71.05%       20ms  5.26%  k8s.io/client-go/rest.(*Request).tryThrottleWithInfo
      10ms  2.63% 73.68%       10ms  2.63%  runtime.(*itabTableType).find
      10ms  2.63% 76.32%       10ms  2.63%  runtime.(*mheap).allocSpan
      10ms  2.63% 78.95%       10ms  2.63%  runtime.(*mspan).heapBitsSmallForAddr
      10ms  2.63% 81.58%       10ms  2.63%  runtime.(*unwinder).resolveInternal
(pprof) exit

}

// GetRandomString generates a random string of the given length
func getRandomString(length int) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s.io/apimachinery has a rand.String function that can be used instead of us maintaining a separate implementation. https://pkg.go.dev/k8s.io/apimachinery/pkg/util/rand#String

"time"
)

func BenchmarkCreateClusterCatalog(b *testing.B) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this benchmark tells us a whole lot because it is mostly measuring:

  • client CPU/memory used to submit a CREATE request to the apiserver
  • client CPU/memory used to submit a DELETE request to the apiserver

But it notably isn't able to account for the resources or time spent by the controller to reconcile the ClusterCatalog, which I think is what we actually want to measure.

But regardless, in my mind the biggest open question in is NOT "what machinery do we use to take measurements?". The real question is "where/how do we store/retrieve historical measurements that we can use as a baseline for comparison?"

Copy link
Author

@jianzhangbjz jianzhangbjz Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real question is "where/how do we store/retrieve historical measurements that we can use as a baseline for comparison?"

The test results will be stored on the GCS bucket if we run these test cases in Prow CI. And then, we can use Big Query to retrieve historical measurements. Similar to Sippy.
I think it's a little bit complex. How about storing the baseline in benchmarks/baseline.txt? The initial benchmark results as the baseline? Once a new version is released, update it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it notably isn't able to account for the resources or time spent by the controller to reconcile the ClusterCatalog, which I think is what we actually want to measure.

Yes, however, I was trying to cover the "real world" case instead of the unit test. Do we need to benchmark some functions? Similar to the unit test?

Copy link
Member

@joelanford joelanford Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about storing the baseline in benchmarks/baseline.txt?

Maybe? But if, for example, we wanted to build a histogram of response times of the catalogd web server over multiple commits and/or CI runs in order to get a smoothed out baseline, that seems hard if a file is committed to the repo because we'd have to update it often.

Maybe another option would be to make use of github's upload-artifact/download-artifact actions? If an artifact can be uploaded/download and shared between GH actions runs, then maybe we could do something like:

  1. download a snapshotted prometheus DB
  2. run prometheus with the DB, and configure scarping with a short interval to pull metrics from our components into the DB
  3. verify via promql that we haven't broken our thresholds, if we have fail the run
  4. snapshot the updated prometheus DB
  5. upload the snapshot
  6. success!

Copy link
Author

@jianzhangbjz jianzhangbjz Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! If so, we don't need to use the benchstat, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to use the Prometheus.

@jianzhangbjz
Copy link
Author

Test on 4.19.0-0.nightly-2025-02-04-230011, GCP.

jiazha-mac:e2e jiazha$ export CATALOG_IMG=registry.redhat.io/redhat/redhat-operator-index:v4.18
jiazha-mac:e2e jiazha$ go test -run=^$ -bench=. -count=10 -memprofile=mem.out -cpuprofile=cpu.out
goos: darwin
goarch: arm64
pkg: github.com/operator-framework/operator-controller/test/e2e
cpu: Apple M1 Pro
BenchmarkCreateClusterCatalog-10    	       1	1274109792 ns/op
BenchmarkCreateClusterCatalog-10    	      18	  85367590 ns/op
BenchmarkCreateClusterCatalog-10    	      21	  98587806 ns/op
BenchmarkCreateClusterCatalog-10    	      20	  80611660 ns/op
BenchmarkCreateClusterCatalog-10    	      15	  95789072 ns/op
BenchmarkCreateClusterCatalog-10    	      15	  87705625 ns/op
BenchmarkCreateClusterCatalog-10    	      14	 186946500 ns/op
BenchmarkCreateClusterCatalog-10    	      12	 166680833 ns/op
BenchmarkCreateClusterCatalog-10    	       9	 177902880 ns/op
BenchmarkCreateClusterCatalog-10    	       4	 447856864 ns/op
PASS
ok  	github.com/operator-framework/operator-controller/test/e2e	38.205s

- name: Compare with baseline
run: |
go install golang.org/x/perf/cmd/benchstat@latest
benchstat benchmarks/baseline.txt new.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchstat is nice to get comparisons, but we'd have to be careful to somehow generate and commit a baseline from GH CI's system to avoid issues like "Jian's laptop is beefier than Joe's, and both are beefier than a GHA VM"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

@joelanford
Copy link
Member

I really think we need to get a Brief and/or RFC written up to get agreement and consensus on the approaches before we merge stuff. But I also think the kind of prototyping and brainstorming that you're doing here and that @OchiengEd did will be necessary to get to the point that we are confident in our approach.

@jianzhangbjz jianzhangbjz force-pushed the benchmark branch 3 times, most recently from 3cde584 to e0e379d Compare February 8, 2025 03:55
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2025
@jianzhangbjz jianzhangbjz force-pushed the benchmark branch 6 times, most recently from dd32769 to 8b1afa9 Compare February 8, 2025 08:10
@jianzhangbjz
Copy link
Author

I really think we need to get a Brief and/or RFC written up to get agreement and consensus on the approaches before we merge stuff.

A Brief is drafting here: [WIP]Brief: Benchmarking test OLMv1

@jianzhangbjz jianzhangbjz force-pushed the benchmark branch 2 times, most recently from ff8faec to 08a24d1 Compare February 8, 2025 10:06
@jianzhangbjz
Copy link
Author

Benchmark test pass: https://github.com/operator-framework/operator-controller/actions/runs/13214623057/job/36892420269?pr=1651
download-artifact:

jiazha-mac:~ jiazha$ tree Downloads/benchmark-artifacts/
Downloads/benchmark-artifacts/
├── new.txt
└── output

1 directory, 2 files
jiazha-mac:~ jiazha$ cat Downloads/benchmark-artifacts/new.txt 
goos: linux
goarch: amd64
pkg: github.com/operator-framework/operator-controller/test/e2e
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkCreateClusterCatalog
BenchmarkCreateClusterCatalog-4   	      81	  82695425 ns/op	   36570 B/op	     397 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	  99872913 ns/op	   37266 B/op	     404 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	  99852972 ns/op	   37327 B/op	     402 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	  99882173 ns/op	   37409 B/op	     405 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	 100024048 ns/op	   37350 B/op	     405 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	 100098746 ns/op	   37568 B/op	     406 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	 100037742 ns/op	   38134 B/op	     405 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	  99984867 ns/op	   37121 B/op	     403 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	  99855796 ns/op	   38886 B/op	     406 allocs/op
BenchmarkCreateClusterCatalog-4   	      12	  99946190 ns/op	   38851 B/op	     404 allocs/op
PASS
ok  	github.com/operator-framework/operator-controller/test/e2e	20.427s
jiazha-mac:~ jiazha$ cat Downloads/benchmark-artifacts/output 
goos: darwin
goarch: arm64
pkg: github.com/operator-framework/operator-controller/test/e2e
cpu: Apple M1 Pro
                        │ benchmarks/baseline.txt │
                        │         sec/op          │
CreateClusterCatalog-10              85.11m ± 16%

                        │ benchmarks/baseline.txt │
                        │          B/op           │
CreateClusterCatalog-10              36.21Ki ± 6%

                        │ benchmarks/baseline.txt │
                        │        allocs/op        │
CreateClusterCatalog-10                394.5 ± 1%

goos: linux
goarch: amd64
cpu: AMD EPYC 7763 64-Core Processor                
                       │ /tmp/artifacts/new.txt │
                       │         sec/op         │
CreateClusterCatalog-4              99.91m ± 0%

                       │ /tmp/artifacts/new.txt │
                       │          B/op          │
CreateClusterCatalog-4             36.50Ki ± 4%

                       │ /tmp/artifacts/new.txt │
                       │       allocs/op        │
CreateClusterCatalog-4               404.5 ± 1%

@jianzhangbjz jianzhangbjz changed the title [WIP] benchmark createClusterCatalog func 🌱 [WIP] benchmark createClusterCatalog func Feb 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2025
@jianzhangbjz jianzhangbjz force-pushed the benchmark branch 2 times, most recently from 521a49f to 9305638 Compare February 10, 2025 10:01
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025
@jianzhangbjz jianzhangbjz changed the title 🌱 [WIP] benchmark createClusterCatalog func 🌱 [WIP] add benchmark pipeline Feb 11, 2025
@jianzhangbjz jianzhangbjz force-pushed the benchmark branch 15 times, most recently from a1d1b5a to 24b9450 Compare February 13, 2025 08:41
@jianzhangbjz
Copy link
Author

This benchmark pipeline logic as follows:

  1. run-benchmark, here is the running log: https://github.com/operator-framework/operator-controller/actions/runs/13304141348/job/37151204944?pr=1651
  • Run benchmark test cases by using the go test -v -run=^$$ -bench=. -benchmem -count=10 -v ./test/e2e/...
  • Convert the test results to Prometheus metrics
  • Upload this metrics file by using action artifacts
  1. run-prometheus, here is the running log: https://github.com/operator-framework/operator-controller/actions/runs/13304141348/job/37151497828?pr=1651
  • Download the previous job's Prometheus DB across jobs.(ToDo: make it across repos by using the GitHub REST API)
  • Extract and Restore Prometheus Snapshot
  • Set Up Prometheus Config to listen to $HOST_IP:9000.
          cat << EOF > prometheus.yml
          global:
            scrape_interval: 5s
          scrape_configs:
            - job_name: 'benchmark_metrics'
              static_configs:
                - targets: ['$HOST_IP:9000']
          EOF
  • Run Prometheus in container
          docker run -d --name prometheus -p 9090:9090 \
            --user=root \
            -v ${{ github.workspace }}/prometheus.yml:/etc/prometheus/prometheus.yml \
            -v ${{ github.workspace }}/prometheus-data:/prometheus \
            prom/prometheus --config.file=/etc/prometheus/prometheus.yml \
            --storage.tsdb.path=/prometheus \
            --storage.tsdb.retention.time=1h \
            --web.enable-admin-api
  • Start HTTP Server to Expose Metrics (Prometheus grabs the metrics and store them to its tsdb).
  • Check Benchmark Metrics Against Threshold.
    For the Threshold, we can update them to the appropriate value after running days.
    For query metrics, we can add more with the benchmark test cases increasing.
          MAX_TIME_NS=1200000000  # 1.2s
          MAX_ALLOCS=4000
          MAX_MEM_BYTES=450000

          # Query Prometheus Metrics, get the max value
          time_ns=$(curl -s "http://localhost:9090/api/v1/query?query=max(benchmark_createclustercatalog_ns)" | jq -r '.data.result[0].value[1]')
          allocs=$(curl -s "http://localhost:9090/api/v1/query?query=max(benchmark_createclustercatalog_allocs)" | jq -r '.data.result[0].value[1]')
          mem_bytes=$(curl -s "http://localhost:9090/api/v1/query?query=max(benchmark_createclustercatalog_mem_bytes)" | jq -r '.data.result[0].value[1]')
  • Find and Upload Prometheus Snapshot
  • Stop Prometheus
  • Upload Prometheus Snapshot
  • Done

Hi @joelanford , I have implemented the logic you suggested above, could you help have a review when you get a chance? Thanks!

@jianzhangbjz jianzhangbjz changed the title 🌱 [WIP] add benchmark pipeline 🌱 add benchmark pipeline Feb 13, 2025
@jianzhangbjz
Copy link
Author

I download the snapshot from https://github.com/operator-framework/operator-controller/actions/runs/13304141348
and check them in Prometheus. It works as expected. As follows,

jiazha-mac:prometheus-3.1.0.darwin-arm64 jiazha$ tree data
data
├── 01JHSRNJT32YH9858HS49WKGNS
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── 01JKZ7W6XKK9WQ4B9JM5GCNZMW
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   ├── tmp_dbro_sandbox3927239696
│   └── tombstones
├── 01JKZ9ZF06MSA9VCPFBRSRWB4J
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── chunks_head
├── prometheus.tar
├── queries.active
└── wal
    ├── 00000000
    ├── 00000001
    └── 00000002

10 directories, 17 files

jiazha-mac:prometheus-3.1.0.darwin-arm64 jiazha$ ./prometheus
time=2025-02-13T09:38:09.214Z level=INFO source=main.go:636 msg="No time or size retention was set so using the default time retention" duration=15d
time=2025-02-13T09:38:09.214Z level=INFO source=main.go:683 msg="Starting Prometheus Server" mode=server version="(version=3.1.0, branch=HEAD, revision=7086161a93b262aa0949dbf2aba15a5a7b13e0a3)"
...

http://localhost:9090/query?g0.expr=benchmark_createclustercatalog_allocs&g0.show_tree=0&g0.tab=graph&g0.range_input=1h&g0.res_type=auto&g0.res_density=medium&g0.display_mode=lines&g0.show_exemplars=0
image
image

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2025
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants