Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add receiver downscale endpoint #88

Merged
merged 19 commits into from
Nov 14, 2024
Merged

Conversation

yuchen-db
Copy link

@yuchen-db yuchen-db commented Oct 11, 2024

The endpoint expose number of TSDBs, rollout operator will patch the sts if n_tsdb=0
related PRs:
https://github.com/databricks/universe/pull/753653
databricks/rollout-operator#7

@yuchen-db yuchen-db changed the title update downscale Oct 12, 2024
@yuchen-db yuchen-db changed the title downscale db downscale Oct 14, 2024
@yuchen-db yuchen-db changed the title db downscale Add receiver downscale endpoint Oct 14, 2024
@yuchen-db yuchen-db requested review from a team, christopherzli, jnyi, hczhu-db and yulong-db and removed request for a team October 14, 2024 23:05
@@ -17,6 +15,11 @@ import (
toolkit_web "github.com/prometheus/exporter-toolkit/web"
"golang.org/x/net/http2"
"golang.org/x/net/http2/h2c"
"net/http"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

@@ -311,6 +311,8 @@ func runReceive(
httpserver.WithGracePeriod(time.Duration(*conf.httpGracePeriod)),
httpserver.WithTLSConfig(*conf.httpTLSConfig),
)
var lastDownscalePrepareTimestamp *int64 = nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation assumes the pod won't restart, to preserve the state, we might need to persist it on local PVC too, any thoughts?

Copy link
Author

@yuchen-db yuchen-db Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but it's safe to lose this state. rollout operator will only do downscale if the returned timestamp is more than 30 seconds ago. If we lose this state, it returns the current timestamp and downscale will not accidentally happen. In the worst case, a scheduled downscale will be delayed for 30 seconds.

@@ -106,6 +106,14 @@ func (t *MultiTSDB) SkipMatchExternalLabels() {
t.skipMatchExternalLabels = true
}

func (t *MultiTSDB) GetTenants() map[string]*tenant {
return t.tenants
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is protected by mtx. A call site of GetTenants() can hold a pointer to the map and do reads/writes while the map is being updated by other goroutines. That'd be a data race.
It's safer to return a deep copy of the map if copying is not too expensive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored code

func RegisterDownscale[K comparable, V any](s *Server, m map[K]V, mtx *sync.RWMutex, t *int64) {
s.mux.Handle("/-/downscale", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
mtx.RLock()
n := len(m)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't pass in a map, if only the size of map is needed. Pass in the size of the map.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the size of the map changes dynamically during handler runtime and Golang passes map by reference

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored code

@yuchen-db yuchen-db force-pushed the yuchen-db/scaledown-with-operator branch from cf878f5 to ea0d891 Compare October 22, 2024 05:51
@hczhu-db hczhu-db requested a review from a team October 24, 2024 22:53
func (t *MultiTSDB) GetTenantsLen() int {
t.mtx.RLock()
defer t.mtx.RUnlock()
return len(t.tenants)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we log how many tenants left and what are they?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add unit test?

@@ -7,8 +7,10 @@ import (
"context"
"fmt"
"net"
"net/http"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Copy link
Collaborator

@jnyi jnyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, would be nice if you can add unit tests, especially after prune.

@yuchen-db yuchen-db merged commit 6cf9daa into db_main Nov 14, 2024
13 of 14 checks passed
@yuchen-db yuchen-db deleted the yuchen-db/scaledown-with-operator branch November 14, 2024 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants