Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup: Export tenant backup schedule metric #141156

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

edwardguo-crl
Copy link
Contributor

@edwardguo-crl edwardguo-crl commented Feb 11, 2025

Export Cloud tenant backup schedule completion metric with tenant id tag.

Issue: #141167

Epic: None

Release note: None

Copy link

blathers-crl bot commented Feb 11, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@edwardguo-crl
Copy link
Contributor Author

Quick question: Is there a way I can test the exported metric before running it in staging? @msbutler

@@ -578,6 +584,12 @@ func init() {
Measurement: "Jobs",
Unit: metric.Unit_TIMESTAMP_SEC,
}),
RpoTenantMetric: NewExportedGaugeVec(metric.Metadata{
Name: "schedules.BACKUP.tenant-last-completed-time",
Help: "The unix timestamp of the most recently completed backup by a tenant schedule specified as maintaining this metric",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "tenant schedule" but rather a schedule for backing up a tenant. So the schedule still exists on the host cluster, but backs up the tenant.

@@ -362,6 +363,11 @@ func (e *scheduledBackupExecutor) backupSucceeded(
e.metrics.RpoMetric.Update(details.(jobspb.BackupDetails).EndTime.GoTime().Unix())
}

if details.(jobspb.BackupDetails).SpecificTenantIds != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the UpdatesLastBackupMetric for determining the "special" schedule that will update the metric — should we do the same for these backup tenant schedules?

Thoughts @msbutler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it looks like the update for this metric should be in the UpdatesLastBackupmetric branch above.

@@ -362,6 +363,11 @@ func (e *scheduledBackupExecutor) backupSucceeded(
e.metrics.RpoMetric.Update(details.(jobspb.BackupDetails).EndTime.GoTime().Unix())
}

if details.(jobspb.BackupDetails).SpecificTenantIds != nil {
e.metrics.RpoTenantMetric.Update(map[string]string{"tenant_id": details.(jobspb.BackupDetails).SpecificTenantIds},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpecificTenantIDs is a []string, so you'll want to loop over them and emit a metric for each one.

@kev-cao
Copy link
Contributor

kev-cao commented Feb 11, 2025

@edwardguo-crl I'm not aware of a way to use a custom Cockroach binary for the clusters in your local CC deployment, and my gut tells me that we don't have a mechanism built for that. @benbardin any insights on this?

@kev-cao
Copy link
Contributor

kev-cao commented Feb 11, 2025

@edwardguo-crl The commit should ideally be prefixed with the package it affects and indicated in the PR title as well. We also generally don't link to the CC ticket in this repo.

nit: I think the PR description can be a bit more specific — there's no mechanism in place that makes the metric "cloud only". I'd be a bit more descriptive in what the metric actually represents as well.

@msbutler
Copy link
Collaborator

@edwardguo-crl thanks for opening this! A few general process nits:

  • it looks like you're linking the crdb pr to a cc issue. In general, crdb pr's shoudl be linked to crdb issues. could you open a corresponding crdb issue instead?
  • once you have the crdb issue, you can follow the general crdb commit guidelines: this pr should be named: backup: export tenant ..... The commit message should contain: Fixes {crdb_issue}.

@msbutler
Copy link
Collaborator

Is there a way I can test the exported metric before running it in staging?

@edwardguo-crl i would augment a test in backup_test.go that creates a backup of several tenants and verify the metrics were updated using some of the infra used in https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/metric_test.go

@edwardguo-crl edwardguo-crl changed the title [CC-30018] Export tenant backup schedule metric [backup] Export tenant backup schedule metric Feb 11, 2025
@edwardguo-crl edwardguo-crl changed the title [backup] Export tenant backup schedule metric backup: Export tenant backup schedule metric Feb 11, 2025
@edwardguo-crl edwardguo-crl marked this pull request as ready for review February 24, 2025 18:55
@edwardguo-crl edwardguo-crl requested a review from a team as a code owner February 24, 2025 18:55
@@ -578,6 +588,12 @@ func init() {
Measurement: "Jobs",
Unit: metric.Unit_TIMESTAMP_SEC,
}),
RpoTenantMetric: metric.NewExportedGaugeVec(metric.Metadata{
Name: "schedules.BACKUP.tenant-last-completed-time",
Help: "The Unix timestamp of the most recently completed backup by host schedule for the tenant",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like "The unix timestamp of the most recently completed tenant backup by a host schedule specified as maintaining this metric" would be more fitting and similar to the existing description.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like: "The unix timestamp of the most recently completed host scheduled backup by virtual cluster specified as maintaining this metric"

RpoTenantMetric: metric.NewExportedGaugeVec(metric.Metadata{}, []string{"tenant_id"}),
},
}
env := scheduledjobs.ProdJobSchedulerEnv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like our existing tests use a test environment instead of the production scheduler environment. I'm not familiar with this part of the codebase though, so @msbutler would have a better idea on how to best do the testing here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the env matters here. @edwardguo-crl this is an awesome test!

@edwardguo-crl edwardguo-crl force-pushed the cc-30018 branch 3 times, most recently from eb3a930 to 081f236 Compare February 24, 2025 22:01
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edwardguo-crl awesome job with the test! this patch is nearly there.

@@ -574,10 +581,16 @@ func init() {
ExecutorPTSMetrics: &pm,
RpoMetric: metric.NewGauge(metric.Metadata{
Name: "schedules.BACKUP.last-completed-time",
Help: "The unix timestamp of the most recently completed backup by a schedule specified as maintaining this metric",
Help: "The unix timestamp of the most recently completed tenant backup by a host schedule specified as maintaining this metric",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert this change. The metric is bumped by a schedule within the cluster, so "host" is misleading. We also typically refer to backups taken within a tenant as just a "backup". A "tenant backup" typically refers to a backup of a tenant.

Measurement: "Jobs",
Unit: metric.Unit_TIMESTAMP_SEC,
}),
RpoTenantMetric: metric.NewExportedGaugeVec(metric.Metadata{
Name: "schedules.BACKUP.tenant-last-completed-time",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we're trying to scrub all mentions of "tenant" in our codebase and docs. Use virtual cluster instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this metric name: "schedules.BACKUP.last-completed-time-by-virtual_cluster"

@@ -578,6 +588,12 @@ func init() {
Measurement: "Jobs",
Unit: metric.Unit_TIMESTAMP_SEC,
}),
RpoTenantMetric: metric.NewExportedGaugeVec(metric.Metadata{
Name: "schedules.BACKUP.tenant-last-completed-time",
Help: "The Unix timestamp of the most recently completed backup by host schedule for the tenant",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like: "The unix timestamp of the most recently completed host scheduled backup by virtual cluster specified as maintaining this metric"

return tenantIDs
}

func verifyRPOTenantMetricLabels(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also pass in and verify the expected end time?

RpoTenantMetric: metric.NewExportedGaugeVec(metric.Metadata{}, []string{"tenant_id"}),
},
}
env := scheduledjobs.ProdJobSchedulerEnv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the env matters here. @edwardguo-crl this is an awesome test!

err := executor.backupSucceeded(ctx, nil, schedule, details, env)
require.NoError(t, err)

expectedTenantIDs := []string{"system", "2"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also verify that executor.metrics.RpoMetric.Value() was also updated in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a Value() method in RpoMetric. I verified the Gauge metric value in a separate function.

@edwardguo-crl edwardguo-crl requested review from a team and removed request for golgeek, arjunmahishi, xinhaoz, mw5h and a team February 25, 2025 21:48
@edwardguo-crl edwardguo-crl force-pushed the cc-30018 branch 4 times, most recently from bd86e18 to e30ec71 Compare February 26, 2025 17:48
Export Cloud tenant backup schedule completion metric with tenant id tag.

Issue: cockroachdb#141167

Epic: None

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants