Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull latest thanos main with Cuckoo filter #89

Merged
merged 159 commits into from
Oct 21, 2024
Merged

Conversation

jnyi
Copy link
Collaborator

@jnyi jnyi commented Oct 17, 2024

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

ahurtaud and others added 30 commits October 16, 2024 13:32
Update Prometheus version to include
prometheus/prometheus#13242 which is important
for me - it unblocks further postings work.

Signed-off-by: Giedrius Statkevičius <[email protected]>
1、In the replace of go.mod, due to weaveworks/common#239, The grpc version is 1.45.0, but there are vulnerabilities in this version. In order to fix CVE-2023-44478, the grpc version needs to be upgraded to 1.57.2
2、In order to upgrade GRPC, the version of weaveworks/common also needs to be upgraded, otherwise the build will fail

Signed-off-by: hanyuting8 <[email protected]>
If the requested label is an external label and we have series matchers
we should only return results if the series matchers actually match a
series.

Signed-off-by: Michael Hoffmann <[email protected]>
…-io#7087)

Receiver hangs waiting for the HTTP Hander to shutdown if an error occurs
before Handler is initialized. This might happen, for example, if the hashring
is too small for a given replication factor.

Signed-off-by: Mikhail Nozdrachev <[email protected]>
the prometheus helm chart is a community maintained chart since a few
years. With that, the old example pointed to an old chart and the
provided example values aren't also working anymore.

This update the documentation.

Signed-off-by: Mario Constanti <[email protected]>
Adds a flag to register the extended promql functions supported by the thanos
query engine when running the rule component.  This will allow rule config
files containing query expressions with (xrate / xincrease / xdelta) to pass
validation.  This will only work if the query endpoint in use is running the
thanos engine.

Signed-off-by: Samuel Dufel <[email protected]>
* Allow using different listing strategies

Signed-off-by: Filip Petkovski <[email protected]>

* Expose flags for block list strategy

Signed-off-by: Filip Petkovski <[email protected]>

* Run make docs

Signed-off-by: Filip Petkovski <[email protected]>

* Fix whitespace

Signed-off-by: Filip Petkovski <[email protected]>

* Add CHANGELOG entry

Signed-off-by: Filip Petkovski <[email protected]>

---------

Signed-off-by: Filip Petkovski <[email protected]>
* receive/handler: implement tenant label splitting

Implement splitting incoming HTTP requests along some label inside of
the timeseries themselves. This functionality is useful when you have
one big application exposing lots of series and, for instance, you have
a label `team` that identifies different owners of metrics in that
application. Then using this you can use that `team` label to have
different tenants in Thanos.

Only negative thing that I could spot is that if after splitting one of
the requests fails then that code is used for all tenants and that skews
the Receiver metrics a little bit. I think that can be left as a TODO
task.

Signed-off-by: Giedrius Statkevičius <[email protected]>

* test/e2e: add more receiver tests

Signed-off-by: Giedrius Statkevičius <[email protected]>

* thanos/receive: note that splitting takes precendence over HTTP

Signed-off-by: Giedrius Statkevičius <[email protected]>

* thanos/receive: fix typo

Signed-off-by: Giedrius Statkevičius <[email protected]>

---------

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
* Receive: fix issue-7248 by introducing a worker pool

Signed-off-by: Yi Jin <[email protected]>

* fix unit test bug

Signed-off-by: Yi Jin <[email protected]>

* fix CLI flags not pass into the receive handler

Signed-off-by: Yi Jin <[email protected]>

* address comments

Signed-off-by: Yi Jin <[email protected]>

* init context in constructor

Signed-off-by: Yi Jin <[email protected]>

---------

Signed-off-by: Yi Jin <[email protected]>
* Show warnings in query frontend

QFE currently does not parse warnings from downstream queriers.
This commit fixes that by adding the field to proto messages and
modifies the merge function to take warnings into account.

Signed-off-by: Filip Petkovski <[email protected]>

* Add CHANGELOG entry

Signed-off-by: Filip Petkovski <[email protected]>

* Omit empty warnings

Signed-off-by: Filip Petkovski <[email protected]>

---------

Signed-off-by: Filip Petkovski <[email protected]>
Remove a long-standing TODO item in the code - let's use the great loser
tree implementation by Bryan. It is faster than the heap because less
comparisons are needed. Should be a nice improvement given that the heap
is used in a lot of hot paths.

Since Prometheus also uses this library, it's tricky to import the "any"
version. I tried doing bboreham/go-loser#3 but
it's still impossible to do that. Let's just copy/paste the code, it's
not a lot.

Bench:

```
goos: linux
goarch: amd64
pkg: github.com/thanos-io/thanos/pkg/store
cpu: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
             │   oldkway   │               newkway               │
             │   sec/op    │    sec/op     vs base               │
KWayMerge-16   2.292m ± 3%   2.075m ± 15%  -9.47% (p=0.023 n=10)

             │   oldkway    │               newkway               │
             │     B/op     │     B/op      vs base               │
KWayMerge-16   1.553Mi ± 0%   1.585Mi ± 0%  +2.04% (p=0.000 n=10)

             │   oldkway   │              newkway               │
             │  allocs/op  │  allocs/op   vs base               │
KWayMerge-16   27.26k ± 0%   26.27k ± 0%  -3.66% (p=0.000 n=10)
```

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
* compact/planner: fix issue 6775

It doesn't make sense to vertically compact downsampled blocks so mark
them with the no compact marker if downsampled blocks were detected in
the plan. Seems like the Planner is the best place for this logic - I
just repeated the previous pattern with the large index file filter.

Signed-off-by: Giedrius Statkevičius <[email protected]>

* CHANGELOG: add item

Signed-off-by: Giedrius Statkevičius <[email protected]>

---------

Signed-off-by: Giedrius Statkevičius <[email protected]>
* allow configurable request logger for Store Gateway

Signed-off-by: Ben Ye <[email protected]>

* lint

Signed-off-by: Ben Ye <[email protected]>

* lint

Signed-off-by: Ben Ye <[email protected]>

* fix tests

Signed-off-by: Ben Ye <[email protected]>

* fix test

Signed-off-by: Ben Ye <[email protected]>

* address comments

Signed-off-by: Ben Ye <[email protected]>

* fix tests

Signed-off-by: Ben Ye <[email protected]>

* changelog

Signed-off-by: Ben Ye <[email protected]>

---------

Signed-off-by: Ben Ye <[email protected]>
* fix serverAsClient goroutines leak

Signed-off-by: Thibault Mange <[email protected]>

* fix lint

Signed-off-by: Thibault Mange <[email protected]>

* update changelog

Signed-off-by: Thibault Mange <[email protected]>

* delete invalid comment

Signed-off-by: Thibault Mange <[email protected]>

* remove temp dev test

Signed-off-by: Thibault Mange <[email protected]>

* remove timer channel drain

Signed-off-by: Thibault Mange <[email protected]>

---------

Signed-off-by: Thibault Mange <[email protected]>
If we account stats for remote write and local writes we will count them
twice since the remote write will be counted locally again by the remote
receiver instance.

Signed-off-by: Michael Hoffmann <[email protected]>
We have seen deadlocks with endpoint discovery caused by the metric
collector hanging and not releasing the store labels lock. This causes
the endpoint update to hang, which also makes all endpoint readers hang on
acquiring a read lock for the resolved endpoints slice.

This commit makes sure the Collect method on the metrics collector has
a built in timeout to guard against cases where an upstream call never
reads from the collection channel.

Signed-off-by: Filip Petkovski <[email protected]>
…ne (thanos-io#7382)

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline

Signed-off-by: Saswata Mukherjee <[email protected]>

* small fix

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Saswata Mukherjee <[email protected]>
In LabelNames and LabelValues gRPC calls were not pruned properly. While
results are not wrong, this leads to inefficient fan-out for setups with
many endpoints.
We took the opportunity to unify the store filtering and generally also
the larger layout of the gRPC methods, including logging and tracing.

Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
* Appending warn to changelog about breaking change

Signed-off-by: Pedro Tanaka <[email protected]>

* Including warning emoji

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
…7392)

If we have a new querier it will create query hints even without the
pushdown feature being present anymore. Old sidecars will then trigger
query pushdown which leads to broken max,min,max_over_time and
min_over_time.

Signed-off-by: Michael Hoffmann <[email protected]>
* *: Using native histograms for grpc middleware metrics

Since we updated the middleware library, we can now use native histograms to keep track of latencies in grpc calls.
This is a semi-breaking change if people enabled native histogram collection on their Prometheus monitoring Thanos instances.

Signed-off-by: Pedro Tanaka <[email protected]>

adding change log

Signed-off-by: Pedro Tanaka <[email protected]>

* removing empty space;

Signed-off-by: Pedro Tanaka <[email protected]>

* Put full disclaimer in changelog

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
* compact: recover from panics (thanos-io#7318)

For thanos-io#6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <[email protected]>

* Sidecar: wait for prometheus on startup (thanos-io#7323)

Signed-off-by: Michael Hoffmann <[email protected]>

* Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948)

* fix serverAsClient goroutines leak

Signed-off-by: Thibault Mange <[email protected]>

* fix lint

Signed-off-by: Thibault Mange <[email protected]>

* update changelog

Signed-off-by: Thibault Mange <[email protected]>

* delete invalid comment

Signed-off-by: Thibault Mange <[email protected]>

* remove temp dev test

Signed-off-by: Thibault Mange <[email protected]>

* remove timer channel drain

Signed-off-by: Thibault Mange <[email protected]>

---------

Signed-off-by: Thibault Mange <[email protected]>

* Receive: fix stats (thanos-io#7373)

If we account stats for remote write and local writes we will count them
twice since the remote write will be counted locally again by the remote
receiver instance.

Signed-off-by: Michael Hoffmann <[email protected]>

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382)

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline

Signed-off-by: Saswata Mukherjee <[email protected]>

* small fix

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Saswata Mukherjee <[email protected]>

* Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392)

If we have a new querier it will create query hints even without the
pushdown feature being present anymore. Old sidecars will then trigger
query pushdown which leads to broken max,min,max_over_time and
min_over_time.

Signed-off-by: Michael Hoffmann <[email protected]>

* Cut patch release v0.35.1

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: Thibault Mange <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Co-authored-by: Giedrius Statkevičius <[email protected]>
Co-authored-by: Michael Hoffmann <[email protected]>
Co-authored-by: Thibault Mange <[email protected]>
Previously we defered starting the gRPC server by blocking the whole
startup until we could ping prometheus. This breaks usecases that rely
on the config reloader to start prometheus.
We fix it by using a channel to defer starting the grpc server
and loading external labels in an actor concurrently.

Signed-off-by: Michael Hoffmann <[email protected]>
* Uupdate Prometheus

Signed-off-by: alanprot <[email protected]>

* Updating prometheus to 4e664035e84e

Signed-off-by: alanprot <[email protected]>

* Temporarily pinning prometheus common

Signed-off-by: alanprot <[email protected]>

* fixing lint

Signed-off-by: alanprot <[email protected]>

* Using jsoniter to encode promql responses

Signed-off-by: alanprot <[email protected]>

* Removing e2e test case with unvalid hifen on a matcher -> prometheus now support this use case

Signed-off-by: alanprot <[email protected]>

* Updating prometheus to v0.52.2-0.20240606174736-edd558884b24

Signed-off-by: alanprot <[email protected]>

* pinning grpc to v1.63.2

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
GiedriusS and others added 12 commits October 16, 2024 19:11
When trimming is not disabled, receivers end up recoding all chunks
in order to drop samples that are outside of the range.
This ends up being very expensive and causes ingestion problems during high
query load.

This commit disables trimming which should reduce CPU usage in receivers.

Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: Filip Petkovski <[email protected]>
…hanos-io#7827)

Bumps google.golang.org/protobuf from 1.34.2 to 1.35.1.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
thanos-io#7825)

Bumps [go.opentelemetry.io/otel/trace](https://github.com/open-telemetry/opentelemetry-go) from 1.29.0 to 1.31.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go@v1.29.0...v1.31.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/trace
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…s-io#7822)

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.26.10 to 3.26.13.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@e2b3eaf...f779452)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [golang.org/x/time](https://github.com/golang/time) from 0.6.0 to 0.7.0.
- [Commits](golang/time@v0.6.0...v0.7.0)

---
updated-dependencies:
- dependency-name: golang.org/x/time
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix coroutine leak

The in-process client uses a pull based iterator which needs
to be closed, otherwise it will leak the underlying coroutine.
When this happens, the tsdb reader will remain open which blocks head
compaction indefinitely.

Signed-off-by: Filip Petkovski <[email protected]>

* Fix race condition

Signed-off-by: Filip Petkovski <[email protected]>

* Fix CHANGELOG

Signed-off-by: Filip Petkovski <[email protected]>

* Improve tests

Signed-off-by: Filip Petkovski <[email protected]>

* Fix blockSeriesClient

Signed-off-by: Filip Petkovski <[email protected]>

* Fix unit test

Signed-off-by: Filip Petkovski <[email protected]>

* Fix another unit test

Signed-off-by: Filip Petkovski <[email protected]>

---------

Signed-off-by: Filip Petkovski <[email protected]>
This commit updates the go version to 1.23 in the CI, including
unit, e2e tests and promu crossbuild.

It also bumps bingo dependencies where needed.

Signed-off-by: Filip Petkovski <[email protected]>
add Memcached deployment in Kubernetes, similar to Cortex [1].

[1] https://cortexmetrics.io/docs/blocks-storage/store-gateway/#memcached-index-cache

Signed-off-by: Kien Nguyen Tuan <[email protected]>
Expose the new concurrent evaluation functionality from Ruler.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Yi Jin <[email protected]>
@jnyi jnyi requested review from a team, christopherzli, hczhu-db, yuchen-db and yulong-db and removed request for a team October 17, 2024 02:21
Signed-off-by: Yi Jin <[email protected]>
@jnyi jnyi force-pushed the cuckoo_filter branch 4 times, most recently from baf4b1e to 927bd45 Compare October 21, 2024 18:14
Signed-off-by: Yi Jin <[email protected]>
@jnyi jnyi merged commit f18efee into databricks:db_main Oct 21, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.