-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Thanos Store] Guidance on best way to scale sharded and autoscaled Thanos Store with persistence enabled #7797
Comments
Olá @kaiohenricunha! Como posso ajudar você hoje? I couldn't find specific guidance on configuring a custom StorageClass for high performance and high availability in a Thanos Store setup within the repository. However, here are some general recommendations that might help:
These adjustments should help you achieve the desired high performance and high availability for your Thanos Store setup. |
Turns out I was having really slow responses from Store because of this issue. A bug in Store caching. Changing bucket caching from redis to in-memory resolved this temporarily, but guidance on how to optimize Store is still welcome. I'm having now latency around ~15seconds for SLIs dashboards. Used this Store configuration:
Along with this caching configuration:
bucketCacheConfig used to point to the same redis instance, but on db 1. |
Thanos, Prometheus and Golang version used:
Thanos version: 0.34.1
Prometheus version: 2.45.0(I guess)
Golang version: 1.21.1
Object Storage Provider:
AWS S3
What happened:
I am working on a Thanos setup that includes sharded and autoscaled Thanos Store StatefulSets with persistent volumes for filesystem storage. My goal is to achieve the maximum performance and maximum availability possible.
The current architecture aims to have Thanos Store pods be able to be scheduled on any node across the cluster without needing to worry if the required PersistentVolumeClaim (PVC) is available on a specific node or AZ. This is crucial as I want to maintain flexibility and avoid any constraints regarding node or AZ affinity for the PVCs.
I have already enabled Thanos Query Frontend(in-memory) and Store(elasticache redis) caching, but my setup still struggles with sudden spikes.
What you expected to happen:
I am looking for guidance on the recommended Thanos Store configuration for such a setup that provides:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
Environment:
Recently, I tried using Amazon EFS for its high availability, as it allows access from any node/AZ, but Thanos pods with persistence enabled took way too long to start up. I went back to EBS GP3 after noticing Thanos Store had been "fetching metadata" at startup for more than 30 minutes, all while consuming little CPU and memory. On EFS console, I noticed throughput was close to 100%.
I would greatly appreciate guidance on:
Thank you very much for your help!
The text was updated successfully, but these errors were encountered: