WIP: Proof of concept using Loki to store and retrieve logs #1540
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is my proof of concept for using Loki to store & retrieve logs. This is based off my work I did on the metal-lb branch, cleaning up some of the way we interact with operators
Changes
Each namespace in Kubernetes becomes an "organization" in Loki. So logs are automatically kept isolated between deployments. Promtail picks up the logs automatically and forwards them to Loki. Accessing this is pretty simple. I wrote a client to make the HTTP request to Loki. Trying to import their client is not very practical and making the actual requests is not complex. I copied some code from Loki, which uses the same license as our project anyways.
Selecting loki and supporting it as our only solution makes sense for now. It is an open source project under active development designed to work with kubernetes. It would be nice to support other log retention solutions but I don't think that is practical for now.
One neat feature I've added is the ability to identify and limit the logs to a specific run of a container. This happens if the container fails & restarts. I think this is helpful because in the case of a container that deploys but fails quickly an end user can request all the logs from a specific run to try and figure out why it is failing. It's still on an end user to deploy a container that logs helpful information, but this should makes things easier.
One thing I haven't tried to account for is running Loki should be optional. If a provider wants to host lots of computational workloads (like miners, or whatever) they shouldn't be required to run Loki. So we need to figure out what to do in that case. Should we just fall back to querying the kube logs?
Incomplete / Unresolved