Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy-paste microservices-connector changes from EnterpriseRAG (only for discussion purposes) #644

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ppalucki
Copy link

@ppalucki ppalucki commented Dec 16, 2024

Description

This PR is only for discussion purposes of upstreaming changes from EnterpriseRAG to opea/GenAIInfra microservices-connector.

It is not expected (at current stage) to compile/work unless we decide how much of available feature we want to backport and in which form.

PR includes only changes from directories:

Please note that, PR excludes:

  • all manifests yamls from following directories:

    • /config/manifests/ - changes were added of both sides
    • /config/samples/ - changes were added of both sides
    • /helm/manifests_common/
  • changes in docs README.md / user_guide.md

  • golang depdencies in go.mod/go.sum

Changes description:

Note when using links please click "load diff" because in main.go there is over 500 lines changed and github diff hides them by default

image

1. Telemetry

Metrics was proposed previously in #296, but now it has a lot of new things (logs/traces included)

Main changes in:

Telemetry know issues:

  • metrics names are missleading (not following AI conventions) and "first token latency" doesnt reflect "real" latency before users gets output (full description here)
  • enabling logs collection through OpenTelemetry collector steals logs from stdout/stderr (limitation of used logging adapter for OpenTelemetry)
  • disabling OTEL_TRACES does not disable the tracing "wrappers" completly (it just doesn't send them anywhere) - minor optimization
  • first/next token latency metrics/traces attributes is based on "first bytes read" latency without any logic parsing those bytes from socket from last component in the pipeline and can be missleading when deployed for dataprep pipeline or from component that do not stream (when output guardrail is enabled) - they aren't wrong but name is missleading suggesting "token" is somehow detected, which is wrong and is just "packet of bytes" we received

2. Router logic changes

3. GMC operator GMC object status more elaborate "string" description

4. New EnterpriseRAG components

TODO (Jakub Piasecki)

4. Other

TODO (Jakub Piasecki)

@ppalucki ppalucki force-pushed the ppalucki-gmcrouter-telemetry-changes branch from 95e9841 to de22d7b Compare December 16, 2024 17:34
@mkbhanda
Copy link
Collaborator

@irisdingbj if you get a chance please follow this PR at a high level and comment. Note this PR is to give an idea, is a work in progress, not expected to compile.

@poussa
Copy link
Collaborator

poussa commented Dec 17, 2024

@ppalucki thanks for the PR. I think the telemetry is a good, albeit large, addition to gmc. I will try the gmc telemetry in eRAG repo to gain more insights.

@irisdingbj
Copy link
Collaborator

@irisdingbj if you get a chance please follow this PR at a high level and comment. Note this PR is to give an idea, is a work in progress, not expected to compile.

These are all good improvements for GMC, Thanks!

Comment on lines +20 to +30
common_tag: &tag "ts1734346962"
common_repository: &repo "localhost:5000"

llm_model: &cpu_model "Intel/neural-chat-7b-v3-3"
llm_model_gaudi: &hpu_model "mistralai/Mixtral-8x7B-Instruct-v0.1"

images:
gmcManager:
image: "opea/gmcmanager"
repository: *repo
tag: *tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this &tag and *tag syntax something supported by Helm, or something requiring pre-processing before it's fed to Helm?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's Helm's native variable reference syntax

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Did not find mention of it from Helm docs. According to this, it's not Helm, but YAML syntax: https://stackoverflow.com/questions/75600964/what-does-and-denote-in-a-helm-template

@@ -38,23 +139,22 @@ podSecurityContext: {}
securityContext:
capabilities:
drop:
- ALL
- ALL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Comment on lines +188 to +200
pvc:
- name: model-volume-embedding
accessMode: ReadWriteOnce
namespace: chatqa
storage: 20Gi
- name: model-volume-embedding
accessMode: ReadWriteOnce
namespace: dataprep
storage: 20Gi
- name: model-volume-llm
accessMode: ReadWriteOnce
namespace: chatqa
storage: 100Gi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV/PVC setup needs some documentation. Are there also separate PVs for each model, are are all these PVCs using same PV?

Comment on lines +524 to +530
statusVerbose := "Not ready"
if deployment.Status.AvailableReplicas == *deployment.Spec.Replicas {
readyCnt += 1
statusVerbose = "Ready"
}
deploymentStatus.WriteString(fmt.Sprintf("%s; Replicas: %d desired | %d updated | %d total | %d available | %d unavailable\n",
statusVerbose,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When HPA is used, deployments should not specify replica counts.

I.e. GMC should not be setting replica counts unless it's also responsible for scaling the services.

provider := sdkmetric.NewMeterProvider(sdkmetric.WithReader(exporter))
otel.SetMeterProvider(provider)

// ppalucki: Own metrics defintion bellow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ppalucki: Own metrics defintion bellow
// ppalucki: Own metrics definition below

Comment on lines +149 to +153
metric.WithDescription("Measures the duration of generating all but first tokens."),
api.WithExplicitBucketBoundaries(1, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16364),
)
if err != nil {
log.Error(err, "metrics: cannot register next token histogram measure")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar improvement could be applied to all metric descriptions / errors:

Suggested change
metric.WithDescription("Measures the duration of generating all but first tokens."),
api.WithExplicitBucketBoundaries(1, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16364),
)
if err != nil {
log.Error(err, "metrics: cannot register next token histogram measure")
metric.WithDescription("Duration of generating all but the first tokens."),
api.WithExplicitBucketBoundaries(1, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16364),
)
if err != nil {
log.Error(err, "metrics: cannot register next token duration histogram")

Comment on lines +157 to +159
"llm.all.token.latency",
metric.WithUnit("ms"),
metric.WithDescription("Measures the duration to generate response with all tokens."),
Copy link
Contributor

@eero-t eero-t Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more established name for this metric would be "llm.request.latency"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggree completly :) please check my comment here: #296 (comment)

// Stop the execution of sequence right away if step is a hard dependency and is unsuccessful
otlpr.WithContext(log, ctx).Info("This step is a hard dependency and it is unsuccessful. Stop pipeline execution.", "stepName", step.StepName, "statusCode", statusCode)
// err is nil here, so we cannot record any details about this unsuccesful response without parsing the responseBody.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// err is nil here, so we cannot record any details about this unsuccesful response without parsing the responseBody.
// err is nil here, so we cannot record any details about this unsuccessful response without parsing the responseBody

}
return
}

// Close span if there was not err and not guardarils were activated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Close span if there was not err and not guardarils were activated
// Close span if there was not err, and no guardrails were activated

Comment on lines -744 to -747
// create a handler to handle traffic to /ui
// if the payload is empty, redirect to ui service
// if there's payload, format the payload and redirect to backend service
func mcUiHandler(w http.ResponseWriter, req *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Router logic changes

Hanlding guard 466 specific http code to stop processing pipeline and still result body

Um, what about the removal of this large UI handling function, and the next asset handling one?

@eero-t
Copy link
Contributor

eero-t commented Dec 18, 2024

In general changes look IMHO quite OK, but I'm not familiar (at all) with the current GMC code.

First/next token latency metrics/traces attributes is based on "first bytes read" latency without any logic parsing those bytes from socket from last component in the pipeline and can be missleading when deployed for dataprep pipeline or from component that do not stream (when output guardrail is enabled) - they aren't wrong but name is missleading suggesting "token" is somehow detected, which is wrong and is just "packet of bytes" we received

Megaservice handles streaming and non-streaming requests separately. Token metrics are reported only for streaming (LLM) requests. Couldn't GMC do the same / is this some future TODO?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants