Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM Occurrence in Next.js when Using metricReader & resourceDetectors #5493

Open
masaya-fukazawa opened this issue Feb 20, 2025 · 3 comments
Assignees
Labels

Comments

@masaya-fukazawa
Copy link

What happened?

Steps to Reproduce

  1. In a Next.js 14.1.3 project, migrate from dd-trace to OpenTelemetry.
  2. Configure OpenTelemetry using NodeSDK with various instrumentations (Http, DNS, Net, Undici, etc.).
  3. Use a configuration that includes both metricReader and resourceDetectors.
  4. Run the application, which eventually triggers an Out Of Memory (OOM) error.

Expected Result

  • OpenTelemetry should collect traces and metrics without causing the application to run out of memory.

Actual Result

  • The application process terminates abnormally due to an OOM error during runtime.

Additional Details

  • The issue occurs in a Next.js environment after switching from dd-trace to OpenTelemetry.
  • The OpenTelemetry configuration includes a NodeSDK setup with W3CTraceContextPropagator, OTLPTraceExporter, and OTLPMetricExporter.
  • The HttpInstrumentation's requestHook is used to set the HTTP route for spans.
  • Investigation Findings: When both metricReader and resourceDetectors are removed from the OpenTelemetry configuration, the OOM error no longer occurs. This indicates that these configurations might be contributing to the memory issue.

memory trends for datadog-agent and my application are shown below:

datadog-agent's memory usage:
Image

application's memory usage:
Image

OpenTelemetry Setup Code

// instrumentation.node.ts

import { IncomingMessage } from "node:http";
import { context } from "@opentelemetry/api";
import { W3CTraceContextPropagator } from "@opentelemetry/core";
import { RPCType, getRPCMetadata, setRPCMetadata } from "@opentelemetry/core";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-grpc";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { DnsInstrumentation } from "@opentelemetry/instrumentation-dns";
import { HttpInstrumentation } from "@opentelemetry/instrumentation-http";
import { NetInstrumentation } from "@opentelemetry/instrumentation-net";
import { UndiciInstrumentation } from "@opentelemetry/instrumentation-undici";
import { awsEcsDetector } from "@opentelemetry/resource-detector-aws";
import {
  Resource,
  envDetector,
  hostDetector,
  osDetector,
  processDetector,
} from "@opentelemetry/resources";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-node";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";
import {
  ATTR_CONTAINER_IMAGE_TAGS,
  ATTR_DEPLOYMENT_ENVIRONMENT,
} from "@opentelemetry/semantic-conventions/incubating";

const sdk = new NodeSDK({
  textMapPropagator: new W3CTraceContextPropagator(),
  traceExporter: new OTLPTraceExporter(),
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME,
    [ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NEXT_PUBLIC_DEPLOYMENT_ENV,
    [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION,
    [ATTR_CONTAINER_IMAGE_TAGS]: process.env.SERVICE_VERSION,
  }),
  instrumentations: [
    new HttpInstrumentation({
      requestHook: (span, request) => {
        const route = (request as IncomingMessage)?.url;
        if (route) {
          if (route && (route.endsWith(".json") || !route.includes("."))) {
            // Try to apply the route only for pages and client side fetches
            const rpcMetadata = getRPCMetadata(context.active()); // retrieve rpc metadata from the active context
            if (rpcMetadata) {
              if (rpcMetadata?.type === RPCType.HTTP) {
                rpcMetadata.route = route;
              }
            } else {
              setRPCMetadata(context.active(), {
                type: RPCType.HTTP,
                route,
                span,
              });
            }
          }
        }
      },
    }),
    new DnsInstrumentation(),
    new NetInstrumentation(),
    new UndiciInstrumentation(),
  ],
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
  }),
  spanProcessors: [new BatchSpanProcessor(new OTLPTraceExporter())],
  resourceDetectors: [
    awsEcsDetector,
    envDetector,
    processDetector,
    hostDetector,
    osDetector,
  ],
});
sdk.start();

package.json

{
  "name": "***",
  "version": "***",
  "dependencies": {
    "@datadog/browser-logs": "5.25.0",
    "@datadog/browser-rum": "5.25.0",
    "@opentelemetry/api": "^1.9.0",
    "@opentelemetry/core": "^1.30.1",
    "@opentelemetry/exporter-metrics-otlp-grpc": "^0.57.1",
    "@opentelemetry/exporter-trace-otlp-grpc": "^0.57.1",
    "@opentelemetry/instrumentation-dns": "^0.43.0",
    "@opentelemetry/instrumentation-http": "^0.57.1",
    "@opentelemetry/instrumentation-net": "^0.43.0",
    "@opentelemetry/instrumentation-undici": "^0.10.0",
    "@opentelemetry/resource-detector-aws": "^1.11.0",
    "@opentelemetry/resources": "^1.30.1",
    "@opentelemetry/sdk-metrics": "^1.30.1",
    "@opentelemetry/sdk-node": "^0.57.1",
    "@opentelemetry/sdk-trace-node": "^1.30.1",
    "@opentelemetry/semantic-conventions": "^1.28.0",
    "next": "14.1.3",
    "react": "18.2.0",
    "react-dom": "18.2.0"
  },
}

Relevant log output

no logs output

Operating System and Version

Docker containers

Runtime and Version

datadog-agent: 7.50.3
Node.js: 20.15.1

@masaya-fukazawa masaya-fukazawa added bug Something isn't working triage labels Feb 20, 2025
@pichlermarc
Copy link
Member

pichlermarc commented Feb 20, 2025

Hi @masaya-fukazawa thanks for reaching out.

That request hook looks unsafe, especially around the route metadata in your requestHook. route must be just that, a Route - not a URL as that is likely to be high cardinality (think query strings, etc.).

Every unique set of attributes (including http.route) is a metrics stream. By default OTel (no matter which language implementation) has no limit set for metrics streams that can be allocated, and without that your app will run out of memory if the SDK is supplied high-cardinality metrics (it has to try to keep every set of attributes for every stream that was ever created in memory).

I think the fix would be to adapt either the hook to supply low-cardinality data to route. A safeguard for this to happen in the future is to use a cardinality limit through a View configuration in your NodeSDK constructor:

new NodeSDK({
  ... // your config
  views: [
     new View({ 
       instrumentName: '*', // wildcard selector, means "apply this view to every instrument"
       aggregationCardinalityLimit: 2000 // limits cardinality to 2000 streams per metric
     })
   ]
})

this will limit cardinality by introducing an overflow metric stream, you will loose data on your metric when hitting the limit though so the recommendation to adapt data passed to route is still valid even when using such a cardinality limit.

@pichlermarc
Copy link
Member

(a way to test my above theory would be to run without the request hook for a while and see if that makes a difference. If it still runs OOM then something else may be the culprit)

@masaya-fukazawa
Copy link
Author

Hi @pichlermarc.

Thanks, I'll try your suggestion.
I'll get back to you after I try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants