Skip to content

Commit

Permalink
Reliable JavaScript/SourceMap processing via DebugId
Browse files Browse the repository at this point in the history
We want to make processing / SourceMap-ing of JavaScript stack traces more reliable.
To achieve this, we want to uniquely identify a (minified / deployed) JavaScript file using a DebugId.
The same DebugId also uniquely identifies the corresponding SourceMap.
That way it should be possible to _reliably_ look up the SourceMap corresponding to
a JavaScript file, which is necessary to have reliable SourceMap processing.
  • Loading branch information
Swatinem committed Mar 23, 2023
1 parent 5fb8214 commit 0a6d0ca
Show file tree
Hide file tree
Showing 2 changed files with 363 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ This repository contains RFCs and DACIs. Lost?
- [0071-continue-trace-over-process-boundaries](text/0071-continue-trace-over-process-boundaries.md): Continue trace over process boundaries
- [0072-kafka-schema-registry](text/0072-kafka-schema-registry.md): Kafka Schema Registry
- [0078-escalating-issues](text/0078-escalating-issues.md): Escalating Issues
- [0081-sourcemap-debugid](text/0081-sourcemap-debugid.md): Reliable JavaScript/SourceMap processing via `DebugId`
362 changes: 362 additions & 0 deletions text/0081-sourcemap-debugid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
- Start Date: 2023-03-21
- RFC Type: initiative
- RFC PR: https://github.com/getsentry/rfcs/pull/81
- RFC Status: draft

# Summary / Motivation

We want to make processing / SourceMap-ing of JavaScript stack traces more reliable.
To achieve this, we want to uniquely identify a (minified / deployed) JavaScript file using a `DebugId`.
The same `DebugId` also uniquely identifies the corresponding SourceMap.
That way it should be possible to _reliably_ look up the SourceMap corresponding to
a JavaScript file.

# Background

It is currently not possible to _reliably_ find the associated SourceMap for a
JavaScript file.

A JavaScript stack trace only points to the (minified / transformed) source file
by its URL, such as `https://example.com/file.min.js`, or `/path/to/local/file.min.js`.

The corresponding SourceMap is often referenced using a `sourceMappingURL` comment
at the end of that file. It is also possible to have a "hidden" SourceMap that is
not referenced in such a way, but is typically found by its filename `{js_filename}.map`.

However it is not guaranteed that the SourceMap found in such a way actually
corresponds to the JavaScript file in which the error happened.

A classical example is caching.

1. An end-user is loading version `1` of `https://example.com/file.min.js`.
2. A new app version `2` is deployed.
3. The user experiences an error.
4. The SourceMap at `https://example.com/file.min.js.map` (version `2`) at this point in time does not correspond to
the code the user was running.

This problem is even worse at Sentry scale, as at any point in time, errors can come in that happened with arbitrary
versions of the deployed code, sometimes even involving multiple files which might be out-of-sync with each other.

To work around this problem, Sentry has used the combination of `release` and optional `dist` to better associate
JavaScript files from one release with SourceMaps uploaded to Sentry.

However this solution is still not reliable, as mentioned above, even two files loaded in the end-users browser can
belong to a different release, due to caching or other reasons.

Using a `DebugId`, which uniquely associates the JavaScript file and its corresponding SourceMap, should make source-mapping
a lot more reliable.

# Supporting Data

TODO: please fill in the gaps here!

Sentry has used the `release + dist` solution for quite some time and found it inadequate.
A lot of events are not being resolved correctly due to these mismatches, and problems with source-mapping are very
common in customer-support interactions.

On the other hand, using a `DebugId` for symbolication of Native crashes and stack traces is working reliably both in
Sentry and in the wider native ecosystem. The Native and C# community has the concept of _Symbol Servers_, which can
serve any debug file based on its `DebugId`, which allows reliable symbolication for any release, at any point in time.

# Options Considered

To make `DebugId` work, we need to generate one, and associate it to both the JavaScript file, and its corresponding
SourceMap.

## The `DebugId` format

The `DebugId` should have the same format as a standard UUI, specifically:
It should be a 128 bit (16 byte), formatted to a string using base-16 hex encoding like so:

`XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX`

## How to generate `DebugId`s?

There is two options of choosing a `DebugId`: Making it completely random, or make it reproducible by deriving it from
a content hash.

### Based on JavaScript Content-hash

This creates a new `DebugId` by hashing the contents of the JavaScript file.

**pros**

- Is fully reproducible. The same JavaScript file will always have the same `DebugId`.
- Works well with existing caching solutions.

**cons**

- Increases overhead in server-side SourceMap processing, as one file can potentially be included in multiple _bundles_.
See [_What is an `ArtifactBundle`_](#what-is-an-artifactbundle) below.
- A difference in a source file might not be reflected in the JavaScript file. An example of this might be changes to
whitespace, comments, or code that was dead-code-eliminated by bundlers.

### Based on SourceMap Content-hash

This creates a new `DebugId` by hashing the contents of the SourceMap file.

**pros**

- Generates a new `DebugId` for changes to source files that would otherwise not lead to changes in the JavaScript file.

**cons**

- Does lead to slightly more cache invalidation.

### Random `DebugId`

This option would create a new random `DebugId` for each file, on each build.

**pros**

- Simpler server-side SourceMap processing, as one `DebugId` is only included in a single _bundle_, and that one bundle
can serve multiple stack frames for multiple files of the same build.

**cons**

- Completely breaks the concept of _caching_, as every file is unique for every build.

## How to inject the `DebugId` into the JavaScript file?

### `//# debugId` comment

We propose to add a new magic comment to the end of JavaScript files similar to the existing `//# sourceMappingURL`
comment. It should be at the end of the file, preferable as the line right before the `sourceMappingURL`, as the
second line from the bottom

It should look like this:

```js
someRandomJSCode();
//# debugId=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
//# sourceMappingURL=file.min.js.map
```

### Runtime Detection / Resolution of `DebugId`

In a shiny utopian future, Browsers would directly expose builtin APIs to programmatically access each frame of an `Error`s stack.
This might include the absolute path, the line and column number, and the `DebugId`.
Though the reality of today is that each browser has its own text-based `Error.stack` format, which might even give
completely different line and column numbers across the different browsers.
No programmatic API exists today, and might never exist. At the very least, widespread support for this is years away.

It is therefore necessary to extract this `DebugId` through other means.

#### Reading the `//# debugId` comment when capturing Errors

Current JavaScript stack traces include the absolute path (called `abs_path`) of each stack frame. It should be possible
to load and inspect that file at runtime whenever an error happens.

**pros**

- Does not require injecting any _code_ into the JavaScript files.

**cons**

- Might incur some async fetching / IO when capturing an Error. Though any `abs_path` in the stack trace should be cached already.

#### Add the `DebugId` to a global at load time

One solution here is to inject a small snippet of JS which will be executed when the JavaScript file is loaded, and adds
the `DebugId` to a global map.

An example snippet is here:

```
!function(){try{var e="undefined"!=typeof window?window:"undefined"!=typeof global?global:"undefined"!=typeof self?self:{},n=(new Error).stack;n&&(e._sentryDebugIds=e._sentryDebugIds||{},e._sentryDebugIds[n]="XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX")}catch(e){}}()
```

This snippet adds a complete `Error.stack` to a global called `_sentryDebugIds`.
Further post-processing at time of capturing an `Error` is required to extract the `abs_path` from that captured stack.

**pros**

- Does not require any async fetching at time of capturing an `Error`.

**cons**

- It does however require parsing of the `Error.stack` at time of capturing the `Error`.

An alternative implementation might use the [`import.meta.url`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/import.meta)
property. This would avoid capturing and post-processing an `Error.stack`, but does require usage of ECMAScript Modules.

```
((globalThis._sentryDebugIds=globalThis._sentryDebugIds||{})[import.meta.url]="XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX");
```

**pros**

- More compact snippet.
- No post-processing required.

**cons**

- Depends on usage of ECMAScript Modules.

## When to inject the `DebugId` into the JavaScript file?

Deploying JavaScript applications can range from a simple _copy files via ftp_
to a complex workflow like the following:

```mermaid
graph TD
transpile[Transpile source files] --> bundle[Bundle source files]
bundle --> minify[Minify bundled chunk]
minify --> fingerprint[Fingerprint minified chunks]
minify --> sentry[Upload release to Sentry]
fingerprint --> upload[Upload assets to CDN]
upload --> propagate[Wait for CDS assets to propagate]
fingerprint --> deploy[Deploy updated asset references]
propagate --> deploy
```

In this example, assets are _fingerprinted_, and after being fully propagated
through a global CDN, they are starting to be referenced from the backend
service via HTML.

This may work with unique content-hash based filenames, and even use _fingerprinting_ and
[Subresource Integrity (SRI)](https://developer.mozilla.org/en-US/docs/Web/Security/Subresource_Integrity).

An example may look like this, for a CDN-deployed and fingerprinted reference
to [katex](https://katex.org/docs/browser.html#starter-template):

```html
<script
defer
src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js"
integrity="sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4"
crossorigin="anonymous"
></script>
```

Not only is the deployment pipeline very complex, it can also involve a variety of tools with varying degree of
integration between them.
The example `<script>` tag shown above might be generated as part of one integrated JS bundler tool, or it might be
generated by a Rust or python backend, based on supplied JSON file.

The checksums themselves might be directly output by a JS bundler tool, or they might be generated by a completely
different tool at another stage of the build pipeline.

Each application and build pipeline is unique, and there is an ever growing multitude of tools.
_Insert joke about a new JS bundler being created each week here._

It is therefore important that whatever comments and/or code we end up injecting into the final JavaScript assets is
being injected at the right point in this pipeline. Ideally it would be injected **before** fingerprinting happens, and
**before** any content-hash based naming happens.

As most JavaScript bundlers support automatic bundle-splitting, and will insert dynamic `import` or `require` statements
referencing those chunks by (fingerprinted) filename, a deep integration into those various bundlers might be needed.

### Injection via `sentry-cli inject`

With this, injection would happen with a new command, `sentry-cli inject`. It will be the responsibility of the developer
to call this at the appropriate time depending on their unique build pipeline.

**pros**

- Gives full control for build pipelines that involve a heterogenous set of tools and stages.

**cons**

- Requires manually using this command.
- Does not work with bundlers that integrate fingerprinting.

### Injection at `sentry-cli upload` time

In this scenario, injection happens at the time of `sentry-cli upload`, and will also modify the files at that time.

**pros**

- Makes sure that assets uploaded to Sentry have a `DebugId`.
- No additional command and invocation needed.

**cons**

- Does not work with bundlers that integrate fingerprinting.
- Does not work in build pipelines where `sentry-cli upload` is not in the main deployment path.

### Injection via bundler plugins

Here, we would build `DebugId` injection right into the various JavaScript bundlers. This can happen with a third-party
plugin at first, and might move into the core bundler packages once there is enough community buy-in for `DebugId`s.

Each bundler is unique though, and has different hooks at different stages of its internal pipeline. Some bundlers
might not have the necessary hooks at the necessary stage at all.

#### Rollup

Rollup has a very comprehensive plugin system, with good documentation about the various hooks and the internal pipeline:
https://rollupjs.org/plugin-development/#output-generation-hooks

According to the above diagram, the appropriate plugin hook to use might be the
[`renderChunk`](https://rollupjs.org/plugin-development/#renderchunk) hook, which allows
access and modification of a chunks `code` and `map` (SourceMap) output.
This hook runs before the `augmentChunkHash` and `generateBundle` hooks which are responsible for fingerprinting and
generating the _final_ output for each chunk.

TODO: further investigation and experimentation for this is needed

#### Webpack

Webpack documentation for plugin hooks is not as extensive, and there is no broad overview of the internal pipeline and
phases. There is a general overview of all the `Compilation` hooks though:
https://webpack.js.org/api/compilation-hooks/

It might be possible to use the [`processAssets`](https://webpack.js.org/api/compilation-hooks/#processassets) hook
for this purpose. Documentation mentions the `PROCESS_ASSETS_STAGE_DEV_TOOLING` phase which is responsible for
extracting SourceMaps, or the `PROCESS_ASSETS_STAGE_OPTIMIZE_HASH` which looks to be responsible for generating the
final fingerprint of an asset.

TODO: further investigation and experimentation for this is needed

#### TODO: other popular bundlers and build tools

## Injecting the `DebugId` into the SourceMap

This is a less controversial part, as SourceMaps are in general not distributed to production, and are less likely to
be fingerprinted or integrity-checked. They are also plain JSON, making it trivial to inject additional fields.
We propose to add a new JSON field to the root of the SourceMap object called `debugId`.
This new field should encode the `DebugId` as a plain string.

# Drawbacks

The main drawback is that this might feel like an invasive change to the JavaScript ecosystem. It is a huge implementation
burden, and might not be received positively by neither customers nor the wider JS tools ecosystem.

Especially injecting a piece of JavaScript into every production asset might alienate some users.

The effectiveness and success of this initiative needs to be proved out first, and is not guaranteed.

# Unresolved questions

- ~~Why do we call the new SourceMap field `debug_id` and not `debugId`?
All existing fields in SourceMaps are camelCase, and so is the general convention in the JS ecosystem.~~

# Implementation

- TODO: link to some implementation breadcrumbs and PRs
- TODO: change the existing SourceMap implementation to use a camelCased `debugId` instead of the snake_cased `debug_id` field.

---

# Appendix

## What is an `ArtifactBundle`

Sentry bundles up all the assets of one release / build into a so-called `ArtifactBundle` (also called `SourceBundle`, or `ReleaseBundle`).

This is a special ZIP file which includes all the minified / production JavaScript files, their corresponding SourceMap,
and the original source files as referenced by the SourceMaps in whatever format (TypeScript or other).

It also has a `manifest.json`, which has more metadata per file, like the type of a file, its `DebugId`, and an optional
`SourceMap` reference from minified files to their SourceMap.

**pros**

- Customers naturally think in _releases_, so having one archive per release is good.
- Only needing to download / cache / process a single file for one release can be more efficient.

**cons**

- Does not work well content-hash based `DebugId`s, as one `DebugId` can appear in a multitude of archives.
- Feels like a workaround for inefficiencies in other parts of the processing pipeline when dealing with more smaller files.

0 comments on commit 0a6d0ca

Please sign in to comment.