Caching for fetched/url datasets #7316

mauromelis · 2021-03-18T15:00:08Z

mauromelis
Mar 18, 2021

In a situation similar to #4146, I would like to be able to embed two or more charts that rely on same fetched datasets/urls minimizing network data exchange.
Let's say we have:

a bar chart

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {"url": "data/unemployment.tsv"},
  "transform": [{"calculate": "slice(datum.id, 0, 2)", "as": "state"}],
  "mark": "bar",
  "encoding": {
    "x": {"field": "rate", "type": "quantitative", "aggregate": "mean"},
    "y": {"field": "state", "type": "ordinal"}
  }
}

and a map:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {"url": "data/us-10m.json", "format": {"type": "topojson", "feature": "counties"}},
  "transform": [{"lookup": "id", "from": {
    "data": {"url": "data/unemployment.tsv"}, "key": "id", "fields": ["rate"]
  }}],
  "projection": {"type": "albersUsa"},
  "mark": "geoshape",
  "encoding": {"color": {"field": "rate", "type": "quantitative"}}
}

that use following datasets:

data/unemployment.tsv
data/us-10m.json

So each chart fetches its own data with a parallel request: this could easily scale towards a series of many multiple requests to the same resources, if more similar charts were added to the page.

In order to avoid multiple requests, I know I can fetch these data once and pass them to the charts using View API, but I cannot because I want each chart to encapsulate the transformation logic and be self-sufficient for reusability and documentation purposes.

So I ended up building the following caching mechanism (a bit rude but it works). Given a spec, it finds each data object requesting a url and converts it to an inline source using values. As you can see, all fetching and parsing code is already inside vega, coming from the vega-loader package:

import { loader as loaderFactory, read } from "vega-loader";
import { iterateDeep } from "./helpers";

const loader = loaderFactory();
const cache = {};

/**
 * Eventually add loader promise to cache and return cached dataset promise.
 * @param {string} url The dataset url to fetch
 * @returns {Promise<string>} A Promise of the unparsed dataset from cache
 */
// REF: https://github.com/vega/vega-lite/issues/4146
function syncCache(url) {
  if (cache[url] == null) cache[url] = loader.load(url); // Set cached data
  return cache[url];
}

/**
 * Convert a data object into an inline one of cached data values
 * @param {{url?: string, format?: object, values?: string}} dataObj The data object to convert (and cache)
 * @returns {Promise<void>} A promise of conversion success
 */
function convertDataObj(dataObj) {
  return syncCache(/** @type {string} */ (dataObj.url)).then(rawData => {
    // BUG: In some cases, as lookup transformation on json data, Vega doesn't parse raw data using format
    const parsedData = read(rawData, dataObj.format); // TODO: Use Vega internals to parse raw data
    delete dataObj.url;
    delete dataObj.format; // TODO: Keep format object
    dataObj.values = parsedData;
  });
}

/**
 * Given a Vega or Vega-Lite spec, convert all dataset with urls into cached inline ones.
 * @param {vegaSpec | vegaLiteSpec} chartSpec The spec with urls to fetch
 * @returns {Promise<vegaSpec | vegaLiteSpec>} A Promise of the converted spec with inline data
 */
export function cacheSpec(chartSpec) {
  const newSpec = JSON.parse(JSON.stringify(chartSpec)); // Clone spec

  const promises = []; // Collect all the promises to sync all data replacements
  iterateDeep(newSpec, (key, obj) => {
    if (key === "url") {
      promises.push(
        // Replace each data object with inline data values
        convertDataObj(obj)
      );
    }
  });
  // Sync all promises and return the spec with inline data values
  return Promise.all(promises).then(() => newSpec);
}

Now I can embed the charts, avoiding multiple call to data/unemployment.tsv, like this:

cacheSpec(barSpec).then(cachedSpec => embed("#bar", cachedSpec, chartConfig));
cacheSpec(mapSpec).then(cachedSpec => embed("#map", cachedSpec, chartConfig));

And immediately I thought that it would be possible to extend the data inside spec with a boolean cache property.
It would be interesting to enable the cache for a dataset by writing:

{"data": {"url": "data/unemployment.tsv", "cache": true}},

or disable with:

{"data": {"url": "data/us-10m.json", "cache": false, "format": {"type": "topojson", "feature": "counties"}}},

If all network requests were handled within a single instance of vega loader, which I believe is already happening, I think adding a promise-based caching mechanism might not be too complicated.
Please forgive the long post,
what do you think?

Answered by domoritz

Mar 18, 2021

Thank you for the detailed feature request. However, I will close this request for a few reasons that I hope make sense. Please ask about anything that's unclear.

Vega-Lite already combines multiple requests to the same dataset within a specification. So if you use concat, it would only make one request.
Browsers already cache requests if the server does not disallow caching. It often looks like two requests in the network panel but you can look at the details to see whether the request was cached.
You already have a working solution by overriding the loader. I think that is the right place to implement this functionality. I believe a spec author should not have to worry about low-level …

View full answer

domoritz · 2021-03-18T15:45:55Z

domoritz
Mar 18, 2021
Maintainer

Thank you for the detailed feature request. However, I will close this request for a few reasons that I hope make sense. Please ask about anything that's unclear.

Vega-Lite already combines multiple requests to the same dataset within a specification. So if you use concat, it would only make one request.
Browsers already cache requests if the server does not disallow caching. It often looks like two requests in the network panel but you can look at the details to see whether the request was cached.
You already have a working solution by overriding the loader. I think that is the right place to implement this functionality. I believe a spec author should not have to worry about low-level details such as caching.

Again, I appreciate the detailed feature request.

0 replies

mauromelis · 2021-03-19T16:05:12Z

mauromelis
Mar 19, 2021
Author

Thank you for the immediate and accurate reply.
I probably oversimplified my use case and couldn't fully explain my needs.
Anyway the app I'm developing uses a frontend framework with complex layouts, routing and cross-chart selections.
So the Vega charts are actually embedded inside deeply nested components.

Then:

concat can be a clever solution, but only if you don't need too complex layouts. Even in the simplest of situations, it is still difficult to have full responsive charts, due to the known limitations of fit
Of course I knew about the Http cache, but unfortunately I am not in complete control of the backend and therefore cannot activate it. However, this only partially solves the issue, because the first time the app is loaded on a browser, N parallel requests still start until the resource is cached. The heavier the resource, the more the performance drops.
Does this make sense

The problem with the above solution is that the spec had to be cloned every time and it got polluted with inline values.
In the end I was forced to deactivate the actions because I didn't want to give such direct access to the data.

Finally I followed this comment vega/vega#2095 (comment) and refactored all the code using a custom loader. The result is a much cleaner and lighter solution, without reinventing the wheel:

import { loader } from "vega";

const cache = {};
const cacheLoader = loader();
const originalHttp = cacheLoader.http;

/**
 * Wrap original http method to use cache.
 * See {@link https://github.com/vega/vega/tree/master/packages/vega-loader#load_file}.
 * Eventually add loader promise to cache and return cached dataset promise.
 */
cacheLoader.http = function cacheLoaderHttp(url, options) {
  if (cache[url] == null) {
    cache[url] = originalHttp.call(this, url, options); // Set cached data
  }
  return cache[url];
};

This allows us to enable the cache by doing:

const chartConfig = {
  loader: cacheLoader
};

embed("#bar", barSpec, chartConfig);
embed("#map", mapSpec, chartConfig);

It would be great for me if you could add this example (or a similar one) on the documentation, to give importance to this feature, which could serve other use cases like adding dynamic parameters or custom headers to http requests.
The pieces were all there but I struggled to connect them.
Thanks you all again for this outstanding library.

0 replies

domoritz · 2021-03-19T21:48:13Z

domoritz
Mar 19, 2021
Maintainer

Thank you for writing up your solution here. I think these issues are actually a good resource many people use. I converted this issue into a discussion at https://github.com/vega/vega-lite/discussions.

If you feel we need this in the docs, please send a pull request.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching for fetched/url datasets #7316

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Caching for fetched/url datasets #7316

mauromelis Mar 18, 2021

Replies: 3 comments

domoritz Mar 18, 2021 Maintainer

mauromelis Mar 19, 2021 Author

domoritz Mar 19, 2021 Maintainer

mauromelis
Mar 18, 2021

domoritz
Mar 18, 2021
Maintainer

mauromelis
Mar 19, 2021
Author

domoritz
Mar 19, 2021
Maintainer