Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore "nanoarrow-js" #41

Closed
kylebarron opened this issue Aug 30, 2023 · 5 comments
Closed

Explore "nanoarrow-js" #41

kylebarron opened this issue Aug 30, 2023 · 5 comments

Comments

@kylebarron
Copy link
Owner

Arrow JS is a big library! It's not really a tenable dependency for a very bundle size conscious library or application.

This is actually the same story as in C/C++/Python. The C++ Arrow library got so big that many projects didn't want to depend on it. That's why nanoarrow was created. As a super minimal library that works with the C Data Interface representation of Arrow arrays.

I think there's definitely potential for a low level Arrow library in JS, that hews very closely to the C Data Interface.

Data structures would be essentially the JS counterpart of C Data Interface structs. All array data (no matter the logical type) would be a Uint8Array, that could later be viewed as another type or as strings.

Because array data are all Uint8Arrays, it means an array could either be "owned" in JS memory or "viewed" from wasm memory. So the memory safety wouldn't be great, but this is JS after all!

It would make sense to have toArrowJS and fromArrowJS functions that convert to and from Arrow JS arrays/Data instances.

An emphasis should be placed on a functional api instead of a class API to keep bundle size low.

Ideally, this would allow high-performance programs to rely on Arrow memory without fear of a huge bundle size impact! But this would be complementary not competitive with Arrow JS.

@kylebarron
Copy link
Owner Author

kylebarron commented Aug 30, 2023

Look at zarrita.js implementation to consider typescript typing for this approach. Seems like type guards would be very useful here.

let arrayData: ArrowArray = ...;

function isStringArray(data: ArrowArray): data is StringArray 

keep in mind though that if StringArray doesn't change the actual interface, a StringArray object will type check the same as a normal array

@kylebarron kylebarron mentioned this issue Aug 31, 2023
2 tasks
@domoritz
Copy link
Contributor

domoritz commented Dec 6, 2023

Arrow JS is a larger library but it is super treeshakeable. So if you don't need IPC reading/writing for example, you can get a much smaller bundle. If you just import one type, it can be tiny.

@kylebarron
Copy link
Owner Author

As a disclaimer, I'm horrible at bundling, so it's very possible I'm doing something wrong, but in geoarrow/geoarrow-js#20 I found that the apache-arrow import wasn't getting tree-shaken by esbuild.

In particular, comparing the tree shaking output of

import { BufferType, Type } from "apache-arrow/enum";
import { Data } from "apache-arrow/data";
import { Vector } from "apache-arrow/vector";
import { Field } from "apache-arrow/schema";

with

import { BufferType, Type } from "apache-arrow";
import { Data } from "apache-arrow";
import { Vector } from "apache-arrow";
import { Field } from "apache-arrow";

Just that one change (from the latter to the former) reduced the minified earcut worker from 205kb to 74kb.

I was suspicious because originally the unminified worker output from the latter still had IPC read/write code.

In the end, because I knew in this worker I was only using attributes of the Data class and no methods, I avoided any arrow import from the worker at all and got the compressed size down to 6kb. But for workers that return Arrow data, that won't be possible.

So, naively, it seems to enable tree shaking I have to ensure imports are from the internal file? Or maybe I'm using esbuild wrong 🤷‍♂️

In any case, as I mentioned here, I'm already spread too thin and don't think I have the bandwidth to make a stable nanoarrow-js right now.

@domoritz
Copy link
Contributor

domoritz commented Dec 6, 2023

I think esbuild doesn't treeshake. We have a bundle test in arrow that compares different bundlers.

$ yarn test:bundle
$ gulp bundle
[13:00:47] Using gulpfile ~/Code/arrow/js/gulpfile.js
[13:00:47] Starting 'bundle'...
[13:00:47] Starting 'bundle:clean'...
[13:00:47] Finished 'bundle:clean' after 12 ms
[13:00:47] Starting 'bundle:esbuild'...
[13:00:48] field-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] makeTable-bundle.js: 197.89 kB (gzipped: 46.45 kB)
[13:00:48] makeVector-bundle.js: 197.81 kB (gzipped: 46.42 kB)
[13:00:48] schema-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] table-bundle.js: 196.29 kB (gzipped: 46.14 kB)
[13:00:48] tableFromArrays-bundle.js: 199.53 kB (gzipped: 47.07 kB)
[13:00:48] tableFromIPC-bundle.js: 197.54 kB (gzipped: 46.4 kB)
[13:00:48] vector-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] vectorFromArray-bundle.js: 199.44 kB (gzipped: 47.04 kB)
[13:00:48] Finished 'bundle:esbuild' after 197 ms
[13:00:48] Starting 'bundle:rollup'...
[13:00:53] table-bundle.js: 88.28 kB (gzipped: 19.07 kB)
[13:00:53] vectorFromArray-bundle.js: 101.59 kB (gzipped: 21.51 kB)
[13:00:53] vector-bundle.js: 66.66 kB (gzipped: 14.99 kB)
[13:00:53] schema-bundle.js: 13.79 kB (gzipped: 3.54 kB)
[13:00:53] field-bundle.js: 799 B (gzipped: 367 B)
[13:00:53] tableFromIPC-bundle.js: 195.94 kB (gzipped: 40.75 kB)
[13:00:53] makeTable-bundle.js: 91.71 kB (gzipped: 19.61 kB)
[13:00:53] makeVector-bundle.js: 74.75 kB (gzipped: 16.03 kB)
[13:00:53] tableFromArrays-bundle.js: 112.72 kB (gzipped: 24.4 kB)
[13:00:53] Finished 'bundle:rollup' after 4.88 s
[13:00:53] Starting 'bundle:webpack'...
[13:00:55] field-bundle.js: 14.68 kB (gzipped: 3.7 kB)
[13:00:55] makeTable-bundle.js: 74.28 kB (gzipped: 17.84 kB)
[13:00:55] makeVector-bundle.js: 60.11 kB (gzipped: 14.52 kB)
[13:00:55] schema-bundle.js: 14.68 kB (gzipped: 3.7 kB)
[13:00:55] table-bundle.js: 72.61 kB (gzipped: 17.53 kB)
[13:00:55] tableFromArrays-bundle.js: 91.64 kB (gzipped: 22.31 kB)
[13:00:55] tableFromIPC-bundle.js: 167.49 kB (gzipped: 37.04 kB)
[13:00:55] vector-bundle.js: 58.48 kB (gzipped: 14.2 kB)
[13:00:55] vectorFromArray-bundle.js: 83.03 kB (gzipped: 20 kB)
[13:00:55] Finished 'bundle:webpack' after 2.67 s

I filed an issue about it at evanw/esbuild#1922 but it sounds like esbuild will expect annotations so we should add those if esbuild is becoming popular.

I have pretty good experiences with rollup. Would be awesome if the problem just went away with a better bundler so you don't have to rewrite the Arrow APIs.

@kylebarron
Copy link
Owner Author

I'm going to close this because I don't have the maintenance bandwidth to try and implement data structures for Arrow outside of Arrow JS, and I don't have a use case at this point where Arrow JS's bundle size is a deal-breaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants