Skip to content

Commit

Permalink
Add bibtex to docs (#2094)
Browse files Browse the repository at this point in the history
And update the landing page
  • Loading branch information
gatesn authored Jan 28, 2025
1 parent 22fb4d0 commit 07b37e4
Show file tree
Hide file tree
Showing 8 changed files with 368 additions and 47 deletions.
62 changes: 41 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,67 +6,83 @@
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/vortex-array)](https://pypi.org/project/vortex-array/)

> [!TIP]
> Check out the [Docs](https://spiraldb.github.io/vortex/docs/) or jump straight into the [Getting Started Guide](https://spiraldb.github.io/vortex/docs/quickstart.html)
> Check out the [Docs](https://docs.vortex.dev/)
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache
Arrow arrays
in-memory, on-disk, and over-the-wire.

Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster),
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and
scans (2-10x faster),
while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device
decompression on GPUs.

Vortex is intended to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
extremely fast, & batteries-included.

> [!CAUTION]
> This library is still under rapid development and is a work in progress!
>
> Some key features are not yet implemented, both the API and the serialized format are likely to change in breaking ways,
> Some key features are not yet implemented, both the API and the serialized format are likely to change in breaking
> ways,
> and we cannot yet guarantee correctness in all cases.
The major features of Vortex are:

* **Logical Types** - a schema definition that makes no assertions about physical layout.
* **Zero-Copy to Arrow** - "canonicalized" (i.e., fully decompressed) Vortex arrays can be zero-copy converted to/from Apache Arrow arrays.
* **Extensible Encodings** - a pluggable set of physical layouts. In addition to the builtin set of Arrow-compatible encodings,
the Vortex repository includes a number of state-of-the-art encodings (e.g., FastLanes, ALP, FSST, etc.) that are implemented
* **Zero-Copy to Arrow** - "canonicalized" (i.e., fully decompressed) Vortex arrays can be zero-copy converted to/from
Apache Arrow arrays.
* **Extensible Encodings** - a pluggable set of physical layouts. In addition to the builtin set of Arrow-compatible
encodings,
the Vortex repository includes a number of state-of-the-art encodings (e.g., FastLanes, ALP, FSST, etc.) that are
implemented
as extensions. While arbitrary encodings can be implemented as extensions, we have intentionally chosen a small set
of encodings that are highly data-parallel, which in turn allows for efficient vectorized decoding, random access reads,
of encodings that are highly data-parallel, which in turn allows for efficient vectorized decoding, random access
reads,
and (in the future) decompression on GPUs.
* **Cascading Compression** - data can be recursively compressed with multiple nested encodings.
* **Pluggable Compression Strategies** - the built-in Compressor is based on BtrBlocks, but other strategies can trivially be used instead.
* **Pluggable Compression Strategies** - the built-in Compressor is based on BtrBlocks, but other strategies can
trivially be used instead.
* **Compute** - basic compute kernels that can operate over encoded data (e.g., for filter pushdown).
* **Statistics** - each array carries around lazily computed summary statistics, optionally populated at read-time.
These are available to compute kernels as well as to the compressor.
* **Serialization** - Zero-copy serialization of arrays, both for IPC and for file formats.
* **Columnar File Format (in progress)** - A modern file format that uses the Vortex serde library to store compressed array data.
* **Columnar File Format (in progress)** - A modern file format that uses the Vortex serde library to store compressed
array data.
Optimized for random access reads and extremely fast scans; an aspiring successor to Apache Parquet.

## Overview: Logical vs Physical

One of the core design principles in Vortex is strict separation of logical and physical concerns.

For example, a Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding
For example, a Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical
encoding
(the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.

The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct
Vortex arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and
`chunked`) that are useful building blocks for other encodings. The included extension encodings are mostly designed
to model compressed in-memory arrays, such as run-length or dictionary encoding.

Analogously, `vortex-serde` is designed to handle the low-level physical details of reading and writing Vortex arrays. Choices
Analogously, `vortex-serde` is designed to handle the low-level physical details of reading and writing Vortex arrays.
Choices
about which encodings to use or how to logically chunk data are left up to the `Compressor` implementation.

One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the
One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data
within the
file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
the file format specification.

For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it
can choose
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is
constant
across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).

In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly
into the files
themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.

## Components
Expand Down Expand Up @@ -239,7 +255,8 @@ Licensed under the Apache License, Version 2.0 (the "License").
## Governance

Vortex is and will remain an open-source project. Our intent is to model its governance structure after the
[Substrait project](https://substrait.io/governance/), which in turn is based on the model of the Apache Software Foundation.
[Substrait project](https://substrait.io/governance/), which in turn is based on the model of the Apache Software
Foundation.
Expect more details on this in Q4 2024.

## Acknowledgments 🏆
Expand All @@ -252,7 +269,8 @@ In particular, the following academic papers have strongly influenced developmen
* Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
[BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
Proc. ACM Manag. Data 1, 2, Article 118 (June 2023), 14 pages.
* Azim Afroozeh and Peter Boncz. [The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar
* Azim Afroozeh and Peter
Boncz. [The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar
Code](https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf). PVLDB, 16(9): 2132 - 2144, 2023.
* Peter Boncz, Thomas Neumann, and Viktor Leis. [FSST: Fast Random Access String
Compression](https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf).
Expand All @@ -270,10 +288,12 @@ Additionally, we benefited greatly from:

* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
[Apache DataFusion](https://github.com/apache/datafusion).
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao).
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project
by [Jorge Leitao](https://github.com/jorgecarleitao).
* the public discussions around choices of compression codecs, as well as the C++ implementations thereof,
from [duckdb](https://github.com/duckdb/duckdb).
* the [Velox](https://github.com/facebookincubator/velox) and [Nimble](https://github.com/facebookincubator/nimble) projects,
* the [Velox](https://github.com/facebookincubator/velox) and [Nimble](https://github.com/facebookincubator/nimble)
projects,
and discussions with their maintainers.

Thanks to all of the aforementioned for sharing their work and knowledge with the world! 🚀
12 changes: 10 additions & 2 deletions docs/_static/style.css
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
html .pst-navbar-icon {
font-size: 1.5rem;
h2 {
font-size: 1.75rem;
}

h3 {
font-size: 1.5rem;
}

h4 {
font-size: 1.25rem;
}
5 changes: 5 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
"sphinx.ext.napoleon",
"sphinx_copybutton",
"sphinx_inline_tabs",
"sphinxcontrib.bibtex",
"sphinxext.opengraph",
]

Expand Down Expand Up @@ -70,3 +71,7 @@

ogp_site_url = "https://docs.vortex.dev"
ogp_image = "https://docs.vortex.dev/_static/vortex_spiral_logo.svg"

# -- Options for Sphinx BibTEX -------------------------------------------

bibtex_bibfiles = ["references.bib"]
62 changes: 38 additions & 24 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,47 @@
# Vortex: a State-of-the-Art Columnar File Format
# Vortex: the columnar data toolkit

Vortex is a fast & extensible columnar file format that is based around the latest research from the
database community. It is built around cascading compression with lightweight, vectorized encodings
(i.e., no block compression), allowing for both efficient random access and extremely fast
decompression.
Vortex is a general purpose toolkit for working with columnar data built around the latest research from the
database community.

Vortex includes an accompanying in-memory format for these (recursively) compressed arrays,
that is zero-copy compatible with Apache Arrow in uncompressed form. Taken together, the Vortex
library is a useful toolkit with compressed Arrow data in-memory, on-disk, & over-the-wire.
## In-memory

Vortex consolidates the metadata in a series of flatbuffers in the footer, in order to minimize
the number of reads (important when reading from object storage) & the deserialization overhead
(important for wide tables with many columns).
Vortex in-memory arrays support:

Vortex aspires to succeed Apache Parquet by pushing the Pareto frontier outwards: 1-2x faster
writes, 2-10x faster scans, and 100-200x faster random access reads, while preserving the same
approximate compression ratio as Parquet v2 with zstd.
* Zero-copy interoperability with [Apache Arrow](https://arrow.apache.org).
* Cascading compression with lightweight, vectorized encodings such as
[FastLanes](https://github.com/spiraldb/fastlanes),
[FSST](https://github.com/spiraldb/fsst),
and [ALP](https://github.com/spiraldb/alp).
* Fast random access to compressed data.
* Compute push-down over compressed data.
* Array statistics for efficient compute.

Its features include:
## On-disk

- A zero-copy data layout for disk, memory, and the wire.
- Kernels for computing on, filtering, slicing, indexing, and projecting compressed arrays.
- Builtin state-of-the-art codecs including FastLanes (integer bit-packing), ALP (floating point),
and FSST (strings).
- Support for custom user-implemented codecs.
- Support for, but no requirement for, row groups.
- A read sub-system supporting filter and projection pushdown.
Vortex ships with an extensible file format supporting:

Vortex's flexible layout empowers writers to choose the right layout for their setting: fast writes,
fast reads, small files, few columns, many columns, over-sized columns, etc.
* Zero-allocation reads, deferring both deserialization and decompression.
* Zero-copy reads from memory-mapped files.
* FlatBuffer metadata to support ultra-wide schemas (>>100k columns).
* Fully customizable layouts and encodings (row-groups, column-groups, writer decides).
* Forwards compatibility by optionally embedding [WASM](https://webassembly.org/) decompression kernels.

## Over-the-wire

Vortex defines a work-in-progress IPC format for sending possibly compressed arrays over the wire.

* Zero-copy serialization and deserialization.
* Support for both compressed and uncompressed data.
* Enables partial compute push-down to storage servers.
* Enables client-side browser decompression with Vortex WASM.

## Extensibility

Vortex is designed to be incredibly extensible. Almost all reader and writer logic is extensible at compile-time
by providing various implementations of Rust traits, and encodings and layouts are extensible at runtime with
dynamically loaded libraries or WebAssembly kernels.

Please reach out to us if you'd like to extend Vortex with your own encodings, layouts, or other functionality.

## Concepts

Expand Down Expand Up @@ -91,6 +104,7 @@ hidden:
caption: Project Links
---
references
Spiral <https://spiraldb.com>
GitHub <https://github.com/spiraldb/vortex>
PyPI <https://pypi.org/project/vortex-array>
Expand Down
2 changes: 2 additions & 0 deletions docs/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,12 @@ authors = []
dependencies = [
"furo>=2024.8.6",
"myst-parser>=4.0.0",
"setuptools>=75.8.0", # Required by sphinxcontrib-bibtex
"sphinx-autobuild>=2024.10.3",
"sphinx-copybutton>=0.5.2",
"sphinx-inline-tabs>=2023.4.21",
"sphinx>=8.0.2",
"sphinxcontrib-bibtex>=2.6.3",
"sphinxext-opengraph>=0.9.1",
"vortex-array",
]
Expand Down
Loading

0 comments on commit 07b37e4

Please sign in to comment.