Skip to content

Latest commit

 

History

History

get_compressed

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

HTTP GET Arrow Data: Compression Examples

This directory contains examples of HTTP servers/clients that transmit/receive data in the Arrow IPC streaming format and use compression (in various ways) to reduce the size of the transmitted data.

Since we re-use the Arrow IPC format for transferring Arrow data over HTTP and both Arrow IPC and HTTP standards support compression on their own, there are at least two approaches to this problem:

  1. Compressed HTTP responses carrying Arrow IPC streams with uncompressed array buffers.
  2. Uncompressed HTTP responses carrying Arrow IPC streams with compressed array buffers.

Applying both IPC buffer and HTTP compression to the same data is not recommended. The extra CPU overhead of decompressing the data twice is not worth any possible gains that double compression might bring. If compression ratios are unambiguously more important than reducing CPU overhead, then a different compression algorithm that optimizes for that can be chosen.

This table shows the support for different compression algorithms in HTTP and Arrow IPC:

Codec Identifier HTTP Support IPC Support
GZip gzip X
DEFLATE deflate X
Brotli br X1
Zstandard zstd X1 X2
LZ4 lz4 X2

Since not all Arrow IPC implementations support compression, HTTP compression based on accepted formats negotiated with the client is a great way to increase the chances of efficient data transfer.

Servers may check the Accept-Encoding header of the client and choose the compression format in this order of preference: zstd, br, gzip, identity (no compression). If the client does not specify a preference, the only constraint on the server is the availability of the compression algorithm in the server environment.

Arrow IPC Compression

When IPC buffer compression is preferred and servers can't assume all clients support it3, clients may be asked to explicitly list the supported compression algorithms in the request headers. The Accept header can be used for this since Accept-Encoding (and Content-Encoding) is used to control compression of the entire HTTP response stream and instruct HTTP clients (like browsers) to decompress the response before giving data to the application or saving the data.

Accept: application/vnd.apache.arrow.stream; codecs="zstd, lz4"

This is similar to clients requesting video streams by specifying the container format and the codecs they support (e.g. Accept: video/webm; codecs="vp8, vorbis").

The server is allowed to choose any of the listed codecs, or not compress the IPC buffers at all. Uncompressed IPC buffers should always be acceptable by clients.

If a server adopts this approach and a client does not specify any codecs in the Accept header, the server can fall back to checking Accept-Encoding header to pick a compression algorithm for the entire HTTP response stream.

To make debugging easier servers may include the chosen compression codec(s) in the Content-Type header of the response (quotes are optional):

Content-Type: application/vnd.apache.arrow.stream; codecs=zstd

This is not necessary for correct decompression because the payload already contains information that tells the IPC reader how to decompress the buffers, but it can help developers understand what is going on.

When programatically checking if the Content-Type header contains a specific format, it is important to use a parser that can handle parameters or look only at the media type part of the header. This is not an exclusivity of the Arrow IPC format, but a general rule for all media types. For example, application/json; charset=utf-8 should match application/json.

When considering use of IPC buffer compression, check the [IPC format section of the Arrow Implementation Status page]4 to see whether the the Arrow implementations you are targeting support it.

HTTP/1.1 Response Compression

HTTP/1.1 offers an elaborate way for clients to specify their preferred content encoding (read compression algorithm) using the Accept-Encoding header.5

At least the Python server (in python/) implements a fully compliant parser for the Accept-Encoding header. Application servers may choose to implement a simpler check of the Accept-Encoding header or assume that the client accepts the chosen compression scheme when talking to that server.

Here is an example of a header that a client may send and what it means:

Accept-Encoding: zstd;q=1.0, gzip;q=0.5, br;q=0.8, identity;q=0

This header says that the client prefers that the server compress the response with zstd, but if that is not possible, then brotli and gzip are acceptable (in that order because 0.8 is greater than 0.5). The client does not want the response to be uncompressed. This is communicated by "identity" being listed with q=0.

To tell the server the client only accepts zstd responses and nothing else, not even uncompressed responses, the client would send:

Accept-Encoding: zstd, *;q=0

RFC 26165 specifies the rules for how a server should interpret the Accept-Encoding header:

A server tests whether a content-coding is acceptable, according to
an Accept-Encoding field, using these rules:

   1. If the content-coding is one of the content-codings listed in
      the Accept-Encoding field, then it is acceptable, unless it is
      accompanied by a qvalue of 0. (As defined in section 3.9, a
      qvalue of 0 means "not acceptable.")

   2. The special "*" symbol in an Accept-Encoding field matches any
      available content-coding not explicitly listed in the header
      field.

   3. If multiple content-codings are acceptable, then the acceptable
      content-coding with the highest non-zero qvalue is preferred.

   4. The "identity" content-coding is always acceptable, unless
      specifically refused because the Accept-Encoding field includes
      "identity;q=0", or because the field includes "*;q=0" and does
      not explicitly include the "identity" content-coding. If the
      Accept-Encoding field-value is empty, then only the "identity"
      encoding is acceptable.

If you're targeting web browsers, check the compatibility table of [compression algorithms on MDN Web Docs]1.

Another important rule is that if the server compresses the response, it must include a Content-Encoding header in the response.

If the content-coding of an entity is not "identity", then the
response MUST include a Content-Encoding entity-header (section
14.11) that lists the non-identity content-coding(s) used.

Since not all servers implement the full Accept-Encoding header parsing logic, clients tend to stick to simple header values like Accept-Encoding: identity when no compression is desired, and Accept-Encoding: gzip, deflate, zstd, br when the client supports different compression formats and is indifferent to which one the server chooses. Clients should expect uncompressed responses as well in theses cases. The only way to force a "406 Not Acceptable" response when no compression is available is to send identity;q=0 or *;q=0 somewhere in the end of the Accept-Encoding header. But that relies on the server implementing the full Accept-Encoding handling logic.

Footnotes

  1. MDN Web Docs: Accept-Encoding 2 3

  2. Arrow Columnar Format: Compression 2

  3. Web applications using the JavaScript Arrow implementation don't have access to the compression APIs to decompress zstd and lz4 IPC buffers.

  4. Arrow Implementation Status: IPC Format

  5. Fielding, R. et al. (1999). HTTP/1.1. RFC 2616, Section 14.3 Accept-Encoding. 2