Skip to content

Latest commit

 

History

History
183 lines (161 loc) · 9.09 KB

OED.md

File metadata and controls

183 lines (161 loc) · 9.09 KB

Octet-Encoded Data (OED)

For more-efficient network transmission and storage, we propose an octet-stream encoding which is compatible with JSON. There are multiple ways to encode the same abstract JSON value. All valid encodings produce equivalent JSON values, although some representation details are lost. Every valid JSON value can be represented in OED. Values may be round-tripped without semantic loss. Arbitrarily large values are fully supported. Decimal values can be represented exactly, without loss due to (for example) base-2 translation.

Goals

The encoding should acheive the following goals:

  • Self-describing data-types and representations
  • Better information-density than JSON
  • Arbitrarily large (finite) representation sizes
  • Lossless translation from JSON and machine types
  • Well-defined translation to JSON and machine types
  • Extension mechanism for application-defined representations
  • Capabilities can be distinguished from other data-types
  • Easy-to-implement encode/decode
  • Encoded data is navigable without fully decoding
  • Efficient to use as an in-memory data format

Design

Let's start with the easy values, false (2#1000_0000), true (2#1000_0001), and null (2#1000_1111). A single octet for each typed value. The four remaining JSON types, Number, String, Array, and Object, can encode arbitrarily-large values, so we will need a Number to describe their size. A single-octet encoding for small integers provides a base-case for the recursive definition of size of the size. Including both positive and negative numbers in the single-octet encoding keeps the encoding small for additional fields of larger encodings. However, it is unclear how large the range should be, so we will used all the encoding space that remains after encoding the other types.

The most basic form of arbitrary-sized Numbers are arbitrary-length bit-strings representing natural numbers. With the addition of a sign bit, we can describe any finite integer value. By adding an integer field for an exponent, we can describe any finite decimal value (assuming base-10). By adding another integer field for base, we can describe any finite rational value, and encode alternate bases (such as 2 for IEEE floats) without loss of precision.

  • Positive Integer: type=2#1000_0010 size::Number natural::Octet*
  • Negative Integer: type=2#1000_0011 size::Number natural::Octet*
  • Positive Decimal: type=2#1000_0100 exponent::Number size::Number natural::Octet*
  • Negative Decimal: type=2#1000_0101 exponent::Number size::Number natural::Octet*
  • Positive Rational: type=2#1000_0110 base::Number exponent::Number size::Number natural::Octet*
  • Negative Rational: type=2#1000_0111 base::Number exponent::Number size::Number natural::Octet*

The basic encoding is sign and magnitude, least-significant-octet first. The sign is positive (0) or negative (1), held in the LSB of the type prefix octet. A 1-component Number is just an integer value. A 2-component Number also includes an exponent, encoded as an additional Number. A 3-component Number also includes a base, encoded as an additional Number. The default exponent is 0. The default base is 10. The size field is a Number describing the number of bits (not octets) in the natural value. There is no requirement that a Number is encoded with the minimum number of octets. If the size is 0, the Number is 0, and there are no natural octets. The octets of the natural value (LSB to MSB) follow the size. If the number of encoded bits is not a multiple of 8, the final octet (MSB) will be padded with 0. The number designated is equal to (natural × base ^ exponent), or (-natural × base ^ exponent) if the sign is negative. Note that the base and exponent are signed integers. Rational numbers may be encoded as an exponent of -1, with the signed magnitude as the numerator, and the base as the denominator.

The String type represent an arbitrary-sized sequence of Unicode code-points. UTF-8 has become the default encoding for textual data throughout the world-wide-web, so we require explicit support for that encoding. Raw octet data (BLOBs) are also an important use-case, where the code-points represented are restricted to the range 0 thru 255. In order to support extensions for application-defined representations, and to encapsulate foreign data verbatim, BLOBs may be labelled with encoding meta-data. It is unclear if a memoization feature is worth the additional complexity it introduces, particularly when link data compression is likely.

  • Raw Octet BLOB: type=2#1000_1010 size::Number data::Octet*
  • Extension BLOB: type=2#1000_1011 meta::Value size::Number data::Octet*
  • UTF-8 String: type=2#1000_1100 length::Number size::Number data::Octet*

The size field is a Number describing the number of octets in the data. If the encoding is UTF-8, the length field is a Number describing the the number of code-points in the String. If the length is 0, there is no size field (and no data). The extension encoding includes a meta field, which is an arbitrary (OED-encoded) Value. The extension may be converted to a String by treating the octets of the entire encoded value (including the Extention BLOB type prefix) as code-points.

The Array type represents an arbitrary-sized sequence of Value elements. The values are not required to have the same type.

  • Array: type=2#1000_1000 length::Number size::Number elements::Value*

The size field is a Number describing the number of octets encoding the elements. The length field is a Number describing the number of elements in the Array. If the length is 0, there is no size field (and no elements).

The Object type represents an arbitrary-sized collection of name/value members. Each name should be a String for JSON compatibility, otherwise the JSON value is the String of encoded octets. Each value may be any type, including nested Object or Array values.

  • Object: type=2#1000_1001 length::Number size::Number members::(name::Value value::Value)*

The size field is a Number describing the number of octets encoding the members. The length field is a Number describing the number of members in the Object. If the length is 0, there is no size field (and no members).

Interoperability

Every OED document corresponds to a JSON document containing all the same data, although without some of the meta-data describing representations. Converting a JSON document to OED and back to JSON should result in an equivalent document. Following the recommendations in RFC 8259 and/or RFC 7493 will increase the portability and interoperability of OED documents.

Summary

type/prefix suffix description
2#0xxx_xxxx - positive small integer (0..127)
2#1000_0000 - false
2#1000_0001 - true
2#1000_0010 size::Number nat::Octet* Number (positive integer)
2#1000_0011 size::Number nat::Octet* Number (negative integer)
2#1000_0100 exp::Number size::Number nat::Octet* Number (positive decimal)
2#1000_0101 exp::Number size::Number nat::Octet* Number (negative decimal)
2#1000_0110 base::Number exp::Number size::Number nat::Octet* Number (positive rational)
2#1000_0111 base::Number exp::Number size::Number nat::Octet* Number (negative rational)
2#1000_1000 length::Number size::Number elements::Value* Array
2#1000_1001 length::Number size::Number members::Octet* Object
2#1000_1010 size::Number data::Octet* String (Raw BLOB)
2#1000_1011 meta::Value size::Number data::Octet* String (Extension BLOB)
2#1000_1100 length::Number size::Number data::Octet* String (UTF-8)
2#1000_1101 length::Number size::Number data::Octet* String (UTF-8 +memo)
2#1000_1110 index::Octet String (memo reference)
2#1000_1111 - null
2#1001_xxxx - negative small integer (-112..-97)
2#101x_xxxx - negative small integer (-96..-65)
2#11xx_xxxx - negative small integer (-64..-1)