gobs.tex

\cleardoublepage
\phantomsection
\addcontentsline{toc}{chapter}{Gobs of Data}
\chapter*{Gobs of Data}

To transmit a data structure across a network or to store it in a
file, it must be encoded and then decoded again. There are many
encodings available, of course: \href{http://www.json.org/}{JSON},
\href{http://www.w3.org/XML/}{XML}, Google's
\href{http://code.google.com/p/protobuf}{protocol buffers}, and
more.  And now there's another, provided by Go's
\href{http://golang.org/pkg/encoding/gob/}{gob} package.

Why define a new encoding? It's a lot of work and redundant at that.
Why not just use one of the existing formats? Well, for one thing,
we do! Go has \href{http://golang.org/pkg/}{packages} supporting
all the encodings just mentioned (the
\href{http://code.google.com/p/goprotobuf}{protocol buffer package}
is in a separate repository but it's one of the most frequently
downloaded). And for many purposes, including communicating with
tools and systems written in other languages, they're the right
choice.

But for a Go-specific environment, such as communicating between two
servers written in Go, there's an opportunity to build something much
easier to use and possibly more efficient.

Gobs work with the language in a way that an externally-defined,
language-independent encoding cannot. At the same time, there are
lessons to be learned from the existing systems.

\section*{Goals}

The gob package was designed with a number of goals in mind.

First, and most obvious, it had to be very easy to use. First, because
Go has reflection, there is no need for a separate interface definition
language or ``protocol compiler''. The data structure itself is all the
package should need to figure out how to encode and decode it. On the
other hand, this approach means that gobs will never work as well with
other languages, but that's OK: gobs are unashamedly Go-centric.

Efficiency is also important. Textual representations, exemplified by
XML and JSON, are too slow to put at the center of an efficient
communications network. A binary encoding is necessary.

Gob streams must be self-describing. Each gob stream, read from the
beginning, contains sufficient information that the entire stream can be
parsed by an agent that knows nothing a priori about its contents. This
property means that you will always be able to decode a gob stream
stored in a file, even long after you've forgotten what data it
represents.

There were also some things to learn from our experiences with Google
protocol buffers.

\section*{Protocol buffer misfeatures}

Protocol buffers had a major effect on the design of gobs, but have
three features that were deliberately avoided. (Leaving aside the
property that protocol buffers aren't self-describing: if you don't know
the data definition used to encode a protocol buffer, you might not be
able to parse it.)

First, protocol buffers only work on the data type we call a struct in
Go. You can't encode an integer or array at the top level, only a struct
with fields inside it. That seems a pointless restriction, at least in
Go. If all you want to send is an array of integers, why should you have
to put it into a struct first?

Next, a protocol buffer definition may specify that fields \texttt{T.x}
and \texttt{T.y} are required to be present whenever a value of type
\texttt{T} is encoded or decoded. Although such required fields may seem
like a good idea, they are costly to implement because the codec must
maintain a separate data structure while encoding and decoding, to be
able to report when required fields are missing. They're also a
maintenance problem. Over time, one may want to modify the data
definition to remove a required field, but that may cause existing
clients of the data to crash. It's better not to have them in the
encoding at all. (Protocol buffers also have optional fields. But if we
don't have required fields, all fields are optional and that's that.
There will be more to say about optional fields a little later.)

The third protocol buffer misfeature is default values. If a protocol
buffer omits the value for a ``defaulted'' field, then the decoded
structure behaves as if the field were set to that value. This idea
works nicely when you have getter and setter methods to control access
to the field, but is harder to handle cleanly when the container is just
a plain idiomatic struct. Required fields are also tricky to implement:
where does one define the default values, what types do they have (is
text UTF-8? uninterpreted bytes? how many bits in a float?) and despite
the apparent simplicity, there were a number of complications in their
design and implementation for protocol buffers. We decided to leave them
out of gobs and fall back to Go's trivial but effective defaulting rule:
unless you set something otherwise, it has the ``zero value'' for that
type - and it doesn't need to be transmitted.

So gobs end up looking like a sort of generalized, simplified protocol
buffer. How do they work?

\section*{Values}

The encoded gob data isn't about \texttt{int8}s and \texttt{uint16}s.
Instead, somewhat analogous to constants in Go, its integer values are
abstract, sizeless numbers, either signed or unsigned. When you encode
an \texttt{int8}, its value is transmitted as an unsized,
variable-length integer. When you encode an \texttt{int64}, its value is
also transmitted as an unsized, variable-length integer. (Signed and
unsigned are treated distinctly, but the same unsized-ness applies to
unsigned values too.) If both have the value 7, the bits sent on the
wire will be identical. When the receiver decodes that value, it puts it
into the receiver's variable, which may be of arbitrary integer type.
Thus an encoder may send a 7 that came from an \texttt{int8}, but the
receiver may store it in an \texttt{int64}. This is fine: the value is
an integer and as a long as it fits, everything works. (If it doesn't
fit, an error results.) This decoupling from the size of the variable
gives some flexibility to the encoding: we can expand the type of the
integer variable as the software evolves, but still be able to decode
old data.

This flexibility also applies to pointers. Before transmission, all
pointers are flattened. Values of type \texttt{int8}, \texttt{*int8},
\texttt{**int8}, \texttt{****int8}, etc. are all transmitted as an
integer value, which may then be stored in \texttt{int} of any size, or
\texttt{*int}, or \texttt{******int}, etc. Again, this allows for
flexibility.

Flexibility also happens because, when decoding a struct, only those
fields that are sent by the encoder are stored in the destination. Given
the value

\begin{Verbatim}[frame=single]
type T struct { X, Y, Z int } // Only exported fields are encoded and decoded.
var t = T{X: 7, Y: 0, Z: 8}
\end{Verbatim}

the encoding of \texttt{t} sends only the 7 and 8. Because it's zero,
the value of \texttt{Y} isn't even sent; there's no need to send a zero
value.

The receiver could instead decode the value into this structure:

\begin{Verbatim}[frame=single]
type U struct{ X, Y *int8 } // Note: pointers to int8s
var u U
\end{Verbatim}

and acquire a value of \texttt{u} with only \texttt{X} set (to the
address of an \texttt{int8} variable set to 7); the \texttt{Z} field is
ignored - where would you put it? When decoding structs, fields are
matched by name and compatible type, and only fields that exist in both
are affected. This simple approach finesses the ``optional field''
problem: as the type \texttt{T} evolves by adding fields, out of date
receivers will still function with the part of the type they recognize.
Thus gobs provide the important result of optional fields -
extensibility - without any additional mechanism or notation.

From integers we can build all the other types: bytes, strings, arrays,
slices, maps, even floats. Floating-point values are represented by
their IEEE 754 floating-point bit pattern, stored as an integer, which
works fine as long as you know their type, which we always do. By the
way, that integer is sent in byte-reversed order because common values
of floating-point numbers, such as small integers, have a lot of zeros
at the low end that we can avoid transmitting.

One nice feature of gobs that Go makes possible is that they allow you
to define your own encoding by having your type satisfy the
\href{http://golang.org/pkg/encoding/gob/\#GobEncoder}{GobEncoder} and
\href{http://golang.org/pkg/encoding/gob/\#GobDecoder}{GobDecoder} interfaces, in a
manner analogous to the \href{http://golang.org/pkg/encoding/json/}{JSON} package's
\href{http://golang.org/pkg/encoding/json/\#Marshaler}{Marshaler} and
\href{http://golang.org/pkg/encoding/json/\#Unmarshaler}{Unmarshaler} and also to the
\href{http://golang.org/pkg/fmt/\#Stringer}{Stringer} interface from
\href{http://golang.org/pkg/fmt/}{package fmt}. This facility makes it possible to
represent special features, enforce constraints, or hide secrets when
you transmit data. See the \href{http://golang.org/pkg/encoding/gob/}{documentation} for
details.

\section*{Types on the wire}

The first time you send a given type, the gob package includes in the
data stream a description of that type. In fact, what happens is that
the encoder is used to encode, in the standard gob encoding format, an
internal struct that describes the type and gives it a unique number.
(Basic types, plus the layout of the type description structure, are
predefined by the software for bootstrapping.) After the type is
described, it can be referenced by its type number.

Thus when we send our first type \texttt{T}, the gob encoder sends a
description of \texttt{T} and tags it with a type number, say 127. All
values, including the first, are then prefixed by that number, so a
stream of \texttt{T} values looks like:

\begin{Verbatim}[frame=single]
("define type id" 127, definition of type T)(127, T value)(127, T value), ...
\end{Verbatim}

These type numbers make it possible to describe recursive types and send
values of those types. Thus gobs can encode types such as trees:

\begin{Verbatim}[frame=single]
type Node struct {
        Value       int
        Left, Right *Node
}
\end{Verbatim}

(It's an exercise for the reader to discover how the zero-defaulting
rule makes this work, even though gobs don't represent pointers.)

With the type information, a gob stream is fully self-describing except
for the set of bootstrap types, which is a well-defined starting point.

\section*{Compiling a machine}

The first time you encode a value of a given type, the gob package
builds a little interpreted machine specific to that data type. It uses
reflection on the type to construct that machine, but once the machine
is built it does not depend on reflection. The machine uses package
unsafe and some trickery to convert the data into the encoded bytes at
high speed. It could use reflection and avoid unsafe, but would be
significantly slower. (A similar high-speed approach is taken by the
protocol buffer support for Go, whose design was influenced by the
implementation of gobs.) Subsequent values of the same type use the
already-compiled machine, so they can be encoded right away.

Decoding is similar but harder. When you decode a value, the gob package
holds a byte slice representing a value of a given encoder-defined type
to decode, plus a Go value into which to decode it. The gob package
builds a machine for that pair: the gob type sent on the wire crossed
with the Go type provided for decoding. Once that decoding machine is
built, though, it's again a reflectionless engine that uses unsafe
methods to get maximum speed.

\section*{Use}

There's a lot going on under the hood, but the result is an efficient,
easy-to-use encoding system for transmitting data. Here's a complete
example showing differing encoded and decoded types. Note how easy it is
to send and receive values; all you need to do is present values and
variables to the \href{http://golang.org/pkg/encoding/gob/}{gob package} and it does all
the work.

\begin{Verbatim}[frame=single]
package main

import (
	"bytes"
	"encoding/gob"
	"fmt"
	"log"
)

type P struct {
	X, Y, Z int
	Name    string
}

type Q struct {
	X, Y *int32
	Name string
}

func main() {
	// Initialize the encoder and decoder.  Normally enc and dec would be
	// bound to network connections and the encoder and decoder would
	// run in different processes.
	var network bytes.Buffer        // Stand-in for a network connection
	enc := gob.NewEncoder(&network) // Will write to network.
	dec := gob.NewDecoder(&network) // Will read from network.
	// Encode (send) the value.
	err := enc.Encode(P{3, 4, 5, "Pythagoras"})
	if err != nil {
		log.Fatal("encode error:", err)
	}
	// Decode (receive) the value.
	var q Q
	err = dec.Decode(&q)
	if err != nil {
		log.Fatal("decode error:", err)
	}
	fmt.Printf("%q: {%d,%d}\n", q.Name, *q.X, *q.Y)
}
\end{Verbatim}

You can compile and run this example code in the
\href{http://play.golang.org/p/\_-OJV-rwMq}{Go Playground}.

The \href{http://golang.org/pkg/net/rpc/}{rpc package} builds on
gobs to turn this encode/decode automation into transport for method
calls across the network. That's a subject for another article.

\textbf{Details}

The \href{http://golang.org/pkg/encoding/gob/}{gob package
documentation}, especially the file
\href{http://golang.org/src/pkg/encoding/gob/doc.go}{doc.go}, expands
on many of the details described here and includes a full worked
example showing how the encoding represents data. If you are
interested in the innards of the gob implementation, that's a good
place to start.