Adaptable Research

Algebraic Tag Length Value

What should a general purpose extensible data exchange format for binary and algebraic data look like?

For converge we need a data exchange format that generically (and efficiently) encodes binary data (for chunks of files, hash digests, etc) and algebraic data types (for types of hashes, etc).

The prototypical data exchange format family for binary data is the tag/type length value format. The traditional setup is two fixed width fields for the tag/type and length followed by a value of length words:

value:	tag=(byte byte) len=(byte byte) byte[len]
byte:	xxxxxxxx

This is undesirable for a number of reasons:

As the popularity of JSON shows, we don’t need tags all the time, and including one for every value mostly just increases overhead. We really just want them where there are multiple options for an embedded value. We also want to be able to distinguish value sequences from byte sequences.

So, abstractly, we want something more like:

value:	binary | array | union
binary:	len=? byte[len]
array:	len=? value[len]
union:	tag=? value
byte:	xxxxxxxx

To encode this we need to encode a kind (binary, array, union), a quantity representing the length or the tag, and then a known-length sequence of bytes or values. We have three kinds, which can fit into the top two bits of a byte with a value to spare. The quantity can be encoded into the bottom six bits a sequence of bytes, ending when the top two bits are one of the kinds.

The result is something like this:

value:	binary | array | union
binary:	len=vlq(00) byte[len]
array:	len=vlq(01) value[len]
union:	tag=vlq(10) value
vlq(YY):	YYxxxxxx | 11xxxxxx vlq(YY)
byte:	xxxxxxxx



There is only as much overhead as needed, often only a single byte. Every possible byte stream is a prefix of a sequence of encoded values, so there is no redundancy in the encoding.


Decoding/parsing can proceed linearly with no backtracking.


Every value has a single valid representation.

Additionally, every byte string is a valid prefix of a sequence of encoded values.

Usage and Implementations



published updated