Algebraic Tag Length Value
What should a general purpose extensible data exchange format for binary and algebraic data look like?
For converge we need a data exchange format that generically (and efficiently) encodes binary data (for chunks of files, hash digests, etc) and algebraic data types (for types of hashes, etc).
The prototypical data exchange format family for binary data is the tag/type length value format. The traditional setup is two fixed width fields for the tag/type and length followed by a value of length words:
value: tag=(byte byte) len=(byte byte) byte[len]
byte: xxxxxxxx
This is undesirable for a number of reasons:
- Every value has four bytes of overhead.
- Values longer than 65535 cannot be encoded, usually this is “fixed” in increasing the overhead to six or even eight bytes.
- The structure of the TLV cannot be examined without either understanding the tags, or a non-deterministic process by guessing that each value is a sequence of contained values, and seeing if they fit together.
As the popularity of JSON shows, we don’t need tags all the time, and including one for every value mostly just increases overhead. We really just want them where there are multiple options for an embedded value. We also want to be able to distinguish value sequences from byte sequences.
So, abstractly, we want something more like:
value: binary | array | union
binary: len=? byte[len]
array: len=? value[len]
union: tag=? value
byte: xxxxxxxx
To encode this we need to encode a kind (binary, array, union), a quantity representing the length or the tag, and then a known-length sequence of bytes or values. We have three kinds, which can fit into the top two bits of a byte with a value to spare. The quantity can be encoded into the bottom six bits a sequence of bytes, ending when the top two bits are one of the kinds.
The result is something like this:
value: binary | array | union
binary: len=vlq(00) byte[len]
array: len=vlq(01) value[len]
union: tag=vlq(10) value
vlq(YY): YYxxxxxx | 11xxxxxx vlq(YY)
byte: xxxxxxxx
Properties
Compact
There is only as much overhead as needed, often only a single byte. Every possible byte stream is a prefix of a sequence of encoded values, so there is no redundancy in the encoding.
Fast
Decoding/parsing can proceed linearly with no backtracking.
Bijective
Every value has a single valid representation.
Additionally, every byte string is a valid prefix of a sequence of encoded values.
Usage and Implementations
Protocols
Libraries
- Rust atlv