The extension of a code is the mapping of finite length source sequences to finite length bit strings, that is obtained by concatenating for each symbol of the source sequence the corresponding codeword produced by the original code.
Using terms from formal language theory, the precise mathematical definition is as follows: Let S {\displaystyle S} and T {\displaystyle T} be two finite sets, called the source and target alphabets, respectively. A code C : S → T ∗ {\displaystyle C:S\to T^{*}} is a total function1 mapping each symbol from S {\displaystyle S} to a sequence of symbols over T {\displaystyle T} , and the extension of C {\displaystyle C} to a homomorphism of S ∗ {\displaystyle S^{*}} into T ∗ {\displaystyle T^{*}} , which naturally maps each sequence of source symbols to a sequence of target symbols, is referred to as its extension.
Variable-length codes can be strictly nested in order of decreasing generality as non-singular codes, uniquely decodable codes, and prefix codes. Prefix codes are always uniquely decodable, and these in turn are always non-singular:
A code is non-singular if each source symbol is mapped to a different non-empty bit string; that is, the mapping from source symbols to bit strings is injective.
A code is uniquely decodable if its extension is § non-singular. Whether a given code is uniquely decodable can be decided with the Sardinas–Patterson algorithm.
Main article: Prefix code
A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Other commonly used names for this concept are prefix-free code, instantaneous code, or context-free code.
A special case of prefix codes are block codes. Here, all codewords must have the same length. The latter are not very useful in the context of source coding, but often serve as forward error correction in the context of channel coding.
Another special case of prefix codes are LEB128 and variable-length quantity (VLQ) codes, which encode arbitrarily large integers as a sequence of octets—i.e., every codeword is a multiple of 8 bits.
The advantage of a variable-length code is that unlikely source symbols can be assigned longer codewords and likely source symbols can be assigned shorter codewords, thus giving a low expected codeword length. For the above example, if the probabilities of (a, b, c, d) were ( 1 2 , 1 4 , 1 8 , 1 8 ) {\displaystyle \textstyle \left({\frac {1}{2}},{\frac {1}{4}},{\frac {1}{8}},{\frac {1}{8}}\right)} , the expected number of bits used to represent a source symbol using the code above would be:
As the entropy of this source is 1.75 bits per symbol, this code compresses the source as much as possible so that the source can be recovered with zero error.
This code is based on an example found in Berstel et al. (2009), Example 2.3.1, p. 63. ↩