Variable-length code

<h2 id="codes-and-their-extensions">Codes and their extensions</h2>
<p>The extension of a code is the mapping of finite length source sequences to finite length bit strings, that is obtained by concatenating for each symbol of the source sequence the corresponding codeword produced by the original code.
</p><p>Using terms from <a href="/facts/Formal_language_theory/crDTyP8q">formal language theory</a>, the precise mathematical definition is as follows: Let 
  
    
      
        S
      
    
    {\displaystyle S}
  
 and 
  
    
      
        T
      
    
    {\displaystyle T}
  
 be two finite sets, called the source and target <a href="/facts/Alphabet_(computer_science)/6KW9qYRW">alphabets</a>, respectively. A code 
  
    
      
        C
        :
        S
        →
        
          T
          
            ∗
          
        
      
    
    {\displaystyle C:S\to T^{*}}
  
 is a total function<a class="footnote-ref" id="fnref:1" href="#fn:1"><sup>1</sup></a> mapping each symbol from 
  
    
      
        S
      
    
    {\displaystyle S}
  
 to a <a href="/facts/Word_(data_type)/2e5umRaj">sequence of symbols</a> over 
  
    
      
        T
      
    
    {\displaystyle T}
  
, and the extension of 
  
    
      
        C
      
    
    {\displaystyle C}
  
 to a <a href="/facts/Homomorphism/0gyqdEse">homomorphism</a> of 
  
    
      
        
          S
          
            ∗
          
        
      
    
    {\displaystyle S^{*}}
  
 into 
  
    
      
        
          T
          
            ∗
          
        
      
    
    {\displaystyle T^{*}}
  
, which naturally maps each sequence of source symbols to a sequence of target symbols, is referred to as its extension.
</p>
<h2 id="classes-of-variable-length-codes">Classes of variable-length codes</h2>
<p>Variable-length codes can be strictly nested in order of decreasing generality as non-singular codes, uniquely decodable codes, and prefix codes. Prefix codes are always uniquely decodable, and these in turn are always non-singular:
</p>
<h3>Non-singular codes</h3>
<p>A code is non-singular if each source symbol is mapped to a different non-empty bit string; that is, the mapping from source symbols to bit strings is <a href="/facts/Injective/5ES6tqf5">injective</a>.
</p>
<ul><li>For example, the mapping 
  
    
      
        
          M
          
            1
          
        
        =
        {
        
        a
        ↦
        0
        ,
        b
        ↦
        0
        ,
        c
        ↦
        1
        
        }
      
    
    {\displaystyle M_{1}=\{\,a\mapsto 0,b\mapsto 0,c\mapsto 1\,\}}
  
 is <i>not</i> non-singular because both "a" and "b" map to the same bit string "0"; any extension of this mapping will generate a lossy (non-lossless) coding. Such singular coding may still be useful when some loss of information is acceptable (for example, when such code is used in audio or video compression, where a lossy coding becomes equivalent to source <a href="/facts/Quantization_(signal_processing)/fddS3BE2">quantization</a>).</li>
<li>However, the mapping 
  
    
      
        
          M
          
            2
          
        
        =
        {
        
        a
        ↦
        1
        ,
        b
        ↦
        011
        ,
        c
        ↦
        01110
        ,
        d
        ↦
        1110
        ,
        e
        ↦
        10011
        ,
        f
        ↦
        0
        }
      
    
    {\displaystyle M_{2}=\{\,a\mapsto 1,b\mapsto 011,c\mapsto 01110,d\mapsto 1110,e\mapsto 10011,f\mapsto 0\}}
  
 <i>is</i> non-singular; its extension will generate a lossless coding, which will be useful for general data transmission (but this feature is not always required). It is not necessary for the non-singular code to be more compact than the source (and in many applications, a larger code is useful, for example as a way to detect or recover from encoding or transmission errors, or in security applications to protect a source from undetectable tampering).</li></ul>
<h3>Uniquely decodable codes</h3>
<p>A code is uniquely decodable if its extension is § non-singular. Whether a given code is uniquely decodable can be decided with the <a href="/facts/Sardinas%E2%80%93Patterson_algorithm/vALVNbBr">Sardinas–Patterson algorithm</a>. 
</p>
<ul><li>The mapping 
  
    
      
        
          M
          
            3
          
        
        =
        {
        
        a
        ↦
        0
        ,
        b
        ↦
        01
        ,
        c
        ↦
        011
        
        }
      
    
    {\displaystyle M_{3}=\{\,a\mapsto 0,b\mapsto 01,c\mapsto 011\,\}}
  
 is uniquely decodable (this can be demonstrated by looking at the <i>follow-set</i> after each target bit string in the map, because each bitstring is terminated as soon as we see a 0 bit which cannot follow any existing code to create a longer valid code in the map, but unambiguously starts a new code).</li>
<li>Consider again the code  
  
    
      
        
          M
          
            2
          
        
      
    
    {\displaystyle M_{2}}
  
 from the previous section.<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a> This code is <i>not</i> uniquely decodable, since the string <i>011101110011</i> can be interpreted as the sequence of codewords <i>01110 – 1110 – 011</i>, but also as the sequence of codewords <i>011 – 1 – 011 – 10011</i>. Two possible decodings of this encoded string are thus given by <i>cdb</i> and <i>babe</i>. However, such a code is useful when the set of all possible source symbols is completely known and finite, or when there are restrictions (such as a formal syntax) that determine if source elements of this extension are acceptable. Such restrictions permit the decoding of the original message by checking which of the possible source symbols mapped to the same symbol are valid under those restrictions.</li></ul>
<h3>Prefix codes</h3>
<p class="note">Main article: <a href="/facts/Prefix_code/461neBYx">Prefix code</a></p>
<p>A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Other commonly used names for this concept are prefix-free code, instantaneous code, or context-free code.
</p>
<ul><li>The example mapping 
  
    
      
        
          M
          
            3
          
        
      
    
    {\displaystyle M_{3}}
  
 above is <i>not</i> a prefix code because we do not know after reading the bit string "0" whether it encodes an "a" source symbol, or if it is the prefix of the encodings of the "b" or "c" symbols.</li>
<li>An example of a prefix code is shown below.</li></ul>
<table><tbody><tr><th>Symbol</th><th>Codeword</th></tr><tr><td>a</td><td>0</td></tr><tr><td>b</td><td>10</td></tr><tr><td>c</td><td>110</td></tr><tr><td>d</td><td>111</td></tr></tbody></table>
Example of encoding and decoding:
aabacdab → 00100110111010 → |0|0|10|0|110|111|0|10| → aabacdab
<p>A special case of prefix codes are <a href="/facts/Block_code/QVuqpGga">block codes</a>. Here, all codewords must have the same length. The latter are not very useful in the context of <a href="/facts/Source_coding/gMFuY4FT">source coding</a>, but often serve as <a href="/facts/Forward_error_correction/cLUTUJrA">forward error correction</a> in the context of <a href="/facts/Channel_coding/cLUTUJrA">channel coding</a>.
</p><p>Another special case of prefix codes are <a href="/facts/LEB128/rzL7LjQX">LEB128</a> and <a href="/facts/Variable-length_quantity/403RnSeL">variable-length quantity</a> (VLQ) codes, which encode arbitrarily large integers as a sequence of octets—i.e., every codeword is a multiple of 8 bits.
</p>
<h2 id="advantages">Advantages</h2>
<p>The advantage of a variable-length code is that unlikely source symbols can be assigned longer codewords and likely source symbols can be assigned shorter codewords, thus giving a low <a href="/facts/Expected_value/1XV0JKL8"><i>expected</i></a> codeword length. For the above example, if the probabilities of (a, b, c, d) were 
  
    
      
        
          
            (
            
              
                
                  1
                  2
                
              
              ,
              
                
                  1
                  4
                
              
              ,
              
                
                  1
                  8
                
              
              ,
              
                
                  1
                  8
                
              
            
            )
          
        
      
    
    {\displaystyle \textstyle \left({\frac {1}{2}},{\frac {1}{4}},{\frac {1}{8}},{\frac {1}{8}}\right)}
  
, the expected number of bits used to represent a source symbol using the code above would be:
</p>

1
        ×
        
          
            1
            2
          
        
        +
        2
        ×
        
          
            1
            4
          
        
        +
        3
        ×
        
          
            1
            8
          
        
        +
        3
        ×
        
          
            1
            8
          
        
        =
        
          
            7
            4
          
        
      
    
    {\displaystyle 1\times {\frac {1}{2}}+2\times {\frac {1}{4}}+3\times {\frac {1}{8}}+3\times {\frac {1}{8}}={\frac {7}{4}}}
  
.
<p>As the entropy of this source is 1.75 bits per symbol, this code compresses the source as much as possible so that the source can be recovered with <i>zero</i> error.
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Golomb_code/bUyW4uLN">Golomb code</a></li>
<li><a href="/facts/Kruskal_count/NcJAyAMg">Kruskal count</a></li>
<li><a href="/facts/Variable-length_instruction_set/8p8vwUeE">Variable-length instruction sets</a> in computing</li></ul>

<h2 id="further-reading">Further reading</h2>
<ul><li>Salomon, David (September 2007). <i>Variable-Length Codes for Data Compression</i> (1 ed.). <a href="/facts/Springer_Verlag/nAesf6nT">Springer Verlag</a>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-1-84628-958-3. (xii+191 pages) <a href="https://web.archive.org/web/20230920174349/https://www.davidsalomon.name/VLCadvertis/VLCerrata.html">Errata 1</a><a href="https://web.archive.org/web/20230920175457/https://www.davidsalomon.name/VLCadvertis/phasedin.pdf">Errata 2</a></li>
<li>Berstel, Jean; Perrin, Dominique; Reutenauer, Christophe (2010). <i>Codes and automata</i>. Encyclopedia of Mathematics and its Applications. Vol. 129. Cambridge, UK: <a href="/facts/Cambridge_University_Press/fgEBSSRq">Cambridge University Press</a>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-521-88831-8. <a href="/facts/Zbl_(identifier)/P6rFxKKx">Zbl</a> <a href="https://zbmath.org/?format=complete&q=an:1187.94001">1187.94001</a>. <a href="http://www-igm.univ-mlv.fr/~berstel/LivreCodes/Codes.html">Draft available online</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>This code is based on an example found in Berstel et al. (2009), Example 2.3.1, p. 63. <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>This code is based on an example found in Berstel et al. (2009), Example 2.3.1, p. 63. <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
</ol>

Variable-length code open-in-new

Variable-length code