Shannon's source coding theorem

<h2 id="statements">Statements</h2>
Source coding is a mapping from (a sequence of) symbols from an information <a href="/facts/Information_theory/qNLH3U5P">source</a> to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is one approach to <a href="/facts/Data_compression/gMFuY4FT">data compression</a>.

<h3>Source coding theorem</h3>
In information theory, the source coding theorem (Shannon 1948)<a class="footnote-ref" id="fnref:2" href="#fn:2">2</a> informally states that (MacKay 2003, pg. 81,<a class="footnote-ref" id="fnref:3" href="#fn:3">3</a> Cover 2006, Chapter 5<a class="footnote-ref" id="fnref:4" href="#fn:4">4</a>):

<blockquote>N <a href="/facts/Independent_and_identically_distributed_random_variables/othIRaWt">i.i.d.</a> random variables each with entropy H(X) can be compressed into more than N H(X) <a href="/facts/Bit/S972NSaD">bits</a> with negligible risk of information loss, as N → ∞; but conversely, if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost.</blockquote>The 
 
 
 
 N
 H
 (
 X
 )
 
 
 {\displaystyle NH(X)}
 
 coded sequence represents the compressed message in a biunivocal way, under the assumption that the decoder knows the source. From a practical point of view, this hypothesis is not always true. Consequently, when the entropy encoding is applied the transmitted message is 
 
 
 
 N
 H
 (
 X
 )
 +
 (
 i
 n
 f
 .
 s
 o
 u
 r
 c
 e
 )
 
 
 {\displaystyle NH(X)+(inf.source)}
 
. Usually, the information that characterizes the source is inserted at the beginning of the transmitted message.
<h3>Source coding theorem for symbol codes</h3>
Let Σ1, Σ2 denote two finite alphabets and let Σ∗1 and Σ∗2 denote the <a href="/facts/Kleene_star/I6SX0f2Y">set of all finite words</a> from those alphabets (respectively).
Suppose that X is a random variable taking values in Σ1 and let  f  be a <a href="/facts/Variable-length_code/DImYG6eg">uniquely decodable</a> code from Σ∗1 to Σ∗2 where |Σ2| = a. Let S denote the random variable given by the length of codeword  f (X).
If  f  is optimal in the sense that it has the minimal expected word length for X, then (Shannon 1948):

H
 (
 X
 )
 
 
 
 log
 
 2
 
 
 ⁡
 a
 
 
 
 ≤
 
 E
 
 [
 S
 ]
 <
 
 
 
 H
 (
 X
 )
 
 
 
 log
 
 2
 
 
 ⁡
 a
 
 
 
 +
 1
 
 
 {\displaystyle {\frac {H(X)}{\log _{2}a}}\leq \mathbb {E} [S]<{\frac {H(X)}{\log _{2}a}}+1}

Where 
 
 
 
 
 E
 
 
 
 {\displaystyle \mathbb {E} }
 
 denotes the <a href="/facts/Expected_value/1XV0JKL8">expected value</a> operator.

<h2 id="proof-source-coding-theorem">Proof: source coding theorem</h2>
Given X is an <a href="/facts/Independent_identically-distributed_random_variables/othIRaWt">i.i.d.</a> source, its <a href="/facts/Time_series/fSXPR817">time series</a> X1, ..., Xn is i.i.d. with <a href="/facts/Entropy_(information_theory)/NLg4NLvt">entropy</a> H(X) in the discrete-valued case and <a href="/facts/Differential_entropy/EoJSyw95">differential entropy</a> in the continuous-valued case. The Source coding theorem states that for any ε > 0, i.e. for any <a href="/facts/Information_theory/qNLH3U5P">rate</a> H(X) + ε larger than the <a href="/facts/Entropy/s52IXeFz">entropy</a> of the source, there is large enough n and an encoder that takes n i.i.d. repetition of the source, X1:n, and maps it to n(H(X) + ε) binary bits such that the source symbols X1:n are recoverable from the binary bits with probability of at least 1 − ε.
Proof of Achievability. Fix some ε > 0, and let

p
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        =
        Pr
        
          [
          
            
              X
              
                1
              
            
            =
            
              x
              
                1
              
            
            ,
            ⋯
            ,
            
              X
              
                n
              
            
            =
            
              x
              
                n
              
            
          
          ]
        
        .
      
    
    {\displaystyle p(x_{1},\ldots ,x_{n})=\Pr \left[X_{1}=x_{1},\cdots ,X_{n}=x_{n}\right].}

The <a href="/facts/Typical_set/trEjxlVz">typical set</a>, Aεn, is defined as follows:

A
 
 n
 
 
 ε
 
 
 =
 
 {
 
 (
 
 x
 
 1
 
 
 ,
 ⋯
 ,
 
 x
 
 n
 
 
 )
  
 :
  
 
 |
 
 −
 
 
 1
 n
 
 
 log
 ⁡
 p
 (
 
 x
 
 1
 
 
 ,
 ⋯
 ,
 
 x
 
 n
 
 
 )
 −
 
 H
 
 n
 
 
 (
 X
 )
 
 |
 
 <
 ε
 
 }
 
 .
 
 
 {\displaystyle A_{n}^{\varepsilon }=\left\{(x_{1},\cdots ,x_{n})\ :\ \left|-{\frac {1}{n}}\log p(x_{1},\cdots ,x_{n})-H_{n}(X)\right|<\varepsilon \right\}.}

The <a href="/facts/Asymptotic_equipartition_property/Jcy9SLWc">asymptotic equipartition property</a> (AEP) shows that for large enough n, the probability that a sequence generated by the source lies in the typical set, Aεn, as defined approaches one. In particular, for sufficiently large n, 
 
 
 
 P
 (
 (
 
 X
 
 1
 
 
 ,
 
 X
 
 2
 
 
 ,
 ⋯
 ,
 
 X
 
 n
 
 
 )
 ∈
 
 A
 
 n
 
 
 ε
 
 
 )
 
 
 {\displaystyle P((X_{1},X_{2},\cdots ,X_{n})\in A_{n}^{\varepsilon })}
 
 can be made arbitrarily close to 1, and specifically, greater than 
 
 
 
 1
 −
 ε
 
 
 {\displaystyle 1-\varepsilon }
 
 (See 
<a href="/facts/Asymptotic_equipartition_property/Jcy9SLWc">AEP</a> for a proof).
The definition of typical sets implies that those sequences that lie in the typical set satisfy:

2
          
            −
            n
            (
            H
            (
            X
            )
            +
            ε
            )
          
        
        ≤
        p
        
          (
          
            
              x
              
                1
              
            
            ,
            ⋯
            ,
            
              x
              
                n
              
            
          
          )
        
        ≤
        
          2
          
            −
            n
            (
            H
            (
            X
            )
            −
            ε
            )
          
        
      
    
    {\displaystyle 2^{-n(H(X)+\varepsilon )}\leq p\left(x_{1},\cdots ,x_{n}\right)\leq 2^{-n(H(X)-\varepsilon )}}

<ul><li>The probability of a sequence 
 
 
 
 (
 
 X
 
 1
 
 
 ,
 
 X
 
 2
 
 
 ,
 ⋯
 
 X
 
 n
 
 
 )
 
 
 {\displaystyle (X_{1},X_{2},\cdots X_{n})}
 
 being drawn from Aεn is greater than 1 − ε.</li>
<li>
 
 
 
 
 |
 
 A
 
 n
 
 
 ε
 
 
 |
 
 ≤
 
 2
 
 n
 (
 H
 (
 X
 )
 +
 ε
 )
 
 
 
 
 {\displaystyle \left|A_{n}^{\varepsilon }\right|\leq 2^{n(H(X)+\varepsilon )}}
 
, which follows from the left hand side (lower bound) for 
 
 
 
 p
 (
 
 x
 
 1
 
 
 ,
 
 x
 
 2
 
 
 ,
 ⋯
 
 x
 
 n
 
 
 )
 
 
 {\displaystyle p(x_{1},x_{2},\cdots x_{n})}
 
.</li>
<li>
 
 
 
 
 |
 
 A
 
 n
 
 
 ε
 
 
 |
 
 ≥
 (
 1
 −
 ε
 )
 
 2
 
 n
 (
 H
 (
 X
 )
 −
 ε
 )
 
 
 
 
 {\displaystyle \left|A_{n}^{\varepsilon }\right|\geq (1-\varepsilon )2^{n(H(X)-\varepsilon )}}
 
, which follows from upper bound for 
 
 
 
 p
 (
 
 x
 
 1
 
 
 ,
 
 x
 
 2
 
 
 ,
 ⋯
 
 x
 
 n
 
 
 )
 
 
 {\displaystyle p(x_{1},x_{2},\cdots x_{n})}
 
 and the lower bound on the total probability of the whole set Aεn.</li></ul>
Since 
 
 
 
 
 |
 
 A
 
 n
 
 
 ε
 
 
 |
 
 ≤
 
 2
 
 n
 (
 H
 (
 X
 )
 +
 ε
 )
 
 
 ,
 n
 (
 H
 (
 X
 )
 +
 ε
 )
 
 
 {\displaystyle \left|A_{n}^{\varepsilon }\right|\leq 2^{n(H(X)+\varepsilon )},n(H(X)+\varepsilon )}
 
 bits are enough to point to any string in this set.
The encoding algorithm: the encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary n(H(X) + ε) digit number. As long as the input sequence lies within the typical set (with probability at least 1 − ε), the encoder does not make any error. So, the probability of error of the encoder is bounded above by ε.
Proof of converse: the converse is proved by showing that any set of size smaller than Aεn (in the sense of exponent) would cover a set of probability bounded away from 1.

<h2 id="proof-source-coding-theorem-for-symbol-codes">Proof: Source coding theorem for symbol codes</h2>
For 1 ≤ i ≤ n let si denote the word length of each possible xi. Define 
 
 
 
 
 q
 
 i
 
 
 =
 
 a
 
 −
 
 s
 
 i
 
 
 
 
 
 /
 
 C
 
 
 {\displaystyle q_{i}=a^{-s_{i}}/C}
 
, where C is chosen so that q1 + ... + qn = 1. Then

H
                (
                X
                )
              
              
                
                =
                −
                
                  ∑
                  
                    i
                    =
                    1
                  
                  
                    n
                  
                
                
                  p
                  
                    i
                  
                
                
                  log
                  
                    2
                  
                
                ⁡
                
                  p
                  
                    i
                  
                
              
            
            
              
              
                
                ≤
                −
                
                  ∑
                  
                    i
                    =
                    1
                  
                  
                    n
                  
                
                
                  p
                  
                    i
                  
                
                
                  log
                  
                    2
                  
                
                ⁡
                
                  q
                  
                    i
                  
                
              
            
            
              
              
                
                =
                −
                
                  ∑
                  
                    i
                    =
                    1
                  
                  
                    n
                  
                
                
                  p
                  
                    i
                  
                
                
                  log
                  
                    2
                  
                
                ⁡
                
                  a
                  
                    −
                    
                      s
                      
                        i
                      
                    
                  
                
                +
                
                  ∑
                  
                    i
                    =
                    1
                  
                  
                    n
                  
                
                
                  p
                  
                    i
                  
                
                
                  log
                  
                    2
                  
                
                ⁡
                C
              
            
            
              
              
                
                =
                −
                
                  ∑
                  
                    i
                    =
                    1
                  
                  
                    n
                  
                
                
                  p
                  
                    i
                  
                
                
                  log
                  
                    2
                  
                
                ⁡
                
                  a
                  
                    −
                    
                      s
                      
                        i
                      
                    
                  
                
                +
                
                  log
                  
                    2
                  
                
                ⁡
                C
              
            
            
              
              
                
                ≤
                −
                
                  ∑
                  
                    i
                    =
                    1
                  
                  
                    n
                  
                
                −
                
                  s
                  
                    i
                  
                
                
                  p
                  
                    i
                  
                
                
                  log
                  
                    2
                  
                
                ⁡
                a
              
            
            
              
              
                
                =
                
                  E
                
                S
                
                  log
                  
                    2
                  
                
                ⁡
                a
              
            
          
        
      
    
    {\displaystyle {\begin{aligned}H(X)&=-\sum _{i=1}^{n}p_{i}\log _{2}p_{i}\\&\leq -\sum _{i=1}^{n}p_{i}\log _{2}q_{i}\\&=-\sum _{i=1}^{n}p_{i}\log _{2}a^{-s_{i}}+\sum _{i=1}^{n}p_{i}\log _{2}C\\&=-\sum _{i=1}^{n}p_{i}\log _{2}a^{-s_{i}}+\log _{2}C\\&\leq -\sum _{i=1}^{n}-s_{i}p_{i}\log _{2}a\\&=\mathbb {E} S\log _{2}a\\\end{aligned}}}

where the second line follows from <a href="/facts/Gibbs%2527_inequality/LjzY1B89">Gibbs' inequality</a> and the fifth line follows from <a href="/facts/Kraft%2527s_inequality/yU2yoHcY">Kraft's inequality</a>:

C
        =
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        
          a
          
            −
            
              s
              
                i
              
            
          
        
        ≤
        1
      
    
    {\displaystyle C=\sum _{i=1}^{n}a^{-s_{i}}\leq 1}

so log C ≤ 0.
For the second inequality we may set

s
          
            i
          
        
        =
        ⌈
        −
        
          log
          
            a
          
        
        ⁡
        
          p
          
            i
          
        
        ⌉
      
    
    {\displaystyle s_{i}=\lceil -\log _{a}p_{i}\rceil }

so that

−
 
 log
 
 a
 
 
 ⁡
 
 p
 
 i
 
 
 ≤
 
 s
 
 i
 
 
 <
 −
 
 log
 
 a
 
 
 ⁡
 
 p
 
 i
 
 
 +
 1
 
 
 {\displaystyle -\log _{a}p_{i}\leq s_{i}<-\log _{a}p_{i}+1}

and so

a
          
            −
            
              s
              
                i
              
            
          
        
        ≤
        
          p
          
            i
          
        
      
    
    {\displaystyle a^{-s_{i}}\leq p_{i}}

and

∑
        
          a
          
            −
            
              s
              
                i
              
            
          
        
        ≤
        ∑
        
          p
          
            i
          
        
        =
        1
      
    
    {\displaystyle \sum a^{-s_{i}}\leq \sum p_{i}=1}

and so by Kraft's inequality there exists a prefix-free code having those word lengths. Thus the minimal S satisfies

E
 
 S
 
 
 
 =
 ∑
 
 p
 
 i
 
 
 
 s
 
 i
 
 
 
 
 
 
 
 
 <
 ∑
 
 p
 
 i
 
 
 
 (
 
 −
 
 log
 
 a
 
 
 ⁡
 
 p
 
 i
 
 
 +
 1
 
 )
 
 
 
 
 
 
 
 =
 ∑
 −
 
 p
 
 i
 
 
 
 
 
 
 log
 
 2
 
 
 ⁡
 
 p
 
 i
 
 
 
 
 
 log
 
 2
 
 
 ⁡
 a
 
 
 
 +
 1
 
 
 
 
 
 
 =
 
 
 
 H
 (
 X
 )
 
 
 
 log
 
 2
 
 
 ⁡
 a
 
 
 
 +
 1
 
 
 
 
 
 
 {\displaystyle {\begin{aligned}\mathbb {E} S&=\sum p_{i}s_{i}\\&<\sum p_{i}\left(-\log _{a}p_{i}+1\right)\\&=\sum -p_{i}{\frac {\log _{2}p_{i}}{\log _{2}a}}+1\\&={\frac {H(X)}{\log _{2}a}}+1\\\end{aligned}}}

<h2 id="extension-to-non-stationary-independent-sources">Extension to non-stationary independent sources</h2>
<h3>Fixed rate lossless source coding for discrete time non-stationary independent sources</h3>
Define typical set Aεn as:

A
 
 n
 
 
 ε
 
 
 =
 
 {
 
 
 x
 
 1
 
 
 n
 
 
  
 :
  
 
 |
 
 −
 
 
 1
 n
 
 
 log
 ⁡
 p
 
 (
 
 
 X
 
 1
 
 
 ,
 ⋯
 ,
 
 X
 
 n
 
 
 
 )
 
 −
 
 
 
 H
 
 n
 
 
 ¯
 
 
 (
 X
 )
 
 |
 
 <
 ε
 
 }
 
 .
 
 
 {\displaystyle A_{n}^{\varepsilon }=\left\{x_{1}^{n}\ :\ \left|-{\frac {1}{n}}\log p\left(X_{1},\cdots ,X_{n}\right)-{\overline {H_{n}}}(X)\right|<\varepsilon \right\}.}

Then, for given δ > 0, for n large enough, Pr(Aεn) > 1 − δ. Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than 
 
 
 
 
 2
 
 n
 (
 
 
 
 H
 
 n
 
 
 ¯
 
 
 (
 X
 )
 +
 ε
 )
 
 
 
 
 {\displaystyle 2^{n({\overline {H_{n}}}(X)+\varepsilon )}}
 
. Thus, on an average, Hn(X) + ε bits suffice for encoding with probability greater than 1 − δ, where ε and δ can be made arbitrarily small, by making n larger.

<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Channel_coding/cLUTUJrA">Channel coding</a></li>
<li><a href="/facts/Error_exponent/pyzBMNvK">Error exponent</a></li>
<li><a href="/facts/Noisy-channel_coding_theorem/vvr8u4Gg">Noisy-channel coding theorem</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Shen, A. and Uspensky, V.A. and Vereshchagin, N. (2017). "Chapter 7.3. : Complexity and entropy". Kolmogorov Complexity and Algorithmic Randomness. American Mathematical Society. p. 226. ISBN 9781470431822.{{cite book}}: CS1 maint: multiple names: authors list (link) <a href="9781470431822" target="_blank">9781470431822</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">C.E. Shannon, "A Mathematical Theory of Communication Archived 2009-02-16 at the Wayback Machine", Bell System Technical Journal, vol. 27, pp. 379–423, 623-656, July, October, 1948 <a href="/wiki/C.E._Shannon" target="_blank">/wiki/C.E._Shannon</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1 <a href="http://www.inference.phy.cam.ac.uk/mackay/itila/book.html" target="_blank">http://www.inference.phy.cam.ac.uk/mackay/itila/book.html</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">Cover, Thomas M. (2006). "Chapter 5: Data Compression". Elements of Information Theory. John Wiley & Sons. pp. 103–142. ISBN 0-471-24195-4. <a href="0-471-24195-4" target="_blank">0-471-24195-4</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
</ol>

Shannon's source coding theorem open-in-new

Shannon's source coding theorem