tf–idf

<h2 id="motivations">Motivations</h2>
<p><a href="/facts/Karen_Sp%C3%A4rck_Jones/NLFBuKaf">Karen Spärck Jones</a> (1972) conceived a statistical interpretation of term-specificity called Inverse Document Frequency (idf), which became a cornerstone of term weighting:<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a>
</p>
<blockquote><p>The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.</p></blockquote><p>For example, the df (document frequency) and idf for some words in Shakespeare's 37 plays are as follows:<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a>
</p><table><tbody><tr><th>Word</th><th>df</th><th>idf</th></tr><tr><td>Romeo</td><td>1</td><td>1.57</td></tr><tr><td>salad</td><td>2</td><td>1.27</td></tr><tr><td>Falstaff</td><td>4</td><td>0.967</td></tr><tr><td>forest</td><td>12</td><td>0.489</td></tr><tr><td>battle</td><td>21</td><td>0.246</td></tr><tr><td>wit</td><td>34</td><td>0.037</td></tr><tr><td>fool</td><td>36</td><td>0.012</td></tr><tr><td>good</td><td>37</td><td>0</td></tr><tr><td>sweet</td><td>37</td><td>0</td></tr></tbody></table>
<p>We see that "<a href="/facts/Romeo/0LMSRTWW">Romeo</a>", "<a href="/facts/John_Falstaff/hY8L8L7K">Falstaff</a>", and "salad" appears in very few plays, so seeing these words, one could get a good idea as to which play it might be. In contrast, "good" and "sweet" appears in every play and are completely uninformative as to which play it is.
</p>
<h2 id="definition">Definition</h2>
<ol><li>The tf–idf is the product of two statistics, <i>term frequency</i> and <i>inverse document frequency</i>. There are various ways for determining the exact values of both statistics.</li>
<li>A formula that aims to define the importance of a keyword or phrase within a document or a web page.</li></ol>
Variants of term frequency (tf) weight<table><tbody><tr><th>weighting scheme</th><th>tf weight</th></tr><tr><td>binary</td><td>                              0          ,          1                      {\displaystyle {0,1}}  </td></tr><tr><td>raw count</td><td>                              f                      t            ,            d                                {\displaystyle f_{t,d}}  </td></tr><tr><td>term frequency</td><td>                              f                      t            ,            d                                                /                                                ∑                                          t                ′                            ∈              d                                                          f                                                t                  ′                                ,                d                                                          {\displaystyle f_{t,d}{\Bigg /}{\sum _{t'\in d}{f_{t',d}}}}  </td></tr><tr><td>log normalization</td><td>                    log        ⁡        (        1        +                  f                      t            ,            d                          )              {\displaystyle \log(1+f_{t,d})}  </td></tr><tr><td>double normalization 0.5</td><td>                    0.5        +        0.5        ⋅                                            f                              t                ,                d                                                                    max                                  {                                      t                    ′                                    ∈                  d                  }                                                                              f                                                            t                      ′                                        ,                    d                                                                                            {\displaystyle 0.5+0.5\cdot {\frac {f_{t,d}}{\max _{\{t'\in d\}}{f_{t',d}}}}}  </td></tr><tr><td>double normalization K</td><td>                    K        +        (        1        −        K        )                                            f                              t                ,                d                                                                    max                                  {                                      t                    ′                                    ∈                  d                  }                                                                              f                                                            t                      ′                                        ,                    d                                                                                            {\displaystyle K+(1-K){\frac {f_{t,d}}{\max _{\{t'\in d\}}{f_{t',d}}}}}  </td></tr></tbody></table>
<h3>Term frequency</h3>
<p>Term frequency, tf(<i>t</i>,<i>d</i>), is the relative frequency of term <i>t</i> within document <i>d</i>, 
</p>

t
          f
        
        (
        t
        ,
        d
        )
        =
        
          
            
              f
              
                t
                ,
                d
              
            
            
              
                ∑
                
                  
                    t
                    ′
                  
                  ∈
                  d
                
              
              
                
                  f
                  
                    
                      t
                      ′
                    
                    ,
                    d
                  
                
              
            
          
        
      
    
    {\displaystyle \mathrm {tf} (t,d)={\frac {f_{t,d}}{\sum _{t'\in d}{f_{t',d}}}}}
  
,
<p>where <i>f</i><i>t</i>,<i>d</i> is the <i>raw count</i> of a term in a document, i.e., the number of times that term t occurs in document d. Note the denominator is simply the total number of terms in document <i>d</i> (counting each occurrence of the same term separately). There are various other ways to define term frequency:<a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a>: 128 
</p>
<ul><li>the raw count itself: tf(<i>t</i>,<i>d</i>) = <i>f</i><i>t</i>,<i>d</i></li>
<li><a href="/facts/Boolean_data_type/0GWYLDoC">Boolean</a> "frequencies": tf(<i>t</i>,<i>d</i>) = 1 if t occurs in d and 0 otherwise;</li>
<li><a href="/facts/Logarithmic_scale/CvIGK5TQ">logarithmically scaled</a> frequency: tf(<i>t</i>,<i>d</i>) = log (1 + <i>f</i><i>t</i>,<i>d</i>);<a class="footnote-ref" id="fnref:6" href="#fn:6"><sup>6</sup></a></li>
<li>augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most frequently occurring term in the document:</li></ul>

t
          f
        
        (
        t
        ,
        d
        )
        =
        0.5
        +
        0.5
        ⋅
        
          
            
              f
              
                t
                ,
                d
              
            
            
              max
              {
              
                f
                
                  
                    t
                    ′
                  
                  ,
                  d
                
              
              :
              
                t
                ′
              
              ∈
              d
              }
            
          
        
      
    
    {\displaystyle \mathrm {tf} (t,d)=0.5+0.5\cdot {\frac {f_{t,d}}{\max\{f_{t',d}:t'\in d\}}}}

<h3>Inverse document frequency</h3>
Variants of inverse document frequency (idf) weight<table><tbody><tr><th>weighting scheme</th><th>idf weight (                              n                      t                          =                  |                {        d        ∈        D        :        t        ∈        d        }                  |                      {\displaystyle n_{t}=|\{d\in D:t\in d\}|}  )</th></tr><tr><td>unary</td><td>1</td></tr><tr><td>inverse document frequency</td><td>                    log        ⁡                              N                          n                              t                                                    =        −        log        ⁡                                            n                              t                                      N                                {\displaystyle \log {\frac {N}{n_{t}}}=-\log {\frac {n_{t}}{N}}}  </td></tr><tr><td>inverse document frequency smooth</td><td>                    log        ⁡                  (                                    N                              1                +                                  n                                      t                                                                                )                +        1              {\displaystyle \log \left({\frac {N}{1+n_{t}}}\right)+1}  </td></tr><tr><td>inverse document frequency max</td><td>                    log        ⁡                  (                                                                      max                                      {                                          t                      ′                                        ∈                    d                    }                                                                    n                                                            t                      ′                                                                                                  1                +                                  n                                      t                                                                                )                      {\displaystyle \log \left({\frac {\max _{\{t'\in d\}}n_{t'}}{1+n_{t}}}\right)}  </td></tr><tr><td>probabilistic inverse document frequency</td><td>                    log        ⁡                                            N              −                              n                                  t                                                                    n                              t                                                          {\displaystyle \log {\frac {N-n_{t}}{n_{t}}}}  </td></tr></tbody></table>
<p>The inverse document frequency is a measure of how much information the word provides, i.e., how common or rare it is across all documents. It is the <a href="/facts/Logarithmic_scale/CvIGK5TQ">logarithmically scaled</a> inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):
</p>

i
          d
          f
        
        (
        t
        ,
        D
        )
        =
        log
        ⁡
        
          
            N
            
              
                |
              
              {
              d
              :
              d
              ∈
              D
              
                 and 
              
              t
              ∈
              d
              }
              
                |
              
            
          
        
      
    
    {\displaystyle \mathrm {idf} (t,D)=\log {\frac {N}{|\{d:d\in D{\text{ and }}t\in d\}|}}}

<p>with
</p>
<ul><li>
  
    
      
        N
      
    
    {\displaystyle N}
  
: total number of documents in the corpus 
  
    
      
        N
        =
        
          
            |
          
          D
          
            |
          
        
      
    
    {\displaystyle N={|D|}}
  
</li>
<li>
  
    
      
        
          |
        
        {
        d
        ∈
        D
        :
        t
        ∈
        d
        }
        
          |
        
      
    
    {\displaystyle |\{d\in D:t\in d\}|}
  
 : number of documents where the term 
  
    
      
        t
      
    
    {\displaystyle t}
  
 appears (i.e., 
  
    
      
        
          t
          f
        
        (
        t
        ,
        d
        )
        ≠
        0
      
    
    {\displaystyle \mathrm {tf} (t,d)\neq 0}
  
). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the numerator 
  
    
      
        1
        +
        N
      
    
    {\displaystyle 1+N}
  
 and denominator to 
  
    
      
        1
        +
        
          |
        
        {
        d
        ∈
        D
        :
        t
        ∈
        d
        }
        
          |
        
      
    
    {\displaystyle 1+|\{d\in D:t\in d\}|}
  
.</li></ul>

<h3>Term frequency–inverse document frequency</h3>
Variants of term frequency-inverse document frequency (tf–idf) weights<table><tbody><tr><th>weighting scheme</th><th>tf-idf</th></tr><tr><td>count-idf</td><td>                              f                      t            ,            d                          ⋅        log        ⁡                              N                          n                              t                                                          {\displaystyle f_{t,d}\cdot \log {\frac {N}{n_{t}}}}  </td></tr><tr><td>double normalization-idf</td><td>                              (                      0.5            +            0.5                                                            f                                      t                    ,                    q                                                                                        max                                          t                                                                            f                                          t                      ,                      q                                                                                                    )                ⋅        log        ⁡                              N                          n                              t                                                          {\displaystyle \left(0.5+0.5{\frac {f_{t,q}}{\max _{t}f_{t,q}}}\right)\cdot \log {\frac {N}{n_{t}}}}  </td></tr><tr><td>log normalization-idf</td><td>                    (        1        +        log        ⁡                  f                      t            ,            d                          )        ⋅        log        ⁡                              N                          n                              t                                                          {\displaystyle (1+\log f_{t,d})\cdot \log {\frac {N}{n_{t}}}}  </td></tr></tbody></table>
<p>Then tf–idf is calculated as
</p>

t
          f
          i
          d
          f
        
        (
        t
        ,
        d
        ,
        D
        )
        =
        
          t
          f
        
        (
        t
        ,
        d
        )
        ⋅
        
          i
          d
          f
        
        (
        t
        ,
        D
        )
      
    
    {\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)}

<p>A high weight in tf–idf is reached by a high term <a href="/facts/Frequency_(statistics)/FtFNmhDS">frequency</a> (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0.
</p>

<h2 id="justification-of-idf">Justification of idf</h2>
<p>Idf was introduced as "term specificity" by <a href="/facts/Karen_Sp%C3%A4rck_Jones/NLFBuKaf">Karen Spärck Jones</a> in a 1972 paper. Although it has worked well as a <a href="/facts/Heuristic/BcBEkv9c">heuristic</a>, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find <a href="/facts/Information_theory/qNLH3U5P">information theoretic</a> justifications for it.<a class="footnote-ref" id="fnref:7" href="#fn:7"><sup>7</sup></a>
</p><p>Spärck Jones's own explanation did not propose much theory, aside from a connection to <a href="/facts/Zipf%27s_law/bspk1du8">Zipf's law</a>.<a class="footnote-ref" id="fnref:8" href="#fn:8"><sup>8</sup></a> Attempts have been made to put idf on a <a href="/facts/Probability_theory/mFBn51rE">probabilistic</a> footing,<a class="footnote-ref" id="fnref:9" href="#fn:9"><sup>9</sup></a> by estimating the probability that a given document d contains a term t as the relative document frequency,
</p>

P
        (
        t
        
          |
        
        D
        )
        =
        
          
            
              
                |
              
              {
              d
              ∈
              D
              :
              t
              ∈
              d
              }
              
                |
              
            
            N
          
        
        ,
      
    
    {\displaystyle P(t|D)={\frac {|\{d\in D:t\in d\}|}{N}},}

<p>so that we can define idf as
</p>

i
                  d
                  f
                
              
              
                
                =
                −
                log
                ⁡
                P
                (
                t
                
                  |
                
                D
                )
              
            
            
              
              
                
                =
                log
                ⁡
                
                  
                    1
                    
                      P
                      (
                      t
                      
                        |
                      
                      D
                      )
                    
                  
                
              
            
            
              
              
                
                =
                log
                ⁡
                
                  
                    N
                    
                      
                        |
                      
                      {
                      d
                      ∈
                      D
                      :
                      t
                      ∈
                      d
                      }
                      
                        |
                      
                    
                  
                
              
            
          
        
      
    
    {\displaystyle {\begin{aligned}\mathrm {idf} &=-\log P(t|D)\\&=\log {\frac {1}{P(t|D)}}\\&=\log {\frac {N}{|\{d\in D:t\in d\}|}}\end{aligned}}}

<p>Namely, the inverse document frequency is the logarithm of "inverse" relative document frequency.
</p><p>This probabilistic interpretation in turn takes the same form as that of <a href="/facts/Self-information/EkUdL669">self-information</a>. However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appropriate <a href="/facts/Event_space/WVldkN4F">event spaces</a> for the required <a href="/facts/Probability_distribution/EpsKKVRu">probability distributions</a>: not only documents need to be taken into account, but also queries and terms.<a class="footnote-ref" id="fnref:10" href="#fn:10"><sup>10</sup></a>
</p>
<h2 id="link-with-information-theory">Link with information theory</h2>
<p>Both term frequency and inverse document frequency can be formulated in terms of <a href="/facts/Information_theory/qNLH3U5P">information theory</a>; it helps to understand why their product has a meaning in terms of joint informational content of a document. A characteristic assumption about the distribution 
  
    
      
        p
        (
        d
        ,
        t
        )
      
    
    {\displaystyle p(d,t)}
  
 is that:
</p>

p
        (
        d
        
          |
        
        t
        )
        =
        
          
            1
            
              
                |
              
              {
              d
              ∈
              D
              :
              t
              ∈
              d
              }
              
                |
              
            
          
        
      
    
    {\displaystyle p(d|t)={\frac {1}{|\{d\in D:t\in d\}|}}}

<p>This assumption and its implications, according to Aizawa: "represent the heuristic that tf–idf employs."<a class="footnote-ref" id="fnref:11" href="#fn:11"><sup>11</sup></a>
</p><p>The <a href="/facts/Conditional_entropy/gHzPezGB">conditional entropy</a> of a "randomly chosen" document in the corpus 
  
    
      
        D
      
    
    {\displaystyle D}
  
, conditional to the fact it contains a specific term 
  
    
      
        t
      
    
    {\displaystyle t}
  
 (and assuming that all documents have equal probability to be chosen) is:
</p>

H
        (
        
          
            D
          
        
        
          |
        
        
          
            T
          
        
        =
        t
        )
        =
        −
        
          ∑
          
            d
          
        
        
          p
          
            d
            
              |
            
            t
          
        
        log
        ⁡
        
          p
          
            d
            
              |
            
            t
          
        
        =
        −
        log
        ⁡
        
          
            1
            
              
                |
              
              {
              d
              ∈
              D
              :
              t
              ∈
              d
              }
              
                |
              
            
          
        
        =
        log
        ⁡
        
          
            
              
                |
              
              {
              d
              ∈
              D
              :
              t
              ∈
              d
              }
              
                |
              
            
            
              
                |
              
              D
              
                |
              
            
          
        
        +
        log
        ⁡
        
          |
        
        D
        
          |
        
        =
        −
        
          i
          d
          f
        
        (
        t
        )
        +
        log
        ⁡
        
          |
        
        D
        
          |
        
      
    
    {\displaystyle H({\cal {D}}|{\cal {T}}=t)=-\sum _{d}p_{d|t}\log p_{d|t}=-\log {\frac {1}{|\{d\in D:t\in d\}|}}=\log {\frac {|\{d\in D:t\in d\}|}{|D|}}+\log |D|=-\mathrm {idf} (t)+\log |D|}

<p>In terms of notation, 
  
    
      
        
          
            D
          
        
      
    
    {\displaystyle {\cal {D}}}
  
 and 
  
    
      
        
          
            T
          
        
      
    
    {\displaystyle {\cal {T}}}
  
 are "random variables" corresponding to respectively draw a document or a term. The <a href="/facts/Mutual_information/HIUvsjvV">mutual information</a> can be expressed as
</p>

M
        (
        
          
            T
          
        
        ;
        
          
            D
          
        
        )
        =
        H
        (
        
          
            D
          
        
        )
        −
        H
        (
        
          
            D
          
        
        
          |
        
        
          
            T
          
        
        )
        =
        
          ∑
          
            t
          
        
        
          p
          
            t
          
        
        ⋅
        (
        H
        (
        
          
            D
          
        
        )
        −
        H
        (
        
          
            D
          
        
        
          |
        
        W
        =
        t
        )
        )
        =
        
          ∑
          
            t
          
        
        
          p
          
            t
          
        
        ⋅
        
          i
          d
          f
        
        (
        t
        )
      
    
    {\displaystyle M({\cal {T}};{\cal {D}})=H({\cal {D}})-H({\cal {D}}|{\cal {T}})=\sum _{t}p_{t}\cdot (H({\cal {D}})-H({\cal {D}}|W=t))=\sum _{t}p_{t}\cdot \mathrm {idf} (t)}

<p>The last step is to expand 
  
    
      
        
          p
          
            t
          
        
      
    
    {\displaystyle p_{t}}
  
, the unconditional probability to draw a term, with respect to the (random) choice of a document, to obtain:
</p>

M
        (
        
          
            T
          
        
        ;
        
          
            D
          
        
        )
        =
        
          ∑
          
            t
            ,
            d
          
        
        
          p
          
            t
            
              |
            
            d
          
        
        ⋅
        
          p
          
            d
          
        
        ⋅
        
          i
          d
          f
        
        (
        t
        )
        =
        
          ∑
          
            t
            ,
            d
          
        
        
          t
          f
        
        (
        t
        ,
        d
        )
        ⋅
        
          
            1
            
              
                |
              
              D
              
                |
              
            
          
        
        ⋅
        
          i
          d
          f
        
        (
        t
        )
        =
        
          
            1
            
              
                |
              
              D
              
                |
              
            
          
        
        
          ∑
          
            t
            ,
            d
          
        
        
          t
          f
        
        (
        t
        ,
        d
        )
        ⋅
        
          i
          d
          f
        
        (
        t
        )
        .
      
    
    {\displaystyle M({\cal {T}};{\cal {D}})=\sum _{t,d}p_{t|d}\cdot p_{d}\cdot \mathrm {idf} (t)=\sum _{t,d}\mathrm {tf} (t,d)\cdot {\frac {1}{|D|}}\cdot \mathrm {idf} (t)={\frac {1}{|D|}}\sum _{t,d}\mathrm {tf} (t,d)\cdot \mathrm {idf} (t).}

<p>This expression shows that summing the Tf–idf of all possible terms and documents recovers the mutual information between documents and term taking into account all the specificities of their joint distribution.<a class="footnote-ref" id="fnref:12" href="#fn:12"><sup>12</sup></a> Each Tf–idf hence carries the "bit of information" attached to a term x document pair.
</p>
<h2 id="example-of-tfidf">Example of tf–idf</h2>
<p>Suppose that we have term count tables of a corpus consisting of only two documents, as listed on the right.
</p>
Document 2<table><tbody><tr><th>Term</th><th>Term Count</th></tr><tr><td>this</td><td>1</td></tr><tr><td>is</td><td>1</td></tr><tr><td>another</td><td>2</td></tr><tr><td>example</td><td>3</td></tr></tbody></table>
Document 1<table><tbody><tr><th>Term</th><th>Term Count</th></tr><tr><td>this</td><td>1</td></tr><tr><td>is</td><td>1</td></tr><tr><td>a</td><td>2</td></tr><tr><td>sample</td><td>1</td></tr></tbody></table>
<p>The calculation of tf–idf for the term "this" is performed as follows:
</p><p>In its raw frequency form, tf is just the frequency of the "this" for each document. In each document, the word "this" appears once; but as the document 2 has more words, its relative frequency is smaller.
</p>

t
          f
        
        (
        
          
            
              
              ″
            
            t
            h
            i
            
              s
              ″
            
          
        
        ,
        
          d
          
            1
          
        
        )
        =
        
          
            1
            5
          
        
        =
        0.2
      
    
    {\displaystyle \mathrm {tf} ({\mathsf {''this''}},d_{1})={\frac {1}{5}}=0.2}

t
          f
        
        (
        
          
            
              
              ″
            
            t
            h
            i
            
              s
              ″
            
          
        
        ,
        
          d
          
            2
          
        
        )
        =
        
          
            1
            7
          
        
        ≈
        0.14
      
    
    {\displaystyle \mathrm {tf} ({\mathsf {''this''}},d_{2})={\frac {1}{7}}\approx 0.14}

<p>An idf is constant per corpus, and accounts for the ratio of documents that include the word "this". In this case, we have a corpus of two documents and all of them include the word "this".
</p>

i
          d
          f
        
        (
        
          
            
              
              ″
            
            t
            h
            i
            
              s
              ″
            
          
        
        ,
        D
        )
        =
        log
        ⁡
        
          (
          
            
              2
              2
            
          
          )
        
        =
        0
      
    
    {\displaystyle \mathrm {idf} ({\mathsf {''this''}},D)=\log \left({\frac {2}{2}}\right)=0}

<p>So tf–idf is zero for the word "this", which implies that the word is not very informative as it appears in all documents.
</p>

t
          f
          i
          d
          f
        
        (
        
          
            
              
              ″
            
            t
            h
            i
            
              s
              ″
            
          
        
        ,
        
          d
          
            1
          
        
        ,
        D
        )
        =
        0.2
        ×
        0
        =
        0
      
    
    {\displaystyle \mathrm {tfidf} ({\mathsf {''this''}},d_{1},D)=0.2\times 0=0}

t
          f
          i
          d
          f
        
        (
        
          
            
              
              ″
            
            t
            h
            i
            
              s
              ″
            
          
        
        ,
        
          d
          
            2
          
        
        ,
        D
        )
        =
        0.14
        ×
        0
        =
        0
      
    
    {\displaystyle \mathrm {tfidf} ({\mathsf {''this''}},d_{2},D)=0.14\times 0=0}

<p>The word "example" is more interesting - it occurs three times, but only in the second document:
</p>

t
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        
          d
          
            1
          
        
        )
        =
        
          
            0
            5
          
        
        =
        0
      
    
    {\displaystyle \mathrm {tf} ({\mathsf {''example''}},d_{1})={\frac {0}{5}}=0}

t
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        
          d
          
            2
          
        
        )
        =
        
          
            3
            7
          
        
        ≈
        0.429
      
    
    {\displaystyle \mathrm {tf} ({\mathsf {''example''}},d_{2})={\frac {3}{7}}\approx 0.429}

i
          d
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        D
        )
        =
        log
        ⁡
        
          (
          
            
              2
              1
            
          
          )
        
        =
        0.301
      
    
    {\displaystyle \mathrm {idf} ({\mathsf {''example''}},D)=\log \left({\frac {2}{1}}\right)=0.301}

<p>Finally,
</p>

t
          f
          i
          d
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        
          d
          
            1
          
        
        ,
        D
        )
        =
        
          t
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        
          d
          
            1
          
        
        )
        ×
        
          i
          d
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        D
        )
        =
        0
        ×
        0.301
        =
        0
      
    
    {\displaystyle \mathrm {tfidf} ({\mathsf {''example''}},d_{1},D)=\mathrm {tf} ({\mathsf {''example''}},d_{1})\times \mathrm {idf} ({\mathsf {''example''}},D)=0\times 0.301=0}

t
          f
          i
          d
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        
          d
          
            2
          
        
        ,
        D
        )
        =
        
          t
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        
          d
          
            2
          
        
        )
        ×
        
          i
          d
          f
        
        (
        
          
            
              
              ″
            
            e
            x
            a
            m
            p
            l
            
              e
              ″
            
          
        
        ,
        D
        )
        =
        0.429
        ×
        0.301
        ≈
        0.129
      
    
    {\displaystyle \mathrm {tfidf} ({\mathsf {''example''}},d_{2},D)=\mathrm {tf} ({\mathsf {''example''}},d_{2})\times \mathrm {idf} ({\mathsf {''example''}},D)=0.429\times 0.301\approx 0.129}

<p>(using the <a href="/facts/Base_10_logarithm/q6EhDnWk">base 10 logarithm</a>).
</p>
<h2 id="beyond-terms">Beyond terms</h2>
<p>The idea behind tf–idf also applies to entities other than terms. In 1998, the concept of idf was applied to citations.<a class="footnote-ref" id="fnref:13" href="#fn:13"><sup>13</sup></a> The authors argued that "if a very uncommon citation is shared by two documents, this should be weighted more highly than a citation made by a large number of documents". In addition, tf–idf was applied to "visual words" with the purpose of conducting object matching in videos,<a class="footnote-ref" id="fnref:14" href="#fn:14"><sup>14</sup></a> and entire sentences.<a class="footnote-ref" id="fnref:15" href="#fn:15"><sup>15</sup></a> However, the concept of tf–idf did not prove to be more effective in all cases than a plain tf scheme (without idf). When tf–idf was applied to citations, researchers could find no improvement over a simple citation-count weight that had no idf component.<a class="footnote-ref" id="fnref:16" href="#fn:16"><sup>16</sup></a>
</p>
<h2 id="derivatives">Derivatives</h2>
<p>A number of term-weighting schemes have derived from tf–idf. One of them is TF–PDF (term frequency * proportional document frequency).<a class="footnote-ref" id="fnref:17" href="#fn:17"><sup>17</sup></a> TF–PDF was introduced in 2001 in the context of identifying emerging topics in the media. The PDF component measures the difference of how often a term occurs in different domains. Another derivate is TF–IDuF. In TF–IDuF,<a class="footnote-ref" id="fnref:18" href="#fn:18"><sup>18</sup></a> idf is not calculated based on the document corpus that is to be searched or recommended. Instead, idf is calculated on users' personal document collections. The authors report that TF–IDuF was equally effective as tf–idf but could also be applied in situations when, e.g., a user modeling system has no access to a global document corpus.
</p>
<h2 id="see-also">See also</h2>

<ul><li><a href="/facts/Word_embedding/7uRcBPqo">Word embedding</a></li>
<li><a href="/facts/Kullback%E2%80%93Leibler_divergence/nh7SjlPE">Kullback–Leibler divergence</a></li>
<li><a href="/facts/Latent_Dirichlet_allocation/BUWlwKWt">Latent Dirichlet allocation</a></li>
<li><a href="/facts/Latent_semantic_analysis/IugrMyMl">Latent semantic analysis</a></li>
<li><a href="/facts/Mutual_information/HIUvsjvV">Mutual information</a></li>
<li><a href="/facts/Noun_phrase/mmz3tJt9">Noun phrase</a></li>
<li><a href="/facts/Okapi_BM25/YunFGFj3">Okapi BM25</a></li>
<li><a href="/facts/PageRank/KtJRJFUX">PageRank</a></li>
<li><a href="/facts/Vector_space_model/WC1LqLA5">Vector space model</a></li>
<li><a href="/facts/Word_count/3zj0NjCg">Word count</a></li>
<li><a href="/facts/SMART_Information_Retrieval_System/fTUg9hzE">SMART Information Retrieval System</a></li></ul>

<ul><li><a href="/facts/Gerard_Salton/K9dbA2dN">Salton, G</a>; McGill, M. J. (1986). <a href="https://archive.org/details/introductiontomo00salt"><i>Introduction to modern information retrieval</i></a>. <a href="/facts/McGraw-Hill/oG8VpKJT">McGraw-Hill</a>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-07-054484-0.</li>
<li><a href="/facts/Gerard_Salton/K9dbA2dN">Salton, G.</a>; Fox, E. A.; Wu, H. (1983). "Extended Boolean information retrieval". <i>Communications of the ACM</i>. 26 (11): 1022–1036. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1145%2F182.358466">10.1145/182.358466</a>. <a href="/facts/Hdl_(identifier)/rdebSxmC">hdl</a>:<a href="https://hdl.handle.net/1813%2F6351">1813/6351</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:207180535">207180535</a>.</li>
<li><a href="/facts/Gerard_Salton/K9dbA2dN">Salton, G.</a>; Buckley, C. (1988). <a href="https://ecommons.cornell.edu/bitstream/1813/6721/1/87-881.pdf">"Term-weighting approaches in automatic text retrieval"</a> (PDF). <i>Information Processing & Management</i>. 24 (5): 513–523. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1016%2F0306-4573%2888%2990021-0">10.1016/0306-4573(88)90021-0</a>. <a href="/facts/Hdl_(identifier)/rdebSxmC">hdl</a>:<a href="https://hdl.handle.net/1813%2F6721">1813/6721</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:7725217">7725217</a>.</li>
<li>Wu, H. C.; Luk, R.W.P.; Wong, K.F.; Kwok, K.L. (2008). "Interpreting TF-IDF term weights as making relevance decisions". <i>ACM Transactions on Information Systems</i>. 26 (3): 1. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1145%2F1361684.1361686">10.1145/1361684.1361686</a>. <a href="/facts/Hdl_(identifier)/rdebSxmC">hdl</a>:<a href="https://hdl.handle.net/10397%2F10130">10397/10130</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:18303048">18303048</a>.</li></ul>
<h2 id="external-links-and-suggested-reading">External links and suggested reading</h2>
<ul><li><a href="/facts/Gensim/UgF45Tkh">Gensim</a> is a Python library for vector space modeling and includes tf–idf weighting.</li>
<li><a href="http://www.codeproject.com/KB/IP/AnatomyOfASearchEngine1.aspx">Anatomy of a search engine</a></li>
<li><a href="https://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/search/Similarity.html">tf–idf and related definitions</a> as used in <a href="/facts/Lucene/tPQ0D39Y">Lucene</a></li>
<li><a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TfidfTransformer</a> in <a href="/facts/Scikit-learn/HqxDHRMJ">scikit-learn</a></li>
<li><a href="https://www.hpclab.ceid.upatras.gr/tmg/">Text to Matrix Generator (TMG)</a>  MATLAB toolbox that can be used for various tasks in text mining (TM) specifically  i) indexing, ii) retrieval, iii) dimensionality reduction, iv) clustering, v) classification. The indexing step offers the user the ability to apply local and global weighting methods, including tf–idf.</li>
<li><a href="https://www.opinosis-analytics.com/knowledge-base/term-frequency-explained/">Term-frequency explained</a> Explanation of term-frequency</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Rajaraman, A.; Ullman, J.D. (2011). "Data Mining" (PDF). Mining of Massive Datasets. pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 978-1-139-05845-2. <a href="978-1-139-05845-2" target="_blank">978-1-139-05845-2</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Breitinger, Corinna; Gipp, Bela; Langer, Stefan (2015-07-26). "Research-paper recommender systems: a literature survey". International Journal on Digital Libraries. 17 (4): 305–338. doi:10.1007/s00799-015-0156-0. ISSN 1432-5012. S2CID 207035184. <a href="https://kops.uni-konstanz.de/bitstreams/8b886e4a-ea4b-4eae-bba1-19918f353170/download" target="_blank">https://kops.uni-konstanz.de/bitstreams/8b886e4a-ea4b-4eae-bba1-19918f353170/download</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Spärck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". Journal of Documentation. 28 (1): 11–21. CiteSeerX 10.1.1.115.8343. doi:10.1108/eb026526. S2CID 2996187. <a href="/wiki/Karen_Sp%C3%A4rck_Jones" target="_blank">/wiki/Karen_Sp%C3%A4rck_Jones</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Speech and Language Processing (3rd ed. draft), Dan Jurafsky and James H. Martin, chapter 14.https://web.stanford.edu/~jurafsky/slp3/14.pdf <a href="https://web.stanford.edu/~jurafsky/slp3/14.pdf" target="_blank">https://web.stanford.edu/~jurafsky/slp3/14.pdf</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>Manning, C.D.; Raghavan, P.; Schutze, H. (2008). "Scoring, term weighting, and the vector space model" (PDF). Introduction to Information Retrieval. p. 100. doi:10.1017/CBO9780511809071.007. ISBN 978-0-511-80907-1. <a href="978-0-511-80907-1" target="_blank">978-0-511-80907-1</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
<li id="fn:6"><p>"TFIDF statistics | SAX-VSM". <a href="https://jmotif.github.io/sax-vsm_site/morea/algorithm/TFIDF.html" target="_blank">https://jmotif.github.io/sax-vsm_site/morea/algorithm/TFIDF.html</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></p></li>
<li id="fn:7"><p>Robertson, S. (2004). "Understanding inverse document frequency: On theoretical arguments for IDF". Journal of Documentation. 60 (5): 503–520. doi:10.1108/00220410410560582. <a href="/wiki/Stephen_Robertson_(computer_scientist)" target="_blank">/wiki/Stephen_Robertson_(computer_scientist)</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></p></li>
<li id="fn:8"><p>Robertson, S. (2004). "Understanding inverse document frequency: On theoretical arguments for IDF". Journal of Documentation. 60 (5): 503–520. doi:10.1108/00220410410560582. <a href="/wiki/Stephen_Robertson_(computer_scientist)" target="_blank">/wiki/Stephen_Robertson_(computer_scientist)</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></p></li>
<li id="fn:9"><p>See also Probability estimates in practice in Introduction to Information Retrieval. <a href="http://nlp.stanford.edu/IR-book/html/htmledition/probability-estimates-in-practice-1.html#p:justificationofidf" target="_blank">http://nlp.stanford.edu/IR-book/html/htmledition/probability-estimates-in-practice-1.html#p:justificationofidf</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></p></li>
<li id="fn:10"><p>Robertson, S. (2004). "Understanding inverse document frequency: On theoretical arguments for IDF". Journal of Documentation. 60 (5): 503–520. doi:10.1108/00220410410560582. <a href="/wiki/Stephen_Robertson_(computer_scientist)" target="_blank">/wiki/Stephen_Robertson_(computer_scientist)</a> <a href="#fnref:10" class="footnote-back-ref">↩</a></p></li>
<li id="fn:11"><p>Aizawa, Akiko (2003). "An information-theoretic perspective of tf–idf measures". Information Processing and Management. 39 (1): 45–65. doi:10.1016/S0306-4573(02)00021-3. S2CID 45793141. <a href="/wiki/Doi_(identifier)" target="_blank">/wiki/Doi_(identifier)</a> <a href="#fnref:11" class="footnote-back-ref">↩</a></p></li>
<li id="fn:12"><p>Aizawa, Akiko (2003). "An information-theoretic perspective of tf–idf measures". Information Processing and Management. 39 (1): 45–65. doi:10.1016/S0306-4573(02)00021-3. S2CID 45793141. <a href="/wiki/Doi_(identifier)" target="_blank">/wiki/Doi_(identifier)</a> <a href="#fnref:12" class="footnote-back-ref">↩</a></p></li>
<li id="fn:13"><p>Bollacker, Kurt D.; Lawrence, Steve; Giles, C. Lee (1998-01-01). "CiteSeer". Proceedings of the second international conference on Autonomous agents - AGENTS '98. pp. 116–123. doi:10.1145/280765.280786. ISBN 978-0-89791-983-8. S2CID 3526393. <a href="978-0-89791-983-8" target="_blank">978-0-89791-983-8</a> <a href="#fnref:13" class="footnote-back-ref">↩</a></p></li>
<li id="fn:14"><p>Sivic, Josef; Zisserman, Andrew (2003-01-01). "Video Google: A text retrieval approach to object matching in videos". Proceedings Ninth IEEE International Conference on Computer Vision. ICCV '03. pp. 1470–. doi:10.1109/ICCV.2003.1238663. ISBN 978-0-7695-1950-0. S2CID 14457153. <a href="978-0-7695-1950-0" target="_blank">978-0-7695-1950-0</a> <a href="#fnref:14" class="footnote-back-ref">↩</a></p></li>
<li id="fn:15"><p>Seki, Yohei. "Sentence Extraction by tf/idf and Position Weighting from Newspaper Articles" (PDF). National Institute of Informatics. <a href="http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3-TSC-SekiY.pdf" target="_blank">http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3-TSC-SekiY.pdf</a> <a href="#fnref:15" class="footnote-back-ref">↩</a></p></li>
<li id="fn:16"><p>Beel, Joeran; Breitinger, Corinna (2017). "Evaluating the CC-IDF citation-weighting scheme – How effectively can 'Inverse Document Frequency' (IDF) be applied to references?" (PDF). Proceedings of the 12th IConference. Archived from the original (PDF) on 2020-09-22. Retrieved 2017-01-29. <a href="https://web.archive.org/web/20200922150304/http://beel.org/publications/2017%20iConference%20--%20Evaluating%20the%20CC-IDF%20citation-weighting%20scheme%20--%20preprint.pdf" target="_blank">https://web.archive.org/web/20200922150304/http://beel.org/publications/2017%20iConference%20--%20Evaluating%20the%20CC-IDF%20citation-weighting%20scheme%20--%20preprint.pdf</a> <a href="#fnref:16" class="footnote-back-ref">↩</a></p></li>
<li id="fn:17"><p>Khoo Khyou Bun; Bun, Khoo Khyou; Ishizuka, M. (2001). "Emerging Topic Tracking System". Proceedings Third International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2001. pp. 2–11. CiteSeerX 10.1.1.16.7986. doi:10.1109/wecwis.2001.933900. ISBN 978-0-7695-1224-2. S2CID 1049263. <a href="978-0-7695-1224-2" target="_blank">978-0-7695-1224-2</a> <a href="#fnref:17" class="footnote-back-ref">↩</a></p></li>
<li id="fn:18"><p>Langer, Stefan; Gipp, Bela (2017). "TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users' Personal Document Collections" (PDF). IConference. <a href="https://www.gipp.com/wp-content/papercite-data/pdf/beel17.pdf" target="_blank">https://www.gipp.com/wp-content/papercite-data/pdf/beel17.pdf</a> <a href="#fnref:18" class="footnote-back-ref">↩</a></p></li>
</ol>

tf–idf open-in-new

tf–idf