BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.
Given a query Q, containing keywords q 1 , . . . , q n {\displaystyle q_{1},...,q_{n}} , the BM25 score of a document D is:
where f ( q i , D ) {\displaystyle f(q_{i},D)} is the number of times that the keyword q i {\displaystyle q_{i}} occurs in the document D, | D | {\displaystyle |D|} is the length of the document D in words, and avgdl is the average document length in the text collection from which documents are drawn. k 1 {\displaystyle k_{1}} and b are free parameters, usually chosen, in absence of an advanced optimization, as k 1 ∈ [ 1.2 , 2.0 ] {\displaystyle k_{1}\in [1.2,2.0]} and b = 0.75 {\displaystyle b=0.75} .3 IDF ( q i ) {\displaystyle {\text{IDF}}(q_{i})} is the IDF (inverse document frequency) weight of the query term q i {\displaystyle q_{i}} . It is usually computed as:
where N is the total number of documents in the collection, and n ( q i ) {\displaystyle n(q_{i})} is the number of documents containing q i {\displaystyle q_{i}} .
There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model.
Here is an interpretation from information theory. Suppose a query term q {\displaystyle q} appears in n ( q ) {\displaystyle n(q)} documents. Then a randomly picked document D {\displaystyle D} will contain the term with probability n ( q ) N {\displaystyle {\frac {n(q)}{N}}} (where N {\displaystyle N} is again the cardinality of the set of documents in the collection). Therefore, the information content of the message " D {\displaystyle D} contains q {\displaystyle q} " is:
Now suppose we have two query terms q 1 {\displaystyle q_{1}} and q 2 {\displaystyle q_{2}} . If the two terms occur in documents entirely independently of each other, then the probability of seeing both q 1 {\displaystyle q_{1}} and q 2 {\displaystyle q_{2}} in a randomly picked document D {\displaystyle D} is:
and the information content of such an event is:
With a small variation, this is exactly what is expressed by the IDF component of BM25.
"OKAPI". smcse.city.ac.uk. Retrieved 2023-10-16.{{cite web}}: CS1 maint: url-status (link) https://smcse.city.ac.uk/doc/cisr/web/okapi/okapi.html ↩
Stephen Robertson & Hugo Zaragoza (2009). "The Probabilistic Relevance Framework: BM25 and Beyond". Foundations and Trends in Information Retrieval. 3 (4): 333–389. CiteSeerX 10.1.1.156.5282. doi:10.1561/1500000019. S2CID 207178704. http://dl.acm.org/citation.cfm?id=1704810 ↩
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval, Cambridge University Press, 2009, p. 233. ↩
"The BM25 Weighting Scheme". http://xapian.org/docs/bm25.html ↩
Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf ↩
Robertson, Stephen; Zaragoza, Hugo; Taylor, Michael (2004-11-13). "Simple BM25 extension to multiple weighted fields". Proceedings of the thirteenth ACM international conference on Information and knowledge management. CIKM '04. New York, NY, USA: Association for Computing Machinery. pp. 42–49. doi:10.1145/1031171.1031181. ISBN 978-1-58113-874-0. S2CID 16628332. 978-1-58113-874-0 ↩
Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In Proceedings of CIKM'2011, pages 7-16. http://sifaka.cs.uiuc.edu/~ylv2/pub/cikm11-lowerbound.pdf ↩