SimHash

<h2 id="evaluation-and-benchmarks">Evaluation and benchmarks</h2>
<p>A large scale evaluation has been conducted by <a href="/facts/Google/GT9Sugza">Google</a> in 2006<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a> to compare the performance of <a href="/facts/Minhash/NL6bV6SV">Minhash</a> and Simhash<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a> algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a> and using Minhash and <a href="/facts/Locality-sensitive_hashing/hj8UllQG">LSH</a> for <a href="/facts/Google_News/YHSEYXK6">Google News</a> personalization.<a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a>
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/MinHash/NL6bV6SV">MinHash</a></li>
<li><a href="/facts/W-shingling/6NIj7xtQ">w-shingling</a></li>
<li><a href="/facts/Count%25E2%2580%2593min_sketch/FQdt8laA">Count–min sketch</a></li>
<li><a href="/facts/Locality-sensitive_hashing/hj8UllQG">Locality-sensitive hashing</a></li></ul>

<h2 id="external-links">External links</h2>
<ul><li><a href="http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf">Simhash Princeton Paper</a></li>
<li><a href="http://matpalm.com/resemblance/simhash/">Simhash explained</a></li>
<li><a href="http://proceedings.mlr.press/v33/shrivastava14.pdf">Comparison of MinHash vs. Simhash</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Cyphers, Bennett (2021-03-03). "Google's FLoC Is a Terrible Idea". Electronic Frontier Foundation. Retrieved 2021-04-13. <a href="https://www.eff.org/deeplinks/2021/03/googles-floc-terrible-idea" target="_blank">https://www.eff.org/deeplinks/2021/03/googles-floc-terrible-idea</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Henzinger, Monika (2006), "Finding near-duplicate web pages: a large-scale evaluation of algorithms", Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 284, doi:10.1145/1148170.1148222, ISBN 978-1595933690, S2CID 207160068. <a href="978-1595933690" target="_blank">978-1595933690</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Charikar, Moses S. (2002), "Similarity estimation techniques from rounding algorithms", Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380–388, doi:10.1145/509907.509965, ISBN 978-1581134957, S2CID 4229473. <a href="978-1581134957" target="_blank">978-1581134957</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), "Detecting near-duplicates for web crawling", Proceedings of the 16th International Conference on World Wide Web (PDF), p. 141, doi:10.1145/1242572.1242592, ISBN 9781595936547. <a href="9781595936547" target="_blank">9781595936547</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>Das, Abhinandan S.; Datar, Mayur; Garg, Ashutosh; Rajaram, Shyam; et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th International Conference on World Wide Web, p. 271, doi:10.1145/1242572.1242610, ISBN 9781595936547, S2CID 207163129. <a href="9781595936547" target="_blank">9781595936547</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
</ol>

SimHash open-in-new

SimHash