Fuzzy hashing exists to solve this problem of detecting data that is similar, but not exactly the same, as other data. Fuzzy hashing algorithms specifically use algorithms in which two similar inputs will generate two similar hash values. This property is the exact opposite of the avalanche effect desired in cryptographic hash functions.
Fuzzy hashing can also be used to detect when one object is contained within another.
Breitinger, Frank (May 2014). "NIST Special Publication 800-168" (PDF). NIST Publications. doi:10.6028/NIST.SP.800-168. Retrieved January 11, 2023. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-168.pdf
Pagani, Fabio; Dell'Amico, Matteo; Balzarotti, Davide (2018-03-13). "Beyond Precision and Recall" (PDF). Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy. New York, NY, USA: ACM. pp. 354–365. doi:10.1145/3176258.3176306. ISBN 9781450356329. Retrieved December 12, 2022. 9781450356329
Sarantinos, Nikolaos; Benzaïd, Chafika; Arabiat, Omar (2016). "Forensic Malware Analysis: The Value of Fuzzy Hashing Algorithms in Identifying Similarities". 2016 IEEE Trustcom/BigDataSE/ISPA (PDF). pp. 1782–1787. doi:10.1109/TrustCom.2016.0274. ISBN 978-1-5090-3205-1. S2CID 32568938. 10.1109/TrustCom.2016.0274. 978-1-5090-3205-1
Kornblum, Jesse (2006). "Identifying almost identical files using context triggered piecewise hashing". Digital Investigation. 3, Supplement (September 2006): 91–97. doi:10.1016/j.diin.2006.06.015. Retrieved June 30, 2022. https://www.sciencedirect.com/science/article/pii/S1742287606000764
Oliver, Jonathan; Cheng, Chun; Chen, Yanggui (2013). "TLSH -- A Locality Sensitive Hash" (PDF). 2013 Fourth Cybercrime and Trustworthy Computing Workshop. IEEE. pp. 7–13. doi:10.1109/ctc.2013.9. ISBN 978-1-4799-3076-0. Retrieved December 12, 2022. 978-1-4799-3076-0
Kornblum, Jesse (2006). "Identifying almost identical files using context triggered piecewise hashing". Digital Investigation. 3, Supplement (September 2006): 91–97. doi:10.1016/j.diin.2006.06.015. Retrieved June 30, 2022. https://www.sciencedirect.com/science/article/pii/S1742287606000764
Al-Kuwari, Saif; Davenport, James H.; Bradford, Russell J. (2011). "Cryptographic Hash Functions: Recent Design Trends and Security Notions". Cryptology ePrint Archive. Report 2011/565. https://eprint.iacr.org/2011/565
Breitinger, Frank (May 2014). "NIST Special Publication 800-168" (PDF). NIST Publications. doi:10.6028/NIST.SP.800-168. Retrieved January 11, 2023. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-168.pdf
Oliver, Jonathan; Hagen, Josiah (2021). "Designing the Elements of a Fuzzy Hashing Scheme" (PDF). 2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC). IEEE. pp. 1–6. doi:10.1109/euc53437.2021.00028. ISBN 978-1-6654-0036-7. Archived from the original (PDF) on 14 April 2021. Retrieved 14 April 2021. 978-1-6654-0036-7
Oliver, Jonathan; Cheng, Chun; Chen, Yanggui (2013). "TLSH -- A Locality Sensitive Hash" (PDF). 2013 Fourth Cybercrime and Trustworthy Computing Workshop. IEEE. pp. 7–13. doi:10.1109/ctc.2013.9. ISBN 978-1-4799-3076-0. Retrieved December 12, 2022. 978-1-4799-3076-0
"Open Source Similarity Digests DFRWS August 2016" (PDF). tlsh.org. Retrieved December 11, 2022. https://github.com/trendmicro/tlsh/blob/master/TLSH_Introduction.pdf
"spamsum README". samba.org. Retrieved December 11, 2022. https://www.samba.org/ftp/unpacked/junkcode/spamsum/README
"spamsum.c". samba.org. Retrieved December 11, 2022. https://download.samba.org/pub/unpacked/junkcode/spamsum/spamsum.c
Kornblum, Jesse (2006). "Identifying almost identical files using context triggered piecewise hashing". Digital Investigation. 3, Supplement (September 2006): 91–97. doi:10.1016/j.diin.2006.06.015. Retrieved June 30, 2022. https://www.sciencedirect.com/science/article/pii/S1742287606000764
Roussev, Vassil (2010). "Data Fingerprinting with Similarity Digests". Advances in Digital Forensics VI. IFIP Advances in Information and Communication Technology. Vol. 337. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 207–226. doi:10.1007/978-3-642-15506-2_15. ISBN 978-3-642-15505-5. ISSN 1868-4238. 978-3-642-15505-5
"Fast Clustering of High Dimensional Data Clustering the Malware Bazaar Dataset" (PDF). tlsh.org. Retrieved December 11, 2022. https://tlsh.org/papersDir/n21_opt_cluster.pdf