Locality-Sensitive Hashing (LSH)
Similarity Measure
Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.
- Cosine similarity
- Euclidean similarity
- Nucleotide similarity
- Amino acid similarity
- Hamming similarity
- Jaccard similarity
Types of LSH
- HyperplaneLSHfor Cosine Distance
- Super-Bit Locality-Sensitive Hashingfor Hamming distance
- Min Hashfor Jaccard similarity
- Min-wise independent permutations
- Nilsimsa Hash
- Random projection
Implementations
- tdebatty/java-LSHA Java implementation of Locality Sensitive Hashing (LSH) MinHash & Super-Bit
- apache/incubator-datafua collection of libraries for working with large-scale data in Hadoop.
- marufaytekin/lsh-sparkHyperplaneLSH for Spark
- soundcloud/cosine-lsh-join-sparkApproximate Nearest Neighbors in Spark
- karlhigley/spark-neighborsSpark-based approximate nearest neighbor search using locality-sensitive hashing supports Hamming, Jaccard, Euclidean, and cosine distance.
- rholder/nilsimsaNilsimsa locality-sensitive hashing algorithm in Java.
- chrisjmccormick/MinHashMinHash Tutorial with Python Codewith example to mining documents similarity.
- barneygovan/lsh-scalaA Locality-Sensitive Hashing Library for Scala with optional Redis storage.
- treadstone90/Locality-Sensitive-Hashingworks only for the text and can support only Jaccard Similarity.
- richwhitjr/DistNNDistributed LSH Implementation in Scala.
- beckgael/Mean-Shift-LSHDistributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH. Scala/Spark implementation.
- ohtaman/LSHC++ implemented MinHash and SimHash.
- JorenSix/TarsosLSHA Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time.
Papers
- Practical and Optimal LSH for Angular Distance
- Optimal Data-Dependent Hashing for Approximate Near Neighbors
- Beyond Locality Sensitive Hashing
- Original LSH algorithm (1999)
- Efficient Distributed Locality Sensitive Hashing
- Jaccard distance: Mining Massive Data Sets chapter#3
- Hamming normA. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
- Lp normsM. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://people.csail.mit.edu/indyk/nips-nn.ps
- Cosine distance and Earth movers distance (EMD)M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
- Very Sparse Random ProjectionsPing Li, T. Hastie and K. W. Church, 2006
- Similarity Estimation Techniques from Rounding Algorithms
- Random projectionRandom projection in dimensionality reduction: Applications to image and text data
- An Introduction to Sequence Similarity (“Homology”) Searching
- Efficient large-scale sequence comparison by locality-sensitive hashing
Finding Nearest Neighbors
Additional Reading
Issues for LSH
- SPARK-5992Locality Sensitive Hashing (LSH) for Spark
- spark/pull/15148
Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.
Detailed changes are as follows:
- Implement abstract LSH, LSHModel classes as Estimator-Model
- Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
- Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
- Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
Things that will be implemented in a follow-up PR:
- Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
- PySpark Integration for the scala classes and methods.