Locality Sensitive Hasing [LSH] Example
This article is for showing an example of How LSH works on simple documents.
Overview and given problem of LSH
data:image/s3,"s3://crabby-images/db91e/db91e8d375572606cf9701a8724c8ec22a91293f" alt=""
As you can see from the above image, The LSH process contains 3-big steps.
- Shingling
- documents to set of fixed length words
- Min Hashing
- Making signature from document, reduce the size of data
- Locality Sensitive Hashing
- Get candidate pairs by applying hash functions.
Given example documents
We will use below 3 documents.
- doc1 -
abcabcabc
- doc2 -
cbacbacba
- doc3 -
bacbaacba
1. Shingling
We will use 2-shingles to represent each document.
- doc1 :
abcabcabc
-> <ab, bc, ca>
- doc2 :
cbacbacba
-> <ac, ba, cb>
- doc3 :
bacbaacba
-> <aa, ac, ba, cb>
2. MinHashing
Below image shows algorithm of MinHashing. The key point of MinHashing is hashing can be interpreted as a permutation, when it maps index to index.
data:image/s3,"s3://crabby-images/4ec11/4ec11a2cfb77de3b8302e97b3b773335635cfe72" alt=""
Steps:
- Represent documents to Bitmap
- Using MinHash functions, make signatures
Set of Shingles to Bitmap
data:image/s3,"s3://crabby-images/1f952/1f952a7ed115c259a9a51222117d855840514875" alt=""
Min Hashing
We will use 6 different hash functions.
- $h_1(x)=x\,mod\,5$
- $h_2(x)=2x+1\,mod\,5$
- $h_3(x)=x+3\,mod\,5$
- $h_4(x)=2x+3\,mod\,5$
- $h_5(x)=x+4\,mod\,5$
- $h_6(x)=2x+4\,mod\,5$
Applying this, we will get 6-dimensional($h_1$ ~ $h_6$) integer vector for each document.
Below shows example of finding MinHash values.
data:image/s3,"s3://crabby-images/b09f4/b09f4d3ea8d44c406b9497bbfbf9407b1016c3ec" alt=""
You can see the result signature vectors below.
data:image/s3,"s3://crabby-images/cdb6c/cdb6ce7d1ea16a9ec40245880d75cce69890bc7f" alt=""
3. Locality Sensitive Hashing
Steps:
- Signature matrix and Hashing it
- Hash the signatures of documents
- AND-OR or OR-AND (BAND) technique.
- Compare Hash values
- AND-OR
- all hash values in each band have to be same, and at least one band has to satisfy the condition
- OR-AND
- at least one row in a band have to have same hash values, and all band have to satisfy the condition
data:image/s3,"s3://crabby-images/2862d/2862d44a026234d73676cf993f57c5d809ba93ae" alt=""
data:image/s3,"s3://crabby-images/97141/9714128730c6bf531cd3fe6d3037ffe6c8594cef" alt=""
For AND-OR technique, (doc1, doc2)
and (doc1, doc3)
are candidate pairs.
For OR-OR technique, (doc1, doc3)
and (doc2, doc3)
are candidate pairs.
Written on
June
24th,
2024
by
Wonbin Kim
Feel free to share!