Rst

Locality Sensitive Hasing [LSH] Example

/Algorithm

This article is for showing an example of How LSH works on simple documents.

Overview and given problem of LSH

As you can see from the above image, The LSH process contains 3-big steps.

  1. Shingling
    • documents to set of fixed length words
  2. Min Hashing
    • Making signature from document, reduce the size of data
  3. Locality Sensitive Hashing
    • Get candidate pairs by applying hash functions.

Given example documents
We will use below 3 documents.

1. Shingling

We will use 2-shingles to represent each document.

2. MinHashing

Below image shows algorithm of MinHashing. The key point of MinHashing is hashing can be interpreted as a permutation, when it maps index to index.

Steps:

  1. Represent documents to Bitmap
  2. Using MinHash functions, make signatures

Set of Shingles to Bitmap

Min Hashing

We will use 6 different hash functions.

Applying this, we will get 6-dimensional($h_1$ ~ $h_6$) integer vector for each document.
Below shows example of finding MinHash values.

You can see the result signature vectors below.

3. Locality Sensitive Hashing

Steps:

  1. Signature matrix and Hashing it
    • Hash the signatures of documents
  2. AND-OR or OR-AND (BAND) technique.
    • Compare Hash values
    • AND-OR
      • all hash values in each band have to be same, and at least one band has to satisfy the condition
    • OR-AND
      • at least one row in a band have to have same hash values, and all band have to satisfy the condition

For AND-OR technique, (doc1, doc2) and (doc1, doc3) are candidate pairs.
For OR-OR technique, (doc1, doc3) and (doc2, doc3) are candidate pairs.

comments powered by Disqus