Home » How to Calculate Jaccard Similarity in R

How to Calculate Jaccard Similarity in R

by Erma Khan

The Jaccard similarity index measures the similarity between two sets of data. It can range from 0 to 1. The higher the number, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

This tutorial explains how to calculate Jaccard Similarity for two sets of data in R.

Example: Jaccard Similarity in R

Suppose we have the following two sets of data:

a 
b 

We can define the following function to calculate the Jaccard Similarity between the two sets:

#define Jaccard Similarity function
jaccard function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

#find Jaccard Similarity between the two sets 
jaccard(a, b)

0.4

The Jaccard Similarity between the two lists is 0.4.

Note that the function will return if the two sets don’t share any values:

c 

And the function will return if the two sets are identical:

e 

The function also works for sets that contain strings:

g cat', 'dog', 'hippo', 'monkey')
h monkey', 'rhino', 'ostrich', 'salmon')

jaccard(g, h)

0.142857

You can also use this function to find the Jaccard distance between two sets, which is the dissimilarity between two sets and is calculated as 1 – Jaccard Similarity.

a #find Jaccard distance between sets a and b
1 - jaccard(a, b)

[1] 0.6

Refer to this Wikipedia page to learn more details about the Jaccard Similarity Index.

Related Posts