Open In App

FuzzyWuzzy Python Library

Last Updated : 29 Sep, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

There are many methods of comparing string in python. Some of the main methods are:

  1. Using regex
  2. Simple compare
  3. Using difflib

But one of the very easy method is by using

fuzzywuzzy

library where we can have a score out of 100, that denotes two string are equal by giving similarity index. This article talks about how we start using fuzzywuzzy library. FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses

Levenshtein Distance

to calculate the differences between sequences.

FuzzyWuzzy

has been developed and open-sourced by SeatGeek, a service to find sport and concert tickets. Their original use case, as discussed in their

blog.

Requirements of fuzzywuzzy

  • Python 2.7 or higher
  • python-Levenshtein
  • difflib

Install via pip :

pip install fuzzywuzzypip install python-Levenshtein

How to use FuzzyWuzzy Python Library ?

First of import these modules,

Python
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Simple ratio usage :

Python
fuzz.ratio('geeksforgeeks', 'geeksgeeks')
87

# Exact match
fuzz.ratio('GeeksforGeeks', 'GeeksforGeeks')  

100
fuzz.ratio('geeks for geeks', 'Geeks For Geeks ') 
80
Python
fuzz.partial_ratio("geeks for geeks", "geeks for geeks!")
100
# Exclamation mark in second string, 
but still partially words are same so score comes 100

fuzz.partial_ratio("geeks for geeks", "geeks geeks")
64
# score is less because there is a extra 
token in the middle middle of the string.

Now, token set ratio an token sort ratio:

Python
# Token Sort Ratio
fuzz.token_sort_ratio("geeks for geeks", "for geeks geeks")
100

# This gives 100 as every word is same, irrespective of the position 

# Token Set Ratio
fuzz.token_sort_ratio("geeks for geeks", "geeks for for geeks")
88
 fuzz.token_set_ratio("geeks for geeks", "geeks for for geeks")
100
# Score comes 100 in second case because token_set_ratio 
considers duplicate words as a single word.

Now suppose if we have list of list of options and we want to find the closest match(es), we can use the

process

module

Python
query = 'geeks for geeks'
choices = ['geek for geek', 'geek geek', 'g. for geeks'] 
 
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
[('geeks geeks', 95), ('g. for geeks', 95), ('geek for geek', 93)]
 
# If we want only the top one
process.extractOne(query, choices)
('geeks geeks', 95)

There is also one more ratio which is used often called

WRatio

, sometimes its better to use WRatio instead of simple ratio as WRatio handles lower and upper cases and some other parameters too.

Python
fuzz.WRatio('geeks for geeks', 'Geeks For Geeks')
100
fuzz.WRatio('geeks for geeks!!!','geeks for geeks')
100
# whereas simple ratio will give for above case
fuzz.ratio('geeks for geeks!!!','geeks for geeks')
91

Full Code

Python
# Python code showing all the ratios together, 
# make sure you have installed fuzzywuzzy module

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

s1 = "I love GeeksforGeeks"
s2 = "I am loving GeeksforGeeks"
print "FuzzyWuzzy Ratio: ", fuzz.ratio(s1, s2)
print "FuzzyWuzzy PartialRatio: ", fuzz.partial_ratio(s1, s2)
print "FuzzyWuzzy TokenSortRatio: ", fuzz.token_sort_ratio(s1, s2)
print "FuzzyWuzzy TokenSetRatio: ", fuzz.token_set_ratio(s1, s2)
print "FuzzyWuzzy WRatio: ", fuzz.WRatio(s1, s2),'\n\n'

# for process library,
query = 'geeks for geeks'
choices = ['geek for geek', 'geek geek', 'g. for geeks'] 
print "List of ratios: "
print process.extract(query, choices), '\n'
print "Best among the above list: ",process.extractOne(query, choices)

Output:

FuzzyWuzzy Ratio:  84
FuzzyWuzzy PartialRatio:  85
FuzzyWuzzy TokenSortRatio:  84
FuzzyWuzzy TokenSetRatio:  86
FuzzyWuzzy WRatio:  84 


List of ratios: 
[('g. for geeks', 95), ('geek for geek', 93), ('geek geek', 86)] 

Best among the above list:  ('g. for geeks', 95)

The FuzzyWuzzy library is built on top of difflib library, python-Levenshtein is used for speed. So it is one of the best way for string matching in python.

Conclusion

The FuzzyWuzzy library offers an efficient and straightforward approach for string comparison in Python. It simplifies the process of measuring similarity between strings by providing various ratios like Simple Ratio, Token Sort Ratio, and WRatio, making it highly versatile for different use cases. With its foundation built on Levenshtein Distance, FuzzyWuzzy is not only powerful but also easy to implement, requiring minimal setup via pip installation. Whether you’re working on text matching, data deduplication, or comparing user inputs, FuzzyWuzzy stands out as one of the best libraries for fuzzy string matching in Python.

FuzzyWuzzy Python library -FAQs

1. What is FuzzyWuzzy used for in Python?

FuzzyWuzzy is a Python library used for fuzzy string matching, which helps find approximate matches between strings. It is commonly used for tasks like data deduplication, matching user inputs, and comparing text with minor differences by providing a similarity score.

2. How does FuzzyWuzzy calculate string similarity?

FuzzyWuzzy uses Levenshtein Distance to calculate the difference between two strings. It provides various ratios, such as Simple Ratio, Token Sort Ratio, and WRatio, to measure the similarity between strings and return a score out of 100.

3. What are the key features of FuzzyWuzzy?

Key features of FuzzyWuzzy include easy-to-use string comparison functions, multiple similarity ratios (like TokenSetRatio and WRatio), and support for finding the best match from a list of strings. It also handles case sensitivity and ignores minor variations in strings.

4. When should I use WRatio over Simple Ratio in FuzzyWuzzy?

WRatio is more versatile than Simple Ratio because it handles case sensitivity and other minor variations in strings. It is ideal when comparing strings with inconsistent casing or extra characters, offering a more robust comparison method.



Next Article
Article Tags :
Practice Tags :

Similar Reads