0% found this document useful (0 votes)
62 views15 pages

Babel Street Analytics Complete Guide To Name Matching

Uploaded by

sandeep_muthangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views15 pages

Babel Street Analytics Complete Guide To Name Matching

Uploaded by

sandeep_muthangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

The Complete Guide

to Name Matching
What It Is, How It Works, and Deciding Which Approach to Take
Name matching is difficult
Modern commerce, finance, and security require fast, accurate data from multiple
sources. But many times the only common link between data sources is the name of a
person, place, or organization. And unlike other forms of data, names are highly variable.
Spelling, misordered name components, and even nicknames (to name just a few
challenges) can result in a failed or false match. The consequences of mismatching a
name are no joke, either. Border patrol could fail to detain a watchlisted individual. A
hospital could create duplicate patient records. A bank could transfer funds to a
sanctioned organization.

You may be thinking, okay, why not just use regular text search? But regular text search
doesn’t go far enough to address the challenges of name matching or searching.
Traditional methods for matching names do exist, but their effectiveness is greatly
limited by a trifecta of frustrating side effects. They’re weak, slow, and expensive to
maintain and run.

Businesses need something better — and that something is


modern, AI-driven, intelligent fuzzy name matching.

2 The Complete Guide to Name Matching


What is name matching used for?
Numerous critical situations require being able to verify an few minutes to answer. If you can’t do it quickly, you hold up
identity, and that almost always begins with looking up a name. Is law-abiding citizens. And if you can’t do it right, the bad actors get
the person at the border entry point on a list of suspected away. Here’s how name matching can improve speed, accuracy,
terrorists? Is that organization trying to move money to a bad and efficiency in a few specific use cases:
actor? You have to get those questions right — and you only get a

Border security Anti-money laundering Patient matching Investigations

Missing even a single match Financial institutions are Healthcare providers and Criminal investigations hinge
against a watchlist puts required by anti-money payers need to quickly find on unambiguous identification
citizens at risk. Boarder laundering regulations to and link patient records, of suspects, making name
security agencies must deliver avoid doing business with creating a 360-degree view matching technology crucial
the most accurate results known bad actors. They must that enables better care and for law enforcement agencies.
possible within real-world time be able to check an entity avoids duplicate records,
constraints to ensure the against sanctions lists, verify which lower productivity and
seamless flow of people and information against business increase costs.
goods through points of entry. directory listings, and reduce
the time required for manual
remediation of false positive
results.

The Complete Guide to Name Matching


3
Getting fuzzy: Why name matching is far from easy
In a structured database, a name is a data point within a record, unique to a single person or entity. Nicknames, transliteration
just like an email address, phone number, or unique ID number. But spellings, multiple spellings of the same name, or the same name
what happens if you only have a name to look up a record? It written in different languages or scripts are just a few of the ways
happens more often than you think, as privacy regulations may matching can fail.
prevent the creation or sharing of ID numbers.
We’ve come to expect our search engines to overcome spelling
When names are your only unifying data point, you have a errors, like when you type Searsha, and the search asks if you
problem. (Actually, you have a lot of potential problems!) Unlike meant Saoirse. You might expect the same of name matching, but
Social Security and other ID numbers, names are highly variable. it’s technically much harder behind the scenes.
There isn’t one correct way to spell a name. And a name isn’t

By the way
This one common Arabic given name can be transliterated more than
1,000 different ways in English. Here are just four possible spellings.

4 The Complete Guide to Name Matching


Why name search is unique A few challenges to name matching

While there is an abundance of search tools on the market, name search


is a different animal than document search, and requires a fundamentally
different approach. Here’s why:

1.  ypos have an outsized impact on name search. For example, “teh” is


T
1/250th of a one-page document; the typo in Jhon Smith is 50% of
the name.

2. Spell-checking names is impossible. Cindy, Cyndi, Cindi, Cyndy, Syndy,


Syndi, Sindi, and Sindy are all correct!

3. Common names (such as John) are as important as unusual names


(like Zappa), yet internet search engines demote the importance of
frequently occurring words.

The Complete Guide to Name Matching


5
Precision, recall and accuracy
The key metrics to use when evaluating a search solution or the performance of a system are precision and recall. Together, these define
accuracy. Already confused? Here’s the breakdown.

Precision measures the percentage of


tp Key
“matches found” that are correct. High
Precision tp = true positives

tp + fp
precision means fewer false positives
(correct matches found)
(incorrect matches).
fp = false positives
(non-matches labeled
as “matches”)
Recall measures the percentage of all
tp fn = false negatives
Recall
possible correct matches that are actually
found. High recall means fewer false (missed matches)
negatives (missed matches). tp + fn

6 The Complete Guide to Name Matching


The pros and cons of common
name matching methods
As a business user, you’re interested in results. You probably prefer leaving the
technological details to the pros. But you still need to understand a few key
terms and solutions when talking about name matching — fuzzy or otherwise —
with your technical team. You have to know what kind of accuracy your
business needs (see “Precision, recall, and accuracy”). Once you’ve gained an
understanding of some popular approaches to name matching, and their pros
and cons, you’ll see that a new, more sophisticated approach is necessary.

Simple and unsophisticated

Edit distance
Edit distance looks at how many character changes it takes to get from one
name to another, but it lacks the linguistic smarts to understand that Jack to
Jcak is more likely a match than Jack to Mack.

Pros: Easy to implement; fast


Cons: Limited to Latin-based languages (e.g., English, French, Spanish); all
swaps are weighted evenly, missing linguistic nuances

The Complete Guide to Name Matching


7
Smarter, but limited

Rules-based methods
Nearly 30 years ago, the go-to name matching technology was to create huge static lists of variations for every name on a watchlist.
These rules-based systems are complicated, top-heavy, and difficult (therefore costly) to maintain. There might be hundreds of
variations for an average name with three components — an Arabic name might easily have five. Multiply that times a million names on a
list, and you have a good sense of how this is bad for real-time processes. And you still might not make the match.

Rules-based methods are restricted by human knowledge. They can only capture situations that people encounter or imagine. As a
result, they can be as frustrating as a game of whack-a-mole: a new variant pops up, leading to a new rule, which may affect how other
rules work, adding complexity atop complexity.

There are two broad categories of rules-based methods.

The list method attempts to list all possible spelling variations of The common key method addresses some of the limitations of
each name component and looks for matching names from these lists by reducing names to a key or code based on their English
lists of name variations. pronunciation, such that similar sounding names share the same
key.
Pros: Easy to edit or add new rules
Pros: Fast execution; high recall
Cons: Difficult to coordinate potentially conflicting rules,
computationally intensive, requires expensive hardware to run Cons: Mostly limited to Latin-based languages; transliterating
against long lists quickly; can’t handle unlisted names, missing/ non-Latin names reduces precision
added spaces between components, or name components in
incorrect fields

8 The Complete Guide to Name Matching


Here’s an example What kind of accuracy do you
Your matching strategy finds three matches to your target name from a list of 200 names.
need? It depends!
Of those matches, two are correct, and one is incorrect. There were an additional three Higher precision searches are those
correct matches in the sample that your strategy did not uncover. that hone in on the likeliest matches
in order to reduce time-wasting false
tp = 2
hits. They’re designed for finding
true positives (correct matches found)
patient records or bank compliance
fp = 1 screenings.
false positives (non-matches labeled as “matches”)
Higher recall searches are best for
fn = 3 high-stakes situations, such as
false negatives (missed matches) border security, where a miss could
mean a potential terrorist attack.
The precision of your results is .67 = 2/(2+1).

The recall of your results is .40 = 2/(2+3).

Sad but true: As precision increases, recall decreases and vice versa. If every name was
picked as a match, you’d have perfect recall, but dismal precision. If you picked only the
top match and it was correct, precision would be perfect, but recall would be low.

So what is accuracy? Accuracy (known as an “F-score”) is a calculation that combines


precision and recall. What most businesses need to understand is whether they need
higher precision or higher recall.

The Complete Guide to Name Matching


9
Smartest, but slowest

AI-powered statistical models


A statistical approach takes hundreds, if not thousands, of matching name pairs and trains a model to recognize what two “similar
names” look like, so that the model can calculate the probability that two names are a match and assign a similarity score.

Some statistical models can compare the semantic similarity of common words (i.e., how close in meaning they are), such as “drug” and
“pharmaceutical,” to spot a possible match in “PennyLuck Drugs” and “PennyLuck Pharmaceuticals.” Semantic similarity is especially
powerful across languages.

日本電信電話株式会社 would be phonetically transliterated as Nippon Denshin Denwa Kabushikigaisha, but its official English name is
Nippon Telegraph and Telephone Corporation. The semantic match would be telegraph to 電信 (denshin); telephone to 電話 (denwa);
and corporation to 株式会社 (kabushikigaisha).

Pros: Matches across languages and scripts; offers greater precision


Cons: Slower performance; high barrier to entry, as it requires training data and customized algorithms

10 The Complete Guide to Name Matching


The winner: AI-powered
hybrid two-pass
All of the previous methods excel at solving a specific problem, but only
one. Edit distance is a blunt instrument for the complexity of name
matching. Lists are customizable, but hard to maintain and slow to
execute. Common keys are fast to execute, but offer limited extensibility
and accuracy, lacking the smarts of the slower statistical approach.

But a method can’t be good at just one thing when it comes to


successful name matching. It must be able to address all of the name
matching challenges listed above to succeed. The solution is a hybrid
strategy that uses the strength of one approach to overcome the
weakness of another. This is known as the AI-powered hybrid two-pass
method.

Artificial intelligence can dynamically and simultaneously consider all


the key ways that names vary, not just the ones found in a list. It can
weigh computational methods (hence the “hybrid”) to refine a search
and return the likeliest match. The dynamic quality of AI enables the
identification of name variations in real time, instead of iterating over
static lists.

The Complete Guide to Name Matching


11
The two-pass approach is a strategy for maximizing both recall and precision.
It works like this:

1. The first pass uses the common key method, which quickly of being locked into a coarse comparison of derived keys (for
produces a high-recall match set. In other words, it gets rid of better or worse), the second pass of the hybrid approach takes a
obvious non-matches. fresh look at the original names in their original scripts before
2. The second pass uses myriad computational methods scoring their similarity.
combined with a statistical model (the AI). These take longer, The hybrid method also avoids the weaknesses of the list
so starting with a list that’s already winnowed down saves a approach by not relying on pre-generation of name variations.
lot of time. This pass is extremely high precision because the Instead, it dynamically considers (via AI) the linguistic variations
system is making smart decisions to assign a match score to of names in each language. This linguistic knowledge of name
each pair of names it considers. variations also gives the hybrid approach an edge over the edit
The two-pass approach can handle a huge variety of variations: distance method, which lacks linguistic knowledge and cannot
nicknames, misspellings, misplaced field data — even different directly compare names in different scripts. The result is a fast,
languages and alphabets (scripts). This greatly improves accurate name matching solution.
accuracy compared with the common key method alone. Instead

The Winning Combination

Hybrid AI-powered Two-pass

Uses multiple methods to maximize Adds precision in evaluating matches Pass 1 quickly eliminates obvious non-
accuracy and speed across a broad range using machine learning. matches. Pass 2 takes a closer look to
of name variations.“province/state,” and rank matches from highest to lowest
“country.” confidence with high precision.

12 The Complete Guide to Name Matching


Congratulations! You know that misspellings, aliases, nicknames, If you really want to solve the problem, you can begin with a smart,
initials, and different languages are but a few of the things that get powerful name matching solution that integrates into your current
in the way of making the right match. You also know how systems without disruptive and risky rip-and-replace
important name matching is for keeping people and businesses implementation. A solution that takes the best of traditional name
safe. So why are you using a slow, inaccurate solution? matching technology, throws in some serious intelligence, and
gives you a match score you can trust.

More than names:


Address and date matching
If you have more than a name to identify a person, it’s only
prudent to use every bit of data you have. Addresses and dates of
birth can help identify one individual among many with similar
names. Then again, address fields such as building name, street,
city, province, and country are names themselves. They benefit
from name matching algorithms and an understanding of
address-specific abbreviations and postal codes.

For date fields, fuzzy matching should handle different formats,


swapped numbers, and chronological proximity, such as
December 31, 2021 and January 1, 2022.

The Complete Guide to Name Matching


13
Babel Street Analytics for identity intelligence
You need Babel Street Analytics. Its hybrid AI-powered name matching technology is the go-to choice for mission-critical applications
that need to verify or match identities because it is accurate, fast, and easily integrated into any system. Babel Street Analytics’ plugins
for Elasticsearch and Apache Solr handle the complexity of name matching and only deliver match-score ranked results.

Babel Street Analytics uses a two-pass approach to take advantage of the speed of the common key method and the precision of
machine learning to perform fuzzy name matching, decreasing false positives and false negatives. It lets you tune parameters for each
use case by adjusting the precision/recall ratio, accommodating name data idiosyncrasies, and weighting data fields to account for their
reliability.

Customers report up to a 90% reduction in false positives


and an increase in true positives when using the global
name matching capabilities of Babel Street Analytics.

Every day, Babel Street customers conduct over 500


million watchlist checks worldwide.

14 The Complete Guide to Name Matching


Babel Street is the trusted technology partner for the world’s most advanced identity
intelligence and risk operations. The Babel Street Insights platform delivers advanced AI and
data analytics solutions to close the Risk-Confidence Gap.

Babel Street provides unmatched, analysis-ready data regardless of language, proactive risk
identification, 360-degree insights, high-speed automation, and seamless integration into
existing systems. We empower government and commercial organizations to transform high-
stakes identity and risk operations into a strategic advantage.

Learn more at babelstreet.com

All names, companies, and incidents portrayed in this document are fictitious. No identification with actual persons
(living or deceased), places, companies, and products are intended or should be inferred.

© 2024 Babel Street. All Rights Reserved.

You might also like