Babel Street Analytics Complete Guide To Name Matching
Babel Street Analytics Complete Guide To Name Matching
to Name Matching
What It Is, How It Works, and Deciding Which Approach to Take
Name matching is difficult
Modern commerce, finance, and security require fast, accurate data from multiple
sources. But many times the only common link between data sources is the name of a
person, place, or organization. And unlike other forms of data, names are highly variable.
Spelling, misordered name components, and even nicknames (to name just a few
challenges) can result in a failed or false match. The consequences of mismatching a
name are no joke, either. Border patrol could fail to detain a watchlisted individual. A
hospital could create duplicate patient records. A bank could transfer funds to a
sanctioned organization.
You may be thinking, okay, why not just use regular text search? But regular text search
doesn’t go far enough to address the challenges of name matching or searching.
Traditional methods for matching names do exist, but their effectiveness is greatly
limited by a trifecta of frustrating side effects. They’re weak, slow, and expensive to
maintain and run.
Missing even a single match Financial institutions are Healthcare providers and Criminal investigations hinge
against a watchlist puts required by anti-money payers need to quickly find on unambiguous identification
citizens at risk. Boarder laundering regulations to and link patient records, of suspects, making name
security agencies must deliver avoid doing business with creating a 360-degree view matching technology crucial
the most accurate results known bad actors. They must that enables better care and for law enforcement agencies.
possible within real-world time be able to check an entity avoids duplicate records,
constraints to ensure the against sanctions lists, verify which lower productivity and
seamless flow of people and information against business increase costs.
goods through points of entry. directory listings, and reduce
the time required for manual
remediation of false positive
results.
By the way
This one common Arabic given name can be transliterated more than
1,000 different ways in English. Here are just four possible spellings.
tp + fp
precision means fewer false positives
(correct matches found)
(incorrect matches).
fp = false positives
(non-matches labeled
as “matches”)
Recall measures the percentage of all
tp fn = false negatives
Recall
possible correct matches that are actually
found. High recall means fewer false (missed matches)
negatives (missed matches). tp + fn
Edit distance
Edit distance looks at how many character changes it takes to get from one
name to another, but it lacks the linguistic smarts to understand that Jack to
Jcak is more likely a match than Jack to Mack.
Rules-based methods
Nearly 30 years ago, the go-to name matching technology was to create huge static lists of variations for every name on a watchlist.
These rules-based systems are complicated, top-heavy, and difficult (therefore costly) to maintain. There might be hundreds of
variations for an average name with three components — an Arabic name might easily have five. Multiply that times a million names on a
list, and you have a good sense of how this is bad for real-time processes. And you still might not make the match.
Rules-based methods are restricted by human knowledge. They can only capture situations that people encounter or imagine. As a
result, they can be as frustrating as a game of whack-a-mole: a new variant pops up, leading to a new rule, which may affect how other
rules work, adding complexity atop complexity.
The list method attempts to list all possible spelling variations of The common key method addresses some of the limitations of
each name component and looks for matching names from these lists by reducing names to a key or code based on their English
lists of name variations. pronunciation, such that similar sounding names share the same
key.
Pros: Easy to edit or add new rules
Pros: Fast execution; high recall
Cons: Difficult to coordinate potentially conflicting rules,
computationally intensive, requires expensive hardware to run Cons: Mostly limited to Latin-based languages; transliterating
against long lists quickly; can’t handle unlisted names, missing/ non-Latin names reduces precision
added spaces between components, or name components in
incorrect fields
Sad but true: As precision increases, recall decreases and vice versa. If every name was
picked as a match, you’d have perfect recall, but dismal precision. If you picked only the
top match and it was correct, precision would be perfect, but recall would be low.
Some statistical models can compare the semantic similarity of common words (i.e., how close in meaning they are), such as “drug” and
“pharmaceutical,” to spot a possible match in “PennyLuck Drugs” and “PennyLuck Pharmaceuticals.” Semantic similarity is especially
powerful across languages.
日本電信電話株式会社 would be phonetically transliterated as Nippon Denshin Denwa Kabushikigaisha, but its official English name is
Nippon Telegraph and Telephone Corporation. The semantic match would be telegraph to 電信 (denshin); telephone to 電話 (denwa);
and corporation to 株式会社 (kabushikigaisha).
1. The first pass uses the common key method, which quickly of being locked into a coarse comparison of derived keys (for
produces a high-recall match set. In other words, it gets rid of better or worse), the second pass of the hybrid approach takes a
obvious non-matches. fresh look at the original names in their original scripts before
2. The second pass uses myriad computational methods scoring their similarity.
combined with a statistical model (the AI). These take longer, The hybrid method also avoids the weaknesses of the list
so starting with a list that’s already winnowed down saves a approach by not relying on pre-generation of name variations.
lot of time. This pass is extremely high precision because the Instead, it dynamically considers (via AI) the linguistic variations
system is making smart decisions to assign a match score to of names in each language. This linguistic knowledge of name
each pair of names it considers. variations also gives the hybrid approach an edge over the edit
The two-pass approach can handle a huge variety of variations: distance method, which lacks linguistic knowledge and cannot
nicknames, misspellings, misplaced field data — even different directly compare names in different scripts. The result is a fast,
languages and alphabets (scripts). This greatly improves accurate name matching solution.
accuracy compared with the common key method alone. Instead
Uses multiple methods to maximize Adds precision in evaluating matches Pass 1 quickly eliminates obvious non-
accuracy and speed across a broad range using machine learning. matches. Pass 2 takes a closer look to
of name variations.“province/state,” and rank matches from highest to lowest
“country.” confidence with high precision.
Babel Street Analytics uses a two-pass approach to take advantage of the speed of the common key method and the precision of
machine learning to perform fuzzy name matching, decreasing false positives and false negatives. It lets you tune parameters for each
use case by adjusting the precision/recall ratio, accommodating name data idiosyncrasies, and weighting data fields to account for their
reliability.
Babel Street provides unmatched, analysis-ready data regardless of language, proactive risk
identification, 360-degree insights, high-speed automation, and seamless integration into
existing systems. We empower government and commercial organizations to transform high-
stakes identity and risk operations into a strategic advantage.
All names, companies, and incidents portrayed in this document are fictitious. No identification with actual persons
(living or deceased), places, companies, and products are intended or should be inferred.