Entity resolution (ER) - sounds complicated, right? Well, it doesn’t have to be! It’s basically just figuring out which pieces of data in your messy datasets actually refer to the same real-world thing. Think of it as connecting the dots between different bits of information about the same person, place, or concept.
Why bother with ER? Because having clean, linked data is crucial for tons of stuff like:
- Customer 360: Get a complete picture of your customers, so you can personalize their experiences
- Fraud Detection: Spot suspicious activity by linking related transactions
- Supply Chain Resilience: Accurately monitor supply and demand of your products, from manufacturing to sales
Outlook - How to Deduplicate in Python
There are some fantastic open-source tools that can help you wrangle those duplicates and get your data back in shape. Let’s dive in and explore a few of them using an open dataset of messy music records and the Python ecosystem.
This article covers:
- What it means to create an end-to-end deduplication pipeline using
pandas,recordlinkage, andnetworkx. - How to fail twice when relying on just the most popular techniques.
- How to expand your toolbox beyond the standard using
faiss,fasttext, and some simple heuristics.
Example: Messy Music Records
The database group of the University of Leipzig open-sourced the Music Brainz 20k dataset,
among many others.
file_path = 'musicbrainz-20-A01.csv'
# Read the duplicates ground truth - two records with the same cluster ID are duplicates:
xref = pd.read_csv(file_path, dtype='str').filter(['TID', 'CID'])
xref.columns = ['record_id', 'cluster_id']
# Read the 20k records:
df = (pd.read_csv(file_path, dtype='str')
.filter(['TID', 'number', 'title', 'length', 'artist', 'album', 'year', 'language'])
.rename(columns={'TID': 'record_id'}).set_index('record_id'))
# Print five example rows:
df.head(10).tail(5)
| record_id | number | title | length | artist | album | year | language |
|---|---|---|---|---|---|---|---|
| 6 | 6 | Try (acoustic) - 2008-02-15: Le Grand Rex, Paris, France | Neil Young | English | |||
| 7 | 11 | Bruce Maginnis - Sttreet Hype | 177 | Groove City | 05 | English | |
| 8 | 9 | Luce Dufault - Ballade à donner | 242 | Luce Dufault | 96 | French | |
| 9 | 9 | Just Like Tom Thumb's Blues (live) | 462000 | Wendy Saddington | Blues Women Anthology, Volume 7 | 2007 | English |
| 10 | 4 | Στα καμένα | 195000 | Λαυρέντης Μαχαιρίίτσας | Συλλογή Δίφωνο, 22: Μουσικοί βιότοποι | 1997 | Greek |
The dataset df covers 20k records across different languages, and a ton of data quality issues. Just have a look at
cluster 8320, with its missing values, inconsistent formatting individual attributes, and misplaced values:
| record_id | number | title | length | artist | album | year | language |
|---|---|---|---|---|---|---|---|
| 6310 | 8 | DJ Rashad feat. Spinn - Double Cup | 249 | Double Cup | 13 | English | |
| 7690 | 8 | Double Cup | 4.15 | DJ Rashad feat. Spinn | Double Cup | '13 | English |
| 11167 | 8 | 008-Double Cup | 4m 9sec | DJ Rashad feat. Spinn | Double Cup (2013) | Eng. | |
| 16154 | 008 | Double Cup | 04:09 | DJ Rashad feta. Spinn | Double Cup | 2013 |
Table xref encodes the ground truth of this benchmark dataset. That’s something we don’t have in practice, but the
desired outcome of our entity resolution pipeline. Every two records with the same cluster_id are duplicates:
| record_id | cluster_id |
|---|---|
| 6310 | 8320 |
| 7690 | 8320 |
| 11167 | 8320 |
| 16154 | 8320 |
Key Steps in End-to-End ER
Let’s break down the key steps to get you from chaos to clarity:
- Preprocessing: Clean up your data! This involves things like fixing inconsistencies, standardizing formats, and handling missing values.
- Pairing: Narrow down the number of comparisons you need to make. Group similar records together based on certain attributes (e.g., names, locations).
- Matching: Compare records within each block using similarity measures. Think of this as figuring out how likely it is that two records refer to the same entity.
- Clustering: Group together records that are likely to represent the same entity, leveraging collective evidence and resolving potential match conflicts.
Preprocessing
Alright, let’s keep it lean and mean – just four attributes for now, the ones that’ll really matter when we get down to the nitty-gritty. We’re gonna:
- Cast
numbers to integers. - Impute missing
artistvalues using a simple heuristic. - Translate Unicode into ASCII, such as “Στα καμένα” into “Sta kamena”.
- Leverage RecordLinkage’s
cleanto remove special characters and any redundant white space. - Remove “unknown” and similar placeholders for a missing value.
import numpy as np
import recordlinkage.preprocessing as rlp
from unidecode import unidecode
# Cast numbers
df.number = df.number.str.replace('[^0-9]', '', regex=True)
df.loc[df.number.eq(''), 'number'] = np.nan
df.number = df.number.astype('Int64')
# Impute artist with a simple heuristic:
df.loc[df.artist.isnull(), 'artist'] = df.title.str.split('-').str[0]
df.loc[df.artist.isnull(), 'artist'] = df.title
unknowns = ['', 'unknown', 'untitled', 'na', 'unk']
for col in ('title', 'artist', 'album'):
# Unicode -> ASCII:
df[col] = rlp.clean(df[col].fillna('').apply(unidecode))
# Replace "unknown" placeholders with a formal missing value:
df.loc[df[col].isin(unknowns), col] = np.nan
Pairing - Standard Blocking
Comparing every pair of records? Nightmare! Think of blocking like sorting your laundry before washing - grouping similar items (records) together based on key attributes (like color or fabric type) to make the whole process more efficient. In entity resolution, blocking does just that, streamlining the matching process by narrowing down the potential candidates.
Standard Blocking is easily the most popular pairing technique in entity resolution, and it’s built right into the RecordLinkage package for your convenience.
# Initiate Pairing with RecordLinkage:
import recordlinkage as rl
indexer = rl.Index()
# We add two records to our index if
# they share the same `number` value:
indexer.block('number')
# or the first character of `artist` is the same:
df['same_first_char_artist'] = df['artist'].str.slice(stop=1)
indexer.block('same_first_char_artist')
# Run Blocking logic on our data:
candidate_pairs = indexer.index(df)
The candidate_pairs we get is a massive list of record_id duos, flagged as possible matches. To gauge the
effectiveness of our pairing, we’ll zoom in on two crucial metrics.
- We’re down to 19.7 million potential matches from a staggering 187.7 million. That’s a 90% reduction in our workload.
- How many of the real matches are we actually catching in our net? We call this metric “Recall”, and we’re lucky to
have the
xrefground truth data to calculate it:
from itertools import combinations
def cross_ref_to_matches(df: pd.DataFrame) -> pd.DataFrame:
match_lists = (df.sort_values('record_id', ascending=False)
.dropna(subset=['record_id'])
.groupby('cluster_id')['record_id'].apply(lambda s: list(s)))
match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]
match_pairs = []
for match_list in match_lists:
match_pairs += list(combinations(match_list, 2))
return pd.MultiIndex.from_tuples(match_pairs)
actual_matches = cross_ref_to_matches(xref)
# Compute recall after pairing:
actual_matches.isin(candidate_pairs).mean()
In this case, we’re looking at a 49% Recall rate, which means we’re snagging about half of the true matches out there.
Time to think outside the RecordLinkage box! We can boost recall with a less conventional technique.
Better Pairing with FastText and FAISS
This technique has been applied in information retrieval long before it found its way into entity resolution. First, we represent every record as an element in a numeric vector space equipped with a metric, say, Cosine distance.
import fasttext
# Assuming we have downloaded a fastText model to the working directory:
ft_model = fasttext.load_model('cc.en.300.bin')
# Concatenate key attributes into a single "sentence"
sentence_representations = (df[['title', 'artist', 'album']]
.apply(lambda s: ' '.join(s.dropna()), axis=1)
.str.replace(r'\s\s+', ' ', regex=True).values)
# Compute a vector representation for every sentence using a FastText model:
embeddings = np.array([ft_model.get_sentence_vector(val) for val in sentence_representations])
Second, for every input record x_0, retrieve its k nearest neighbors x_1,...,x_k in that vector space and
add (x_0,x_1),...,(x_0,x_n) to the candidate matches:
from typing import Optional
import numpy as np
import pandas as pd
import faiss
def get_nn_pairs(embeddings: np.array, record_ids: np.array, k: int = 5,
index_factory: str = 'L2norm,Flat') -> pd.DataFrame:
index = faiss.index_factory(embeddings.shape[1], index_factory, faiss.METRIC_INNER_PRODUCT)
if not index.is_trained:
# Some factories need to be fitted to data:
index.train(embeddings)
index.add(embeddings)
# FAISS computes the k nearest neighbors for every input embedding:
nn_distances, nn_indices = index.search(embeddings, k=k)
# Reshape results into pairs of neighbors:
res = []
for idx, row in enumerate(nn_indices):
res.append(np.array(np.meshgrid([record_ids[idx]], record_ids[row])).T.reshape(-1, 2))
res = np.vstack(res)
# Reduce the results to non-trivial pairs:
res = pd.DataFrame(res[res[:, 0] != res[:, 1]], columns=['record_id_0', 'record_id_1'])
res = res.loc[res.record_id_0 != res.record_id_1]
# We order pairs so that first component > second component by RecordLinkage convention
res_c = res.copy(deep=True)
must_flip = res_c.record_id_0 < res_c.record_id_1
res.loc[must_flip, 'record_id_0'] = res_c.record_id_1
res.loc[must_flip, 'record_id_1'] = res_c.record_id_0
return res.drop_duplicates().reset_index(drop=True)
candidate_pairs_knn = pd.MultiIndex.from_frame(get_nn_pairs(embeddings=embeddings, record_ids=df.index.values, k=5))
# Let's revisit our two key metrics for pairing:
shrinkage_percent = candidate_pairs_knn.shape[0] / n_all_pairs * 100
recall = actual_matches.isin(candidate_pairs_knn).mean()
- Boom! We’ve slashed those potential matches down to a mere 59k. From a mountain of 187.7 million, we’re now looking at a tiny fraction – a minuscule 0.03%! That’s the power of smart blocking techniques in action.
- Hold onto your hats, folks! What’s truly mind-blowing is that our Recall rate catapulted to a jaw-dropping 97.4%! That’s nearly perfect accuracy in snagging those true matches. Talk about a game-changer!
We unravel the secrets of this powerful technique in a series of blog post, starting with this one.
Matching
Normally, this is where things get computationally heavy. But not today! With a lean 59k pairs in our net, we’re in the clear. Let’s turn our attention to similarity scoring. We’ll focus on those three key attributes and flex our muscles with three different string metrics for each.
# Initiate similarity scoring with RecordLinkage:
comparer = rl.Compare(n_jobs=-1)
# There are a few more methods for numeric attributes but the interesting stuff is for strings:
comparer.string(left_on='title', right_on='title', method='jarowinkler', label='title_jw')
comparer.string(left_on='title', right_on='title', method='lcs', label='title_qgram')
comparer.string(left_on='title', right_on='title', method='qgram', label='title_lcs')
comparer.string(left_on='artist', right_on='artist', method='jarowinkler', label='artist_jw')
comparer.string(left_on='artist', right_on='artist', method='lcs', label='artist_lcs')
comparer.string(left_on='artist', right_on='artist', method='qgram', label='artist_qgram')
comparer.string(left_on='album', right_on='album', method='jarowinkler', label='album_jw')
comparer.string(left_on='album', right_on='album', method='lcs', label='album_lcs')
comparer.string(left_on='album', right_on='album', method='qgram', label='album_qgram')
scores = comparer.compute(candidate_pairs_knn, df)
So, for every potential match, we’ve got a set of nine features, each giving us a similarity score between 0 and 1. Think of it as a multi-dimensional picture of how closely those records resemble each other. Here are five examples:
| record_id_0 | record_id_1 | title_jw | title_qgram | title_lcs | artist_jw | artist_lcs | artist_qgram | album_jw | album_lcs | album_qgram |
|---|---|---|---|---|---|---|---|---|---|---|
| 15184 | 1 | 0.729 | 0.646 | 0.525 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 11221 | 1 | 0.588 | 0.331 | 0.222 | 0.661 | 0.133 | 0.176 | 0.535 | 0.25 | 0.147 |
| 3979 | 1 | 0.638 | 0.444 | 0.405 | 0.849 | 0.364 | 0.389 | 0.661 | 0.411 | 0.268 |
| 16801 | 1 | 0.451 | 0.071 | 0.048 | 0.542 | 0.0 | 0.059 | 0.525 | 0.382 | 0.294 |
| 2 | 16796 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
Imagine we’re manually reviewing a random sample of 1000 pairs. We can mimic this process using the xref ground truth
data and some handy random sampling techniques in pandas.
import numpy as np
from sklearn.model_selection import train_test_split
# Adding the ground truth as the `label` column derived from xref:
scores['label'] = np.where(scores.index.isin(actual_matches), 1., 0.)
# Simulating manual labeling of 1000 pairs:
df_train = scores.sample(1000, random_state=1)
df_test = scores.loc[~scores.index.isin(df_train.index)]
Now, let’s train a binary classifier on our data. We’ll use the catboost package, known for its resistance to overfitting. It also lets us add monotonicity constraints, which align perfectly with the nature of our similarity scores.
from catboost import CatBoostClassifier
# Catboost can incorporate monotonicity constraints
# "1" means non-decreasing in a feature - we set non-decreasing for all five:
model = CatBoostClassifier(random_state=1, monotone_constraints=[1, 1, 1, 1, 1, 1, 1, 1, 1])
model.fit(X=df_train.iloc[:, :-1], y=df_train['label'], verbose=False)
Let’s put our matcher to the test and see how it performs on a fresh set of data:
from sklearn.metrics import precision_score, recall_score
# Can be tuned using the training performance - keeping it on 50% for simplicity:
decision_threshold = 0.5
y_test_pred_proba = model.predict_proba(df_test.iloc[:, :-1])[:, 1]
y_test_pred = np.where(y_test_pred_proba > decision_threshold, 1., 0.)
precision = precision_score(df_test['label'], y_test_pred)
recall = recall_score(df_test['label'], y_test_pred)
f1 = 2 * precision * recall / (precision + recall)
Drumroll, please! The results are in, and they’re impressive: We’ve hit a precision of 92.8%, recall of 92.9%, and an F1 score of 92.9%. That means our model is not only accurate in identifying matches but also manages to capture a vast majority of them.
Clustering - Transitive Closure
Hold on a second! Let’s say our process flags (x_0,x_1) and (x_1,x_2) as duplicates, but (x_0,x_2) isn’t. That’s a
logical hiccup – it simply doesn’t make sense. If we’re aiming to generate our own reliable xref table, we need to address
these inconsistencies head-on.
These contradictions pop up constantly in the real world, which is why the entity resolution community delves into a variety of techniques known as “clustering.” The most basic and commonly used method is called transitive closure:
import networkx as nx
def matches_to_cross_ref(matches: pd.MultiIndex) -> pd.DataFrame:
graph = nx.from_edgelist(matches)
sub_graphs = [graph.subgraph(c) for c in nx.connected_components(graph)]
match_groups = []
for i, sub_graph in enumerate(sub_graphs):
ebc_values = nx.edge_betweenness_centrality(sub_graph)
match_groups.append(
pd.DataFrame({'record_id': sub_graph.nodes})
.assign(**{
'cluster_id': str(i),
'match_density': nx.density(sub_graph)
})
)
match_groups = pd.concat(match_groups, ignore_index=True)
return match_groups.round(decimals=3)
matches = pd.concat([
df_train.loc[df_train.label.eq(1.)],
df_test.loc[df_test.pred_label.eq(1.)]
]).index
xref_pred = matches_to_cross_ref(matches)
In a nutshell: We trust those positive predictions wholeheartedly. If (x_0,x_1) and (x_1,x_2) are matches, then
(x_0,x_2) has to be too, even if our model says otherwise. We’re overriding those negative predictions in the name
of logical consistency.
# Compute performance after clustering:
df_final = df.reset_index().merge(xref_pred[['record_id', 'cluster_id']], on='record_id', how='left')
final_matches = cross_ref_to_matches(df_final)
n_true_positives = actual_matches.intersection(final_matches).shape[0]
n_false_positives = final_matches.difference(actual_matches).shape[0]
n_false_negatives = actual_matches.difference(final_matches).shape[0]
precision = n_true_positives / (n_true_positives + n_false_positives)
recall = n_true_positives / (n_true_positives + n_false_negatives)
f1 = 2 * precision * recall / (precision + recall)
Yikes! Our recall did get a slight bump from 92.8% to 93.9%, but at what cost? Precision took a nosedive, plummeting from 92.8% to a dismal 56.2%. This trade-off is definitely not worth it! Transitive closure might be simple, but in this case, it’s clearly not the answer. We need to explore more sophisticated clustering techniques to achieve a better balance between recall and precision.
Better Clustering - A Simple Heuristic
The following illustration shines a light on the notorious downside of transitive clustering, especially when you’re dealing with clusters bigger than just two records.

Alright, let’s keep things straightforward for now. We’ll ditch any matches that are hanging out in sparsely connected clusters. Think of it as weeding out the stragglers – if a cluster isn’t densely packed with links, it’s probably not a reliable source of matches. We’ll set our density threshold at 50%, meaning we’ll only keep clusters where at least half of the possible connections are actually present.
# Drop clusters with poor density:
xref_pred = xref_pred.loc[xref_pred.match_density.ge(0.5)]
df_final = df.reset_index().merge(xref_pred[['record_id', 'cluster_id']], on='record_id', how='left')
final_matches = cross_ref_to_matches(df_final)
n_true_positives = actual_matches.intersection(final_matches).shape[0]
n_false_positives = final_matches.difference(actual_matches).shape[0]
n_false_negatives = actual_matches.difference(final_matches).shape[0]
precision = n_true_positives / (n_true_positives + n_false_positives)
recall = n_true_positives / (n_true_positives + n_false_negatives)
f1 = 2 * precision * recall / (precision + recall)
Crisis averted! We took a minor hit in recall, now at 88.3%, but it was a small price to pay for a squeaky clean, conflict-free dataset. Precision stayed rock-solid at 92.8%, giving us a final F1 score of 90.5%.
Wrap-up
Standard entity resolution tools are great, but they’re not magic. They might miss some subtle clues or make wrong connections if your data is a bit quirky. It’s like using a map of New York City to navigate Amsterdam – sure, you might find some landmarks, but you’ll definitely get lost along the way. Knowing the limits of your toolbox helps you spot those tricky situations and find smarter workarounds for truly accurate results.
Don’t miss out on future posts like this! Join my mailing list and stay updated on all things entity resolution.
Need Help with Entity Resolution?
If you’re struggling with messy data or just don’t have the time to tackle ER yourself, reach out to us. We specialize in helping businesses get their data in order, so they can focus on what they do best. Contact us today for a free consultation!
