Entity resolution (ER) - sounds complicated, right? Well, it doesn’t have to be! It’s basically just figuring out which pieces of data in your messy datasets actually refer to the same real-world thing. Think of it as connecting the dots between different bits of information about the same person, place, or concept.

Why bother with ER? Because having clean, linked data is crucial for tons of stuff like:

  • Customer 360: Get a complete picture of your customers, so you can personalize their experiences
  • Fraud Detection: Spot suspicious activity by linking related transactions
  • Supply Chain Resilience: Accurately monitor supply and demand of your products, from manufacturing to sales

Outlook - How to Deduplicate in Python

There are some fantastic open-source tools that can help you wrangle those duplicates and get your data back in shape. Let’s dive in and explore a few of them using an open dataset of messy music records and the Python ecosystem.

This article covers:

  • What it means to create an end-to-end deduplication pipeline using pandas, recordlinkage, and networkx.
  • How to fail twice when relying on just the most popular techniques.
  • How to expand your toolbox beyond the standard using faiss, fasttext, and some simple heuristics.

Example: Messy Music Records

The database group of the University of Leipzig open-sourced the Music Brainz 20k dataset, among many others.

file_path = 'musicbrainz-20-A01.csv'

# Read the duplicates ground truth - two records with the same cluster ID are duplicates:
xref = pd.read_csv(file_path, dtype='str').filter(['TID', 'CID'])
xref.columns = ['record_id', 'cluster_id']

# Read the 20k records:
df = (pd.read_csv(file_path, dtype='str')
      .filter(['TID', 'number', 'title', 'length', 'artist', 'album', 'year', 'language'])
      .rename(columns={'TID': 'record_id'}).set_index('record_id'))

# Print five example rows:
df.head(10).tail(5)
record_idnumbertitlelengthartistalbumyearlanguage
66Try (acoustic) - 2008-02-15: Le Grand Rex, Paris, FranceNeil YoungEnglish
711Bruce Maginnis - Sttreet Hype177Groove City05English
89Luce Dufault - Ballade à donner242Luce Dufault96French
99Just Like Tom Thumb's Blues (live)462000Wendy SaddingtonBlues Women Anthology, Volume 72007English
104Στα καμένα195000Λαυρέντης ΜαχαιρίίτσαςΣυλλογή Δίφωνο, 22: Μουσικοί βιότοποι1997Greek

The dataset df covers 20k records across different languages, and a ton of data quality issues. Just have a look at cluster 8320, with its missing values, inconsistent formatting individual attributes, and misplaced values:

record_idnumbertitlelengthartistalbumyearlanguage
63108DJ Rashad feat. Spinn - Double Cup249Double Cup13English
76908Double Cup4.15DJ Rashad feat. SpinnDouble Cup'13English
111678008-Double Cup4m 9secDJ Rashad feat. SpinnDouble Cup (2013)Eng.
16154008Double Cup04:09DJ Rashad feta. SpinnDouble Cup2013

Table xref encodes the ground truth of this benchmark dataset. That’s something we don’t have in practice, but the desired outcome of our entity resolution pipeline. Every two records with the same cluster_id are duplicates:

record_idcluster_id
63108320
76908320
111678320
161548320

Key Steps in End-to-End ER

Let’s break down the key steps to get you from chaos to clarity:

  1. Preprocessing: Clean up your data! This involves things like fixing inconsistencies, standardizing formats, and handling missing values.
  2. Pairing: Narrow down the number of comparisons you need to make. Group similar records together based on certain attributes (e.g., names, locations).
  3. Matching: Compare records within each block using similarity measures. Think of this as figuring out how likely it is that two records refer to the same entity.
  4. Clustering: Group together records that are likely to represent the same entity, leveraging collective evidence and resolving potential match conflicts.

Preprocessing

Alright, let’s keep it lean and mean – just four attributes for now, the ones that’ll really matter when we get down to the nitty-gritty. We’re gonna:

  • Cast numbers to integers.
  • Impute missing artist values using a simple heuristic.
  • Translate Unicode into ASCII, such as “Στα καμένα” into “Sta kamena”.
  • Leverage RecordLinkage’s clean to remove special characters and any redundant white space.
  • Remove “unknown” and similar placeholders for a missing value.
import numpy as np
import recordlinkage.preprocessing as rlp
from unidecode import unidecode

# Cast numbers
df.number = df.number.str.replace('[^0-9]', '', regex=True)
df.loc[df.number.eq(''), 'number'] = np.nan
df.number = df.number.astype('Int64')

# Impute artist with a simple heuristic:
df.loc[df.artist.isnull(), 'artist'] = df.title.str.split('-').str[0]
df.loc[df.artist.isnull(), 'artist'] = df.title

unknowns = ['', 'unknown', 'untitled', 'na', 'unk']

for col in ('title', 'artist', 'album'):
    # Unicode -> ASCII:
    df[col] = rlp.clean(df[col].fillna('').apply(unidecode))

    # Replace "unknown" placeholders with a formal missing value:
    df.loc[df[col].isin(unknowns), col] = np.nan

Pairing - Standard Blocking

Comparing every pair of records? Nightmare! Think of blocking like sorting your laundry before washing - grouping similar items (records) together based on key attributes (like color or fabric type) to make the whole process more efficient. In entity resolution, blocking does just that, streamlining the matching process by narrowing down the potential candidates.

Standard Blocking is easily the most popular pairing technique in entity resolution, and it’s built right into the RecordLinkage package for your convenience.

# Initiate Pairing with RecordLinkage:
import recordlinkage as rl

indexer = rl.Index()

# We add two records to our index if
# they share the same `number` value:
indexer.block('number')

# or the first character of `artist` is the same:
df['same_first_char_artist'] = df['artist'].str.slice(stop=1)
indexer.block('same_first_char_artist')

# Run Blocking logic on our data:
candidate_pairs = indexer.index(df)

The candidate_pairs we get is a massive list of record_id duos, flagged as possible matches. To gauge the effectiveness of our pairing, we’ll zoom in on two crucial metrics.

  1. We’re down to 19.7 million potential matches from a staggering 187.7 million. That’s a 90% reduction in our workload.
  2. How many of the real matches are we actually catching in our net? We call this metric “Recall”, and we’re lucky to have the xref ground truth data to calculate it:
from itertools import combinations

def cross_ref_to_matches(df: pd.DataFrame) -> pd.DataFrame:
    match_lists = (df.sort_values('record_id', ascending=False)
                   .dropna(subset=['record_id'])
                   .groupby('cluster_id')['record_id'].apply(lambda s: list(s)))
    match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]

    match_pairs = []
    for match_list in match_lists:
        match_pairs += list(combinations(match_list, 2))
    return pd.MultiIndex.from_tuples(match_pairs)

actual_matches = cross_ref_to_matches(xref)

# Compute recall after pairing:
actual_matches.isin(candidate_pairs).mean()

In this case, we’re looking at a 49% Recall rate, which means we’re snagging about half of the true matches out there.

Time to think outside the RecordLinkage box! We can boost recall with a less conventional technique.

Better Pairing with FastText and FAISS

This technique has been applied in information retrieval long before it found its way into entity resolution. First, we represent every record as an element in a numeric vector space equipped with a metric, say, Cosine distance.

import fasttext

# Assuming we have downloaded a fastText model to the working directory:
ft_model = fasttext.load_model('cc.en.300.bin')

# Concatenate key attributes into a single "sentence"
sentence_representations = (df[['title', 'artist', 'album']]
                            .apply(lambda s: ' '.join(s.dropna()), axis=1)
                            .str.replace(r'\s\s+', ' ', regex=True).values)

# Compute a vector representation for every sentence using a FastText model:
embeddings = np.array([ft_model.get_sentence_vector(val) for val in sentence_representations])

Second, for every input record x_0, retrieve its k nearest neighbors x_1,...,x_k in that vector space and add (x_0,x_1),...,(x_0,x_n) to the candidate matches:

from typing import Optional
import numpy as np
import pandas as pd
import faiss

def get_nn_pairs(embeddings: np.array, record_ids: np.array, k: int = 5,
                 index_factory: str = 'L2norm,Flat') -> pd.DataFrame:
    index = faiss.index_factory(embeddings.shape[1], index_factory, faiss.METRIC_INNER_PRODUCT)
    
    if not index.is_trained:
        # Some factories need to be fitted to data:
        index.train(embeddings)

    index.add(embeddings)

    # FAISS computes the k nearest neighbors for every input embedding:
    nn_distances, nn_indices = index.search(embeddings, k=k)

    # Reshape results into pairs of neighbors:
    res = []
    for idx, row in enumerate(nn_indices):
        res.append(np.array(np.meshgrid([record_ids[idx]], record_ids[row])).T.reshape(-1, 2))

    res = np.vstack(res)

    # Reduce the results to non-trivial pairs:
    res = pd.DataFrame(res[res[:, 0] != res[:, 1]], columns=['record_id_0', 'record_id_1'])
    res = res.loc[res.record_id_0 != res.record_id_1]

    # We order pairs so that first component > second component by RecordLinkage convention
    res_c = res.copy(deep=True)
    must_flip = res_c.record_id_0 < res_c.record_id_1
    res.loc[must_flip, 'record_id_0'] = res_c.record_id_1
    res.loc[must_flip, 'record_id_1'] = res_c.record_id_0

    return res.drop_duplicates().reset_index(drop=True)

candidate_pairs_knn = pd.MultiIndex.from_frame(get_nn_pairs(embeddings=embeddings, record_ids=df.index.values, k=5))

# Let's revisit our two key metrics for pairing:
shrinkage_percent = candidate_pairs_knn.shape[0] / n_all_pairs * 100
recall = actual_matches.isin(candidate_pairs_knn).mean()
  1. Boom! We’ve slashed those potential matches down to a mere 59k. From a mountain of 187.7 million, we’re now looking at a tiny fraction – a minuscule 0.03%! That’s the power of smart blocking techniques in action.
  2. Hold onto your hats, folks! What’s truly mind-blowing is that our Recall rate catapulted to a jaw-dropping 97.4%! That’s nearly perfect accuracy in snagging those true matches. Talk about a game-changer!

We unravel the secrets of this powerful technique in a series of blog post, starting with this one.

Matching

Normally, this is where things get computationally heavy. But not today! With a lean 59k pairs in our net, we’re in the clear. Let’s turn our attention to similarity scoring. We’ll focus on those three key attributes and flex our muscles with three different string metrics for each.

# Initiate similarity scoring with RecordLinkage:
comparer = rl.Compare(n_jobs=-1)

# There are a few more methods for numeric attributes but the interesting stuff is for strings:
comparer.string(left_on='title', right_on='title', method='jarowinkler', label='title_jw')
comparer.string(left_on='title', right_on='title', method='lcs', label='title_qgram')
comparer.string(left_on='title', right_on='title', method='qgram', label='title_lcs')
comparer.string(left_on='artist', right_on='artist', method='jarowinkler', label='artist_jw')
comparer.string(left_on='artist', right_on='artist', method='lcs', label='artist_lcs')
comparer.string(left_on='artist', right_on='artist', method='qgram', label='artist_qgram')
comparer.string(left_on='album', right_on='album', method='jarowinkler', label='album_jw')
comparer.string(left_on='album', right_on='album', method='lcs', label='album_lcs')
comparer.string(left_on='album', right_on='album', method='qgram', label='album_qgram')

scores = comparer.compute(candidate_pairs_knn, df)

So, for every potential match, we’ve got a set of nine features, each giving us a similarity score between 0 and 1. Think of it as a multi-dimensional picture of how closely those records resemble each other. Here are five examples:

record_id_0record_id_1title_jwtitle_qgramtitle_lcsartist_jwartist_lcsartist_qgramalbum_jwalbum_lcsalbum_qgram
1518410.7290.6460.5251.01.01.00.00.00.0
1122110.5880.3310.2220.6610.1330.1760.5350.250.147
397910.6380.4440.4050.8490.3640.3890.6610.4110.268
1680110.4510.0710.0480.5420.00.0590.5250.3820.294
2167960.00.00.00.00.00.01.01.01.0

Imagine we’re manually reviewing a random sample of 1000 pairs. We can mimic this process using the xref ground truth data and some handy random sampling techniques in pandas.

import numpy as np
from sklearn.model_selection import train_test_split

# Adding the ground truth as the `label` column derived from xref:
scores['label'] = np.where(scores.index.isin(actual_matches), 1., 0.)

# Simulating manual labeling of 1000 pairs:
df_train = scores.sample(1000, random_state=1)
df_test = scores.loc[~scores.index.isin(df_train.index)]

Now, let’s train a binary classifier on our data. We’ll use the catboost package, known for its resistance to overfitting. It also lets us add monotonicity constraints, which align perfectly with the nature of our similarity scores.

from catboost import CatBoostClassifier

# Catboost can incorporate monotonicity constraints
# "1" means non-decreasing in a feature - we set non-decreasing for all five:
model = CatBoostClassifier(random_state=1, monotone_constraints=[1, 1, 1, 1, 1, 1, 1, 1, 1])
model.fit(X=df_train.iloc[:, :-1], y=df_train['label'], verbose=False)

Let’s put our matcher to the test and see how it performs on a fresh set of data:

from sklearn.metrics import precision_score, recall_score

# Can be tuned using the training performance - keeping it on 50% for simplicity:
decision_threshold = 0.5

y_test_pred_proba = model.predict_proba(df_test.iloc[:, :-1])[:, 1]
y_test_pred = np.where(y_test_pred_proba > decision_threshold, 1., 0.)

precision = precision_score(df_test['label'], y_test_pred)
recall = recall_score(df_test['label'], y_test_pred)
f1 = 2 * precision * recall / (precision + recall)

Drumroll, please! The results are in, and they’re impressive: We’ve hit a precision of 92.8%, recall of 92.9%, and an F1 score of 92.9%. That means our model is not only accurate in identifying matches but also manages to capture a vast majority of them.

Clustering - Transitive Closure

Hold on a second! Let’s say our process flags (x_0,x_1) and (x_1,x_2) as duplicates, but (x_0,x_2) isn’t. That’s a logical hiccup – it simply doesn’t make sense. If we’re aiming to generate our own reliable xref table, we need to address these inconsistencies head-on.

These contradictions pop up constantly in the real world, which is why the entity resolution community delves into a variety of techniques known as “clustering.” The most basic and commonly used method is called transitive closure:

import networkx as nx

def matches_to_cross_ref(matches: pd.MultiIndex) -> pd.DataFrame:
    graph = nx.from_edgelist(matches)
    sub_graphs = [graph.subgraph(c) for c in nx.connected_components(graph)]
    match_groups = []
    for i, sub_graph in enumerate(sub_graphs):
        ebc_values = nx.edge_betweenness_centrality(sub_graph)
        match_groups.append(
            pd.DataFrame({'record_id': sub_graph.nodes})
            .assign(**{
                'cluster_id': str(i),
                'match_density': nx.density(sub_graph)
            })
        )

    match_groups = pd.concat(match_groups, ignore_index=True)

    return match_groups.round(decimals=3)

matches = pd.concat([
    df_train.loc[df_train.label.eq(1.)],
    df_test.loc[df_test.pred_label.eq(1.)]
]).index

xref_pred = matches_to_cross_ref(matches)

In a nutshell: We trust those positive predictions wholeheartedly. If (x_0,x_1) and (x_1,x_2) are matches, then (x_0,x_2) has to be too, even if our model says otherwise. We’re overriding those negative predictions in the name of logical consistency.

# Compute performance after clustering:
df_final = df.reset_index().merge(xref_pred[['record_id', 'cluster_id']], on='record_id', how='left')
final_matches = cross_ref_to_matches(df_final)

n_true_positives = actual_matches.intersection(final_matches).shape[0]
n_false_positives = final_matches.difference(actual_matches).shape[0]
n_false_negatives = actual_matches.difference(final_matches).shape[0]

precision = n_true_positives / (n_true_positives + n_false_positives)
recall = n_true_positives / (n_true_positives + n_false_negatives)
f1 = 2 * precision * recall / (precision + recall)

Yikes! Our recall did get a slight bump from 92.8% to 93.9%, but at what cost? Precision took a nosedive, plummeting from 92.8% to a dismal 56.2%. This trade-off is definitely not worth it! Transitive closure might be simple, but in this case, it’s clearly not the answer. We need to explore more sophisticated clustering techniques to achieve a better balance between recall and precision.

Better Clustering - A Simple Heuristic

The following illustration shines a light on the notorious downside of transitive clustering, especially when you’re dealing with clusters bigger than just two records.

Two clusters of records are linked through a false match between x and y. Transitive clustering would add many more false matches, which degrades precision.

Alright, let’s keep things straightforward for now. We’ll ditch any matches that are hanging out in sparsely connected clusters. Think of it as weeding out the stragglers – if a cluster isn’t densely packed with links, it’s probably not a reliable source of matches. We’ll set our density threshold at 50%, meaning we’ll only keep clusters where at least half of the possible connections are actually present.

# Drop clusters with poor density:
xref_pred = xref_pred.loc[xref_pred.match_density.ge(0.5)]

df_final = df.reset_index().merge(xref_pred[['record_id', 'cluster_id']], on='record_id', how='left')
final_matches = cross_ref_to_matches(df_final)

n_true_positives = actual_matches.intersection(final_matches).shape[0]
n_false_positives = final_matches.difference(actual_matches).shape[0]
n_false_negatives = actual_matches.difference(final_matches).shape[0]

precision = n_true_positives / (n_true_positives + n_false_positives)
recall = n_true_positives / (n_true_positives + n_false_negatives)
f1 = 2 * precision * recall / (precision + recall)

Crisis averted! We took a minor hit in recall, now at 88.3%, but it was a small price to pay for a squeaky clean, conflict-free dataset. Precision stayed rock-solid at 92.8%, giving us a final F1 score of 90.5%.

Wrap-up

Standard entity resolution tools are great, but they’re not magic. They might miss some subtle clues or make wrong connections if your data is a bit quirky. It’s like using a map of New York City to navigate Amsterdam – sure, you might find some landmarks, but you’ll definitely get lost along the way. Knowing the limits of your toolbox helps you spot those tricky situations and find smarter workarounds for truly accurate results.

Don’t miss out on future posts like this! Join my mailing list and stay updated on all things entity resolution.

Need Help with Entity Resolution?

If you’re struggling with messy data or just don’t have the time to tackle ER yourself, reach out to us. We specialize in helping businesses get their data in order, so they can focus on what they do best. Contact us today for a free consultation!