Hey there, data wranglers! If you’ve ever wrestled with messy, duplicated data, then you know entity resolution (ER) is your trusty sidekick. Buckle up, because today we’re diving deep into the world of sentence embeddings and how this nifty technique can supercharge your pairing game in entity resolution.

If pairing sounds like a new dance craze to you, pump the brakes and check out our beginner-friendly introduction to pairing first.

In this post, we’ll unravel the mysteries of lexical and semantic embedding techniques. You’ll get a clear picture of how they work their magic and, more importantly, how to blend their unique strengths for entity resolution that’s truly next-level. And to top it all off, we’ll harness the speed of FAISS to make pairing a breeze.

Faster, Smarter Pairing with Sentence Embeddings

Traditional pairing methods can be a headache. They require domain expertise, don’t work well on unstructured data, or don’t scale well. Embedding-based pairing swoops in with the following approach:

  1. Turn data into words: Combine key details from each record into a single sentence, creating a simple profile for easy comparison.
  2. Turn words into numbers: Use an embedding model to convert each record’s sentence into a vector of numbers. These can represent keywords or even capture the semantics of a sentence, allowing us to compare records.
  3. Find neighbors: Iterate over all vectors and pair each with its K Nearest Neighbors.

It’s refreshingly simple. Just pick which parts of your data to include in your sentences, choose an embedding model, and then decide how many neighbors (we call that ‘K’) you want to consider for each record. Boom, you’re ready to roll!

Pairing Five Million People Records

The database group of the University of Leipzig open-sourced the North Carolina Voters 5M dataset, among many others.

Let’s roll up our sleeves and dive into the pairing process we’ve outlined:

from pathlib import Path
import pandas as pd

# Load the data stored in a local directory '5Party-ocp20':
df = pd.concat([
    pd.read_csv(path, dtype='str') for path in Path('5Party-ocp20').glob('*.csv')
], ignore_index=True)
df.index.name = 'record_id'

# The data comes with the solution which we put into a separate cross-refernece table:
xref_true = df['recid'].rename('cluster_id').reset_index()
df = df.drop('recid', axis=1)

# Check the sizes and numbers per size of clusters:
xref_true.cluster_id.value_counts().value_counts().sort_index()

# Shows the first 5:
df.head()
record_idgivennamesurnamesuburbpostcode
0kadelyngragnaniwaxhaw28|73
1ronelcarterwashington2788g
2jasonroehrigmillers ceeek286s1
3antreamuregreensboro27410
4caedaroliverjacksonville28542

Our dataset paints a picture of messy reality: around 3 Mio duplicate-free people, 84k clusters where each person has one duplicate, 83k clusters with two duplicates per person, and so on. All in all, we’re looking at 5 million records for just 3.5 million actual people. That’s 12.5 Trillion possible pairs!

# 1. Turn data into words:
sentences = (
    (
        # Sentence = join strings with a whitespace separator:
        df.givenname.fillna('') + ' ' + df.surname.fillna('') + ' ' + df.suburb.fillna('')
    )
    # Remove any redundant whitespace:
    .str.replace(r'\s\s+', ' ', regex=True))
)

Our sentences consist of simple string joins of key attributes with a whitespace separator. For example, the first record’s sentence is "kadelyn gragnani waxhaw". There’s a whole universe of options to explore - a topic we’ll be diving into deeper in a future article. Join our newsletter and stay tuned! But for now, let’s keep things simple and focus on the basics.

Time to take the next step and transform those sentences into numbers.

TF-IDF: The OG of Text Embeddings

TF-IDF has been around since the dinosaurs roamed the earth (well, before the internet anyway!), but don’t let its age fool you. This old-school technique still packs a punch when it comes to creating text embeddings that get the job done.

from sklearn.feature_extraction.text import TfidfVectorizer

# 2a. Turn words into numbers - TF-IDF model using 2- and 3-grams:
tfidf_model = TfidfVectorizer(analyzer='char', ngram_range=(2, 3), min_df=10**3)  # min_df drops very sparse features
tfidf_embeddings = tfidf_model.fit_transform(sentences.values).todense()  # FAISS does not work with sparse matrices

This method counts word chunks (n-grams) in a sentence, then gives a boost to the less common ones. We control the breakdown of sentences into tokens with its hyperparameters. Here, we consider all 2- and 3-grams.

TF-IDF is all about keywords. It’s a lexical method, meaning it focuses on matching exact tokens. So, if two sentences share a lot of the same tokens, they’ll be considered similar. But if they use different words to express the same idea, TF-IDF might miss the connection. For example, “Robert” and “Bob” are the same person, but to TF-IDF, they’re complete strangers.

FastText brings a fresh perspective to the table, going beyond simple keyword matching to capture the nuances of language. It’s like TF-IDF with a superpower: the ability to understand words it’s never even seen before.

import fasttext

# 2b. Turn words into numbers - English FastText model:
fasttext_model = fasttext.load_model('cc.en.300.bin')  # Download a pre-trained model from fasttext.cc
fasttext_embeddings = np.array([fasttext_model.get_sentence_vector(val) for val in sentences.values])

FastText has a unique way of chopping up sentences into tokens, and it stores a special vector for each token in a massive file, here 'cc.en.300.bin'. This file is like a dictionary of word meanings, built by training a powerful deep learning model on a mountain of text data (think Wikipedia-sized!).

Thanks to this training, FastText can see connections between words that TF-IDF might miss. For example, it knows that “VW” and “Volkswagen” are close, giving them relatively similar vectors.

To create a sentence embedding, FastText simply takes the average of all the codes for the words in that sentence. It’s a straightforward approach that can be surprisingly effective, especially when dealing with abbreviations, misspellings, or words that traditional methods might not recognize.

SBERT: Contextual Embeddings

SBERT, which stands for Sentence-BERT, is another powerful deep learning model for creating sentence embeddings. But here’s the kicker: unlike FastText, SBERT understands that words can have different meanings depending on the company they keep. It considers the whole sentence to figure out the true meaning of each word.

For example, the word “bank” takes on a very different vibe in the sentence “river bank” compared to “bank account.” SBERT gets this, while simpler methods might miss the nuance.

from sentence_transformers import SentenceTransformer

# 2c. Turn words into numbers - pre-trained sentence-transformer:
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')  # Downloads a pre-trained model from Huggingface
sbert_embeddings = sbert_model.encode(sentences.values, show_progress_bar=True)

SBERT’s superpower comes at a cost: it’s not as speedy as FastText. You can’t just pre-calculate codes for each token and store them in a lookup table. Instead, every sentence needs to take a ride on the full SBERT model, which requires more time or a hefty GPU to handle the load.

Turning those 5 million records into embeddings took a breezy 25 minutes on my Macbook Pro M3. Big shoutout to Pytorch’s MPS support for making this happen so quickly!

Think of FAISS as a super-organized filing system for your data. It takes your embeddings and arranges them in a clever way that allows for lightning-fast searches. So, when you need to find the closest match for a new record, FAISS can pinpoint the most similar candidates in the blink of an eye.

We’ll explore how to use FAISS to accelerate our pairing process, ensuring we find the best matches without getting bogged down in endless comparisons.

from typing import Optional

import numpy as np
import pandas as pd
import faiss


def get_nn_pairs(embeddings: np.array, record_ids: np.array, k: int = 5,
                 index_factory: str = 'L2norm,Flat') -> pd.DataFrame:
    index = faiss.index_factory(embeddings.shape[1], index_factory, faiss.METRIC_INNER_PRODUCT)
    if not index.is_trained:
        index.train(embeddings)

    index.add(embeddings)

    nn_distances, nn_indices = index.search(embeddings, k=k)

    res = []
    for idx, row in enumerate(nn_indices):
        res.append(np.array(np.meshgrid([record_ids[idx]], record_ids[row])).T.reshape(-1, 2))

    res = np.vstack(res)

    # Reduce the results to non-trivial pairs:
    res = pd.DataFrame(res[res[:, 0] != res[:, 1]], columns=['record_id_0', 'record_id_1'])
    res = res.loc[res.record_id_0 != res.record_id_1]

    # By convention, we order pairs so that first component > second component
    res_c = res.copy(deep=True)
    must_flip = res_c.record_id_0 < res_c.record_id_1
    res.loc[must_flip, 'record_id_0'] = res_c.record_id_1
    res.loc[must_flip, 'record_id_1'] = res_c.record_id_0

    return res.drop_duplicates().reset_index(drop=True)

# 3a. Find neighbors:
tfidf_candidate_pairs = pd.MultiIndex.from_frame(
    get_nn_pairs(embeddings=tfidf_embeddings, record_ids=df.index.values, k=10, index_factory='L2norm,IVF1024,Flat')
)

# 3b. Find neighbors:
fasttext_candidate_pairs = pd.MultiIndex.from_frame(
    get_nn_pairs(embeddings=fasttext_embeddings, record_ids=df.index.values, k=10, index_factory='L2norm,IVF1024,Flat')
)

# 3c. Find neighbors:
sbert_candidate_pairs = pd.MultiIndex.from_frame(
    get_nn_pairs(embeddings=sbert_embeddings, record_ids=df.index.values, k=10, index_factory='L2norm,IVF1024,Flat')
)

Out of the box, our function does a full-blown KNN search using cosine distances — accurate, but potentially slow on large datasets. To keep things snappy, we’ve switched to a smarter, non-exhaustive search by tweaking the index_factory argument. For all the juicy details on this, head over to the FAISS docs and prepare to be amazed!

A Word of Warning for Sparse Embeddings

We used the same approximate nearest neighbor search for all our embeddings, but there’s a catch. FAISS, our trusty tool, is optimized for dense vectors, where most values are non-zero. TF-IDF, on the other hand, produces sparse embeddings with lots of zeros.

To get the best performance with sparse embeddings like TF-IDF, we need to be careful with how we configure FAISS. Tweaking the index_factory setting or exploring techniques like HNSW or product quantization can help. Alternatively, we could switch to a different tool specifically designed for sparse embeddings, like FALCONN or datasketch.

Intrigued by the world of sparse embeddings and their role in entity resolution? Stay tuned for a dedicated blog post where we’ll unravel their mysteries and show you how to harness their power.

Evaluating Recall

Recall is the proportion of true matches that our method successfully catches. In other words, it measures how good we are at not accidentally throwing away good matches. The higher the recall, the fewer matches slip through the cracks.

Let’s see how our contenders stack up!

from itertools import combinations

def cross_ref_to_matches(df: pd.DataFrame) -> pd.DataFrame:
    match_lists = (df.sort_values('record_id', ascending=False)
                   .dropna(subset=['record_id'])
                   .groupby('cluster_id')['record_id'].apply(lambda s: list(s)))
    match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]

    match_pairs = []
    for match_list in match_lists:
        match_pairs += list(combinations(match_list, 2))
    return pd.MultiIndex.from_tuples(match_pairs)

actual_matches = cross_ref_to_matches(xref_true)

# Recall:
tfidf_recall = actual_matches.isin(tfidf_candidate_pairs).mean()
fasttext_recall = actual_matches.isin(fasttext_candidate_pairs).mean()
sbert_recall = actual_matches.isin(sbert_candidate_pairs).mean()

# Count candidates:
tfidf_size = tfidf_candidate_pairs.shape[0]
fasttext_size = fasttext_candidate_pairs.shape[0]
sbert_size = sbert_candidate_pairs.shape[0]

Our recall evaluation shows TF-IDF leading the pack at 88.89%, followed by SBERT at 71.98%, and FastText trailing behind at 67.80%. All three methods produced a similar number of candidate pairs (around 34-35 million), which is significantly below the maximum of K*n=50 Mio.

Contrastive Learning: Training SBERT to be a Pairing Champ

72% recall with SBERT is OK, but we can do better. Let’s crank it up a notch and show you how to squeeze even more performance out of this powerful tool.

Up until now, we’ve been using the all-MiniLM-L6-v2 model. It’s great for information retrieval tasks, but it’s never actually learned anything about entity resolution — even though those two disciplines have some things in common.

We’re about to change that, and it’s a two-step process.

  1. We craft a training dataset of triplets packed with challenging examples from a variety of other entity resolution benchmark datasets.
  2. We use the sentence-transformers API to fine-tune our model with a contrastive learning objective, feeding it our carefully curated training dataset.

Soon we will have a whole other post that dives deep into this process, so we’ll skip the nitty-gritty details here. For now, we just load the model and evaluate it on our 5 Mio people records:

fine_tuned_model = SentenceTransformer('fine-tuned-all-MiniLM-L6-v2')  # Load fine-tuned model from disk
fine_tuned_embeddings = fine_tuned_model.encode(sentences.values, show_progress_bar=True)

fine_tuned_candidate_pairs = pd.MultiIndex.from_frame(
    get_nn_pairs(embeddings=fine_tuned_embeddings, record_ids=df.index.values, k=10, index_factory='L2norm,IVF1024,Flat')
)

actual_matches.isin(fine_tuned_candidate_pairs).mean()

Fine-tuning definitely gave SBERT a boost, jumping from 71.98% to a respectable 75.19%.

Hold on! If TF-IDF is already winning, why even bother with fancy semantic stuff like FastText and SBERT?

TF-IDF is a workhorse, handling most of our matching needs with ease. But it does stumble on those trickier cases where understanding the meaning of words is key. The good news? We don’t have to pick just one method! We can combine the strengths of TF-IDF with the semantic smarts of other approaches. Here is how:

  1. Generate embeddings for each record using as many embedding models as you like.
  2. Run KNN search independently for each method.
  3. Union the candidate sets.

By blending the three embedding approaches from the previous sections, we can create a pairing strategy that leverages the best of each world. Let’s prove that it indeed pushes our recall boundaries to the next level:

combined_candidate_pairs = tfidf_candidate_pairs.union(fine_tuned_candidate_pairs).union(fasttext_candidate_pairs)

# Recall:
actual_matches.isin(combined_candidate_pairs).mean()

We’re hitting an impressive 92.09% recall with a mere 92 million candidate pairs. That’s still less than 0.001% of the astronomical number we’d be dealing with if we compared every possible pair.

A Note on Near-linear Scaling

The pairing approach combining embeddings with KNN search should give us a maximum of K*n candidate pairs on a dataset of size n. That’s linear scaling, which is music to our compute budget’s ears.

But let’s be real here: that formula is a bit of a fairy tale. In the trenches, I’ve seen time and again that K isn’t just a constant – a reasonable choice changes with n. No fancy proof to back it up, but my gut says it’s more like K=c*log(n).

Picture this: the bigger your dataset, the more likely it is for a non-match to sneak in between two real matches. It’s like trying to find your friend in a crowded concert – the more people there are, the higher the odds of someone else blocking your view.

Not perfectly linear anymore, but hey, still way better than the combinatorial explosion we’d face without pairing or with standard blocking. We’ll take it!

Outlook: Sentence Embeddings for Matching

FAISS’s speedy KNN search makes sentence embeddings a practical choice for pairing. But there are other, more computationally intensive methods for comparing sentence embeddings that can boost accuracy. These heavy-duty techniques shine in the matching stage, where precision is paramount.

A simple two-step pipeline to decide if a pair of records are duplicates. Sentence embeddings can play a role in both steps.

Want the inside scoop on how to leverage sentence embeddings for matching? We’ve got a deep dive in the works, where we’ll reveal how BERT can revolutionize your matching game. Don’t miss out – subscribe to our blog and stay tuned!

Wrapping Up

Sentence embeddings paired with KNN search give you a simple, effective, and scalable way to tackle pairing. While the classic TF-IDF still holds its own against the newer deep learning methods, the real magic happens when you combine lexical and semantic search. This gives you the best of both worlds—the efficiency of TF-IDF and the nuanced understanding of SBERT. And with the ability to fine-tune SBERT, you can create a model that truly excels at entity resolution.

Want to Dive Deeper? Check Out These Resources

We’re here to help! Contact us today for a free consultation and let’s discuss your specific entity resolution challenges.

Join my mailing list and stay tuned for upcoming posts.