BERT, or Bidirectional Encoder Representations from Transformers, quickly proved its worth and took the entity resolution community by storm.

If BERT is new to you, picture it as ChatGPT’s older, brainy cousin. BERT is like the translator, converting text into numbers that computers understand. Meanwhile, ChatGPT is the storyteller, using those numbers to generate fresh text – that’s the magic behind generative AI.

What’s the Big Deal with BERT?

The original base version of BERT is a deep learning architecture consisting of 110 million parameters. It was pre-trained on a massive dataset of unlabeled text using two clever techniques:

  1. Masked Language Modeling (MLM): Imagine BERT is reading a sentence, but some words are randomly hidden. MLM challenges BERT to predict those missing words based on the surrounding context. This forces it to understand the relationships between words and grasp the overall meaning of the sentence.
  2. Next Sentence Prediction (NSP): BERT is given two sentences and has to figure out if the second sentence logically follows the first. This helps BERT understand broader discourse and relationships between sentences.

Through these pre-training tasks, BERT essentially learned the structure and nuances of language, making it a powerful tool for all sorts of natural language processing tasks - including entity resolution!

BERT’s Key Advances in Entity Resolution

  • Semantic Similarity: Traditional methods look for exact or near-exact matches. BERT grasps the underlying meaning, even with different wording.
  • Contextual Representation: BERT doesn’t just look at words or fields in isolation. It understands each piece of a record within the context of the entire record. Even more impressive, BERT can represent one record in relation to another, making it ideal for comparing pairs of records to see if they match.
  • Transfer Learning: BERT’s layers are like building blocks: some form the foundation of language understanding, others can be customized for specific tasks. Fine-tune BERT on company deduplication and repurpose it for people with impressive results – that’s the power of its adaptable design.

Alright, let’s roll up our sleeves and see how BERT can be put to work in an entity resolution pipeline.

The Bi-Encoder Approach

Let’s say we need to figure out if two customer records, let’s call them record A and record B, actually belong to the same person or not.

  1. We need to turn both records into simple text s_i=f(r_i) (“sentences”), since that’s how BERT works. Think of it like turning all the different pieces of information in each record (like names, addresses, maybe even birthdates) into words, and then sticking all those words together into one big sentence.
  2. We use BERT to create “embeddings” v_i=avg(BERT(t_i)) - unique fingerprints for each sentence. BERT breaks each sentence into smaller bits, assigning each a vector. We then average these vectors to create a simple representation of the whole sentence.
  3. We plug these two embeddings into a classifier. This could be something simple, like a trained logistic regression model, which takes the difference x=abs(v_1-v_2) as its input and tells us whether the records are a match or not.

An illustration of the bi-encoder approach to predict whether two records are duplicates.

The real number crunching happens when we use BERT to create those embeddings. The beauty of this approach is that we can recycle the embedding for record A when we compare it to a different record, say, record C. This keeps the computational cost in check, scaling nicely with the number of records we need to process – a big win!

The Cross-Encoder Approach

Let’s stick with the previous examples of two customer records, now doing the following:

  1. We combine our two records into one sentence s_ij=g(r_i,r_j) for BERT. We’ll use a special separator “[SEP]” so BERT knows there are two distinct pieces of information. Basically, we stick the text versions f(r_i) from the bi-encoder approach together with this separator in the middle.
  2. Once again, we ask BERT to create a representation x_ij=BERT(s_ij), but now it’s for the whole combined sentence. This time, we grab the special “[CLS]” token’s embedding – it’s like a super-concentrated summary of the entire sentence, ideal for our task.
  3. Finally, we feed this concentrated code x_ij directly into our classifier – perhaps another trusty logistic regression model – which gives us the final verdict: match or no match.

An illustration of the cross-encoder approach to predict whether two records are duplicates.

The downside here is that we can’t reuse our embeddings anymore. Each record pair needs its own unique representation, and that gets computationally expensive fast – especially when you’ve got loads of records to compare. So, why bother with this fancy cross-encoder stuff when the bi-encoder seemed to work just fine?

Well, the key is context. Cross-encoders let BERT understand one record in relation to the other, almost like it can peek over that “[SEP]” fence while it’s processing each piece of the puzzle. And the research backs it up: cross-encoders consistently outperform bi-encoders, sometimes by a little, sometimes by a lot.

How to Combine Bi- And Cross-Encoders

The struggle between cost and performance is a common one. Entity resolution tackles this with “pairing and matching”: a quick, initial filter followed by a more thorough examination of the most likely matches.

An illustration of a combined pairing & matching approach using two BERT models. This strategy is called 'Retrieve & Re-rank' in the information retrieval community.

You might’ve noticed that our two BERTs are shown in different colors. That’s a visual cue to remind us that BERT can be fine-tuned for specific tasks, and it’s a good idea to do so for pairing and matching. Think of it like training a chef for two different cuisines – each requires specialized skills.

Our pairing-BERT thrives on “contrastive” or “metric” learning, which helps it measure similarity between records. On the other hand, the matching-BERT is perfectly content with traditional binary classification training, learning to make that final yes-or-no decision.

The Key to Successful BERT is Training

The beauty of these BERT approaches is their flexibility. They’re not just limited to entity resolution. Swap out “records” for “queries” and “documents,” and the same architecture tackles information retrieval tasks. Instead of thinking in terms of “matches,” we’re now looking for the most “relevant” results.

So, the challenge isn’t about designing the model anymore – that’s already been figured out. The real hurdle lies in training those massive BERT models effectively. Some specific challenges include:

  • Curating training data that is large and accurate enough - BERT can be unforgivable when messing up the quality.
  • Dealing with imbalanced labels - also here, BERT is less robust compared to e.g. a tree ensemble.
  • Applying strategies to grow the training data by the most relevant additions, for example, with active learning.
  • Choosing the right training strategy and why among the ever-growing number of choices.
  • Choosing cost-effective hardware.

Need Help with BERT for Entity Resolution?

Struggling to harness the power of BERT for your entity resolution needs? We can help you navigate the complexities of training and fine-tuning these models, turning your data challenges into successes. Contact us today for a personalized solution.

Don’t miss out on future posts like this! Join my mailing list and stay updated on all things entity resolution.