In our data-drenched world, it’s easy to drown in duplicates and disconnected info. It’s like having a messy closet, but way worse for your business!

When you’re dealing with a single dataset and need to eliminate duplicate records within it, it’s natural to call this process “deduplication.” On the other hand, when you have multiple, already deduplicated datasets and need to connect records that represent the same real-world entity across them, the term “linkage” is commonly used.

Example - DuDe Restaurants

The restaurants dataset is open data provided by the Dude Team from the Hasso-Plattner Institute, University of Potsdam. We show the first five columns. The goal of a deduplication pipeline is to create that last column entity_id, which is a unique identifier assigned to each distinct restaurant entity.

record_idnameaddrcityphonetypeentity_id
r1arnie morton's of chicago435 s. la cienega blv.los angeles310/246-1501americane1
r2arnie morton's of chicago435 s. la cienega blvd.los angeles310-246-1501steakhousese1
r3art's delicatessen12224 ventura blvd.studio city818/762-1221americane2
r4art's deli12224 ventura blvd.studio city818-762-1221delise2
r5hotel bel-air701 stone canyon rd.bel air310/472-1211californiane3

A naive approach compares every record with every other. Pairs that are similar enough are considered matches and receive the same entity_id value. That’s 10 pairs in our toy example.

Let’s assume instead that our restaurant records are distributed across the two duplicate-free datasets below. Now we are talking about a linkage problem, and again the goal is to add an entity_id to every record:

1st dataset:

record_idnamestreetcityphoneentity_id
r1arnie morton's of chicago435 s. la cienega blv.los angeles310/246-1501e1
r2art's delicatessen12224 ventura blvd.studio city818/762-1221e2
r3hotel bel-air701 stone canyon rd.bel air310/472-1211e3

2nd dataset:

record_idnameaddressphonetypeentity_id
s1arnie morton's of chicago435 s. la cienega blvd., los angeles310-246-1501steakhousese1
s2art's deli12224 ventura blvd., studio city818-762-1221delise2

Here, a naive approach would compare every record in the first with every in the other dataset. This time, we have 6 pairs to process, but only because we added the duplicate-free assumption to every dataset. If not, we would compare 3 more pairs within the first and one within the second dataset. That’s a total of 10 again.

Deduplication and Linkage Under One Framework

Deduplication and linkage are both specific applications of entity resolution, tailored to different data scenarios. Our toy example demonstrates that the difference between deduplication and linkage is less about the number of datasets we consider but the belief about which record pairs we should be comparing and which not.

Real-world scenarios deal with datasets of sizes in the millions. Comparing every possible pair within one or across two datasets is prohibitively expensive. Some common strategies are:

  • (standard blocking) Choose a match key among the attributes and compare two records only if they share the same match key.
  • (sorted neighborhood) Sort records alphabetically using a single attribute as the sort key. Only pairs with records in close proximity will be compared across all their attributes.
  • (K nearest embeddings) Represent every record as a numeric vector and identify the K nearest neighbors per record in this vectorspace equipped with a distance metric.

No pairing technique is perfect. Here, we select a subset (blue) of all possible pairs (red) covering most actual matches (green).

Note, for instance, that standard blocking splits a dataset into several and assumes that there are no duplicates across datasets - the opposite of our linkage assumption with duplicate-free subsets.

All these strategies are relatively cheap to apply and can eliminate most no-matches. It is the first step of every entity resolution pipeline and an actively researched problem called indexing, pairing, and blocking.

Deduplication vs. Linkage in Practice

In practice, it can help break down a single entity resolution problem into multiple independent deduplication and linkage tasks. Think of the following situations:

  • A single dataset of customer records following a consistent data schema, where one subset represents bill-to and the other ship-to addresses. Different business processes behind the different address types can result in duplicates that follow different patterns. Two individual matching models work better than one.
  • Two datasets of customer records with vastly different data schemas. A union of both into a single tabular format means, we need to find a compromise structure, which is suboptimal for one or both subsets. Poorer input quality likely translates to poorer matching accuracy.

Note that our toy linkage example has two different schemas. The first dataset has no type and the second has the street and city in a single addr field. Let’s drop the duplicate-free assumption now and do the following:

  1. Build a deduplication pipeline for the 1st dataset with a matching model that takes into account individual similarity scores for name, street, city, and phone value pairs.
  2. Build another deduplication pipeline for the 2nd dataset, this time scoring on name, addr, type, and phone.
  3. Unify the two dataschemas by concatenating street and city to a single addr attribute in the 1st and by dropping type in the 2nd dataset. Finally, build a linkage pipeline between the individually deduplicated datasets by comparing name, addr, and phone value pairs.

Conclusion

The only difference between deduplication and linkage is in pairing, our strategy of how to select a subset of pairs we consider in the expensive matching step. Sometimes it does make sense to break down a single resolution problem into individual deduplication and linkage procedures.

If you’re struggling with duplicate or fragmented data, our consultancy can help. We specialize in implementing effective deduplication and linkage strategies that drive data quality and business success.

Don’t miss out on future posts like this! Join my mailing list and stay updated on all things entity resolution.