In our data-drenched world, it’s easy to drown in duplicates and disconnected info. It’s like having a messy closet, but way worse for your business!
When you’re dealing with a single dataset and need to eliminate duplicate records within it, it’s natural to call this process “deduplication.” On the other hand, when you have multiple, already deduplicated datasets and need to connect records that represent the same real-world entity across them, the term “linkage” is commonly used.
Example - DuDe Restaurants
The restaurants dataset is open data provided by the Dude Team
from the Hasso-Plattner Institute, University of Potsdam. We show the first five columns. The goal of a deduplication
pipeline is to create that last column entity_id, which is a unique identifier assigned to each distinct
restaurant entity.
| record_id | name | addr | city | phone | type | entity_id |
|---|---|---|---|---|---|---|
| r1 | arnie morton's of chicago | 435 s. la cienega blv. | los angeles | 310/246-1501 | american | e1 |
| r2 | arnie morton's of chicago | 435 s. la cienega blvd. | los angeles | 310-246-1501 | steakhouses | e1 |
| r3 | art's delicatessen | 12224 ventura blvd. | studio city | 818/762-1221 | american | e2 |
| r4 | art's deli | 12224 ventura blvd. | studio city | 818-762-1221 | delis | e2 |
| r5 | hotel bel-air | 701 stone canyon rd. | bel air | 310/472-1211 | californian | e3 |
A naive approach compares every record with every other. Pairs that are similar enough are considered matches and
receive the same entity_id value. That’s 10 pairs in our toy example.
Let’s assume instead that our restaurant records are distributed across the two duplicate-free datasets below.
Now we are talking about a linkage problem, and again the goal is to add an entity_id to every record:
1st dataset:
| record_id | name | street | city | phone | entity_id |
|---|---|---|---|---|---|
| r1 | arnie morton's of chicago | 435 s. la cienega blv. | los angeles | 310/246-1501 | e1 |
| r2 | art's delicatessen | 12224 ventura blvd. | studio city | 818/762-1221 | e2 |
| r3 | hotel bel-air | 701 stone canyon rd. | bel air | 310/472-1211 | e3 |
2nd dataset:
| record_id | name | address | phone | type | entity_id |
|---|---|---|---|---|---|
| s1 | arnie morton's of chicago | 435 s. la cienega blvd., los angeles | 310-246-1501 | steakhouses | e1 |
| s2 | art's deli | 12224 ventura blvd., studio city | 818-762-1221 | delis | e2 |
Here, a naive approach would compare every record in the first with every in the other dataset. This time, we have 6 pairs to process, but only because we added the duplicate-free assumption to every dataset. If not, we would compare 3 more pairs within the first and one within the second dataset. That’s a total of 10 again.
Deduplication and Linkage Under One Framework
Deduplication and linkage are both specific applications of entity resolution, tailored to different data scenarios. Our toy example demonstrates that the difference between deduplication and linkage is less about the number of datasets we consider but the belief about which record pairs we should be comparing and which not.
Real-world scenarios deal with datasets of sizes in the millions. Comparing every possible pair within one or across two datasets is prohibitively expensive. Some common strategies are:
- (standard blocking) Choose a match key among the attributes and compare two records only if they share the same match key.
- (sorted neighborhood) Sort records alphabetically using a single attribute as the sort key. Only pairs with records in close proximity will be compared across all their attributes.
- (K nearest embeddings) Represent every record as a numeric vector and identify the K nearest neighbors per record in this vectorspace equipped with a distance metric.

Note, for instance, that standard blocking splits a dataset into several and assumes that there are no duplicates across datasets - the opposite of our linkage assumption with duplicate-free subsets.
All these strategies are relatively cheap to apply and can eliminate most no-matches. It is the first step of every entity resolution pipeline and an actively researched problem called indexing, pairing, and blocking.
Deduplication vs. Linkage in Practice
In practice, it can help break down a single entity resolution problem into multiple independent deduplication and linkage tasks. Think of the following situations:
- A single dataset of customer records following a consistent data schema, where one subset represents bill-to and the other ship-to addresses. Different business processes behind the different address types can result in duplicates that follow different patterns. Two individual matching models work better than one.
- Two datasets of customer records with vastly different data schemas. A union of both into a single tabular format means, we need to find a compromise structure, which is suboptimal for one or both subsets. Poorer input quality likely translates to poorer matching accuracy.
Note that our toy linkage example has two different schemas. The first dataset has no type and the second has the
street and city in a single addr field. Let’s drop the duplicate-free assumption now and do the following:
- Build a deduplication pipeline for the 1st dataset with a matching model that takes into account individual
similarity scores for
name,street,city, andphonevalue pairs. - Build another deduplication pipeline for the 2nd dataset, this time scoring on
name,addr,type, andphone. - Unify the two dataschemas by concatenating
streetandcityto a singleaddrattribute in the 1st and by droppingtypein the 2nd dataset. Finally, build a linkage pipeline between the individually deduplicated datasets by comparingname,addr, andphonevalue pairs.
Conclusion
The only difference between deduplication and linkage is in pairing, our strategy of how to select a subset of pairs we consider in the expensive matching step. Sometimes it does make sense to break down a single resolution problem into individual deduplication and linkage procedures.
If you’re struggling with duplicate or fragmented data, our consultancy can help. We specialize in implementing effective deduplication and linkage strategies that drive data quality and business success.
Don’t miss out on future posts like this! Join my mailing list and stay updated on all things entity resolution.
