Master Data Management (MDM) is how the big software companies talk about deduplicating data. But here’s the kicker: most of them sell it as a service, and they charge you based on how many records you feed into their system. For larger companies, that can mean spending hundreds of thousands, even millions, of dollars every year.

The target audience for this article

Thinking about jumping on the MDM bandwagon? You’ve probably already gotten a few quotes, and let’s be honest, they’re eye-watering. Or maybe you’re already locked into an MDM subscription that’s taking a big bite out of your budget.

But what if there was a way to slash those annual costs without sacrificing results? Imagine getting 80% of the benefits with just 20% of the effort. It’s possible.

The secret? Tackle the easy deduplication tasks with open-source tools, and let your MDM handle the truly complex matches.

Why entity resolution matters

Picture this: a company’s on a roll, snapping up businesses left and right. New products, new markets, it’s all exciting… until you look at the IT side. Suddenly, you’re juggling a hundred different ERP systems, teams are working in isolation, and data’s scattered everywhere.

You’ve got the same customer listed in multiple systems, product information is out of sync, and duplicates are running rampant. It’s a mess, and it’s costing the company time, money, and opportunities.

  • Missed Sales: Without a clear view of their data, the company’s missing out on chances to sell more to existing customers, just because different teams don’t know what others are doing.
  • Inefficient Teams: Those old regional and product line divisions are holding back the field teams, preventing them from working together effectively.
  • Supplier Squeeze: Different teams buying the same stuff separately? That’s like leaving money on the table. The company’s losing out on bulk discounts and better deals.
  • Order Chaos: Manufacturing, procurement, and sales aren’t on the same page, which means backlogs are piling up and customers are getting frustrated.

system integrations
Larger enterprises maintain many IT systems across functions, regions, and lines of business. Arrows represent implemented processes, the flow of data, or just manual paperwork. Can we trace a product from closed opportunity to manufacturing, distribution, installation/sale, and service?

Don’t despair! MDM can untangle this mess. It connects the dots between your scattered data, creating a unified view. Imagine the possibilities: increased sales, efficient teams, better supplier deals, and smooth order processing.

IT architecture of a data model
We extract+load (EL) data from sources to our raw layer in the lake. Some data flows into our Master Data Management (MDM) platform, which handles entity resolution end-to-end. Transformations (T) complement the MDM work to a full-blown data mart. Analytics and Reverse ETL deliver actionable insights. Image by the author.

End-to-end entity resolution

Want to dive deeper into the world of entity resolution? Check out the comprehensive survey by Christophides and co-authors, “End-to-End Entity Resolution for Big Data.” It’s packed with insights and covers many aspects we won’t touch on here.

The next figure represents one of many ways to implement entity resolution.

end-to-end entity resolution
Entity resolution can be an iterative process. We ingest and preprocess records, engineer similarity features, select (and fit) a classification model, and cluster matches. We can set rules (similarity thresholds) under which highly similar pairs are considered true matches automatically (red) and distribute a batch of likely but unsure cases across humans for review (green). Resolved examples help us to learn and refine.

Alright, let’s break down the typical steps involved in an end-to-end entity resolution process:

  1. Data Cleanup: We’ll get your data ready by tidying it up - fixing errors, standardizing formats, and filling in any gaps.
  2. Smart Pairing: Instead of comparing every single record with every other record (which would take forever!), we’ll cleverly narrow down the possibilities, focusing only on the pairs that have a decent chance of being matches.
  3. Similarity Scoring: We’ll develop ways to measure how alike different attributes are. For example, we might compare names, addresses, or product descriptions to see how closely they match.
  4. Match Prediction: We’ll use a model to predict which pairs are most likely to be the same entity. Think of it like a super-smart detective, weighing the evidence and making educated guesses.
  5. Grouping: We’ll take those predicted matches and group them together into clusters, representing the same real-world entity.
  6. Human Check: Just to be extra sure, we’ll have humans review some of the trickier cases where the model isn’t completely confident.

Usually, you’ll only need humans to review a small group of records where the model isn’t certain. Think of it like a quality check. Their feedback can be used to make the model even smarter, or even help you rethink earlier steps in the process.

The Budget Crunch: Keeping Costs Under Control

Entity resolution isn’t just about algorithms and data. It also hits your wallet. Between cloud computing costs and paying people to review uncertain matches, things can get expensive fast.

That’s why a quick-and-dirty solution might actually cost you more in the long run. It’s time to level up your game.

Building the Dream Team: It Takes a Village

We’re talking about adding more advanced features to your entity resolution process. Think things like using expert knowledge to automatically label training data or prioritizing the most uncertain matches for human review.

These are all doable, but here’s the challenge: you need a diverse skill set to build and manage it all. Infrastructure, security, backend development, machine learning, a user-friendly review interface… it’s a lot!

You could tackle some parts in-house, but don’t be afraid to bring in vendors for the heavy lifting. I’ve chatted with a few who offer powerful matching engines you can install on your own servers. And there are also companies that specialize in providing software for managing the human review process.

Open-Source First: Your Secret Weapon

All this vendor talk might seem overwhelming. But remember, it’s also a chance to learn. Before you start signing contracts, I highly recommend experimenting with open-source frameworks. Here’s why:

  • Cut through the Marketing Hype: You’ll know exactly what you need, so you can avoid those sales pitches full of empty promises.
  • Challenge Vendors: You’ll have real-world examples to test their solutions and expose any weaknesses.
  • Negotiate from a Position of Strength: When you know what you’re talking about, you can drive a harder bargain. Vendors will know you’re not an easy target.

Trust me, a little upfront experimentation can save you a lot of money in the long run.

How you can reduce your MDM costs

When you’re talking to MDM vendors, remember this: their pricing isn’t just about how much data you have. They’ll also try to upsell you on additional features and integrations, like fancy address validation services. These add-ons can significantly inflate your costs.

But what if you could cut back on some of those expenses? Take a look at the data preparation step in the diagram below. There are several opportunities to save money, highlighted in green.

Data processing and cleansing
Every green box is a money-saving opportunity. Preprocessing (e.g., SQL) helps us subset to just relevant records. Open-source entity resolution takes care of the simple cases, reducing again the number of records fed into MDM. Finally, expensive third-party APIs are called only where not replaceable by cheap alternatives.

Save money with simple preprocessing

A lot of the customer records sitting in your source systems are dead weight. They might be outdated or just plain irrelevant to your business goals. Why waste time and money cleaning up data that won’t add any value?

Not every record is relevant for MDM
We extract+load relevant data to our lake. Some records won’t be beneficial for MDM. We can identify those with rules translated into SQL and executed on our data lake. The rules can change anytime.

  • Zombie Records: If a customer record has no connection to any orders, transactions, contracts, or open opportunities, it’s likely a dead end. Deduplicating these “zombie” records is a waste of resources.
  • B2C Customers: Consider if deduplicating B2C customers is worth the effort. The classic MDM pitch is about creating a 360-degree view of customers across regions and product lines. But if that’s not a game-changer for your B2C business, don’t invest in resolving those entities.

Don’t worry about making the wrong call on excluding data. If your business needs change and those previously ignored records become valuable, it’s easy to update your settings. A simple code tweak, and the next MDM batch will include the data you need.

Save money with cheaper alternatives for 3rd party APIs

Your MDM system is powerful, but it’s not an island. To get the most out of it, you’ll want to connect it to external data. This integration unlocks a whole new world of possibilities, allowing you to enrich and validate internal data for deeper insights and better decision-making.

Here are two prime examples of how API integrations can supercharge your MDM:

  • Geocoding and Address Validation: Say goodbye to typos, incomplete addresses, and inaccurate location data. With a geocoding and address validation API, you can ensure that your customer and supplier addresses are accurate, standardized, and available as geographic coordinates. This improves everything from marketing campaigns to logistics planning.
  • B2B Customer Enrichment: Go beyond the basics with your B2B customer data. Enrich your records with external information like industry classifications, company hierarchies (parent companies, subsidiaries), and key performance indicators like annual revenue and headcount. This deeper understanding of your customers empowers you to tailor your sales and marketing strategies, identify new opportunities, and build stronger relationships.

Most MDM vendors will try to steer you toward their in-house solutions or the big names in the industry. It’s the classic “nobody gets fired for buying IBM” mentality. But let’s be honest, does that really guarantee you’re getting the best bang for your buck?

Save costs with cheap alternatives
Use a friendly service to geocode your addresses. Some services respond not just with search results but also a confidence score. Call a 2nd, more expensive service for scores below a threshold if needed.

Let’s talk geocoding services. The big names like Google Maps and Mapbox are tempting, but their proprietary nature can be a trap. High costs and restrictive licenses can limit your flexibility and leave you at the mercy of their pricing whims.

Enter the open-source alternatives, like Geoapify and OpenCage. They leverage the power of OpenStreetMap, a massive, community-driven map of the world. These services often offer significantly lower prices than their closed competitors, and their friendly licenses give you the freedom to store and share their data as you please.

Save money with open-source entity resolution

Yes, building a top-tier MDM solution requires effort. You need accuracy and a streamlined review process, especially for those tricky edge cases. But let’s not forget the reality: the majority of duplicates are low-hanging fruit.

For every complex match that needs careful consideration, there are likely several straightforward duplicates we can catch with a simple receipt.

Use open source to measure the similarity of pairs of entity records. Let sophisticated proprietary MDM solutions do the heavy lifting for you. Increase value for your money.

Let’s talk architecture. We can tackle this initial cleanup with a simple script deployed right on your data lake. Run it once after you’ve extracted and loaded your source data, and you’ll likely knock out the bulk of those easy duplicates. Set it up as a periodic job to catch any stragglers that pop up later.

The script’s output – those low-hanging duplicate pairs – can be stored in a handy cross-reference table. This table, combined with the results from your MDM system, will give you the complete, deduplicated picture you’ve been aiming for.

Prove the concept and negotiate with confidence

MDM is a major commitment, both in terms of time and money. So how do you convince decision-makers that it’s worth it? The answer: a proof of concept (POC). An expert in entity resolution demonstrates tangible benefits of resolved internal data in a few days of work.

  • How many duplicate customer/product/supplier records are lurking in our most important systems?
  • How much information is missing or mismatched across different sources?
  • How are these data inconsistencies impacting our business?

Don’t just report the number of duplicates you can detect with high confidence. Investigate the likely but unsure cases with random sampling and manual efforts.

You detected many duplicates confidently (green) with open-source and a few lines of code. Where does your algorithm need to catch up? Get an idea by random stratified sampling and some good old manual investigative work.

Where does your open-source solution fall short? Is it struggling with misspellings, synonyms, acronyms, or foreign languages? Use those examples to challenge MDM vendors. Can their fancy systems handle what yours can’t?

If your preferred vendor stumbles, you’ve got leverage. Point out their weaknesses and use that evidence to negotiate a better price. Remember, it’s a two-way street. You’re not just a customer; you’re bringing valuable insights to the table.

Conclusion

MDM platforms come with a hefty price tag, and vendors are quick to tout their value. I don’t disagree. But there’s always room to squeeze more out of your investment.

Building everything in-house sounds tempting. Keep it simple, start small, and you’re already ahead of the game. But will that be enough for your business needs?

  • Focus on the Business: Don’t treat entity resolution as just an IT project. Identify specific business cases where better data can make a real difference. Estimate the value of those improvements. This will help you make informed decisions about in-house vs. vendor solutions and set realistic expectations.
  • Assess Your Talent: Do you have the skills and expertise to build and maintain an in-house solution? Hiring a dedicated team can be costly. Be realistic about what you can achieve with your current resources.
  • Explore the Market: MDM pricing varies wildly. If budget is a concern, don’t assume you have to go with the big players. Plenty of vendors offer competitive solutions at a fraction of the cost. You might be surprised at how much you can save.

Join my mailing list and stay tuned for upcoming posts.