What is deduplication and how does it work?

What is deduplication?


Deduplication is the process of identifying duplicate records and merging them together into a single, well-defined version (sometimes called “golden record”, or the "single version of the truth”).

Where can I use deduplication in Octolis?


In Octolis, deduplication is used to unify data coming from different Sources and build Datasets that are free of duplicate records.

How does deduplication work in Octolis?


How to identify duplicate records?

Octolis enables you to identify duplicate records based on a set of columns (aka a deduplication key).
Let’s take a few examples:
Your want to identify as duplicates contacts that have the same email.
Your want to identify as duplicates contacts that have the same email and the same phone.
Each time a new record enters the Dataset, we will compare its deduplication key to the ones of all records that already exist in the Dataset, and identify if it is a duplicate.
If several records enter the Dataset all at once, we also make sure to identify duplicates amongst them.

How does the merge work?

Each time Octolis identifies duplicate records (based on the deduplication key you set), they are merged together, resulting in a single record in the Dataset.
We take the values of the most recent duplicate record to build the final record, except for Source key columns ( ).
We will soon enable you to use more custom rules to build each column of the final record.
We always preserve a stable association between a deduplicated record and each Source key it was first associated with.
We also automatically add several system columns to the Dataset deduplicated records:
  • __master-Id__ to uniquely identify each deduplicated record (stable over time).
  • __modified-At__ to state when each record was updated for the last time in the Sources.
__<SourceName>_<SourcekeyColumn>_list__ (for each Source Key column) to list the Source key values of all duplicates the deduplicated record is resulting from.
  • __created-At__ to state when each record was created in the DB table (stable over time).
  • __updated-At__ to state when each record was updated in the DB table.

How to use the output of deduplication in my systems?


Thanks to the work of Octolis, you are ensured that only deduplicated records will be synced to your systems.
What now? You might want to do some data stewarding to clean your systems from duplicate records.
For this, we advise you to map in a Sync towards your system the __master-Id__ and the list of the Source key values of all duplicates the deduplicated record is resulting from.
We will later offer some native capability of tagging duplicates as Duplicate of Id XXX in a dedicated field.

Pitfalls of multi-key deduplication


Let's imagine that we use the following deduplication key: "Email OR Phone OR (Last name x first name x date of birth)" on our Contacts dataset.
The risks of this multiple deduplication key are the following:
  • It is very important to make sure that there is always an "OR" field with all records being non-zero. If this is not the case the contacts will be ignored by the dedup.
  • You need to anticipate any unwanted merge cases due to repetitive email or phone patterns (e.g. 30 contacts have Phone = 060000000000) and apply data prep rules at the source level to anticipate them (i.e. nullify 060000000000 here)
  • There are unavoidable unwanted merges -> e.g. all contacts with name = empty + first name = empty + same date of birth will be merged.
  • Multi-key leads to merging a posteriori of several contacts that already have a master_id when there are updates of values of some keys in some sources:
  • If the CDP contacts are synchronized elsewhere, a specific deletion workflow must be created to manage these master_id "disappearances".
Final takeaways :
Be aware that multi-key problems are more difficult to debug than single-key (or composed-key) ones, anticipating complex merge cases is very difficult and client setup iterations are longer.