Flattening and Deduplication Process
Platforms often deal with large datasets coming from many sources, such as system logs, social media feeds, and transaction records. This data frequently does not fit neatly into a traditional relational schema. To work with that variety of structures and origins, a flexible approach is required to process semi-structured data.
In many scenarios, semi-structured data leads to very large tables. A good practice is to ingest the data incrementally based on the updated_at column. During this process, it is important to preserve data integrity, especially when multiple versions of the same record exist. That is why deduplication is required.
Example
Suppose a full load was executed yesterday at 00:00, bringing all records available at that moment. After switching the pipeline to incremental mode, the next execution identifies every record whose updated_at value is greater than that point in time. If a record was updated, it will be replicated again.
Deduplication process
If the tables have a primary key, deduplication can be performed like this:
SELECT *
FROM TABLE
QUALIFY (ROW_NUMBER() OVER (PARTITION BY primary_key ORDER BY updated_at DESC)) = 1;
This query keeps only the most recent version of each record based on the primary key, eliminating duplicates and preserving data integrity.
It is essential to run this deduplication process to guarantee accurate analyses and avoid distortions caused by multiple versions of the same record.