Writing API Results to a Database Without Duplicates

Modified on Fri, 12 Dec, 2025 at 12:03 PM

Most APIs are not clean, perfectly incremental data sources.
They resend records, update old data, and deliver results late.

If you assume “each API call equals new data,” you will eventually get:

duplicate rows
missing records
or both

This article explains how to design API ingestion flows in Clockspring that are safe, repeatable, and complete, even when APIs behave poorly.

The Core Rule

Favor completeness over uniqueness at ingest time.

It is always safer to:

ingest overlapping data
allow duplicates temporarily
remove or overwrite duplicates deterministically

Trying to pull “exactly once” data from an API almost always fails.

Why APIs Commonly Produce Duplicates

Duplicates happen because:

APIs resend data intentionally
Records are updated, not just created
Pagination overlaps across runs
Time-based filters are imprecise
Backfills arrive late

This is normal behavior, not a bug.

Your flow must assume duplicates will happen.

Avoid Missing Records: Use Overlapping Time Windows

This is one of the most important patterns in API ingestion.

The risky pattern

Run the flow every 60 minutes
Query the API for records from the last 60 minutes

Why this fails:

Records that arrive late can fall between runs
Boundary timestamps get skipped
Clock drift and rounding cause gaps
Backfilled updates may appear after your run finishes

This is how data quietly falls through the cracks.

The recommended pattern

Run the flow every 30 minutes
Query the API for the last 60 minutes (or last 45)
Expect duplicates
Handle them safely downstream

You intentionally create overlap so nothing is missed.

Rule of thumb:

Your query window should always be larger than your schedule
Choose overlap based on how late data can realistically arrive

This trades duplicates for reliability. That is a good trade.

How to Handle Duplicates Safely

There are two practical strategies. Pick one and be consistent.

Option 1: Upsert Using a Stable Key (Preferred)

This is the simplest and most reliable approach.

Requirements:

The API provides a stable identifier (id, external_id, etc.)
Your database table enforces uniqueness on that key

Behavior:

New records insert
Existing records update
Re-sent records do not create duplicates

This works especially well with overlapping time windows.

Important notes:

Make sure the primary or unique key truly represents the record
Understand whether updates should overwrite existing values

For most teams, this is the correct default.

Option 2: Deduplicate Before Writing

This approach filters duplicates before the database write.

Common methods:

Track seen IDs in a cache or table
Drop records that already exist
Pass only new records downstream

This can reduce database load but adds complexity.

Use this when:

You cannot upsert
The database schema is fixed
You need strict insert-only behavior

If you do this, deduplicate before batching, not after.

Where Deduplication Should Happen in the Flow

Correct order:

Pull data from the API
Normalize records
Deduplicate or upsert logically
Batch records
Write to the database

Do not:

batch first and dedupe later
rely on retries to fix duplicates
assume pagination prevents overlap

Common Mistakes

Running the schedule at the same interval as the query window
Assuming APIs return “only new data”
Treating duplicates as errors
Trying to eliminate duplicates at the API call layer
Ignoring late-arriving updates

Most duplicate problems are design issues, not processor issues.

How This Fits with Pagination and Rate Limits

These patterns work together:

Pagination loops may resend records
Overlapping windows will resend records
Rate limiting may cause retries

Your flow should assume:

records can appear more than once
order is not guaranteed
safety comes from keys and logic, not timing

Summary

To ingest API data safely in Clockspring:

Use overlapping time windows
Expect duplicate records
Prevent duplicates with upserts or deduplication
Batch records for performance
Design for completeness first

If you never miss data, duplicates become a solvable problem.
If you miss data, you may never know it happened.