Writing API Results to a Database Without Duplicates

Modified on Fri, 12 Dec, 2025 at 12:03 PM

Most APIs are not clean, perfectly incremental data sources.
They resend records, update old data, and deliver results late.

If you assume “each API call equals new data,” you will eventually get:

  • duplicate rows

  • missing records

  • or both

This article explains how to design API ingestion flows in Clockspring that are safe, repeatable, and complete, even when APIs behave poorly.


The Core Rule

Favor completeness over uniqueness at ingest time.

It is always safer to:

  • ingest overlapping data

  • allow duplicates temporarily

  • remove or overwrite duplicates deterministically

Trying to pull “exactly once” data from an API almost always fails.


Why APIs Commonly Produce Duplicates

Duplicates happen because:

  • APIs resend data intentionally

  • Records are updated, not just created

  • Pagination overlaps across runs

  • Time-based filters are imprecise

  • Backfills arrive late

This is normal behavior, not a bug.

Your flow must assume duplicates will happen.


Avoid Missing Records: Use Overlapping Time Windows

This is one of the most important patterns in API ingestion.

The risky pattern

  • Run the flow every 60 minutes

  • Query the API for records from the last 60 minutes

Why this fails:

  • Records that arrive late can fall between runs

  • Boundary timestamps get skipped

  • Clock drift and rounding cause gaps

  • Backfilled updates may appear after your run finishes

This is how data quietly falls through the cracks.


The recommended pattern

  • Run the flow every 30 minutes

  • Query the API for the last 60 minutes (or last 45)

  • Expect duplicates

  • Handle them safely downstream

You intentionally create overlap so nothing is missed.

Rule of thumb:

  • Your query window should always be larger than your schedule

  • Choose overlap based on how late data can realistically arrive

This trades duplicates for reliability. That is a good trade.


How to Handle Duplicates Safely

There are two practical strategies. Pick one and be consistent.


Option 1: Upsert Using a Stable Key (Preferred)

This is the simplest and most reliable approach.

Requirements:

  • The API provides a stable identifier (id, external_id, etc.)

  • Your database table enforces uniqueness on that key

Behavior:

  • New records insert

  • Existing records update

  • Re-sent records do not create duplicates

This works especially well with overlapping time windows.

Important notes:

  • Make sure the primary or unique key truly represents the record

  • Understand whether updates should overwrite existing values

For most teams, this is the correct default.


Option 2: Deduplicate Before Writing

This approach filters duplicates before the database write.

Common methods:

  • Track seen IDs in a cache or table

  • Drop records that already exist

  • Pass only new records downstream

This can reduce database load but adds complexity.

Use this when:

  • You cannot upsert

  • The database schema is fixed

  • You need strict insert-only behavior

If you do this, deduplicate before batching, not after.


Where Deduplication Should Happen in the Flow

Correct order:

  1. Pull data from the API

  2. Normalize records

  3. Deduplicate or upsert logically

  4. Batch records

  5. Write to the database

Do not:

  • batch first and dedupe later

  • rely on retries to fix duplicates

  • assume pagination prevents overlap


Common Mistakes

  • Running the schedule at the same interval as the query window

  • Assuming APIs return “only new data”

  • Treating duplicates as errors

  • Trying to eliminate duplicates at the API call layer

  • Ignoring late-arriving updates

Most duplicate problems are design issues, not processor issues.


How This Fits with Pagination and Rate Limits

These patterns work together:

  • Pagination loops may resend records

  • Overlapping windows will resend records

  • Rate limiting may cause retries

Your flow should assume:

  • records can appear more than once

  • order is not guaranteed

  • safety comes from keys and logic, not timing


Summary

To ingest API data safely in Clockspring:

  • Use overlapping time windows

  • Expect duplicate records

  • Prevent duplicates with upserts or deduplication

  • Batch records for performance

  • Design for completeness first

If you never miss data, duplicates become a solvable problem.
If you miss data, you may never know it happened.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article