Provenance Behavior in Clockspring Clusters

Modified on Thu, 11 Dec, 2025 at 9:22 PM

When Clockspring runs in a cluster, provenance behaves differently than on a single node. These differences affect performance, search behavior, and how lineage is interpreted. This article explains how provenance works in clustered deployments and what to expect under load.

1. Provenance Is Stored Per Node

Each node in the cluster keeps its own provenance repository. There is no shared or centralized store.

This means:

A FlowFile’s lineage lives on the node that processed it
If a FlowFile moves between nodes, its history is split across repositories
Cluster-wide searches have to ask every node for results

Lineage is still complete, but it is distributed.

2. Provenance Searches Are More Expensive in a Cluster

When you run a provenance search in the UI:

The coordinator sends the query to all nodes
Each node scans its local provenance repo
Results are merged and shown in the UI

As the cluster grows, this becomes more expensive because:

Search cost increases linearly with node count
One slow node delays the entire search
Larger or busier repositories take longer to respond
Frequent searches add disk pressure across the cluster

Best practice:
Use narrow queries by UUID, processor, or timeframe. Avoid “search everything” during peak load.

3. Node Imbalance Impacts Provenance Load

If most FlowFiles run on one node (due to scheduling patterns or uneven volume):

That node produces more provenance events
Its repository fills faster
Searches involving that node take longer
Cleanup and rollover happen more often

Cluster performance is healthiest when work is balanced.

4. Provenance Retention Affects Startup and Cleanup

Provenance retention is per node. Large or full repositories increase:

Cleanup frequency
Disk I/O
Startup and shutdown times

This is normal behavior but becomes visible on nodes that receive most of the work.

Retention does not need to be identical across environments. Production typically uses shorter retention to reduce load.

5. Best Practices for Provenance in Clusters

Recommended:

Keep retention at a level that supports troubleshooting without storing unnecessary history
Use narrow searches instead of broad pattern searches
Balance work across nodes to prevent repository hot spots
Monitor disk usage on each node separately
Avoid heavy provenance loads (splits, clones, large debug pipelines) during peak processing

Avoid:

Searching provenance constantly during high-volume processing
Large queues that make lineage harder to search
Debug or noisy processors in production flows

6. Key Takeaways

Provenance is local to each node
Cluster-wide searches query every node
Larger clusters magnify search cost
Node imbalance creates hotspots in provenance volume
Retention settings affect performance, cleanup, and startup time

This KB focuses only on provenance behavior in clusters. Replay behavior is covered in its own KB.