When Clockspring runs in a cluster, provenance behaves differently than on a single node. These differences affect performance, search behavior, and how lineage is interpreted. This article explains how provenance works in clustered deployments and what to expect under load.
1. Provenance Is Stored Per Node
Each node in the cluster keeps its own provenance repository. There is no shared or centralized store.
This means:
A FlowFile’s lineage lives on the node that processed it
If a FlowFile moves between nodes, its history is split across repositories
Cluster-wide searches have to ask every node for results
Lineage is still complete, but it is distributed.
2. Provenance Searches Are More Expensive in a Cluster
When you run a provenance search in the UI:
The coordinator sends the query to all nodes
Each node scans its local provenance repo
Results are merged and shown in the UI
As the cluster grows, this becomes more expensive because:
Search cost increases linearly with node count
One slow node delays the entire search
Larger or busier repositories take longer to respond
Frequent searches add disk pressure across the cluster
Best practice:
Use narrow queries by UUID, processor, or timeframe. Avoid “search everything” during peak load.
3. Node Imbalance Impacts Provenance Load
If most FlowFiles run on one node (due to scheduling patterns or uneven volume):
That node produces more provenance events
Its repository fills faster
Searches involving that node take longer
Cleanup and rollover happen more often
Cluster performance is healthiest when work is balanced.
4. Provenance Retention Affects Startup and Cleanup
Provenance retention is per node. Large or full repositories increase:
Cleanup frequency
Disk I/O
Startup and shutdown times
This is normal behavior but becomes visible on nodes that receive most of the work.
Retention does not need to be identical across environments. Production typically uses shorter retention to reduce load.
5. Best Practices for Provenance in Clusters
Recommended:
Keep retention at a level that supports troubleshooting without storing unnecessary history
Use narrow searches instead of broad pattern searches
Balance work across nodes to prevent repository hot spots
Monitor disk usage on each node separately
Avoid heavy provenance loads (splits, clones, large debug pipelines) during peak processing
Avoid:
Searching provenance constantly during high-volume processing
Large queues that make lineage harder to search
Debug or noisy processors in production flows
6. Key Takeaways
Provenance is local to each node
Cluster-wide searches query every node
Larger clusters magnify search cost
Node imbalance creates hotspots in provenance volume
Retention settings affect performance, cleanup, and startup time
This KB focuses only on provenance behavior in clusters. Replay behavior is covered in its own KB.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article