Provenance Behavior in Clockspring Clusters

Modified on Thu, 11 Dec, 2025 at 9:22 PM

When Clockspring runs in a cluster, provenance behaves differently than on a single node. These differences affect performance, search behavior, and how lineage is interpreted. This article explains how provenance works in clustered deployments and what to expect under load.


1. Provenance Is Stored Per Node

Each node in the cluster keeps its own provenance repository. There is no shared or centralized store.

This means:

  • A FlowFile’s lineage lives on the node that processed it

  • If a FlowFile moves between nodes, its history is split across repositories

  • Cluster-wide searches have to ask every node for results

Lineage is still complete, but it is distributed.


2. Provenance Searches Are More Expensive in a Cluster

When you run a provenance search in the UI:

  • The coordinator sends the query to all nodes

  • Each node scans its local provenance repo

  • Results are merged and shown in the UI

As the cluster grows, this becomes more expensive because:

  • Search cost increases linearly with node count

  • One slow node delays the entire search

  • Larger or busier repositories take longer to respond

  • Frequent searches add disk pressure across the cluster

Best practice:
Use narrow queries by UUID, processor, or timeframe. Avoid “search everything” during peak load.


3. Node Imbalance Impacts Provenance Load

If most FlowFiles run on one node (due to scheduling patterns or uneven volume):

  • That node produces more provenance events

  • Its repository fills faster

  • Searches involving that node take longer

  • Cleanup and rollover happen more often

Cluster performance is healthiest when work is balanced.


4. Provenance Retention Affects Startup and Cleanup

Provenance retention is per node. Large or full repositories increase:

  • Cleanup frequency

  • Disk I/O

  • Startup and shutdown times

This is normal behavior but becomes visible on nodes that receive most of the work.

Retention does not need to be identical across environments. Production typically uses shorter retention to reduce load.


5. Best Practices for Provenance in Clusters

Recommended:

  • Keep retention at a level that supports troubleshooting without storing unnecessary history

  • Use narrow searches instead of broad pattern searches

  • Balance work across nodes to prevent repository hot spots

  • Monitor disk usage on each node separately

  • Avoid heavy provenance loads (splits, clones, large debug pipelines) during peak processing

Avoid:

  • Searching provenance constantly during high-volume processing

  • Large queues that make lineage harder to search

  • Debug or noisy processors in production flows


6. Key Takeaways

  • Provenance is local to each node

  • Cluster-wide searches query every node

  • Larger clusters magnify search cost

  • Node imbalance creates hotspots in provenance volume

  • Retention settings affect performance, cleanup, and startup time

This KB focuses only on provenance behavior in clusters. Replay behavior is covered in its own KB.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article