Node Loss, Failover, and Recovery in Clockspring

Modified on Thu, 11 Dec, 2025 at 5:20 PM

Summary

Nodes in a Clockspring cluster can go offline due to maintenance, crashes, or network issues. This article explains what happens when a node is lost, how the cluster behaves, how failover works, and what to expect when the node returns. Most importantly: you do not lose FlowFiles during an outage.


1. Types of Node Loss

Clockspring distinguishes between two types of node loss:

A. Graceful Shutdown

The service is intentionally stopped (restart, patching, maintenance).

B. Unexpected Failure

Power loss, OS crash, network drop, hardware issue, or forced kill.


These two scenarios behave very differently.


2. Graceful Shutdown Behavior

When a node is stopped cleanly:

  • Clockspring automatically offloads all queued FlowFiles from that node

  • Remaining nodes take over the work

  • No FlowFiles remain stranded

  • No data is lost

This is the safest and preferred way to remove a node for maintenance and all currently processing FlowFiles will continue to be processed.

Primary Node Behavior

If the node being shut down is the Primary Node:

  • A new primary is automatically elected

  • Primary-only processors resume on the new primary

No manual intervention is needed.


3. Unexpected Node Failure Behavior

If a node goes down without warning:

  • FlowFiles on that node remain on that node’s disk

  • They are not redistributed

  • Other nodes continue processing their own work

  • The UI still shows the total queue count

  • A new primary is elected if needed

No data is lost.

All FlowFiles remain intact on disk until the node returns.  When the node returns the FlowFiles will pick up where they left off.

What cannot happen

  • Offload cannot occur (the node isn’t running)

  • Other nodes cannot access the missing node’s queues

  • FlowFiles cannot be redistributed automatically

This is expected and safe behavior.


4. Recovery: What Happens When the Node Comes Back

When the failed node comes back online:

  • It immediately rejoins the cluster

  • FlowFiles stored on that node become available again

  • Processing resumes exactly where it left off

  • Any primary-only processors rejoin the scheduler

There is no manual fix required unless corruption or disk failure occurred.


5. How Failover Works

Primary Node Failover

If the primary node goes down:

  • A new primary is selected

  • Processors configured for “Primary Node Only” start running on the new primary

Processor Execution Mode

Execution mode does not move FlowFiles.
If a processor was running on one node:

  • It stops when that node stops

  • Its work resumes only when the node returns (unless another node was also running it)


6. When You Should Take Action

You only need to intervene if:

A node will NOT return

You must manually offload its queues (on each affected connection) to avoid leaving stranded FlowFiles.
Otherwise, simply removing the node will orphan that data.

The node is stuck in a bad health state

Try:

  • restarting Clockspring

  • rebooting the host

  • checking local disk space and repo health

Work is unevenly distributed and won’t self-correct

This is usually a design issue, not a cluster issue.
The fix is to add load balancing at the appropriate connection.


7. What You Should NOT Do

  • Do not repeatedly offload just because a node is slow

  • Do not offload to “rebalance” workload — it won’t help

  • Do not panic when queue counts remain high with a node down

  • Do not manually delete FlowFile repository files

  • Do not assume data is lost — it's not

Clockspring’s cluster model is built to survive node loss without losing work.


8. Practical Summary

  • Graceful shutdown = automatic drain, no manual work

  • Crash = FlowFiles stay on that node until it returns

  • No data is lost

  • Primary failover happens automatically

  • Manual offload is only required if a node will NOT return

  • For even distribution, use load balancing, not offload


Related Articles

  • How a Clockspring Cluster Works

  • Queue Load Balancing

  • Offloading Queues

  • Execution Node: All Nodes vs Primary Node

  • Designing Cluster-Safe Flows

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article