Node Loss, Failover, and Recovery in Clockspring

Modified on Thu, 11 Dec, 2025 at 5:20 PM

Summary

Nodes in a Clockspring cluster can go offline due to maintenance, crashes, or network issues. This article explains what happens when a node is lost, how the cluster behaves, how failover works, and what to expect when the node returns. Most importantly: you do not lose FlowFiles during an outage.

1. Types of Node Loss

Clockspring distinguishes between two types of node loss:

A. Graceful Shutdown

The service is intentionally stopped (restart, patching, maintenance).

B. Unexpected Failure

Power loss, OS crash, network drop, hardware issue, or forced kill.

These two scenarios behave very differently.

2. Graceful Shutdown Behavior

When a node is stopped cleanly:

Clockspring automatically offloads all queued FlowFiles from that node
Remaining nodes take over the work
No FlowFiles remain stranded
No data is lost

This is the safest and preferred way to remove a node for maintenance and all currently processing FlowFiles will continue to be processed.

Primary Node Behavior

If the node being shut down is the Primary Node:

A new primary is automatically elected
Primary-only processors resume on the new primary

No manual intervention is needed.

3. Unexpected Node Failure Behavior

If a node goes down without warning:

FlowFiles on that node remain on that node’s disk
They are not redistributed
Other nodes continue processing their own work
The UI still shows the total queue count
A new primary is elected if needed

No data is lost.

All FlowFiles remain intact on disk until the node returns. When the node returns the FlowFiles will pick up where they left off.

What cannot happen

Offload cannot occur (the node isn’t running)
Other nodes cannot access the missing node’s queues
FlowFiles cannot be redistributed automatically

This is expected and safe behavior.

4. Recovery: What Happens When the Node Comes Back

When the failed node comes back online:

It immediately rejoins the cluster
FlowFiles stored on that node become available again
Processing resumes exactly where it left off
Any primary-only processors rejoin the scheduler

There is no manual fix required unless corruption or disk failure occurred.

5. How Failover Works

Primary Node Failover

If the primary node goes down:

A new primary is selected
Processors configured for “Primary Node Only” start running on the new primary

Processor Execution Mode

Execution mode does not move FlowFiles.
If a processor was running on one node:

It stops when that node stops
Its work resumes only when the node returns (unless another node was also running it)

6. When You Should Take Action

You only need to intervene if:

A node will NOT return

You must manually offload its queues (on each affected connection) to avoid leaving stranded FlowFiles.
Otherwise, simply removing the node will orphan that data.

The node is stuck in a bad health state

Try:

restarting Clockspring
rebooting the host
checking local disk space and repo health

Work is unevenly distributed and won’t self-correct

This is usually a design issue, not a cluster issue.
The fix is to add load balancing at the appropriate connection.

7. What You Should NOT Do

Do not repeatedly offload just because a node is slow
Do not offload to “rebalance” workload — it won’t help
Do not panic when queue counts remain high with a node down
Do not manually delete FlowFile repository files
Do not assume data is lost — it's not

Clockspring’s cluster model is built to survive node loss without losing work.

8. Practical Summary

Graceful shutdown = automatic drain, no manual work
Crash = FlowFiles stay on that node until it returns
No data is lost
Primary failover happens automatically
Manual offload is only required if a node will NOT return
For even distribution, use load balancing, not offload

How a Clockspring Cluster Works
Queue Load Balancing
Offloading Queues
Execution Node: All Nodes vs Primary Node
Designing Cluster-Safe Flows