Queue Load Balancing in Clockspring

Modified on Thu, 11 Dec, 2025 at 4:42 PM

Summary

Queue load balancing distributes FlowFiles across nodes in a cluster. When enabled on a connection, any FlowFile that enters that queue is redistributed according to the selected strategy. This improves parallel processing but must be used intentionally to avoid unnecessary overhead.

1. What Load Balancing Does

Load balancing is configured on a queue, not on a processor.

When a FlowFile enters a load-balanced queue:

It may be transferred to another node
Assignment depends on the chosen strategy
Downstream processors on all nodes can process the work

Load balancing only affects FlowFiles entering that queue after LB is enabled.

2. Why Use Load Balancing

Use it when you want:

True parallel processing on multiple nodes
Even distribution of heavy workloads
To fan out work from a Primary-only processor
To avoid overloading a single node

3. Load Balancing Strategies

Here are the three modes and their correct behavior:

A. Round Robin

Distributes FlowFiles evenly across nodes in sequence.

Good for:

General parallelism
CPU-heavy steps
Simple even distribution

B. Partition by Attribute

Groups FlowFiles by the value of an attribute and sends each group to the same node.

Good for:

Ensuring related data stays together
Avoiding cross-node contention
Customer/account/order-level partitioning

Examples: partition by customer_id, account_id, order_id

C. Single Node

All FlowFiles entering the queue go to one node.

Important details:

You cannot choose which node
Clockspring decides internally
Still useful when you want all downstream work local to a single node

Common use cases:

Merging FlowFiles
Ordering-sensitive operations
Deduplication
Anything requiring “everything in one place”

This behaves like an implicit “gather step.”

4. When Load Balancing Applies

Load balancing affects every FlowFile that enters the queue, regardless of how it was created.

Examples:

From GenerateFlowFile
From API responses
From splits/merges
From any processor upstream
From Primary-only or All-Nodes execution

What load balancing does not do:

Does not move FlowFiles that were already in the queue before LB was enabled
Does not rebalance FlowFiles sitting inside processors

5. Avoid Load Balancing on Multiple Consecutive Queues

Enabling LB repeatedly can hurt performance.

Why:

Each redistribution costs network bandwidth
It increases CPU workload from serialization, compression, and deserialization
It provides no benefit after the first balancing step
Can cause unnecessary churn and slow down the entire flow

Practical rule:

Load balance once at the point where parallelism is needed.
Do not add LB to every queue.

6. Execution Mode vs Load Balancing

Execution mode (All Nodes vs Primary Node) does not affect load balancing.

Key points:

A Primary-only processor can send to a load-balanced queue
An All-Nodes processor can send to a non-balanced queue
Execution mode controls where the processor runs
Load balancing controls where FlowFiles go next

These behaviors are independent.

7. Compression

When enabling load balancing, you can choose whether to compress FlowFiles in transit.

Benefits:

Reduces network transfer size
Helpful for large FlowFiles

Tradeoffs:

Adds CPU overhead to compress and decompress
Slows down overall throughput for small FlowFiles
Not helpful for already-compressed data (ZIP, PDF, images)

Rule of thumb:
Use compression when your bottleneck is network bandwidth, not CPU.

8. When Not to Load Balance

Avoid load balancing when:

You require strict local processing
You need node-specific behavior
Ordering is critical
You plan to merge downstream
You don’t truly need multi-node parallelism
You're unsure — adding LB by habit is a common mistake

How a Clockspring Cluster Works
Execution Node: All Nodes vs Primary Node
Offloading Queues
Node Loss, Failover, and Recovery
Designing Cluster-Safe Flows