ZooKeeper Ensemble Without Quorum Causes Clockspring Startup / Cluster Registration Failure

Modified on Tue, 23 Jun at 11:02 AM

Summary

When ZooKeeper is configured as a multi-node ensemble, Clockspring cannot successfully connect and register with ZooKeeper unless the ZooKeeper ensemble has quorum.

In a standard 3-node ZooKeeper ensemble, at least 2 ZooKeeper nodes must be online and participating. If only 1 of the 3 configured ZooKeeper nodes is running, ZooKeeper will not have quorum. In that state, Clockspring may be able to reach the ZooKeeper port, but ZooKeeper is not actually available to service cluster coordination requests.

This commonly appears during Clockspring startup as a leader election or connection loss issue.

Why ZooKeeper Requires a Majority

ZooKeeper uses a quorum model to make sure the ensemble agrees on cluster state.

For ZooKeeper to safely serve requests, more than half of the configured voting nodes must be available. This prevents two separated groups of ZooKeeper nodes from both thinking they are the active/valid ensemble.

In a 3-node ensemble, the majority is 2 nodes:

3 configured ZooKeeper nodes -> 2 required for quorum

If only 1 of the 3 nodes is running or reachable, that node cannot know whether it is truly alone or whether it has been separated from the rest of the ensemble by a network problem. To avoid serving inconsistent cluster state, ZooKeeper will not operate normally without quorum.

That is why a single running ZooKeeper node is not enough when ZooKeeper is configured with 3 `server.X` entries.

For Clockspring, the result is simple: without ZooKeeper quorum, Clockspring cannot reliably perform cluster registration or leader election.

Symptoms

During Clockspring startup, `application.log` may show warnings similar to:

WARN [main] o.a.n.f.c.l.z.CuratorLeaderElectionManager Unable to determine the Elected Leader for role 'Cluster Coordinator'; assuming no leader has been elected

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator

The ZooKeeper log may show:

WARN [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@391] - Close of session 0x0

java.io.IOException: ZooKeeperServer not running

Root Cause

This usually means ZooKeeper does not have quorum.

For Clockspring clustered deployments, ZooKeeper is typically configured with a 3-node ensemble by default, using entries similar to:

server.1=<zookeeper-node-1>:2888:3888

server.2=<zookeeper-node-2>:2888:3888

server.3=<zookeeper-node-3>:2888:3888

When ZooKeeper is configured this way, it expects to participate as part of a multi-node ensemble. A single running ZooKeeper node is not enough.

If only 1 ZooKeeper node is running, ZooKeeper does not have quorum. Without quorum, ZooKeeper is not fully available, and Clockspring cannot use it for cluster coordination, leader election, or node registration.

The important point is this:

A ZooKeeper process being up does not mean the ZooKeeper ensemble is healthy. If the ensemble lacks quorum, Clockspring cannot use it.

Why This Breaks Clockspring

Clockspring relies on ZooKeeper for clustered coordination. During startup, Clockspring attempts to connect to ZooKeeper and participate in cluster leader election.

If ZooKeeper does not have quorum, Clockspring cannot reliably create or read the expected coordination paths, such as:

/nifi/leaders/Cluster Coordinator

As a result, Clockspring logs a `ConnectionLossException` and reports that it cannot determine the elected leader.

This is not usually a Clockspring application issue. It is usually a ZooKeeper ensemble availability issue.

How to Confirm

Check the ZooKeeper configuration and determine how many `server.X` entries are configured.

Example:

server.1=<zookeeper-node-1>:2888:3888

server.2=<zookeeper-node-2>:2888:3888

server.3=<zookeeper-node-3>:2888:3888

Then confirm how many of those ZooKeeper nodes are actually running and joined to the ensemble.

The clearest sign of this issue is usually in the ZooKeeper logs under:

/opt/zookeeper/logs

When starting one ZooKeeper node while the other configured ensemble members are down or unreachable, the running node may log warnings like:

WARN [QuorumConnectionThread-[myid=1]-1:o.a.z.s.q.QuorumCnxManager@401] - Cannot open channel to 2 at election address cluster2.example.com/10.x.x.x:3888

java.net.ConnectException: Connection refused

And:

WARN [QuorumConnectionThread-[myid=1]-2:o.a.z.s.q.QuorumCnxManager@401] - Cannot open channel to 3 at election address cluster3.example.com/10.x.x.x:3888

java.net.ConnectException: Connection refused

You may also see ZooKeeper repeatedly attempting leader election:

INFO [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@997] - Notification time out: 400 ms

These messages mean the local ZooKeeper node is trying to contact the other configured ensemble members but cannot reach them on the election port, usually `3888`.

In a 3-node ensemble, this is enough to explain the Clockspring startup failure. One ZooKeeper node may be running and listening on port `2181`, but without at least one additional ZooKeeper node available, the ensemble does not have quorum.

For a 3-node ensemble:

Configured ZooKeeper Nodes	Running / Reachable ZooKeeper Nodes	Quorum?	Expected Result
3	3	Yes	Healthy
3	2	Yes	Healthy enough to operate
3	1	No	Clockspring cannot register reliably
3	0	No	ZooKeeper unavailable

The key point is that a local ZooKeeper process can be running but still not be usable by Clockspring. If the logs show that ZooKeeper cannot open election channels to the other configured nodes, and fewer than a majority of the ensemble members are reachable, ZooKeeper does not have quorum.

In that state, Clockspring may show `ConnectionLossException` errors because ZooKeeper is not available for leader election or cluster registration.

Resolution

Start enough ZooKeeper nodes to establish quorum and confirm that the ZooKeeper nodes can communicate with each other.

For the default 3-node Clockspring ZooKeeper ensemble, at least 2 of the 3 ZooKeeper nodes must be running and able to reach each other.

This is important: it is not enough for the ZooKeeper service to be running locally. The ZooKeeper nodes must also be able to communicate with the other configured ensemble members.

In a typical ZooKeeper ensemble, the relevant ports are:

2181 - Client connections from Clockspring to ZooKeeper

2888 - ZooKeeper peer communication

3888 - ZooKeeper leader election

If firewall rules, host firewalls, network ACLs, routing, DNS, or security groups block communication between ZooKeeper nodes on the quorum/election ports, ZooKeeper may fail to establish quorum even though the local ZooKeeper process is running.

For a 3-node ensemble, confirm the following:

Clockspring nodes can reach ZooKeeper on port 2181.

ZooKeeper nodes can reach each other on port 2888.

ZooKeeper nodes can reach each other on port 3888.

At least 2 of the 3 configured ZooKeeper nodes are running and reachable.

Once quorum is established, Clockspring should be able to connect to ZooKeeper, participate in leader election, and complete cluster registration.

Additional ZooKeeper Health Check

You can also verify whether a ZooKeeper node is actually serving requests by running the `srvr` four-letter command against the local ZooKeeper client port.

Run this from one of the ZooKeeper nodes:

echo srvr | nc localhost 2181

If ZooKeeper is running but does not have quorum, or is otherwise not ready to serve requests, you may see:

This ZooKeeper instance is not currently serving requests

That means ZooKeeper is not healthy from a client perspective. Even if the ZooKeeper process is running and port `2181` is open, Clockspring will not be able to use that ZooKeeper node for cluster coordination while it is in this state.

If ZooKeeper is healthy and serving requests, the command should return ZooKeeper server details similar to:

Zookeeper version: <version>, built on <date>

Latency min/avg/max: 0/0.0/0

Received: 1

Sent: 0

Connections: 1

Outstanding: 0

Zxid: 0x40000018a

Mode: follower

Node count: 11

The important indicators are:

Zookeeper version: ...

Mode: follower

or:

Zookeeper version: ...

Mode: leader

If the command returns the ZooKeeper version and a mode of `leader` or `follower`, the ZooKeeper node is serving requests.

If the command returns:

This ZooKeeper instance is not currently serving requests

then ZooKeeper is not ready, and Clockspring should not be expected to register successfully.

This check is more useful than only confirming that the ZooKeeper process is running. A running process does not prove that ZooKeeper has quorum or is able to serve Clockspring requests.

Important Notes

Do not stop troubleshooting after confirming that Clockspring can reach ZooKeeper on port `2181`.

That only proves that something is listening on the client port. It does not prove that ZooKeeper is healthy or that the ensemble has quorum.

For this issue, also confirm:

At least 2 of the 3 ZooKeeper nodes are running.

The ZooKeeper nodes can reach each other on ports 2888 and 3888.

The ZooKeeper node returns a valid response to: echo srvr | nc localhost 2181

If ZooKeeper is configured as a 3-node ensemble and only one node is available, Clockspring should be expected to fail cluster registration. The ZooKeeper process may be running, but the ensemble is not in a usable state.

Bottom Line

If ZooKeeper is configured as a 3-node ensemble, one running ZooKeeper node is not enough.

Clockspring requires ZooKeeper to be operational, and ZooKeeper requires quorum. With 3 configured ZooKeeper nodes, at least 2 must be running and joined to the ensemble before Clockspring can successfully register and participate in clustering.