SSV Into The Future: Asyncrounous BFT Protocols

Discover how SSV uses QBFT and the limitations it presents. Learn about Asynchronous BFT protocols and how they can improve SSV.

Thank you Matheus Franco for helping review and write this article

For an intro on SSV’s design see here
SSV currently uses QBFT as its consensus protocol. QBFT is a partially synchronous BFT protocol which works in rounds of communication until all nodes agree on a single “thing”.
Every round, a leader is selected. A faulty leader is replaced in the next round if it fails to propose a valid “thing”.
If a valid “thing” is proposed but no quorum of honest operators approves it, a new round with a different leader starts.
Every round has a timeout trigger which moves the inner state to the next round.

Partially synchronous BFT protocols are great when a leader is honest and responsive but become slower when the round’s leader is not. This can translate to lower performance in an SSV setup.
A potential solution to the above is a different breed of consensus protocols called Asynchronous BFT (i.e. HoneyBadger, Alea, etc.). Async BFT protocols are leaderless and can reach consensus as fast as network latency allows, they can make SSV faster and less vulnerable to faulty leaders/ sporadic leader latencies.

A generalized async BFT protocol including a reliable broadcast stage followed by a binary agreement stage

Defining BFT network models

BFT protocols have largely 3 Communication model (how many messages sent and to whom)

  • Synchronous — in which all messages arrive within T seconds or do not arrive at all. T is known
  • Partially synchronous — in which all messages arrive within T seoncds or do not arrive at all. T is unknown
  • Asynchronous — No assumption about T is made (you can’t distinguish between a late message and a message that was never sent)

The above communication models influence how a BFT protocol is designed, what types of faults it needs to overcome. In general the easiest communication model to design a BFT protocol for is the synchronous model, then the partially synchronous model and ultimately the asynchronous model.

Partially synchronous protocols can terminate deterministically by forcing periods of synchrony (known as rounds) after which they reset and try again. Such a family of protocols are well tested and used though they present some limitations for applications like SSV.
Most BFT protocols in the partially synchronous model adopt a round based approach for consensus where each round starts with a designated leader which proposes a value to decide on (in SSV case a data structure to sign corresponding to a beacon duty).
Latency sensitive protocols like SSV can see detrimental effects due to a faulty leader, edge cases can see greater latency which affects livness in periods of significant latency.

Asynchronous BFT Basics

Asynchronous communication models have no assumptions on message delay, and can’t differentiate between a message delay and a message that was never sent.

The FLP impossibility — “any protocol solving consensus in the asynchronous model that is resilient to even one crash failure must have an infinite execution”

The famous FLT impossibility seesm to exclude async BFT protcools at first glance but a small change to its deinition gives way for async BFT protocols. Realistically the FLP impossibility includes deterministic protocols, as we see in a bit async BFT are actually not deterministic but rather use randomness to solve for async network models.

Random BFT protocols
Imagine a simple protocol which has the goal of deciding 0/1 between a few players, either everyone agrees on a result or none do.

Protocol steps:

  1. Every player chooses at random 0/1
  2. Each player declares to the other players what they chose
  3. Each player waits for a threshold amount of declarations from other players, if all the declarations are either 0 or 1 then the player sends a confirmation to all other players, otherwise it sends a special “Redo” command.
  4. If a player received a threshold amount of confirmations, it terminates with the value. If it got a threshold amount of “Redo” commands, it starts all over

Intuitively from the above we can conclude that such a protocol will terminate sometimes because there will come a round in which all players chose the same random value.

The probability that a single player gets a 0/1 is ½, the probability of X players getting the same value is 1/(2^X).
This type of protocol is broadly called a binary agreement(BA).

In 1983 Ben Or proposed a binary agreement(BA) protocol that uses randomization within the protocol as a solution for the FLP impossibility. Ben’s BA protocol can tolerate t < N/5 byzantine replicas (arbitrarly deviate from the protocol) or t < N/2 fail-stop process(stop participating all together in the protocol).

In its core, Ben’s BA protocol was inspired by the PBFT protocol. Where it differs was in cases where the protocol reached a point of no decision (did not receive (N+t)/2 decision messages), the protocol would reset itself (bumping the round number) with a random 1 or 0 initial value.
As all N replicas did just that, eventually a quorum will randomly choose the same value and decide the following round.

A simple Go implementation of Ben Or’s protocol can be found here

Reliable Broadcast
The next big leap was the introduction of Bracha’s reliable broadcast protocol to circumvent faulty replicas to just fail-stop, increasing the protocol’s resilience to t < N/3.

Reliable broadcast is a propagation protocol for messages that guarantees:(validity): If the leader is non-faulty then eventually all non-faulty parties will output the leader’s input.
(agreement): If some non-faulty party outputs a value then eventually all non-faulty parties will output the same value.
https://decentralizedthoughts.github.io/2020-09-19-living-with-asynchrony-brachas-reliable-broadcast/

Reliable broadcast achieves the above by forcing peers to consider a message delivered only if all other replicas saw and acknowledged receiving the message, an all or nothing type of message propagation.

A simple Go implementation of Bracha’s protocol can be found here

Modern Async BFT Protocols

Honey Badger BFT
Honey Badger BFT was the first practical asynchronous protocol to emerge. Its main component is called Asynchronous Common Subset (ACS), which is composed of two stages: N reliable broadcasts (one for each replica), in which replicas propose their values, and N asynchronous binary agreements (ABA) (one for each replica), in which they agree on a binary value (0 or 1) that indicates whether to deliver a value of a replica or not. The protocol is driven by rounds (which consist of a single ACS instance). The ABA phase starts when a replica receives N-f values from different broadcasts, where N is the number of replicas and f is the number of possible faulty replicas. After that, the replica inputs 1 for those received and 0 for the others.

The protocol can be analyzed based on three different complexities. Message complexity, which is the expected number of messages generated, communication complexity, which is the expected total bit-length of messages, and time complexity, which is the expected number of communication rounds until the protocol terminates. HBBFT achieves O(N3), O(N3log N) and O(log N) for message, communication and time complexities, respectively. Thus, although HBBFT was the first practical asynchronous protocol, its complexity is a bottleneck for latency and scalability.

Dumbo 1 and Dumbo 2
The main objective of Dumbo is to reduce the number of ABA instances, the most costful phase of HBBFT’s ACS. Dumbo 1 accomplishes this by adding a committee election for each round and performing only one ABA for each committee member. The committee is created in a way that at least one node is honest (not faulty) with a high probability. The committee members, after the N reliable broadcasts, are responsible for broadcasting the IDs of the replicas of the first N-f values received from the reliable broadcast. A node that receives a list from a committee member, inputs 1 to the ABA of this member if it also has received all the values of the replicas in the list. When the first ABA is finished, the other ABAs are also forced to finish. After an ABA is decided for 1, replicas output the values associated with this ABA, ending the round.

Dumbo 2 accomplishes the reduction in a different way. Each node starts by performing a provable reliable broadcast (a reliable broadcasts with an extension for the replica to acquire a proof that at least f+1 nodes received its value). When a replica receives N-f messages alerting that a replica has finished, it starts a multi-valued byzantine agreement phase in which it inputs the list of IDs of replicas from which it received the finished messages. After agreement, replicas output the values agreed on, ending the round.

Dumbo 1 and Dumbo 2 remains with the O(N3) and O(N3log N) message and communication complexities, but, for time complexity, Dumbo 1 achieves O(log k) (where k is the committee size) and Dumbo 2 achieves O(1).

Alea-BFT
Differently from ACS based protocols, Alea-BFT is composed of two parts that run in parallel. The first one consists of a verifiable consistent broadcast protocol (VCBC) and the second one of an asynchronous binary agreement (ABA). The broadcast is triggered whenever a replica receives B (batch size) messages from clients. Each broadcast message includes a priority value which is incremented as new broadcasts are performed. The agreement part runs in parallel and is divided into rounds. Every round has a leader replica elected in a round robin manner (every replica is chosen evenly in a pre-defined order). In an agreement round, an instance of the asynchronous binary agreement protocol is run to decide on delivering the next highest priority value of the leader replica. If the ABA decides on 1 (to deliver the value) and an honest replica voted for 0 (a situation in which it hasn’t received the value yet), it can obtain the value decided through a recovery mechanism by which it receives it by an honest replica that voted for 1.

Alea-BFT achieves O(N2), O(N2) and O(1) for message, communication and time complexity, respectively, outperforming the previous protocols and improving scalability.

Async BFT protocols are on the bleeding edge of technology, although they’ve been around for a while they are not widely used in the blockchain space.
I believe they make perfect sense in an SSV setup, for that end we’ve started working on a research project with Henrique Moniz, Rodrigo Rodrigues and Matheus Franco on POCing AleaBFT in SSV as a substitute for QBFT.

The aim of this research is to testout the integration, real life performance and compatibility to the SSV use case.