Atomic Broadcast Generic Broadcast
Latency-optimal fault-tolerant replication Piotr Zieli´ nski Computer Laboratory University of Cambridge
May 24, 2005
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Hotel booking system
Protocol
Client
1
client → server: “book room 5”
2
server → client: “room booked”
book room 5
Server
client server
room booked
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Fault tolerance by replication Problem a single server crash blocks the entire system
A
A
A
A
B
B
C
C
Solution introduce many servers system still usable despite some servers being down
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Consistency problems Problem messages might reach the replicas in different orders A and B book the room to
client , replica C to client . results: unpredictable
A B C
Solution ensure that replicas receive requests in the same order by using Atomic Broadcast to disseminate requests Piotr Zieli´ nski
A B C Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Outline
1
Atomic Broadcast Chandra-Toueg algorithm Latency considerations
2
Generic Broadcast General approach Infinitely many instances with finite resources
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Atomic Broadcast
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
System model and problem specification
Atomic Broadcast
atomic broadcast
clients atomically broadcast messages, such as and replicas atomically deliver them replicas atomically deliver all messages in the same order fault-tolerant
A B C
formal definition
Piotr Zieli´ nski
Atomic Broadcast atomic delivery
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Our goal
Goal Atomic Broadcast algorithm that is as fast as possible when no failures occur
System assumptions processes communicate by sending/receiving messages no time bounds for messages, no clocks processes can fail by crashing, no malicious faults less than a half/third of the servers can crash unreliable leader oracle Ω
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Consensus A B C
1 3
3 Consensus
6
3 3
propose
decide
Validity Every decision was proposed by some process. Agreement No two processes decide differently. Termination All correct processes eventually decide. proposal/decision events can occur at different times Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Consensus A B C
xA xB
x Consensus
xC
x x
propose
decide
Validity Every decision was proposed by some process. Agreement No two processes decide differently. Termination All correct processes eventually decide. proposal/decision events can occur at different times Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Implementing Consensus A B C
1
3
3
3
6
3
propose
decide
Comments step 1: servers learn the proposal of the leader step 2: majority confirmation round, for fault-tolerance if no decision, repeat with a new leader the leader might not be unique Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Implementing Consensus A B C
1
3
3
3
6 propose
decide
Comments step 1: servers learn the proposal of the leader step 2: majority confirmation round, for fault-tolerance if no decision, repeat with a new leader the leader might not be unique Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Implementing Consensus A B C
1
3
3 6
3
propose
decide
Comments step 1: servers learn the proposal of the leader step 2: majority confirmation round, for fault-tolerance if no decision, repeat with a new leader the leader might not be unique Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Implementing Consensus A B C
1
1
3 6
1
propose
decide
Comments step 1: servers learn the proposal of the leader step 2: majority confirmation round, for fault-tolerance if no decision, repeat with a new leader the leader might not be unique Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Implementing Consensus A B C
1
3
3
3
6
3
propose
decide
Comments step 1: servers learn the proposal of the leader step 2: majority confirmation round, for fault-tolerance if no decision, repeat with a new leader the leader might not be unique Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg Atomic Broadcast algorithm
Chandra-Toueg algorithm uses a sequence of Consensus instances Cons1 , Cons2 , . . . in each instance Consi , replicas 1 2 3
propose the first i messages decide on some set {m1 , . . . , mi } atomically deliver m1 , . . . , mi
no message delivered twice
A B
Cons1
C
propose { } decide { }
instances Consi can run in parallel
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg algorithm Example
A
Cons1
B
Cons2
C
propose { }
decide { } propose { , } decide { , }
Comments Cons1 : all propose { }, decide on { }, and deliver Cons2 : all propose { , }, decide on { , }, and deliver all replicas deliver
before Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg algorithm Example
A
Cons1
B
Cons2
C
propose { }
decide { } propose { , } decide { , }
Comments Cons1 : all propose { }, decide on { }, and deliver Cons2 : all propose { , }, decide on { , }, and deliver all replicas deliver
before Piotr Zieli´ nski
(even if failures occur) Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg algorithm Code 1 2 3 4 5 6 7 8 9 10
when a client atomically broadcasts m do broadcast m to all replicas task proposing at every replica is for k = 1, 2, . . . do wait for some message mk propose Mk = {m1 , . . . , mk } to Consensus instance k task delivery at every replica is for k = 1, 2, . . . do wait until Consensus instance k decides on some Mk atomically deliver all undelivered messages in Mk in order
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg algorithm
A B C
Comments message
to A is delayed
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg algorithm Replica A B C
1:{ } 2:{ , }
A proposes B proposes C proposes
Cons1
Cons2
{ } { } { }
{ , } { , } { , }
1:{ } 2:{ , } 1:{ } 2:{ , } { , }={ , }
Comments
message to A is delayed replicas start instances of Consensus at different times Cons1 : A proposes { }, B and C propose { } Cons2 : all replicas propose { , }, and decide on { , }
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg algorithm Replica A B C
1:{ } 2:{ , } 1:{ } 2:{ , } 1:{ } 2:{ , }
A proposes B proposes C proposes
all decide all deliver
Cons1
Cons2
{ } { } { }
{ , } { , } { , }
{ }
{ , }
{ , }={ , }
Comments
message to A is delayed replicas start instances of Consensus at different times Cons1 : A proposes { }, B and C propose { } Cons2 : all replicas propose { , }, and decide on { , } If Cons1 decides on { }, then replicas deliver Piotr Zieli´ nski
followed by
Latency-optimal fault-tolerant replication
.
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Chandra-Toueg algorithm Replica A B C
1:{ } 2:{ , } 1:{ } 2:{ , } 1:{ } 2:{ , }
A proposes B proposes C proposes
all decide all deliver
Cons1
Cons2
{ } { } { }
{ , } { , } { , }
{ }
{ , }
{ , }={ , }
Comments
message to A is delayed replicas start instances of Consensus at different times Cons1 : A proposes { }, B and C propose { } Cons2 : all replicas propose { , }, and decide on { , } If Cons1 decides on { }, then replicas deliver Piotr Zieli´ nski
followed by
Latency-optimal fault-tolerant replication
.
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Latency 1
1 Consensus
A
A
B
B
C
C
Direct Broadcast 1 step
Consk
Chandra-Toueg 1 step + Consensus
Latency the number of communication steps from atomically broadcasting a message to its atomic delivery Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Latency 1
1
A
A
B
B
C
C
Direct Broadcast 1 step
2
Consk
Chandra-Toueg 3 steps
Latency the number of communication steps from atomically broadcasting a message to its atomic delivery Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Latency 1
1
A
A
B
B
C
C
Direct Broadcast 1 step
1
Ck
Chandra-Toueg 2 steps
Latency the number of communication steps from atomically broadcasting a message to its atomic delivery Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Latency of Consensus Two kinds of Consensus algorithms 1-3 Consensus: 1 step if proposal the same, otherwise 3 steps 2-2 Consensus: 2 steps decision in all cases
{ } A B C
{ } { } { }
{ } A
{ }
B
{ }
C
{ }
D
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
D
1 step Consensus 2 steps Atomic Broadcast
2 steps Consensus 3 steps Atomic Broadcast
(if replicas receive client messages in the same order) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Latency of Consensus Two kinds of Consensus algorithms 1-3 Consensus: 1 step if proposal the same, otherwise 3 steps 2-2 Consensus: 2 steps decision in all cases
{ } A B C
{ } { } { }
{ } A
{ }
B
{ }
C
{ }
D
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
D
3 steps Consensus 4 steps Atomic Broadcast
2 steps Consensus 3 steps Atomic Broadcast
(if replicas receive client messages in different orders) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
1-2 Consensus Properties 1 decides in one step if proposal are the same 2 decides in two steps otherwise
A B C
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
D
1 step Consensus 2 steps Atomic Broadcast (if replicas receive client messages in the same order) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
1-2 Consensus Properties 1 decides in one step if proposal are the same 2 decides in two steps otherwise
A B C
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
D
2 steps Consensus 3 steps Atomic Broadcast (if replicas receive client messages in different orders) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
1-2 Consensus
Instance 1
:A
:A
A
A
:B
:A
A
A
:C
:A
A
A
:D
:A
A
A
Instance 2
Instance L
Implementation We use three parallel Consensus instances: 1
instance 1 of 1-3 Consensus (to decide in one step)
2
instance 2 of 2-2 Consensus (to decide in two steps)
3
instance L of 1-3 Consensus (to agree on the leader) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
1-2 Consensus Code 1 2 3 4 5 6 7 8 9 10 11
when propose(x) by process p do broadcast(x, p) propose1 (x) propose2 (x, p) proposeL (leader ) task decide at process p is wait until decideL (leader ) wait until one of the conditions is true and decide on x condition 1: decide1 (x) and receive(x, leader ) condition 2: decide2 (x, leader ) condition 3: decide1 (x) and decide2 (y , q) with q 6= leader
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Condition 1 :A
:A
A
A
:B
:A
A
A
:C
:A
A
A
:D
:A
A
A
Instance 1
Instance 2
Instance L
Condition 1 decideL (leader ) and decide1 (x) and receive(x, leader ) x=
leader = A
decideL ( A ) and decide1 ( ) and receive( , A ) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Condition 2 :A
:A
A
A
:B
:A
A
A
:C
:A
A
A
:D
:A
A
A
Instance 1
Instance 2
Instance L
Condition 2 decideL (leader ) and decide2 (x, leader ) x=
leader = A
decideL ( A ) and decide2 ( , A ) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Condition 3 :B
:B
A
A
:C
:B
A
A
:D
:B
B
A
Instance 1
Instance 2
Instance L
Condition 3 decideL (leader ) and decide1 (x) and decide2 (y , q) with q 6= leader x=
y=
q= B
leader = A
decideL ( A ) and decide1 ( ) and decide2 ( , B ) with B 6= A Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
Chandra-Toueg algorithm Latency considerations
Related work Consensus Schiper [1997], Lamport [2001] Brasileiro et al. [2001] This work
Atomic Broadcast Chandra and Toueg [1996] Pedone and Schiper [1998] This work
identical proposals
different proposals
2 steps 1 step 1 step
2 steps 3 steps 2 steps
same order
different orders
3 steps 2 steps 2 steps
3 steps 4 steps 3 steps
The latencies of our algorithms are provably optimal. Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Generic Broadcast
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Generic Broadcast read x write to x r w r w r r w w
A B
Atomic Broadcast
r r w w r r w w
C
Observations ordering all messages is expensive (Atomic Broadcast) not all messages have to be ordered (Generic Broadcast)
r
r w w = r Piotr Zieli´ nski
r w w 6= r
r w w
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Generic Broadcast read x write to x r w r w r r w w
A B
Generic Broadcast
r r w w r r w w
C
Observations ordering all messages is expensive (Atomic Broadcast) not all messages have to be ordered (Generic Broadcast)
r
r w w = r Piotr Zieli´ nski
r w w 6= r
r w w
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Generic Broadcast r w
w r
Meta-solution 1
Define the conflict relation“ ”. Only conflicting messages must be delivered in the same order.
2
Determine the partial order “ of conflicting messages.
3
Deliver messages in any total order consistent with “ ”.
r w
w r
r
r w w
r
r w w Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
”
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1:
m5
m1
order 2: Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1:
m5
m1
order 2: Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1:
m5
m1
order 2: Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1:
m5
m1
order 2: Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1:
m5
m1
order 2: Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2
m5
m1
order 2: Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m3
m1
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2:
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2 ,m3
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2 ,m3
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2 ,m3 ,m4
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2 ,m3 ,m4
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2 ,m3 ,m4 ,m5
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2 ,m3 ,m4 ,m5 ,m6
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Example m4
m5
m3
m4
m6
m2
m1
m3
m6
m2
order 1: m2 ,m3 ,m1 ,m4 ,m5 ,m6
m5
m1
order 2: m1 ,m2 ,m3 ,m4 ,m5 ,m6
Delivery rule Deliver a message when all undelivered conflicting messages are its successors Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Problems m1
m2 1
use a separate Consensus instance for each pair of messages, all executed in parallel
m2 m3
2
Cycles if no failures, the leader dictates the order, no cycles if failures occur, a (slow) cycle-resolution algorithm employed
m1 m1
Different processes perceive different orders
m2 m3 m4 m5 m6
3
The graph contains all possible messages infinitely many parallel instances of Consensus most of them identical, only finitely many different implementable with finite resources
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Determining the order of messages When a message
in sent
Replicas propose that precede Consensus decides that
should
1
→
Done in parallel for all possible conflicting with Message
1 or 2
A
is delivered
B C
Latency Two steps if all conflicting messages arrive at the replicas in the same order, and three otherwise.
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
?
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Related work
Generic Broadcast
same order
Chandra and Toueg [1996] Pedone and Schiper [1998] Pedone and Schiper [1999] Aguilera et al. [2000] This work
3 2 4 4 2
steps steps steps steps steps
no conflicts 3 4 2 2 2
steps steps steps steps steps
no failures 3 4 4 4 3
The latency of our algorithm is provably optimal.
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
steps steps steps steps steps
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Summary of the talk
A B C
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
A B C
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
New algorithms
D
D
A
1
1-2 Consensus
2
Fast Atomic Broadcast
3
Fast Generic Broadcast
Cons1
B C
propose { } decide { }
Remarks messages delivered in 2 or 3 steps
r w
w
provably optimal
r
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Infinitely many instances with finite resources
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
A single instance
A
state of A : state of B : state of C :
B
state D :
C D
Comments A and B propose
, and C proposes
D only collects information, does not propose
Consider the state of D as more and more messages arrive
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
A single instance
A
state of A : state of B : state of C :
B
state D :
C D
Comments A and B propose
, and C proposes
D only collects information, does not propose
Consider the state of D as more and more messages arrive
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
A single instance
A
state of A : state of B : state of C :
B
state D :
C D
Comments A and B propose
, and C proposes
D only collects information, does not propose
Consider the state of D as more and more messages arrive
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
A single instance
A
state of A : state of B : state of C :
B
state D :
C D
Comments A and B propose
, and C proposes
D only collects information, does not propose
Consider the state of D as more and more messages arrive
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Many instances at the same time A proposes B proposes C proposes
in 1–8 in 1–4 in 1–12
in 9–15 in 5–15 in 13–15
A:
A
B:
B
C:
C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
D D:
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Many instances at the same time A proposes B proposes C proposes
in 1–8 in 1–4 in 1–12
in 9–15 in 5–15 in 13–15
A:
A
B:
B
C:
C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
D D:
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Many instances at the same time A proposes B proposes C proposes
in 1–8 in 1–4 in 1–12
in 9–15 in 5–15 in 13–15
A:
A
B:
B
C:
C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
D D:
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Many instances at the same time A proposes B proposes C proposes
in 1–8 in 1–4 in 1–12
in 9–15 in 5–15 in 13–15
A:
A
B:
B
C:
C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
D D:
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Grouping Before
A:
Instances I1 , . . . , I15 executed independently.
B: C: 0
9
15
A:
After Instances Ik with the same state are simulated by the same interval instance Ia,b = { Ik : a < k ≤ b }
0
3
I0,9
I9,15
B: 0
I0,3
I3,15
12
15
C:
I0,12 Piotr Zieli´ nski
15
Latency-optimal fault-tolerant replication
I12,15
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: A: B: C: 0
A B
15
D:
C
I0,15
D
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: 1
split some instances, cloning the state
A: B: C: 0
A B
9
15
D:
C
I0,9
D
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
I9,15
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: 1
split some instances, cloning the state
2
pass the message to the appropriate instances
A: B: C: 0
A B
9
15
D:
C
I0,9
D
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
I9,15
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: 1
split some instances, cloning the state
2
pass the message to the appropriate instances
A: B: C: 0
A B
3
9
15
D:
C
I0,3
D
Piotr Zieli´ nski
I3,9
Latency-optimal fault-tolerant replication
I9,15
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: 1
split some instances, cloning the state
2
pass the message to the appropriate instances
A: B: C: 0
A B
3
9
15
D:
C
I0,3
D
Piotr Zieli´ nski
I3,9
Latency-optimal fault-tolerant replication
I9,15
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: 1
split some instances, cloning the state
2
pass the message to the appropriate instances
A: B: C: 0
A B
3
9
12
15
D:
C
I0,3
D
Piotr Zieli´ nski
I3,9
I9,12 I12,15
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: 1
split some instances, cloning the state
2
pass the message to the appropriate instances
A: B: C: 0
A B
3
9
12
15
D:
C
I0,3
D
Piotr Zieli´ nski
I3,9
I9,12 I12,15
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
How it works Algorithm When a new message arrives: 1
split some instances, cloning the state
2
pass the message to the appropriate instances
A: B: C:
Remarks finitely many (4) actual instances created dynamically generalizations:
0
3
9
12
15
D:
replacing 15 with ∞ integers with reals intervals with other sets other algorithms Piotr Zieli´ nski
I0,3
I3,9
I9,12 I12,15
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
Summary of the talk
A B C
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
A B C
D
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{ }
New algorithms 1
1-2 Consensus
2
Fast Atomic Broadcast
3
Fast Generic Broadcast
4
Infinitely many parallel instances
D
A
Cons1
B C
propose { } decide { } r
Remarks
w
w r 0
3
9
12
messages delivered in 2 or 3 steps
15
provably optimal I0,3
I3,9
I9,12 I12,15
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast Generic Broadcast
General approach Infinitely many instances with finite resources
References M. K. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg. Thrifty generic broadcast. Lecture Notes in Computer Science, 1914:268–282, 2000. F. Brasileiro, F. Greve, A. Mostefaoui, and M. Raynal. Consensus in one communication step. Lecture Notes in Computer Science, 2127:42–50, 2001. Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, 1996. Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):18–25, December 2001. Fernando Pedone and Andr´ e Schiper. Optimistic atomic broadcast. In Proceedings of the 12th International Symposium on Distributed Computing, September 1998. Fernando Pedone and Andr´ e Schiper. Generic broadcast. In Proceedings of the Thirteenth International Symposium on Distributed Computing (DISC’99, formerly WDAG), 1999. Andr´ e Schiper. Early Consensus in an asynchronous system with a weak failure detector. Distributed Computing, 10(3):149–157, April 1997.
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Atomic Broadcast
Validity For any message m, every learner delivers m at most once, and only if m was broadcast by a proposer. Agreement If some learner delivers message m0 after message m, then every learner delivers m0 only after it has delivered m. Termination If a correct proposer broadcasts a message, Validity then all correct learners will eventually deliver it. Termination If a learner delivers a message, Agreement then all correct learners will eventually deliver it.
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Sequencer
Sequencer-based algorithm clients broadcast messages to the main replica the main replica assigns sequence numbers to them and broadcasts the to other replicas
A
replicas deliver messages in order
C
k
B
if the main replica fails, another takes over
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
k k k
Sequencer ensures the same order Example
A
1
1
2
2
B C
2
1
main replica A assigns 1 to
1
2
and 2 to
replicas A and C deliver messages
and
straight away
replica B waits with delivering until it has delivered all replicas deliver and in the same order Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
When the sequencer fails Examples
A
1
B C
1
2
2
1
2
1
2
Case 1: no failures A assigns 1 to
all replicas deliver
, and 2 to followed by
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
When the sequencer fails Examples
A
1
B C
1 1
2
1
2 2
Case 2: Sequencer fails, no message loss A assigns 1 to
, and fails
B takes over and assigns 2 to
all replicas deliver
before (possibly) Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
When the sequencer fails Examples
A
1
1 1
B
1 1
C
Case 3: Sequencer fails, message loss occurs A crashes and its messages to the others are lost B does not know about A delivers
, it assigns 1 to
, replicas B and C deliver Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
When the sequencer fails Examples
A
1
1 1
B
1 1
C
Case 4: Sequencer is just very slow A is correct but very slow B thinks A crashed, it assigns 1 to A delivers
first, replicas B and C deliver Piotr Zieli´ nski
first
Latency-optimal fault-tolerant replication
Sequencer-based algorithm (3) Sequencer-based Algorithm 1 2 3 4 5 6 7 8 9 10
when a client executes abcast(m) do send m to the main replica a1 task sequencer at the main replica is for k = 1, 2, . . . do wait for a message m broadcast (m, k) to all replicas
{ including itself }
task delivery at any replica is for k = 1, 2, . . . do wait for message (m, k) abdeliver (m)
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication
Chandra-Toueg Chandra-Toueg 1 2 3 4 5 6 7 8 9 10
when a client atomically broadcasts m do broadcast m to all replicas task proposing at every replica is for k = 1, 2, . . . do wait for some message mk propose Mk = {m1 , . . . , mk } to Consensus instance k task delivery at every replica is for k = 1, 2, . . . do wait until Consensus instance k decides on some Mk atomically deliver all undelivered messages in Mk in order
Piotr Zieli´ nski
Latency-optimal fault-tolerant replication