Javascript must be enabled to continue!

Fast State-Machine Replication in Geo-Distributed Systems

Modern cloud services rely on geo-distributed infrastructure spanning multiple data centers across the globe. These services must maintain high availability and strong consistency, even in the face of network partitions, node failures, and significant latency variability. A widely adopted approach for ensuring consistency in such systems is state-machine replication (SMR), which models the service as a deterministic state machine replicated across geographically distributed sites. SMR protocols ensure that commands submitted by clients are executed in a consistent order across all replicas, preserving linearizability the guarantee that each command appears to take effect instantaneously at a single point in global time. Classical SMR protocols face fundamental trade-offs between latency, fault tolerance, and complexity. Leader-based protocols such as Paxos offer a well-studied and relatively simple design, but incur high latency and low throughput due to their reliance on a centralized leader. Protocols like Fast Paxos introduce fast paths to reduce latency under favorable conditions, yet they perform poorly in the worst case due to disruptive fallback behavior and state-transfer overhead. More recent leaderless protocols, such as EPaxos, remove the leader bottleneck and enable decentralized decision-making, but introduce substantial complexity in their failure recovery mechanisms and suffer from high tail latency due to execution convoy effects. Moreover, EPaxos the most influential leaderless SMR protocol is ambiguously specified and suffers from nontrivial bugs. This thesis addresses these limitations by presenting two complementary SMR protocols: SwiftPaxos, a new leader-based protocol that lowers best-case latency without introducing performance penalties in the worst case, and EPaxos*, a new leaderless protocol based on EPaxos that is simpler, provably correct, and optimal in number of processes required. SwiftPaxos executes client commands in two-message delays in the absence of conflicts, and in three-message delays otherwise. It introduces a novel double-voting mechanism, which allows replicas to resolve conflicts without restarting the consensus instance or transferring large amounts of state. Experiments across 13 AWS regions demonstrate that SwiftPaxos reduces latency by up to 29% compared to Paxos and achieves up to 2.9× higher throughput than EPaxos. EPaxos*, in contrast, revisits the leaderless approach to SMR. We introduce the notion of fast SMR and define a general class of e-fast, f -resilient protocols that guarantee decisions in two message delays in conflict-free runs with up to e failures. We establish a lower bound of max{2e + f 1, 2f + 1} on the number of processes required for fast SMR, and show that EPaxos* matches this bound, which makes it the first protocol to to achieve this. In addition, EPaxos* fixes both known and newly discovered safety and liveness issues in EPaxos by introducing a novel failure-recovery mechanism. Together, SwiftPaxos and EPaxos* provide a substantial improvements of low-latency, fault- tolerant SMR in geo-distributed systems. RESUMEN Los servicios modernos en la nube dependen de una infraestructura geodistribuida que abarca múltiples centros de datos en todo el mundo. Estos servicios deben mantener una alta disponibilidad y una fuerte consistencia, incluso frente a particiones en la red, fallos de nodos y una variabilidad significativa en la latencia. Un enfoque ampliamente adoptado para garantizar la consistencia en tales sistemas es la replicación de máquinas de estados (SMR, por sus siglas en inglés), que modela el servicio como una máquina de estados determinista replicada en sitios distribuidos geográficamente. Los protocolos SMR aseguran que los comandos enviados por los clientes se ejecuten en un orden consistente en todas las réplicas, preservando la linealizabilidad: la garantía de que cada comando parece tener efecto instantáneamente en un único punto del tiempo global. Los protocolos SMR clásicos enfrentan compensaciones fundamentales entre latencia, tolerancia a fallos y complejidad. Los protocolos basados en un líder, como Paxos, ofrecen un diseño bien estudiado y relativamente simple, pero incurren en alta latencia y bajo rendimiento debido a su dependencia de un líder centralizado. Los protocolos como Fast Paxos introducen caminos rápidos para reducir la latencia en condiciones favorables; sin embargo, tienen un mal desempeño en el peor de los casos debido a comportamientos de retroceso disruptivos y a la sobrecarga dada por la transferencia de estado. Los protocolos sin líder más recientes, como EPaxos, eliminan el cuello de botella del líder y habilitan la toma de decisiones descentralizada, pero introducen una complejidad sustancial en sus mecanismos de recuperación ante fallos y sufren de alta latencia de cola debido a efectos de convoy durante la ejecución. Además, EPaxos está especificado de forma ambigua y sufre de errores no triviales. Esta tesis aborda estas limitaciones mediante dos protocolos SMR complementarios: Swift- Paxos, un nuevo protocolo basado en líder que reduce la latencia en el mejor caso sin penalizar el rendimiento en el peor, y EPaxos*, un protocolo sin líder basado en EPaxos, más simple, demostrablemente correcto y óptimo en el número de procesos requeridos. SwiftPaxos ejecuta comandos de cliente en dos retardos de mensaje en ausencia de conflictos, y en tres retardos de mensaje en caso contrario. Introduce un novedoso mecanismo de doble votación, que permite a las réplicas resolver conflictos sin reiniciar la instancia de consenso ni transferir grandes cantidades de estado. Experimentos en 13 regiones de AWS demuestran que SwiftPaxos reduce la latencia hasta en un 29% en comparación con Paxos y logra hasta 2.9× más rendimiento que EPaxos. EPaxos*, en contraste, reconsidera el enfoque sin líder para SMR. Introducimos la noción de SMR rápido y definimos una clase general de protocolos e-rápidos y f -resilientes que garantizan decisiones en dos retardos de mensaje en ejecuciones sin conflictos con hasta e fallos. Establecemos una cota inferior de max{2e + f 1, 2f + 1} sobre el número de procesos requeridos para SMR rápido, y mostramos que EPaxos* alcanza esta cota, lo que lo convierte en el primer protocolo en lograr esto. Además, EPaxos* corrige tanto problemas de seguridad y progreso ya conocidos como otros descubiertos recientemente en EPaxos, mediante la introducción de un novedoso mecanismo de recuperación ante fallos. En conjunto, SwiftPaxos y EPaxos* mejoran SMR al ofrecer menor latencia y mayor tolerancia a fallos en sistemas geodistribuidos.

Universidad Politecnica de Madrid - University Library

Fedor Ryabinin

2026

Title: Fast State-Machine Replication in Geo-Distributed Systems

Description:

Modern cloud services rely on geo-distributed infrastructure spanning multiple data centers across the globe.

These services must maintain high availability and strong consistency, even in the face of network partitions, node failures, and significant latency variability.

A widely adopted approach for ensuring consistency in such systems is state-machine replication (SMR), which models the service as a deterministic state machine replicated across geographically distributed sites.

SMR protocols ensure that commands submitted by clients are executed in a consistent order across all replicas, preserving linearizability the guarantee that each command appears to take effect instantaneously at a single point in global time.

Classical SMR protocols face fundamental trade-offs between latency, fault tolerance, and complexity.

Leader-based protocols such as Paxos offer a well-studied and relatively simple design, but incur high latency and low throughput due to their reliance on a centralized leader.

Protocols like Fast Paxos introduce fast paths to reduce latency under favorable conditions, yet they perform poorly in the worst case due to disruptive fallback behavior and state-transfer overhead.

More recent leaderless protocols, such as EPaxos, remove the leader bottleneck and enable decentralized decision-making, but introduce substantial complexity in their failure recovery mechanisms and suffer from high tail latency due to execution convoy effects.

Moreover, EPaxos the most influential leaderless SMR protocol is ambiguously specified and suffers from nontrivial bugs.

This thesis addresses these limitations by presenting two complementary SMR protocols: SwiftPaxos, a new leader-based protocol that lowers best-case latency without introducing performance penalties in the worst case, and EPaxos*, a new leaderless protocol based on EPaxos that is simpler, provably correct, and optimal in number of processes required.

SwiftPaxos executes client commands in two-message delays in the absence of conflicts, and in three-message delays otherwise.

It introduces a novel double-voting mechanism, which allows replicas to resolve conflicts without restarting the consensus instance or transferring large amounts of state.

Experiments across 13 AWS regions demonstrate that SwiftPaxos reduces latency by up to 29% compared to Paxos and achieves up to 2.

9× higher throughput than EPaxos.

EPaxos*, in contrast, revisits the leaderless approach to SMR.

We introduce the notion of fast SMR and define a general class of e-fast, f -resilient protocols that guarantee decisions in two message delays in conflict-free runs with up to e failures.

We establish a lower bound of max{2e + f 1, 2f + 1} on the number of processes required for fast SMR, and show that EPaxos* matches this bound, which makes it the first protocol to to achieve this.

In addition, EPaxos* fixes both known and newly discovered safety and liveness issues in EPaxos by introducing a novel failure-recovery mechanism.

Together, SwiftPaxos and EPaxos* provide a substantial improvements of low-latency, fault- tolerant SMR in geo-distributed systems.

RESUMEN Los servicios modernos en la nube dependen de una infraestructura geodistribuida que abarca múltiples centros de datos en todo el mundo.

Estos servicios deben mantener una alta disponibilidad y una fuerte consistencia, incluso frente a particiones en la red, fallos de nodos y una variabilidad significativa en la latencia.

Un enfoque ampliamente adoptado para garantizar la consistencia en tales sistemas es la replicación de máquinas de estados (SMR, por sus siglas en inglés), que modela el servicio como una máquina de estados determinista replicada en sitios distribuidos geográficamente.

Los protocolos SMR aseguran que los comandos enviados por los clientes se ejecuten en un orden consistente en todas las réplicas, preservando la linealizabilidad: la garantía de que cada comando parece tener efecto instantáneamente en un único punto del tiempo global.

Los protocolos SMR clásicos enfrentan compensaciones fundamentales entre latencia, tolerancia a fallos y complejidad.

Los protocolos basados en un líder, como Paxos, ofrecen un diseño bien estudiado y relativamente simple, pero incurren en alta latencia y bajo rendimiento debido a su dependencia de un líder centralizado.

Los protocolos como Fast Paxos introducen caminos rápidos para reducir la latencia en condiciones favorables; sin embargo, tienen un mal desempeño en el peor de los casos debido a comportamientos de retroceso disruptivos y a la sobrecarga dada por la transferencia de estado.

Los protocolos sin líder más recientes, como EPaxos, eliminan el cuello de botella del líder y habilitan la toma de decisiones descentralizada, pero introducen una complejidad sustancial en sus mecanismos de recuperación ante fallos y sufren de alta latencia de cola debido a efectos de convoy durante la ejecución.

Además, EPaxos está especificado de forma ambigua y sufre de errores no triviales.

Esta tesis aborda estas limitaciones mediante dos protocolos SMR complementarios: Swift- Paxos, un nuevo protocolo basado en líder que reduce la latencia en el mejor caso sin penalizar el rendimiento en el peor, y EPaxos*, un protocolo sin líder basado en EPaxos, más simple, demostrablemente correcto y óptimo en el número de procesos requeridos.

SwiftPaxos ejecuta comandos de cliente en dos retardos de mensaje en ausencia de conflictos, y en tres retardos de mensaje en caso contrario.

Introduce un novedoso mecanismo de doble votación, que permite a las réplicas resolver conflictos sin reiniciar la instancia de consenso ni transferir grandes cantidades de estado.

Experimentos en 13 regiones de AWS demuestran que SwiftPaxos reduce la latencia hasta en un 29% en comparación con Paxos y logra hasta 2.

9× más rendimiento que EPaxos.

EPaxos*, en contraste, reconsidera el enfoque sin líder para SMR.

Introducimos la noción de SMR rápido y definimos una clase general de protocolos e-rápidos y f -resilientes que garantizan decisiones en dos retardos de mensaje en ejecuciones sin conflictos con hasta e fallos.

Establecemos una cota inferior de max{2e + f 1, 2f + 1} sobre el número de procesos requeridos para SMR rápido, y mostramos que EPaxos* alcanza esta cota, lo que lo convierte en el primer protocolo en lograr esto.

Además, EPaxos* corrige tanto problemas de seguridad y progreso ya conocidos como otros descubiertos recientemente en EPaxos, mediante la introducción de un novedoso mecanismo de recuperación ante fallos.

En conjunto, SwiftPaxos y EPaxos* mejoran SMR al ofrecer menor latencia y mayor tolerancia a fallos en sistemas geodistribuidos.

Back

Abstract The problem of waste management in landfills which is not resolved will be a threat to the environment and humans. The main cause of water resources pollution in landfil...

Single‐Molecule Optical Replication Mapping (ORM) Suggests Human Replication Timing is Regulated by Stochastic Initiation

DNA replication timing is regulated by the timing of initiation across the genome. However, there is no consensus as to how initiation timing is regulated. Deterministic models con...

Hepatitis C Virus Replication Depends on Endosomal Cholesterol Homeostasis

ABSTRACT Similar to other positive-strand RNA viruses, hepatitis C virus (HCV) causes massive rearrangements of intracellular membranes, resulting in a membranous web (MW...

Identification of 1600 replication origins in S. cerevisiae

Abstract There are approximately 500 known origins of replication in the yeast genome, and the process by which DNA replication initiates at these locations is well understood. In ...

Decoding the multifaceted interventions between human sirtuin 2 and dynamic hepatitis B viral proteins to confirm their roles in HBV replication

The human sirtuin 2 gene (SIRT2) encodes a full-length Sirt2 protein (i.e., the Sirt2 isoform 1), which primarily functions as a cytoplasmic α-tubulin deacetylase, and which promot...

Geological and geomorphological objects of the Ukrainian Carpathians’ Beskid Mountains and their tourist attractiveness

The article explores the geological and geomorphological objects of the Beskidy Ukrainian Carpathians for the further creation of geo-tourist routes. Geo-tourist areas combining se...

“KAIROS OF Biblical Geo-politics for the New Life of the Minjung in the 21st Century”

Theologically speaking, it is in this situation that we begin to sense the imminence of the KAIROTIC GEO-POLITICS. We begin with the Biblical geo-politics of God’s Reign, which wou...

Geo Information Systems &Remote Sensing: Applications in Environmental Systems & Management

Geo Informatics is an interdisciplinary field responsible for spatial information related activities. Geo Informatics is close to the Geo Information Science, Geo Information Syste...

Email:
Password:

Email:

Fast State-Machine Replication in Geo-Distributed Systems

Related Results