"The Beacon Chain has life." On May 11th and 12th, 2023, Ethereum faced two temporary loss of finality events that tested its resilience. Despite these challenges, the network maintained its vitality, autonomously recovering from both incidents. We're about to delve into these noteworthy events, scrutinizing their impact and the subsequent enhancements implemented to prevent similar incidents in the future.
May 11 and 12, 2023, will be marked as significant dates in Ethereum's history, where its resilience was put to the test. On May 11, around 20:19 UTC, Ethereum's Mainnet network witnessed a significant drop in block production, causing a four-epoch delay in finalization - a first for Ethereum. The following day, a similar incident occurred, this time stretching the delay to nine epochs and leading to an inactivity penalty.
During these incidents, a substantial dip in network participation was observed. The first drop occurred in epoch 200,551, resulting in a temporary halt in finalization until epoch 200,555. The second participation drop was seen in epoch 200,750, causing another pause in finalization until epoch 200,759.
Despite the initial concerns, Ethereum's network showcased its inherent robustness by autonomously recovering from both incidents. These events not only affirmed the resilience of Ethereum's Beacon Chain but also highlighted areas for potential enhancements. As we progress through this article, we'll explore these areas in greater detail.
During periods of non-finality, the Ethereum network deploys a crucial mechanism called the "inactivity leak.” This feature is ingrained in Ethereum 2.0's proof-of-stake protocol, engineered to sustain network functionality amidst significant disturbances - from events like World War III to large-scale natural disasters - that could result in a sizable number of validators going offline, thereby obstructing block finalization.
In the event that the network fails to finalize blocks for four consecutive epochs (~16 minutes), it triggers the inactivity leak mode. Under this regime, validators failing to attest to blocks start to lose a part of their staked Ether (ETH). This penalty intensifies quadratically over time until block finalization resumes.
This mode wields a dual deterrent. Firstly, it negates rewards for validators' attestations. Secondly, it imposes escalating penalties on non-participating validators proportionate to the duration of their inactivity. This mechanism incentivizes validators to stay actively involved, accelerating network recovery. It is a cornerstone feature that safeguards network integrity during major disruptions.
You can read more about this in the comprehensive guide on Ethereum 2.0 incentives and mechanisms at Eth2Book here.
For Network Participants (Validators):
As per the estimates provided by Ben Edgington, assuming an 8 epoch leak with 65% of validators offline, the inactivity leak led to approximately 28 ETH being burned. This equates to about 0.0006 ETH per offline validator.
Furthermore, during the duration of the outages, attestation rewards were reduced to zero, resulting in an additional loss of around 50 ETH that could have been otherwise issued. Taken together, the estimated total loss for the validators, in terms of both inactivity penalties and lost attestation rewards, is around 78 ETH.
Contrarily, end users experienced minimal impact. Despite the decline in available block space leading to reduced transaction processing capacity, there was no dramatic surge in gas prices, which stayed below the daily peak. More significantly, the network maintained liveness throughout these incidents.
This meant that Ethereum continued to process transactions without any major disruptions, demonstrating its resilience. Consequently, users could maintain their operations on the Ethereum network largely uninterrupted, underscoring the robustness of the system even in the face of challenges.
At the heart of the issue with Prysm was an absent caching mechanism for block replays. This absence escalated the system load, generated excessive go routines, and heightened CPU stress. In some instances, new replays commenced even before the previous ones concluded, further straining the system.
Another compounding factor was Prysm's improper processing of attestations from previous epochs - data that should have been disregarded, but wasn't. This inefficiency, coupled with the suboptimal use of the head state, stressed the system particularly in the face of a deposit surge and a growing validator registry.
The incidents also unveiled key differences in the strategies adopted by different Ethereum clients. While Lighthouse chose to drop attestations to maintain network liveness, Prysm and Teku, among others, defaulted to generating blocks using old attestations when faced with execution client issues.
Despite the challenges, these incidents were critical in providing insights into software inefficiencies, design choices, and network conditions, making the Ethereum network more robust. This series of events has not led to any permanent damage, but rather has reinforced the resilience and versatility of Ethereum's network design.
The resilience of Ethereum's Beacon Chain was truly tested during these incidents, and it passed with flying colors. The Ethereum Beacon Chain seems to be alive, healing itself.
A key factor in the successful recovery was the diversity of clients on the Ethereum network. The existence of multiple clients, each with their unique ways of handling the network, proved to be a boon. For instance, while Prysm and Teku clients struggled under the load of old attestations, Lighthouse's strategy of dropping attestations ensured that part of the network stayed live and functional.
In essence, Ethereum's resilience comes from its client diversity, a factor that played a crucial role in helping the network recover on its own, thereby negating the need for any manual intervention.
- Testnet vs Mainnet: The incidents underscored the discrepancies between testnet environments and the Mainnet. With over 600,000 validators and a significant volume of withdrawal operations on the Mainnet, it's clear that the complexities and unpredictability of a live network often exceed those of test environments. This signals the need for more rigorous stress testing to better prepare for real-world network conditions.
- Inactivity Leak Penalties: The effectiveness of inactivity leak penalties on the Mainnet was reinforced during the incidents. These penalties play an essential role in promoting active validator participation, maintaining network liveness, and enabling network recovery.
- The Importance of Liveness: The incidents underscored the vital role of liveness in a blockchain network. Under the design of the LMD Ghost protocol, Ethereum maintained its liveness throughout the process, ensuring users experienced minimal impact. Unlike certain blockchains that may face downtime during network issues, Ethereum prioritizes liveness over throughput. This approach safeguards users and the proper functioning of the network, emphasizing that without liveness, network functionality and user security are compromised, regardless of throughput.
- Importance of Client Diversity: The recovery process emphasized the value of having a diverse client base. Different Ethereum clients have unique responses to network incidents, contributing to the overall resilience and robustness of the network.
- Network Resilience: The incidents served as a powerful testament to the Ethereum network's resilience. Despite significant challenges, the network self-recovered and bounced back stronger, embodying the concept of antifragility in complex systems. This resilience sets a strong precedent for the broader crypto ecosystem and signifies the robustness of Ethereum's underlying architecture and design principles.
The incidents on May 11 and 12, 2023, served as pivotal moments in Ethereum's journey. They provided tangible proof of the Beacon chain's vitality, even amidst challenging circumstances. As Ethereum continues to evolve, it builds upon these experiences, growing not only more robust but also more antifragile - ready to forge ahead in its journey of decentralization and beyond.