Why Most Parallel Execution Designs Fail
Part 1 of The Execution Layer Truth
Why Most Parallel Execution Designs Fail
Every new L1 ships with the same marketing slide: a bar chart showing 100K TPS next to Ethereum's 15. Parallel execution is now a checkbox — claim it or get ignored.
I've been deep in blockchain infrastructure for a while now — consensus layers, state management under load, execution engine profiling across multiple L1 and L2 stacks. The pattern repeats: parallel execution designs break down under real DeFi workloads. Not theoretically. Measurably. And the industry keeps optimizing for the wrong metric.
The Promise vs The Reality
The theoretical model is seductive. Identify independent transactions, execute them on separate cores, throughput scales linearly with hardware. 16 cores = 16x throughput. Simple.
Except it's not.
Real DeFi workloads create dense dependency graphs. A single block containing Uniswap swaps, Aave liquidations, and MEV bot activity creates a web of state dependencies that no static analysis can fully predict. Chains advertising "parallel execution" deliver 2-3x improvement in practice. Not 16x.
The gap between synthetic benchmarks and production reality is the dirty secret of the parallel execution narrative.
Failure Mode 1: State Access Prediction Is Broken
The most popular approach — optimistic parallel execution, popularized by systems like Monad and formalized in Block-STM — works like this: assume transactions are independent, execute them in parallel, detect conflicts after the fact, re-execute conflicting transactions.
The core assumption: you can predict which state slots a transaction will access before execution.
This assumption fails in DeFi.
Consider a swap on a DEX aggregator. The optimal route depends on current pool reserves, which change with every preceding transaction. The state access pattern is data-dependent — you can't know what state the transaction touches until you've already executed the routing logic.
Or take an Aave liquidation: whether a position is liquidatable depends on the current oracle price and the borrower's health factor. The transaction might read 2 state slots or 200. Depends on the outcome.
The result is a cascade of re-executions. In high-contention scenarios — exactly when throughput matters most — re-execution overhead can exceed the parallelism gains. You're running 16 cores but spending 60% of cycles on redundant work.
How bad? Flashbots found that ~35% of Ethereum transactions conflicted in 2017 blocks — and that was before DeFi Summer, before Uniswap v3, before the MEV arms race. Amber Group's analysis suggests conflict rates above 50% during NFT or memecoin frenzies. Sei Research measured 35% dependent transactions on recent Ethereum data and calculated that even with perfect hardware, Amdahl's Law caps parallel speedup at ~2-3x when a third of transactions are serial.
That's not an optimization. That's a tax.
Failure Mode 2: Static Declaration Doesn't Scale
The alternative approach — requiring developers to declare state access paths upfront (as in Move/Aptos) — trades runtime flexibility for predictability. The VM knows exactly which resources each transaction touches before execution begins.
Works beautifully for simple transfers. Breaks down for composability.
When your contract calls another contract, and that contract calls a third, the total state access footprint is a runtime composition of all three. You can't statically declare what you don't statically know. This forces developers into conservative over-declaration (killing parallelism) or runtime discovery (killing predictability).
Developer experience suffers too. In a world where we're trying to abstract complexity, forcing engineers to reason about access paths at the resource level is a regression.
Failure Mode 3: Benchmark Theater
The numbers everyone cite.
Block-STM — the parallel execution engine behind Aptos — reports 170K TPS on synthetic benchmarks. Impressive. Until you look at what's actually running on mainnet.
Solana's stress test hit 107K TPS with trivial no-op transactions. Real mainnet throughput? ~900-1000 TPS. 100x gap between the lab and the street.
Arbitrum's theoretical ceiling is ~40K TPS. Practical throughput: ~400 TPS. Even Avalanche, one of the more optimized L1s, sustains only ~33 TPS on-chain.
| Chain/Engine | Benchmark TPS | Real TPS | Gap |
|---|---|---|---|
| Block-STM (Aptos) | 170,000 | 50,000-80,000 (contended) | 2-3x |
| Solana | 107,000 (stress) | ~900-1,000 | 100x |
| Arbitrum | ~40,000 (theoretical) | ~400 | 100x |
| Avalanche | N/A | ~33 | — |
| Ethereum L1 | N/A | ~20-40 | — |
The industry has collectively agreed to measure performance at the best case and ignore the worst. Benchmarking a car's top speed on a straight road and ignoring that it has to drive in a city.
Replace those synthetic transfers with multi-hop DEX swaps, leveraged position management with oracle updates, cross-protocol composability — and that 100K TPS drops to single-digit thousands. Sometimes less. Throughput under adversarial workloads — exactly the conditions MEV bots create — is even worse.
Failure Mode 4: Composability Is the Enemy
The fundamental architectural tension: DeFi composability and parallel execution are at odds.
Composability means contracts can atomically compose — swap on Uniswap, deposit into Aave, borrow against it, swap again. Each step depends on the previous step's state changes. Sequential dependency chains that no amount of parallelism can optimize away.
Amdahl's Law puts a hard number on this. Sei Research's analysis shows that if ~65% of transactions can run in parallel (P=0.65), a 64-core system achieves only ~2.7x throughput. Not 64x. Not 16x. Under 3x.
And that's optimistic. In DeFi, composable apps — trades, liquidations, arbitrage — create hotspots. A single popular liquidity pool can serialize a significant chunk of a block. The real P value for complex DeFi blocks is likely lower, dragging speedups closer to 1-2x.
Chains that optimize for parallelism tend to sacrifice composability (sharded execution, isolated VMs). Chains that optimize for composability tend to sacrifice parallelism. Getting both at full strength is architecturally difficult — and the industry hasn't been honest about this tradeoff.
What Actually Works
Abandon parallel execution? No. But be smarter about where we apply it.
1. Adaptive conflict detection. Instead of optimistic execution with blind re-execution, use lightweight pre-analysis to identify likely conflict zones and schedule accordingly. Block-STM's multi-version data structure detects conflicts at the memory-access level rather than the transaction level, reducing wasted work. But even Block-STM's speedup collapses from 17x to 2-3x once contention exceeds 30%.
2. Read/write path separation. Read-heavy transactions (queries, simulations, price checks) can be massively parallelized with zero conflicts. Write-heavy transactions need careful ordering. Current EVM execution treats them identically — a massive inefficiency. Separating read replicas from write masters is standard practice in distributed databases. Blockchain execution should adopt the same pattern.
3. Application-aware scheduling. Protocols can annotate transactions with coarse-grained intent (e.g., "this interacts with lending markets") to enable smarter scheduling without full static declaration. Sits between the extremes of "declare every storage slot" and "figure it out at runtime." The annotation doesn't need to be precise — even knowing "this touches Uniswap V3 pools" is enough to avoid scheduling two swaps on the same pool in parallel.
4. Hierarchical execution. Local parallelism within protocol boundaries, global serialization at cross-protocol boundaries. Matches how DeFi actually composes — mostly within protocols, occasionally across them. A block of 100 transactions might have 80 that are intra-protocol (highly parallelizable) and 20 that are cross-protocol (serial). Optimize for the common case.
5. State access optimization over CPU parallelism. Flashbots reports that over 70% of Ethereum transaction processing time is spent in state reads and writes — Merkle trie I/O, not computation. Robert Leifke's 2026 analysis confirms: blockchain performance is data-movement bound, not compute bound. Adding execution cores is fighting the wrong battle. The real wins: better state storage models (flat key-value vs. Merkle trees), in-memory caching for hot state, hardware-aware I/O scheduling. Some teams are exploring specialized database backends (MDBX, RocksDB tuning) over generic LevelDB — early results show 3-5x improvement in state access latency.
The Uncomfortable Truth
Here's what nobody wants to hear: execution speed is not the bottleneck.
State growth is the bottleneck. State I/O is the bottleneck. The cost of reading from and writing to state storage dominates execution time for most real workloads. A transaction that touches 50 storage slots spends more time waiting for state reads than executing instructions.
The industry should be optimizing state access patterns, storage models, and caching strategies — not adding more execution cores. But "parallel state access optimization" doesn't make for a sexy marketing headline.
Conclusion
Most parallel execution designs fail because they optimize for the wrong thing: theoretical throughput on synthetic workloads. The real world is messier — data-dependent state access, dense dependency graphs, adversarial conditions, and the fundamental tension between composability and parallelism.
The winners won't be the chains with the highest benchmark numbers. They'll be the ones honest about the tradeoffs, adaptive in their scheduling, and focused on the actual bottleneck: state.
Next time someone shows you a 100K TPS benchmark, ask one question: what does that number look like when the chain is actually being used?
Part 1 of The Execution Layer Truth series. Next: AI Still Cannot Design Correct Ledger Systems.
References
- Flashbots, "Speeding up the EVM" (2023) — 35% conflict rate, 70% state I/O
- Saraph & Herlihy, "Speculative Concurrency in Ethereum" (2019, Dagstuhl) — 20-34% abort rate
- Gelashvili et al., Block-STM / Aptos (2022) — 170K TPS synthetic, 17x speedup
- Sei Research, "64.85% of Ethereum TXs Can Be Parallelized" (2024) — Amdahl P=0.649
- Amber Group, "Parallel Power Unlocked" (2022) — conflict rate analysis
- Robert Leifke, "State Access is the Bottleneck" (2026) — I/O bound analysis
- Solana stress test data via AInvest (2025) — 107K vs ~900 TPS
- Chainspect Dashboard — Avalanche ~33 TPS