DCIPs/EIPS/eip-2926.md

162 lines
12 KiB
Markdown

---
eip: 2926
title: Chunk-Based Code Merkleization
author: Sina Mahmoodi (@s1na), Alex Beregszaszi (@axic)
discussions-to: https://ethereum-magicians.org/t/eip-2926-chunk-based-code-merkleization/4555
status: Stagnant
type: Standards Track
category: Core
created: 2020-08-25
requires: 161, 170, 2584
---
## Abstract
Code merkleization, along with binarification of the trie and gas cost bump of state accessing opcodes, are considered as the main levers for decreasing block witness sizes in stateless or partial-stateless Eth1x roadmaps. Here we specify a fixed-sized chunk approach to code merkleization and outline how the transition of existing contracts to this model would look like.
## Motivation
Bytecode is currently the [second contributor](https://github.com/mandrigin/ethereum-mainnet-bin-tries-data) to block witness size, after the proof hashes. Transitioning the trie from hexary to binary reduces the hash section of the witness by 3x, thereby making code the first contributor. By breaking contract code into chunks and committing to those chunks in a merkle tree, stateless clients would only need the chunks that were touched during a given transaction to execute it.
## Specification
This specification assumes that [EIP-2584](./eip-2584.md) is deployed, and the merkleization rules and gas costs are proposed accordingly. What follows is structured to have two sections:
1. How a given contract code is split into chunks and then merkleized
2. How to merkleize all existing contract codes during a hardfork
### Constants and Definitions
#### Constants
- `CHUNK_SIZE`: 32 (bytes)
- `KEY_LENGTH`: 2 (bytes)
- `MAX_CHUNK_COUNT`: `0xfffc`
- `VERSION_KEY`: `0xfffd`
- `CODELEN_KEY`: `0xfffe`
- `CODEHASH_KEY`: `0xffff`
- `VERSION`: 0
- `EMPTY_CODE_ROOT`: `0xc5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470` (==`keccak256('')`)
- `HF_BLOCK_NUMBER`: to be defined
#### Definitions
- `BE(x, N)`: casts `x` to an unsigned integer of `N` bytes and returns its big-endian representation
### Code merkleization
For an account record `A` with code `C`, the field `A.codeHash` is replaced with `codeRoot`. `codeRoot` is `EMPTY_CODE_ROOT` if `C` is empty. Otherwise it contains the root of `codeTrie`, a [binary trie](https://hackmd.io/uCWOpSrUQaytBgcO0MVkTQ) with the following leaves:
- Key: `VERSION_KEY`, value: `BE(VERSION, 1)`
- Key: `CODELEN_KEY`, value: `BE(length(C), 4)`
- Key: `CODEHASH_KEY`, value: `keccak256(C)`
In addition to the above, `codeTrie` commits to a list of code `chunks = [(FIO_0, code_0), ..., (FIO_n, code_n)]` which are derived from `C` in a way that:
- `n < MAX_CHUNK_COUNT`.
- `code_0 || ... || code_n == C`.
- `length(code_i) == CHUNK_SIZE` where `0 <= i < n`.
- `length(code_n) <= CHUNK_SIZE`.
- `FIO_i` is the offset of the first instruction within the chunk. It should only be greater than zero if the last instruction in `code_i-1` is a multi-byte instruction (like `PUSHN`) crossing the chunk boundary. It is set to `CHUNK_SIZE` in the case where all bytes of a chunk are data.
The `i`th element of `chunks` is stored in `codeTrie` with:
- Key: `BE(i, KEY_LENGTH)`
- Value: `BE(FIO_i, 1) || code_i`, where `||` stands for byte-wise concatenation
#### Contract creation gas cost
As of now there is a charge of 200 gas per byte of the code stored in state by contract creation operations, be it initiated via `CREATE`, `CREATE2`, or an external transaction. This per byte cost is to be increased from `200` to `TBD` to account for the chunking and merkleization costs.
### Updating existing code (transition process)
The transition process involves reading all contracts in the state and applying the above procedure to them. A benchmark showing how long this process will take is still pending, but intuitively it should take longer than the time between two blocks (in the order of hours). Hence we recommend clients to pre-process the changes before the EIP is activated.
Code has the nice property that it is (mostly) static. Therefore clients can maintain a mapping of `[accountAddress -> codeRoot]` which stores the results for the contracts they have already merkleized. During this pre-computation phase, whenever a new contract is created its `codeRoot` is computed, and added to the mapping. Whenever a contract self-destructs, its corresponding entry is removed.
At block `HF_BLOCK_NUMBER` when the EIP gets activated, before executing any transaction the client must update the account record for all contracts with non-empty code in the state to replace the `codeHash` field with the pre-computed `codeRoot`. EoA accounts will keep their `codeHash` value as `codeRoot`. *Accounts with empty code will keep their `codeHash` value as `codeRoot`.*
## Rationale
### Hexary vs binary trie
The Ethereum mainnet state is encoded as of now in a hexary Merkle Patricia Tree. As part of the Eth1x roadmap, a transition to a [binary trie](https://ethresear.ch/t/binary-trie-format/7621) has been [investigated](https://medium.com/@mandrigin/stateless-ethereum-binary-tries-experiment-b2c035497768) with the goal of reducing witness sizes. Because code chunks are also stored in the trie, this EIP would benefit from the witness size reduction offered by the binarification. Therefore we have decided to explicitly state [EIP-2584](./eip-2584.md) to be a requirement of this change. Note that if code merkleization is included in a hard-fork beforehand, then all code must be re-merkleized after the binary transition.
### Chunk size
The current recommended chunk size of 32 bytes has been selected based on a few observations. Smaller chunks are more efficient (i.e. have higher [chunk utilization](https://ethresear.ch/t/some-quick-numbers-on-code-merkelization/7260/3)), but incur a larger hash overhead (i.e. number of hashes as part of the proof) due to a higher trie depth. Larger chunks are less efficient, but incur less hash overhead. We plan to run a larger experiment comparing various chunk sizes to arrive at a final recommendation.
### First instruction offset
The `firstInstructionOffset` fields allows safe jumpdest analysis when a client doesn't have all the chunks, e.g. a stateless clients receiving block witnesses.
Note: there could be an edge case when computing FIO for the chunks where the data bytes at the end of a bytecode (last chunk) resemble a multi-byte instruction. This case can be safely ignored.
### Gas cost of code-accessing opcodes
How merkleized code is stored in the client database affects the performance of code-accessing opcodes, i.e: CALL, CALLCODE, DELEGATECALL, STATICCALL, EXTCODESIZE, EXTCODEHASH, and EXTCODECOPY. Storing the code trie with all intermediate nodes in the database implies multiple look-ups to fetch the code of the callee, which is more than the current one look-up (excluding the trie traversal to get the account) required. Note CODECOPY and CODESIZE are not affected since the code for the current contract has already been loaded to memory.
The gas cost analysis in this section assumes a specific way of storing it. In this approach clients only merkleize code once during creation to compute `codeRoot`, but then discard the chunks. They instead store the full bytecode as well as the metadata fields in the database. We believe per-chunk metering for calls would be more easily solvable by witness metering in the stateless model.
### Different chunking logic
We have considered an alternative option to package chunks, where each chunk is prepended with its `chunkLength` and would only contain complete opcodes (i.e. any multi-byte opcode not fitting the `CHUNK_SIZE` would be deferred to the next chunk).
This approach has downsides compared to the one specified:
1) Requires a larger `CHUNK_SIZE` -- at least 33 bytes to accommodate the `PUSH32` instruction.
2) It is more wasteful. For example, `DUP1 PUSH32 <32-byte payload>` would be encoded as two chunks, the first chunk contains only `DUP1`, and the second contains only the `PUSH32` instruction with its payload.
3) Calculating the number of chunks is not trivial and would have to be stored explicitly in the metadata.
Additionally we have reviewed many other options (basic block based, Solidity subroutines (requires determining the control flow), EIP-2315 subroutines). This EIP however only focuses on the chunk-based option.
### RLP and SSZ
To remain consistent with the binary transition proposal we avoid using RLP for serializing the leaf values. We have furthermore considered SSZ for both serializing data and merkleization and remain open to adopting it, but decided to use the binary trie format for simplicity.
### Metadata fields
The metadata fields `version`, `codeLen` and `codeHash` are added mostly to facilitate a cheaper implementation of `EXTCODESIZE` and `EXTCODEHASH` in a stateless paradigm. The version field allows for differentiating between bytecode types (e.g. [EVM1.5/EIP-615](./eip-615.md), [EIP-2315](./eip-2315.md), etc.) or code merkleization schemes (or merkleization settings, such as larger `CHUNK_SIZE`) in future.
Instead of encoding `codeHash` and `codeSize` in the metadata, they could be made part of the account. In our opinion, the metadata is a more concise option, because EoAs do not need these fields, resulting in either additional logic (to omit those fields in the accounts) or calculation (to include them in merkleizing the account).
An alternative option to the version field would be to add an account-level field: either following [EIP-1702](./eip-1702.md), or by adding an `accountKind` field (with potential options: `eoa`, `merkleized_evm_chunk32`, `merkleized_eip2315_chunk64`, etc.) as the first member of the account. One benefit this could provide is omitting `codeHash` for EoAs.
### The keys in the code trie (and `KEY_LENGTH`)
As explained in the specification above, the keys in the code trie are indices of the `chunks` array. Having a key length of 2 bytes means the trie can address 65536 - 3 (minus metadata fields) chunks, which corresponds to ~2Mb code size. That allows for roughly ~85x increase in the code size limit in future without requiring a change in merkleization.
### Alternate values of codeRoot for EoAs
This proposal changes the meaning of the fourth field (`codeHash`) of the account. Prior to this change, that field represents the Keccak-256 hash of the bytecode, which is logically hash of an empty input for EoAs.
Since `codeHash` is replaced with `codeRoot`, the root hash of the code trie, the value would be different for EoAs under the new rules: the root of the `codeTrie(metadata=[codeHash=keccak256(''), codeSize=0])`. An alternative would be simply using the hash of an empty trie. Or to avoid introducing yet another constant (the result of the above), one could also consider using `codeRoot = 0` for EoAs.
However, we wanted to maintain compatibility with [EIP-1052](./eip-1052.md) and decided to keep matters simple by using the hash of empty input (`c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470`) for EoAs.
## Backwards Compatibility
From the perspective of contracts, the design aims to be transparent, with the exception of changes in gas costs.
Outside of the interface presented for contracts, this proposal introduces major changes to the way contract code is stored, and needs a substantial modification of the Ethereum state. Therefore it can only be introduced via a hard fork.
## Test Cases
TBD
Show the `codeRoot` for the following cases:
1. `code=''`
2. `code='PUSH1(0) DUP1 REVERT'`
3. `code='PUSH32(-1)'` (data passing through a chunk boundary)
## Implementation
The implementation of the chunking and merkleization logic in Typescript can be found [here](https://github.com/ewasm/biturbo/blob/merklize-mainnet-blocks/src/relayer/bytecode.ts#L172), and in Python [here](https://github.com/hugo-dc/code-chunks/). Please note neither of these examples currently use a binary tree.
## Security Considerations
TBA
## Copyright
Copyright and related rights waived via [CC0](../LICENSE.md).