DCIPs/EIPS/eip-3540.md

306 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
eip: 3540
title: EOF - EVM Object Format v1
description: EOF is an extensible and versioned container format for EVM bytecode with a once-off validation at deploy time.
author: Alex Beregszaszi (@axic), Paweł Bylica (@chfast), Andrei Maiboroda (@gumb0), Matt Garnett (@lightclient)
discussions-to: https://ethereum-magicians.org/t/evm-object-format-eof/5727
status: Review
type: Standards Track
category: Core
created: 2021-03-16
requires: 3541, 3860, 4750, 5450
---
## Abstract
We introduce an extensible and versioned container format for the EVM with a once-off validation at deploy time. The version described here brings the tangible benefit of code and data separation, and allows for easy introduction of a variety of changes in the future. This change relies on the reserved byte introduced by [EIP-3541](./eip-3541.md).
To summarise, EOF bytecode has the following layout:
```
magic, version, (section_kind, section_size)+, 0, <section contents>
```
## Motivation
On-chain deployed EVM bytecode contains no pre-defined structure today. Code is typically validated in clients to the extent of `JUMPDEST` analysis at runtime, every single time prior to execution. This poses not only an overhead, but also a challenge for introducing new or deprecating existing features.
Validating code during the contract creation process allows code versioning without an additional version field in the account. Versioning is a useful tool for introducing or deprecating features, especially for larger changes (such as significant changes to control flow, or features like account abstraction).
The format described in this EIP introduces a simple and extensible container with a minimal set of changes required to both clients and languages, and introduces validation.
The first tangible feature it provides is separation of code and data. This separation is especially beneficial for on-chain code validators (like those utilised by layer-2 scaling tools, such as Optimism), because they can distinguish code and data (this includes deployment code and constructor arguments too). Currently, they a) require changes prior to contract deployment; b) implement a fragile method; or c) implement an expensive and restrictive jump analysis. Code and data separation can result in ease of use and significant gas savings for such use cases. Additionally, various (static) analysis tools can also benefit, though off-chain tools can already deal with existing code, so the impact is smaller.
A non-exhaustive list of proposed changes which could benefit from this format:
- Including a `JUMPDEST`-table (to avoid analysis at execution time) and/or removing `JUMPDEST`s entirely.
- Introducing static jumps (with relative addresses) and jump tables, and disallowing dynamic jumps at the same time.
- Requiring the execution of a code section ends with a terminating instruction. (Assumptions like this can provide significant speed improvements in interpreters, such as a speed-up of ~7% seen in evmone (ethereum/evmone#295).
- Multibyte opcodes without any workarounds.
- Representing functions as individual code sections instead of subroutines.
- Introducing special sections for different use cases, notably Account Abstraction.
## Specification
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 and RFC 8174.
In order to guarantee that every EOF-formatted contract in the state is valid, we need to prevent already deployed (and not validated) contracts from being recognized as such format. This is achieved by choosing a byte sequence for the *magic* that doesn't exist in any of the already deployed contracts.
### Remarks
The *initcode* is the code executed in the context of the *create* transaction, `CREATE`, or `CREATE2` instructions. The *initcode* returns *code* (via the `RETURN` instruction), which is inserted into the account. See section 7 ("Contract Creation") in the Yellow Paper for more information.
The opcode `0xEF` is currently an undefined instruction, therefore: *It pops no stack items and pushes no stack items, and it causes an exceptional abort when executed.* This means *initcode* or already deployed *code* starting with this instruction will continue to abort execution.
Unless otherwised specified, all integers are encoded in big-endian byte order.
### Code validation
We introduce *code validation* for new contract creation. To achieve this, we define a format called EVM Object Format (EOF), containing a version indicator, and a ruleset of validity tied to a given version.
At `block.number == HF_BLOCK` new contract creation is modified:
- if *initcode* or *code* starts with the `MAGIC`, it is considered to be EOF formatted and will undergo validation specified in the following sections,
- else if *code* starts with `0xEF`, creation continues to result in an exceptional abort (the rule introduced in EIP-3541),
- otherwise code is considered *legacy code* and the following rules do not apply to it.
For a create transaction, if *initcode* or *code* is invalid, the contract creation results in an exceptional abort. Such a transaction is valid and may be included in a block. Therefore, the transaction sender's nonce is increased.
For the `CREATE` and `CREATE2` instructions, if *initcode* or *code* is invalid, instructions' execution ends with the result `0` pushed on stack. The *initcode* validation happens just before its execution and validation failure is observable as if execution results in an exceptional abort. I.e. in case *initcode* or returned *code* is invalid the caller's nonce remains increased and all creation gas is deducted.
### Container specification
EOF container is a binary format with the capability of providing the EOF version number and a list of EOF sections.
The container starts with the EOF header:
| description | length | value | |
|-------------|----------|------------|--------------------|
| magic | 2-bytes | 0xEF00 | |
| version | 1-byte | 0x010xFF | EOF version number |
The EOF header is followed by at least one section header. Each section header contains two fields, `section_kind` and either `section_size` or `section_size_list`, depending on the kind. `section_size_list` is a list of size values when multiple sections of this kind are allowed.
| description | length | value | |
|-------------------|---------|---------------|-------------------|
| section_kind | 1-byte | 0x010xFF | `uint8` |
| section_size | 2-bytes | 0x00000xFFFF | `uint16` |
| section_size_list | dynamic | n/a | `uint16, uint16+` |
The list of section headers is terminated with the *section headers terminator byte* `0x00`. The body content follows immediately after.
#### Container validation rules
1. `version` MUST NOT be `0`.[^1](#eof-version-range-start-with-1)
2. `section_kind` MUST NOT be `0`. The value `0` is reserved for *section headers terminator byte*.
3. There MUST be at least one section (and therefore section header).
4. Section content size MUST be equal to size declared in its header.
5. Stray bytes outside of sections MUST NOT be present. This includes trailing bytes after the last section.
### EOF version 1
EOF version 1 is made up of 5 EIPs, including this one: [EIP-3540](./eip-3540.md), [EIP-3670](./eip-3670.md), [EIP-4200](./eip-4200.md), [EIP-4750](./eip-4750.md), and [EIP-5450](./eip-5450.md). Some values in this specification are only discussed briefly. To understand the full scope of EOF, it is necessary to review each EIP in-depth.
#### Container
The EOF version 1 container consists of a `header` and `body`.
```
container := header, body
header := magic, version, kind_type, type_size, kind_code, num_code_sections, code_size+, kind_data, data_size, terminator
body := type_section, code_section+, data_section
type_section := (inputs, outputs, max_stack_height)+
```
*note: `,` is a concatenation operator and `+` should be interpreted as "one or more" of the preceding item*
##### Header
| name | length | value | description |
|-------------------|----------|---------------|------------------------------------------------------------------------------------|
| magic | 2 bytes | 0xEF00 | EOF prefix |
| version | 1 byte | 0x01 | EOF version |
| kind_type | 1 byte | 0x01 | kind marker for EIP-4750 type section header |
| type_size | 2 bytes | 0x0004-0xFFFC | uint16 denoting the length of the type section content, 4 bytes per code segment |
| kind_code | 1 byte | 0x02 | kind marker for code size section |
| num_code_sections | 2 bytes | 0x0001-0x0400 | uint16 denoting the number of the code sections |
| code_size | 2 bytes | 0x0001-0xFFFF | uint16 denoting the length of the code section content |
| kind_data | 1 byte | 0x03 | kind marker for data size section |
| data_size | 2 bytes | 0x0000-0xFFFF | uint16 integer denoting the length of the data section content |
| terminator | 1 byte | 0x00 | marks the end of the header |
##### Body
| name | length | value | description |
|------------------|----------|--------------|----------------------------------------------------|
| type_section | variable | n/a | stores EIP-4750 and EIP-5450 code section metadata |
| inputs | 1 byte | 0x00-0x7F | number of stack elements the code section consumes |
| outputs | 1 byte | 0x00-0x7F | number of stack elements the code section returns |
| max_stack_height | 2 bytes | 0x0000-0x3FF | max height of operand stack during execution |
| code_section | variable | n/a | arbitrary bytecode |
| data_section | variable | n/a | arbitrary sequence of bytes |
See [EIP-4750](./eip-4750.md) for more information on the type section content.
#### EOF version 1 validation rules
1. In addition to general validation rules above, EOF version 1 bytecode conforms to the rules specified below:
- Exactly one type section header MUST be present immediately following the EOF version. Each code section MUST have a specified type signature in the type body.
- Exactly one code section header MUST be present immediately following the type section. A maximum of 1024 individual code sections are allowed.
- Exactly one data section header MUST be present immediately following the code section.
2. Any version other than `0x01` is invalid.
(*Remark:* Contract creation code SHOULD set the section size of the data section so that the constructor arguments fit it.)
### Changes to execution semantics
For clarity, the *container* refers to the complete account code, while *code* refers to the contents of the code section only.
1. Execution starts at the first byte of the first code section, and PC is set to 0.
2. Execution stops if `PC` goes outside the code section bounds.
3. `PC` returns the current position within the *code*.
4. `CODECOPY`/`CODESIZE`/`EXTCODECOPY`/`EXTCODESIZE`/`EXTCODEHASH` keeps operating on the entire *container*.
5. The input to `CREATE`/`CREATE2` is still the entire *container*.
6. The size limit for deployed code as specified in [EIP-170](./eip-170.md) and for initcode as specified in [EIP-3860](./eip-3860.md) is applied to the entire *container* size, not to the *code* size. This also means if initcode validation fails, it is still charged the EIP-3860 `initcode_cost`.
(*Remark:* Due to [EIP-4750](./eip-4750.md), `JUMP` and `JUMPI` are disabled and therefore are not discussed in relation to EOF.)
### Changes to contract creation semantics
For clarity, the *EOF prefix* together with a version number *n* is denoted as the *EOFn prefix*, e.g. *EOF1 prefix*.
1. If *initcode's container* has EOF1 prefix it MUST be valid EOF1 code.
2. If *code's container* has EOF1 prefix it MUST be valid EOF1 code.
3. If *initcode's container* is valid EOF1 code the resulting *code's container* MUST be valid EOF1 code (i.e. it MUST NOT be empty and MUST NOT produce legacy code).
4. If `CREATE` or `CREATE2` instruction is executed in an EOF1 code the instruction's *initcode* MUST be valid EOF1 code (i.e. EOF1 contracts MUST NOT produce legacy code).
See [Code validation](#code-validation) above for specification of behaviour in case one of these conditions is not satisfied.
## Rationale
EVM and/or account versioning has been discussed numerous times over the past years. This proposal aims to learn from them.
See "Ethereum account versioning" on the Fellowship of Ethereum Magicians Forum for a good starting point.
### Execution vs. creation time validation
This specification introduces creation time validation, which means:
- All created contracts with *EOFn* prefix are valid according to version *n* rules. This is very strong and useful property. The client can trust that the deployed code is well-formed.
- In the future, this allows to serialize `JUMPDEST` map in the EOF container and eliminate the need of implicit `JUMPDEST` analysis required before execution.
- Or to completely remove the need for `JUMPDEST` instructions.
- This helps with deprecating EVM instructions and/or features.
- The biggest disadvantage is that deploy-time validation of EOF code must be enabled in two hard-forks. However, the first step ([EIP-3541](./eip-3541.md)) is already deployed in London.
The alternative is to have execution time validation for EOF. This is performed every single time a contract is executed, however clients may be able to cache validation results. This *alternative* approach has the following properties:
- Because the validation is consensus-level execution step, it means the execution always requires the entire code. This makes *code merkleization impractical*.
- Can be enabled via a single hard-fork.
- Better backwards compatibility: data contracts starting with the `0xEF` byte or the *EOF prefix* can be deployed. This is a dubious benefit, however.
### Contract creation restrictions
The [Changes to contact creation semantics](#changes-to-contract-creation-semantics) section defines
minimal set of restrictions related to the contract creation: if *initcode* or *code* has the EOF1
container prefix it must be validated. This adds two validation steps in the contract creation,
any of it failing will result in contract creation failure.
Moreover, it is not allowed to create legacy contracts from EOF1 ones. And the EOF version of *initcode* must match the EOF version of the produced *code*.
The rule can be generalized in the future: EOFn contract must only create EOFm contracts, where m ≥ n.
This guarantees that a cluster of EOF contracts will never spawn new legacy contracts.
Furthermore, some exotic contract creation combinations are eliminated (e.g. EOF1 contract creating new EOF1 contract with legacy *initcode*).
Finally, create transaction must be allowed to contain legacy *initcode* and deploy legacy *code* because otherwise there is no transition period allowing upgrading transaction signing tools. Deprecating such transactions may be considered in the future.
### The MAGIC
1. The first byte `0xEF` was chosen because it is reserved for this purpose by [EIP-3541](./eip-3541.md).
2. The second byte `0x00` was chosen to avoid clashes with three contracts which were deployed on **Mainnet**:
- `0xca7bf67ab492b49806e24b6e2e4ec105183caa01`: `EFF09f918bf09f9fa9`
- `0x897da0f23ccc5e939ec7a53032c5e80fd1a947ec`: `EF`
- `0x6e51d4d9be52b623a3d3a2fa8d3c5e3e01175cd0`: `EF`
3. No contracts starting with `0xEF` bytes exist on public testnets: Goerli, Ropsten, Rinkeby, Kovan and Sepolia at their London fork block.
### EOF version range start with 1
The version number 0 will never be used in EOF, so we can call legacy code *EOF0*.
Also, implementations may use APIs where 0 version number denotes legacy code.
### Section structure
We have considered different questions for the sections:
- Streaming headers (i.e. `section_header, section_data, section_header, section_data, ...`) are used in some other formats (such as WebAssembly). They are handy for formats which are subject to editing (adding/removing sections). That is not a useful feature for EVM. One minor benefit applicable to our case is that they do not require a specific "header terminator". On the other hand they seem to play worse with code chunking / merkleization, as it is better to have all section headers in a single chunk.
- Whether to have a header terminator or to encode `number_of_sections` or `total_size_of_headers`. Both raise the question of how large of a value these fields should be able to hold. A terminator byte seems to avoid the problem of choosing a size which is too small without any perceptible downside, so it is the path taken.
- Whether to encode `section_size` as a fixed 16-bit value or some kind of variable length field (e.g. LEB128). We have opted for fixed size, because it simplifies client implementations, and 16-bit seems enough, because of the currently exposed code size limit of 24576 bytes (see [EIP-170](./eip-170.md) and [EIP-3860](./eip-3860.md)). Should this be limiting in the future, a new EOF version could change the format. Besides simplifying client implementations, not using LEB128 also greatly simplifies on-chain parsing.
### Data-only contracts
The EOF prevents deploying contracts with arbitrary bytes (data-only contracts: their purpose is to store data not execution). **EOF1 requires** presence of a **code section** therefore the minimal overhead EOF data contract consist of a data section and one code section with single instruction. We recommend to use `INVALID` instruction in this case. In total there are 20 additional bytes required.
```
EF0001 010004 020001 0001 03 <data size (2 bytes)> 00 00000000 FE <data>
```
It is possible in the future that this data will be accessible with data-specific opcodes, such as `DATACOPY` or `EXTDATACOPY`. Until then, callers will need to determine the data offset manually.
### PC starts with 0 at the code section
The value for `PC` is specified to start at 0 and to be within the active *code* section. We considered keeping `PC` to operate on the whole *container* and be consistent with `CODECOPY`/`EXTCODECOPY` but in the end decided otherwise. This also feels more natural and easier to implement in EVM: the new EOF EVM should only care about traversing *code* and accessing other parts of the *container* only on special occasions (e.g. in `CODECOPY` instruction).
## Backwards Compatibility
This is a breaking change given that any code starting with `0xEF` was not deployable before (and resulted in exceptional abort if executed), but now some subset of such codes can be deployed and executed successfully.
The choice of `MAGIC` guarantees that none of the contracts existing on the chain are affected by the new rules.
## Test Cases
### Contract creation
All cases should be checked for creation transaction, `CREATE` and `CREATE2`.
- Legacy init code
- Returns legacy code
- Returns valid EOF1 code
- Returns invalid EOF1 code, contract creation fails
- Returns 0xEF not followed by EOF1 code, contract creation fails
- Valid EOF1 init code
- Returns legacy code, contract creation fails
- Returns valid EOF1 code
- Returns invalid EOF1 code, contract creation fails
- Returns 0xEF not followed by EOF1 code, contract creation fails
- Invalid EOF1 init code
### Contract execution
- EOF code containing `PC` opcode - offset inside code section is returned
- EOF code containing `CODECOPY/CODESIZE` - works as in legacy code
- `CODESIZE` returns the size of entire container
- `CODECOPY` can copy from code section
- `CODECOPY` can copy from data section
- `CODECOPY` can copy from the EOF header
- `CODECOPY` can copy entire container
- `EXTCODECOPY/EXTCODESIZE/EXTCODEHASH` with the EOF *target* contract - works as with legacy target contract
- `EXTCODESIZE` returns the size of entire target container
- `EXTCODEHASH` returns the hash of entire target container
- `EXTCODECOPY` can copy from target's code section
- `EXTCODECOPY` can copy from target's data section
- `EXTCODECOPY` can copy from target's EOF header
- `EXTCODECOPY` can copy entire target container
- Results don't differ when executed inside legacy or EOF contract
## Security Considerations
With the anticipated EOF extensions, the validation is expected to have linear computational and space complexity.
We think that the validation cost is sufficiently covered by:
- [EIP-3860](./eip-3860.md) for *initcode*,
- high per-byte cost of deploying *code*.
## Copyright
Copyright and related rights waived via [CC0](../LICENSE.md).