BSV Academy Logo over digital globe

Ch8: The Bitcoin white paper - Reclaiming disk space

Nov 24, 2022

The BSV Academy’s free introduction to Bitcoin Theory course covers the design of Bitcoin as a system as prescribed by Satoshi Nakamoto. This course is open to anyone who is interested in Bitcoin and is the beginner course in this series. Some technical experience would be helpful to complete the course, however, it is open to anyone regardless of experience.

The course goes through the Bitcoin white paper section by section elaborating on the concepts contained within each. This section focuses on reclaiming disk space. When the latest transaction in a coin is buried under enough blocks, the spent transactions before it can be discarded to save disk space.

To make it as effortless as possible for you to have access to this educational material, we are publishing the entire course here on our blog. Stay tuned for a section-by-section release, and remember that you are still welcome to enrol in the BSV Academy to gain a certificate of completion to add to your resume.

Reclaiming disk space on the Bitcoin blockchain

Once the latest transaction in a coin is buried under enough blocks, the spent transactions before it can be discarded to save disk space. To facilitate this without breaking the block's hash, transactions are hashed in a Merkle Tree, with only the root included in the block's hash. Old blocks can then be compacted by stubbing off branches of the tree. The interior hashes do not need to be stored.

Reclaiming Disk Space
A block header with no transactions would be about 80 bytes. If we suppose blocks are generated every 10 minutes, 80 bytes * 6 * 24 * 365 = 4.2MB per year. With computer systems typically selling with 2GB of RAM as of 2008, and Moore's Law predicting current growth of 1.2GB per year, storage should not be a problem even if the block headers must be kept in memory.

Spent Bitcoin transactions

Once the latest transaction in a coin is buried under enough blocks, the spent transactions before it can be discarded to save disk space.
- Satoshi Nakamoto, Bitcoin Whitepaper

Saving Disk Space

When a UTXO is created in a transaction, the outputs that were spent as inputs to the transaction are consumed by the action and cease to exist as live spendable tokens on the network. 

Once the transaction has been ‘mined’ into a block which the network has then expended work building upon, it can be said that the transaction record is immutable. This means that anyone with a copy of the transaction can prove that the transaction was created before the block timestamp.

Once the spent outputs from old transactions, which are referred to in the new transaction's inputs, have been made immutable, nodes are free to remove them from their copy of the blockchain. They are not required for the process of creating new blocks and consume hard disk storage space. At this point, it becomes the responsibility of the people who made the transactions to keep their own copies of them. This can be managed through archive services, private storage and more.
 

The Merkle Tree

To facilitate this without breaking the block's hash, transactions are hashed in a Merkle Tree, with only the root included in the block's hash.
- Satoshi Nakamoto, Bitcoin Whitepaper

The Merkle Tree

In order for nodes to be able to remove individual transactions that have been placed in a block without impacting the integrity of the block hash, a data structure known as a Merkle Tree is used.

Merkle Tree is a hash structure that can hold an unbounded number of data items in a particular order, and identify the set using a single ‘Merkle root’ value.

Imagine a Merkle root like an inverted family tree. We start with a long list of individuals, and each is partnered and generates one child who partners with the child of another pair. This leads down to a single person who is the subject of the family tree.

The Merkle root is a single hash which represents the ordered list of all of the transaction ID’s processed by the node during the construction of their block. This Merkle root is embedded in the block header which forms the blockchain DAG. Because every transaction included in the block is effectively an input into the final hash that produces the Merkle root, changing any detail of any individual transaction, or even the order of the transactions would produce a different Merkle root.

We can exploit this property to provide a mathematical proof that a transaction is part of a Merkle tree with a given root, but we do not need the data for the entire tree in order to do so. The required data for such a “proof of inclusion” is very compact.

Merkle trees allow Bitcoin blocks to grow at an unbounded rate with minimal impact on the user experience.

A Merkle tree enables a node to remove or ‘prune’ individual transactions from their record of a given block, and to retain only hashes of the transaction, or even hashes of the branch the transaction was in. With just a block header, a node can provably show that any transaction for which they have a corresponding hash and merkle path is contained within a block. This allows nodes to keep efficient, high-speed systems at the forefront of the network, making the most of dynamic memory which is expensive to keep.

Merkle tree

Learn more about the Merkle Tree here

Compacting Bitcoin blocks

Old blocks can then be compacted by stubbing off branches of the tree. The interior hashes do not need to be stored.
- Satoshi Nakamoto, Bitcoin Whitepaper

Compacting Bitcoin Blocks

Like the original generation parents in the family tree, the Merkle tree starts with a list of transactions which are hashed to form the base level. As long as the node has copies of all of the transactions, they can validate the whole structure.

Transactions are hashed to form their transaction IDs, which are then hashed in pairs to create the hashes at the parent branch, effectively halving the number of data items at the next level in the tree. This is akin to the parents at the top each producing a unique child.

Parent branches are hashed with their adjacent parent branch creating another level half the size and so on and so forth until the two deepest branches are combined to form the Root Hash of the Merkle tree.

This is like each generation halving the number of people, and those people’s pairing with each other resulting in a next generation half the size. This tree gets smaller and smaller until we reach the root of the tree, or the ‘subject’ of the family tree.

It is this root hash that is used in the block header which is subjected to Proof of Work, and it is this value which must remain provably linked to the tree the node is storing.

Thanks to this structure, nodes are able to streamline their operation by cutting back the number of spent transaction records they store and focusing mainly on storing live UTXOs as needed for the operation of the network. As transactions are removed, the parent hash is generated and stored so that the node can still establish the authenticity of the remaining records they have kept. As more transactions are removed, the stubs that are stored move higher into the branches of the tree, further reducing the size of the block record.

Imagine that the subject decides information about one of the original generations is no longer needed, and they cut it from the tree. As long as they keep information about the children, we can find a path from the subject back to the original generation. If enough parents are discarded, the system can purge their children from the tree as well. Records of these middle generations are only needed to ensure that the link between the subject and the original parent generation is able to be proven.

Nodes role in the operation of the network does not involve storing a full copy of the transaction records in every block forever. This information is not required to validate transactions or to build blocks. This realisation allows the role of storing complete copies of every block to be pushed out from the central core of the network to archiving systems that can create business models around selling access to old, spent transaction records.

This also creates an incentive for users to take responsibility for their own data. Records can be kept locally or in cold storage, removing the need to access any kind of archive. This also allows users to attach additional details to transactions which may not be stored on the ledger.

Further information can be found here.

Bitcoin block headers

A block header with no transactions would be about 80 bytes. If we suppose blocks are generated every 10 minutes, 80 bytes * 6 * 24 * 365 = 4.2MB per year. With computer systems typically selling with 2GB of RAM as of 2008, and Moore's Law predicting current growth of 1.2GB per year, storage should not be a problem even if the block headers must be kept in memory.
- Satoshi Nakamoto, Bitcoin Whitepaper

Bitcoin Block Headers

The record of a block’s existence is its header. This is like the birth certificate of the subject of the family tree. From the header, a node can see which block it builds upon, what time it was discovered and can validate the proof of work done by the node that discovered it. The contents of the merkle tree, which are all of the transactions that the block includes, are not part of the header and are not needed to prove the existence of a block whose transactions were validated by the network a long time ago.

Using the equations above, it becomes easy to see that the amount of data needed to store a full record of the chain of proof of work and the live UTXO set, plus a full set of all transactions from recently mined blocks is a much more efficient way to manage an operating node rather than having to store and manage transactions which will never again be referenced in the generation of a new transaction or block.

The list of block headers grows linearly at a rate of around 4.2MB per year while the capability of computing systems scales exponentially. This makes it possible for the system to scale to accommodate the needs of a global financial network and for end-user wallets which only require this small subset of Bitcoin data to operate efficiently on consumer-grade devices such as mobile phones or even embedded IoT systems.

Ryan Brothwell
Ryan Brothwell

Deputy Editor, Bitcoin Association