Content Addressing is Magic

It is conjuring from the ether, it is wishing things into your hands, it is just saying its name and it will appear. It is pure magic. And it may become a very important part of our future– in the Decentralized Web and beyond.

“The great thing of the web is that now knowledge has an address,” said Peter Lyman, the University Librarian of UC Berkeley 20 years ago of the URL, which means that people can build easily on other’s knowledge. Now we can add something: “Content addressibility means that knowledge has a name.” A name can be better than an address because addresses sometimes become obsolete. (Peter Lyman was one of the first board members of the Internet Archive and, I believe, came up with the term “Born-Digital” to describe materials had never been printed– a new thing in 1996).

What am I talking about? This might sound to simple to stand up these big claims, but bear with me. This is one of the big things I have learned from the Decentralized Web work.

Content Addressing starts by processing a digital file into a “hash” which is roughly 64Bytes, or 64 character long string of numbers (using sha256). This hash is has amazing properties– given a hash you can confirm that a digital file matches it, further given a hash it is very very difficult to create the digital file. And, here is the kicker, given a hash it is almost impossible to create a second digital file that matches it, but was not exactly the same as the original.

Therefore, a “hash” is a name for a file in the sense that if you have a hash you are looking for, and someone hands you a file, you can confirm it, and you do not have to trust who gave it to you– they can not fake or counterfeit the file. The file either has the same hash or not.

That the hash is very short, like 64 characters, and can name a multigigabyte file means that moving around these hashes is very efficient. The Internet Archive has 17 petabytes of web data, but all of the hashes are only 22terabytes. Therefore to give every web object a unique name, it only takes 0.1% of the size.

So, with a hash, one can address content directly, ask for it by name, and confirm if what you are given matches. The most common application of this is in the BitTorrent system, but it is widely used. In bittorrent, one can start with a “magnet link”, which is a hash, and asked the “decentralized hash table” DHT, and it will help you retrieve the file that matches that hash, in this case a “torrent” file. A torrent file, in turn, contains a list of hashes of pieces of files that can then be retrieved through the bittorrent protocol, and after this magic is done, then you will have a set of files on our hard drive that came from 10s or thousands of others all over the net.

Therefore, if there are others on the peer-to-peer network that are serving files, and you have a hash, then you can ask the network to give you that matching file or piece of a file, and there can be no counterfeiting.

Magic.

Why this can be important that materials can be served from many places, served from libraries and archives, and keep permanently available long after the original server is gone. I think of it as a way to have the same book be in many libraries, and even if the publisher goes away, and several of the libraries merge, you still have a chance to get the book. This is different from a website, where if the website goes away, you are either out of luck or if something like the Wayback Machine has a copy, you are saved, but you have to trust us. So in a way, this hash idea is bringing back some nice features back from the printed era. A much more reliable system of digital publishing is possible in this way.

(This is how IPFS, Zeronet, DAT, and just about every decentralized system works, but I think still it is under-appreciated magic. Next miracle I will describe is how cryptographically signed files can bring us the next step: updatable digital files that are served from everywhere and nowhere.)

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Content Addressing is Magic

  1. AKFörster says:

    And when you fix a typo, it’s a completely new hash. Another typo, another hash… Does that scale?
    What about a simple identifier like a “name”? Or an URN?

    • URN’s are good for things that change or might change, but have a disadvantage that you need to have an authoratative URN resolver to say which are valid documents. This can be somewhat fixed by having signed documents, and having the URN hold the public key so anyone can check to see if this is valid, but this is not clear, never mind, I will take another run at this later.

      But Content Addressing has the advantage over URNs in that you can get the document from anywhere and trust that you have the right thing. This is because its hash of the retrieved document can be easily checked to see if it matches the address you were using to retrieve it, and (and this is the magic part), it is very hard to make a second document that matches a hash, so it is difficult to forge documents. And when we say “hard” they seem to mean it would take a computer the age of the universe.

      In a decentralized system, this “get it from anywhere” is very important. It is also important in archives.

  2. Zooko says:

    Good blog post! I agree that it is a deeply important idea. Nitpick: use BLAKE2 instead of SHA256. It is more efficient, easier to use safely, and more future-proof. At the Decentralized Web Summit there was a breakout group to discuss “what secure hash function should we standardize on?” and BLAKE2 was the answer.

Comments are closed.