-- Leo's gemini proxy

-- Connecting to mntn.srht.site:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

A proposal for links among sites published using git


📅 Published 2021-10-26


Alongside the recent discussions regarding email, Solderpunk proposed decentralized publishing using git repositories. This post primarily deals with one set of questions from this scheme--addressing and linking--but first a quick summary of my thoughts on this idea. Most of this echoes what Solderpunk has already laid out, and I suggest reading that post first.


decentralized publishing using git repositories


Pros:


From a decentralization perspective, git solves a lot of problems regarding synchronization

Git is already offline-first, which is what I'm looking for in a future network

Git supports networking, for those who have access to it

Secure Scuttlebutt supports Git through Git-SSB

Git is designed to support multiple collaborators; there are many ways to do this, including emailed patches

Git supports versioning

Git supports signing, and therefore supports integrity protection and author verifiability via web of trust

Git is widely supported on many platforms


Git-SSB


Cons:


Git is a complex beast under the hood

Repositories with lots of activity have extra overhead (try cloning the Linux kernel repo, even with sparse checkout it's a monster)

You can't just pull down one specific file from a git repo unless the repo is served by a frontend that allows it

Git doesn't provide a way to uniquely identify a given repository, a problem when a mystery repo has come to you via USB drive

There is currently no standard for "linking" to a specific file in a given git repo


The last two are big problems when you're dealing with publishing hypertext. To build a true decentralized-first, resilient publishing platform on Git, you've got to address hyperlinks, at least if you want to retain the advantages of hypertext.


Why can't we just use a normal hyperlink?


These days, most git repositories are located on a central server: GitHub, Sourcehut, Gitlab, private servers, or some other place. Individual developers pull down copies of the repositories in order to do work, and often they choose to retain a full copy of the repository and its branches. What many people don't realize is that these copies are identical to the centralized repository, as long as they are kept up-to-date. If GitHub were to disappear tomorrow, a developer could upload their repository to another service and keep running as if nothing had happened. If *all* hosting services were to disappear tomorrow, they could switch over to `git-send-email` or even just pass around USB drives. That's why git is considered **decentralized** when compared to earlier systems like CVS, it is designed to support development workflows that do not depend on a central gatekeeper.


This decentralization provides us with many advantages, but it also prevents us from just pointing a link to `https://github.com/my-name/my-cool-repo/blog-post.gmi`. (This would lose many of the advantages of decentralized publishing.)


The problem with addressing specific content in a decentralized system like this is that you can't depend on there being a single "server" that you can point people to, as with a standard hyperlink. There may be more than one server, and servers may move from time to time. There may be no server at all, and the repository may live entirely as *samizdat* passed from person-to-person. Complicating the picture further, the contents of one server may "drift" from another as different patches are applied in different places.


Fortunately, there is a way to deal with this situation.


Magnet links as prior art


Magnet links may be familiar to you if you've ever used Bittorrent or other file sharing networks. Like git, Bittorrent is a distributed system that often relies on centralized servers ("trackers") to help downloaders and seeders find each other. Trackers often allow users to search for and identify files they are looking for. But trackers are vulnerable to all kinds of legal and technical attacks, and you can't always rely on them to have indexed the file you're seeking.


Magnet links


Magnet links were devised to solve this problem. They uniquely identify a file by providing a cryptographic hash of the file, which some clients can use to locate a seed directly--without even touching a tracker. This allows people to distribute information about a file using a short text string, without sharing a .torrent and thereby placing themselves at the mercy of copyright attorneys. The magnet link tells you that the file exists and what it looks like, without passing along the file itself--and this is enough to locate it, thanks to some additional systems that we'll hand-wave away for now.


But wait--there's a much more respectable example to draw from, and it has nothing to do with the underground world of copyright infringement. I'm talking about *books*.


Copies of a specific book, like copies of a git repository, exist in homes and libraries all over the world. When I cite a passage from a book, there is no need for me to specify the library or bookstore from which I obtained it. Academic citation formats will tell you all you need to know to identify and locate the book--author, title, publisher, date, and so on. With that information, you can locate the book yourself using the systems provided by your library (or perhaps your local bookseller).


And yet there's an even more modern and efficient way to do this: an ISBN, or International Standard Book Number. ISBNs should uniquely identify a modern book, as long as the author has purchased an ISBN for their work (yes, this part is unfortunate, but we can ignore it). Going one step further, RFC 3187 defined a way to represent ISBNs as URNs, or Uniform Resource Names. These should look vaguely familiar: an example from the RFC is `urn:isbn:0-395-36341-1`. Doesn't that look a little bit like a hyperlink URL?


International Standard Book Number

RFC 3187

Uniform Resource Names


(Note, RFC 3187 has been obsoleted by a more complex RFC that doesn't significantly change the format. But that's not important here.)


URNs versus URLs


The URLs that we all know and love, like `https://thispersondoesnotexist.com`, are a subset of a broader standard: the URI, or Uniform Resource Identifier standard. Allow me to quote Wikipedia:


Uniform Resource Identifier


> A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies a logical or physical resource used by web technologies. URIs may be used to identify anything, including real-world objects, such as people and places, concepts, or information resources such as web pages and books.


This is exactly what we're looking for, it seems. URLs go a step further:


> Some URIs provide a means of locating and retrieving information resources on a network (either on the Internet or on another private network, such as a computer filesystem or an Intranet); these are Uniform Resource Locators (URLs). A URL provides the location of the resource. A URI identifies the resource by name at the specified location or URL.


We would do this if we could, but we can't (probably). Instead, we need to look to URNs:


> Other URIs provide only a unique name, without a means of locating or retrieving the resource or information about it, these are Uniform Resource Names (URNs).


Yes. If I'm looking for a specific file in a specific repository, I don't care *how* I get it, as long as I eventually get it. I just want to know that this repository that you just handed me is a valid copy of the repository I'm looking for.


Clarifying our needs


So we are dealing with git repositories here, not books or bootleg copies of "Tommy Boy." What would a URN look like if it were pointing to a specific file in a specific git repository?


Here's a short list of requirements, as I see them:


URNs should identify a *repository* (and all its descendents), not a network location or file path; as Git is a distributed protocol, repos may be passed along through multiple channels.

URNs should be able to identify a specific version of a file, for use in automated builds or academic and journalistic citations.

As long as the referenced file is still present, URNs should continue to function (mostly) even if the commit I was looking for has been rebased out of existence; the specific version we were seeking has been lost, but I may still want to find the current version as a fallback.

URNs should be resilient in the face of possible git hash collisions, whether accidental or manufactured.

If possible, URNs should give us a hint on how to find the repository.


I'll talk more about each of these below.


Building a unique identifier


I've thought about the first point over and over, and as far as I can tell, there is only one thing git provides natively which can be used to uniquely identify a repository's full lineage: the initial commit hash.


Unless the very first commit is rebased--and you could argue that this act creates a brand new repository--the initial commit hash will not change. It's not a perfect solution by any means, but this is about the best we can do, and it's not altogether terrible. I want this URN to work for plain old git repos, which is why I didn't consider schemes like placing special files, tags, or other objects in the repository to help with identification.


Identifying a file version and path


This part is easy; simply specify a commit hash and a file path and you have the version of the file you're looking for! This also provides the requested fallback in case of a rebase; if the commit is missing, we can use the file's path to find the another version of the file if the user requests it.


Protecting against tampering


Again, git has us covered here. As long as the author signed the commit, we can include the fingerprint of their signing key in the URN to prevent tampering. Even if someone can manufacture a commit with a hash collision in the repository (highly unlikely), this provides an extra layer of protection that makes tampering all but impossible.


Hints for finding the repository


There is a provision in RFC 8141 for something called an "r-component" inside of a URN. R-components are intended for passing parameters to a "resolution service" that can help you find the resource you are seeking. For how this might work, consider the earlier example regarding books.


RFC 8141


An ISBN URN does not tell you where to find a book; you must use a "resolution service" such as a library catalog system, bookstore portal, or other search engine to find it. Well, there's no reason that this service couldn't be a program running on your computer. Once r-components are standardized, you could use them to pass a "location hint" to that service; in the case of our URNs, you could pass one or more known network locations for the repository to make it really easy.


Unfortunately r-components are not yet standardized, in fact the RFC tells you that you SHOULD NOT use them yet. In the meantime, I can imagine some generous person running a service to catalog known public repos to make access easier; with local caching of repository information, that should be enough. But once they are ready, r-components can provide a long-term solution that decreases reliance on centralized services.


The proposal


Cutting to the chase, here's the proposed URN spec in ABNF:


namestring        =  "git-resource" ":" [ commit ] [ commit ]
                     path [ fragment ]

commit            =  commit-hash [ fingerprint ] ":"
commit-hash       =  sha1 / sha256
fingerprint       =  "!" fingerprint-hash
fingerprint-hash  =  md5 / sha1 / sha256
md5               =  32HEXDIG
sha1              =  40HEXDIG
sha256            =  64HEXDIG

fragment          =  "#" line  ; other types TBD
line              =  "line" "=" %x31-39 *DIGIT  ; 1 or greater, no leading zeros

path              =  *(unreserved / pct-encoded / "/")
unreserved        =  ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded       =  "%" HEXDIG HEXDIG

For the non-CS nerds out there, here's what this might look like:


`git-resource:178647b4bcee74ce75b09f0330ec4a25ec8643a9:README.md`


This refers to a README.md file in the root of the repository that has `178647b4bcee74ce75b09f0330ec4a25ec8643a9` as its initial commit hash. (Leading slashes are unnecessary and will be ignored.) Not so bad, the hash is not human-friendly but overall it's not too complicated.


Now a for link that points to a specified file version:


`git-resource:178647b4bcee74ce75b09f0330ec4a25ec8643a9:7da783e8536c0a4b21427f9d3d0f4134e30b7907:README.md`


This adds a bit of noise to the path, but it does identify the version. Now here is the same URN with key fingerprints:


`git-resource:178647b4bcee74ce75b09f0330ec4a25ec8643a9!9EDF3F9D9286FA20:7da783e8536c0a4b21427f9d3d0f4134e30b7907!9EDF3F9D9286FA20:README.md`


Super secure! And again with a specific line fragment:


`git-resource:178647b4bcee74ce75b09f0330ec4a25ec8643a9!9EDF3F9D9286FA20:7da783e8536c0a4b21427f9d3d0f4134e30b7907!9EDF3F9D9286FA20:README.md#line=5`


If you look closely at the grammar, you'll see that both commit hashes are optional. With no commit hashes, this becomes an INTERNAL link (points to a file in the same repository):


`git-resource:README.md`


So this addresses a variety of use cases while still meeting the requirements. The hashes aren't all that friendly to human eyes, but I don't know how else to accomplish this without sacrificing something important. If there is any interest in this, I would definitely make a `git-urn` tool that can build these URNs for you automatically, as a way of bootstrapping usage.


Conclusion


What about forks? Web of trust concerns? Other issues? Believe me, I have plenty of thoughts here, but I've met my quota of writing for a time.


Please let me know if you have thoughts to add, either by email or through a gemlog response (I check Antenna daily).


---


Comments? Email the author: mntn at mntn.xyz


🌎 View this page on the web

☚ Back to the home page

-- Response ended

-- Page fetched on Mon May 13 00:44:53 2024