Viewing Cached Gemini Content (or: gemini.circumlunar.space is offline and that is super annoying, so I fixed it)

2022-02-21 | #search #kennedy #delorean | @Acidus

Today I was trying to reply to an email about how Kennedy, my Gemini search engine, handles robots.txt rules. ~Solderpunk has a companion spec for using robots.txt in Gemini space, which is basically the original robots.txt spec, with really primitive "Disallow" rules, and some guidance on how to handle user-agents.

Unfortunately, I couldn't link to this content, because gemini.circumlunar.space is currently offline. (The DNS resolves, but nothing is listen on port 1965). It never occurred to me that one of the original capsules, containing the specifications for Gemini, would go offline, so I never thought to cache it. While people like ~ew make a local cached copy of content when they are replying to it on their gemlog, this wouldn't help me since I hadn't cached it.

🔭 Viewing cached content on Kennedy

Luckily, Kennedy's crawler keeps a local copy of documents in Gemini space. I do this so I can try different indexing and search strategies without having to do an entire re-crawl. This made it trivially easy to add a great new feature: View Cached Content.

So I have added "Cached copy" links to Kennedy search results, which allows you to view the cached copy of the URL at the time Kennedy crawled it.

Screenshot of Kennedy results with "Cached copy" link.

Kennedy results for "robots.txt"

This is super helpful. Sometimes when clicking on a search result, you get an error if that capsule is offline or otherwise unavailable. With cached copy, you can see it.

For example, here is the cached version of the robots.txt companion spec from gemini.circumlunar.space:

Kennedy's cached copy of the robots.txt companion spec page

🏎 Delorean: Cache by URL

While having an option to view cached content is great when looking at search results, it's not great if you want to see the cached copy for something specific. For example, I know that gemini.circumlunar.space is offline. I should be able to directly pull up the cached contents by the URL. I shouldn't have to try and find it via the search results, and follow the "Cached copy" link from there.

So I also built another feature I'm calling Delorean, after the DeLorean time machine from Back to the Future. Delorean allows you to provide a URL, and see its cached contents, if a cached copy exists in the search database.

🏎 DeLorean: View Cached Gemini content

Limitations

These new features are not the same as the Internet Archive's Wayback Machine. I only keep a local cached copy for content that would appear in Kennedy search results. This means:

Only the most recent copy of content is cached. A versioned archive is not available.

Only content with text/gemini or text/plain MIME types are cached.

Content excluded from the search engine by robots.txt is not cached (since it is never crawled to begin with).

I'm not transforming any links on the cached copy. So clicking on a link in the cached version will probably lead to an error page if the source is unavailable.

Future directions?

Building Delorean into something more like the Wayback Machine would be a bit involved. Funny enough, the robots.txt piece I could not link to talks about using different pseudo user-agents from a search engine (indexer) vs an "archiver" which specifically mentions the Wayback machine as an example. Right now, I'm just using the Kennedy search database, and I'm only keeping the latest copy, so I feel that using the "indexer" is probably OK. If I do more, I will need to start using the "archiver" rules, which could be different. Other challenges:

Overhead of balancing what is allowed to be crawled vs. allowed to be archived.

Handling cases where I have cached content, and the original capsule goes away. How long do I keep it?

Handling when content goes away, but the original capsule still exists. How long do I keep it?

Process for people removing things from the archive.

Dealing with backlash from people that don't want anything in the archive, but don't have a robots.txt, and are yelling at me.

For now, I have a lot on my plate so creating a Wayback-style archive isn't a priority at all. However I'm very open to feedback about this. What would you want to see in a Gemini archive?