Gemini Archive

🤔 📜️ 🤗

About this Collection

This is an early attempt to archive all publicly facing gemini:// servers for historical preservation. Even though the gemini community is still young, I believe that it's important to record the ecosystem while many of the original servers are still accessible. This archive is comprised of three separate gemini crawls that were made between September 2020 to November 2020.

Each crawl is slightly different as bugs were discovered and the crawling software was refined. Gemini servers were checked for robots.txt files and exclusion rules were obeyed. All gemini requests were recorded verbatim (including malformed & invalid responses) using the WARC 1.1 standard format. The raw log output from the crawling tool was also saved, which includes additional metadata for URLs that failed to download for reasons like TLS errors and robots.txt rules.

Details

Summary

Crawl           | September  | October    | November
---             | ---        | ---        | ---
Date            | 2020-09-24 | 2020-10-31 | 2020-11-07
Size            | 9.3 GB     | 12.9 GB    | 13.5 GB
Domains seen    | 283        | 276        | 314
Total Responses | 51,995     | 71,632     | 65,347
2x Responses    | 43,425     | 61,771     | 56,680

September Crawl (1 of 3)

=>https://archive.org/details/mozz-gemini-crawl-2020-1

This was my first attempt at a global crawl of geminispace. The crawling software crashed after about 3 hours due to an out-of-memory error and unfortunately I was unable to resume it after that. However, a significant amount of URLs were successfully scraped during that window. I also noticed afterwards that some domains were downloaded twice - both with and without the ":1965" at the end of the URL. I changed this behavior so that later crawls would remove the default port number from request URLs.

October Crawl (2 of 3)

=>https://archive.org/details/mozz-gemini-crawl-2020-2

This was my second attempt at a global crawl of geminispace. Changes were made to the software to be more resilient against unexpected crashes. This time, the crawler was able to finish successfully. It got tripped up by a few infinite redirect loops and unresponsive domains that I had to manually intervene and block. There was also a bug in the software that caused root URLs to be marked as duplicates. For example, if "gemini://mozz.us" redirected to "gemini://mozz.us/", the latter URL was marked as a duplicate and skipped. This caused the crawler to miss several important gemini home pages.

November Crawl (3 of 3)

=>https://archive.org/details/mozz-gemini-crawl-2020-3

This was my third attempt at a global crawl of geminispace. All known bugs from the previous two crawls were fixed. There was a noticeable increase in TLS handshake errors this time around, which I attribute to gemini server admins who were playing around with their TLS settings.

Live View

(2020-12-14 Update: The mirror is currently offline to save hosting costs. Stay tuned for future updates!)

I'm temporarily hosting a mirror of this archive online. The mirror works by leveraging the "proxy" feature of the gemini protocol. The server will listen for any gemini:// URL, and will then attempt to replay the saved response from the archive.

=>gemini://mozz.us:1966

You can connect to it using any gemini client that supports defining a proxy server. Example request (using gemget):

$ gemget --proxy mozz.us:1966 -o - gemini://gemini.circumlunar.space/capcom/

=>https://github.com/makeworld-the-better-one/gemget gemget - Command line downloader for the Gemini protocol

-- Response ended

-- Page fetched on Tue May 7 19:50:00 2024