-- Leo's gemini proxy

-- Connecting to gemi.dev:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Mirror of Drew Devault's capsule available

2022-08-15 | #mirrors | @Acidus


As mentioned on Station, Drew DeVault's capsule has been offline for a month or so:


>Seems drewdevault finally pulled the plug on their capsule. Sad to see it go"

@smokey on missing capsule

>I wrote to Drew DeVault asking if he can restore his capsule... Will he ever reply to me?

@freezr on missing capsule


Luckily, I was able to reconstruct most of Drew's capsule using saved content from older crawls from Kennedy, my Gemini search engine. I rewrote the internal hyperlinks to be relative links, so you can read the capsule online or off.


[Update: Drew's capsule, and any others, are now available via Delorean Time Machine]

Archive of Drew DeVault's capsule


The capsule had some CGIs which obviously won't function. Also, my captured data predated Kennedy's image search feature, so I was only storing responses with a "text/*" MIME type. There are about 30 images on his capsule that I don't have a copy of. All in all, I salvaged 110 pages.


How I did this


Usually search engines have a centralized database of results, which the crawler uses determine what content should be visited and refreshes the results over and over again, continuously. I tend to make a lot of changes to Kennedy, the how the crawler works, data it collects, and how that is stored. This is true today and was certain true in the first few months of building Kennedy, as I was organically figuring it all out. So from the very beginning, I wrote Kennedy crawler to always start fresh. Each time the crawler runs, it produces a new search database, and a data store contains saved copies of all the responses.


This self-contained approach turned out to be super helpful:

If a crawl had a problem, I just delete it

Updating Kennedy's search database is just a configuration change to point to new data files

I can run analysis on Gemini space, without waiting for a crawler, by just iterating over all the saved responses


A side benefit of this approach is that I tend to have older copies of the search database and data store scattered around, including a copy from mid-June that had ~140 files from Drew's capsule.


> Oh hot damn! Last week @freezr posted about trying to get Drew DeVault's capsule back online. I went looking at data from old Kennedy crawls and found I had visited 124 URLs on his capsule in mid June. Back then I only cached text content, which returned a status of 20. So I have 104 gemtext pages from Drew's Capsule. I need to write some code to export that (maybe make it a gempub as well) and then I'll post it back on line! Saving full bodies, FTW!

Me, on Station


I wrote code that pulled all this content out, and saved it to files. I've done similar work with website data in the past. Usually there are problems when the characters in the URL are not allowed in the file name. Things like query strings are especially annoying, and file systems often have limits of the maximum length of a path, which makes it difficult to have clear URL-to-file mappings. Luckily, most capsules tend to not use query strings, and the URLs are fairly simple.


The surprisingly hard part of this project was writing code that would rewrite the links in Drew's gemtext to be relative links. This was critical to allowing a reader to navigate around the extracted pages. I'll still consider creating a gempub of the content at some point. Besides Langrange, I don't know any client's that support it. The work seems to have stalled on the spec:


Gempub specification


Does anyone use a client that supports it?

-- Response ended

-- Page fetched on Tue May 21 15:12:02 2024