-- Leo's gemini proxy

-- Connecting to gemi.dev:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

🦊🛰 Full Mailing List with Numbers Restored!

2023-12-09 | #mailinglist #mbox | @Acidus


I recently created a gemtext version of the Gemini Mailing List:


🛰🦊❤️ Orbital Fox Redux: Complete Mirror of Gemini Mailing List

Complete, threaded archive of the Gemini Mailing List


Orbital Fox (🦊🛰), the server that had hosted the mailing list before it died also provided an HTML archive. People posting to the mailing list would often include hyperlinks to this HTML archive to point to previous discussions or decisions. Links to previous mailing list messages took the form of:


https://lists.orbitalfox.eu/archives/gemini/[YEAR]/[6 DIGIT NUMBER].html

I really wanted to be able use the same numbering convention in my gemtext archive, so that I could re-write the hyperlinks to point to the correct message in my archive. As I mentioned in a previous post, six digit number was *mostly* increasing, however there were some odd jumps. Message #125 would appear chronologically *before* message #124, and some numbers wouldn't be used at all.


There were enough oddities that, over the entire 7700+ messages, the numbers would be way off by the end. I couldn't figure out this crazy logic that seemed to create the number series, so I couldn't use the same numbers for my archive. Super bummer. I wrote about my frustrations:


Help wanted: Recovering the actual message numbers from the Mailing List archive


This generated some ideas from the community but mostly it was stuff that I had already tried, (don't order with timezones, try UTC, etc.) which didn't work. Besides, no one had any idea that would explain the missing missing numbers at all. So I was stuck.


And much as as I tried to move on, I kept coming back trying to figure out what was going on.


Wrong Assumptions:


While hacking on this problem, I noticed something odd. Orbital Fox's 2019.mbox file you can download from the Wayback machine has 294 messages in it. But the saved HTML archive page only has 289 messages for all of 2019...


Archive 2019 mbox with 294 messages

Orbital Fox's HTML page for 2019 showing only 289 messages


Turns out I made 2 wrong assumptions. First was that I assumed that an email message would appear only once in an mbox file. It's not true! Looking at the 2019 mbox file, I found that it actually contains duplicate emails entire emails which appear twice, verbatim, including headers like Message-ID header.


How many duplicate messages? 5. And 294 messages - 5 duplicates = 289 messages. So the HTML view is showing the unique messages, which makes sense. So what messages appear multiple times in the mbox file? It turns out the same messages that appear to have out-of-order gaps!


The actual algorithm


Load an mbox from Orbital Fox

Start reading the messages, one at a time, in the order they appear *IN THE FILE*.

Each message read is assigned a number, starting with 0

When assigning a number to a message, if a message with that same Message-ID header already exists and has been given a number, drop that original message. What ever number was given to the later copy of the message is used


Here is an example. The email from Jason McBrayer, sent on 2019-09-07 at 21:38:43 UTC, with Message-ID "878sqzrgdo.fsf@cassilda.carcosa.net" appears twice in the mbox file. The first time it appears, it is given the message number #121. However the same email appears again in the mbox file, as message number #125. According to the algorithm, we drop the first email and just use the message number (#125) from the second copy. This is why #121 is not used in the Orbital Fox archive, and message #125 appears immediately after #120.


Different Sources


In retrospect, that seems pretty obvious. Why didn't I see this duplicate messages and their effect earlier? That was my second mistaken assumption. I assumedI assumed all the different mbox files that people had saved or made available contained the same messages!


Some of the mbox files you can find are not the original mbox files, or says the original mbox files all concatenated together into a single "complete" mbox file. Some of them were created by importing the original mbox files into some mail client and then exporting the messages out as a new mbox. Depending on the mail client, this process de-dups the messages mbox. So, depending on the mbox file I was working with, I would have different message counts and order.


I was mostly working an "all.mbox" file, which I assumed was the same as the individual mboxes. It was only when, trying to troubleshoot the numbering, I switched to using the original mbox files from Orbital Fox.


A Shortcut? 🙈


In hindsight I probably could have avoided all of this by just looking more at the software Orbital Fox used to manage the mailing list and generate the HTML archive. It used GNU Mailman 2.x. By looking at the code, or even running the source mboxes through Mailman, I probably could have avoided all this work. But that would not have been as much fun.


Final Result


Regardless of how I got here:


I was able to reconstruct all the Orbital Fox message numbers.

I used those message numbers to renumber the filenames in my Gemini archive mirroring what was used in the original HTML archive

I have rewritten any hyperlinks that appeared in the original messages that pointed to the Orbital Fox HTML archive to instead point to the same message in my Gemini Archive. This will let readers be able to follow any references the original authors made.



-- Response ended

-- Page fetched on Wed May 22 00:25:21 2024