-- Leo's gemini proxy

-- Connecting to gemi.dev:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Help wanted: Recovering the actual message numbers from the Mailing List archive

2023-10-14 | #gemini #mailinglist | @Acidus


I've gotten some nice feedback on my Gemini-first archive of the Gemini Mailing List that I released a few days ago. I even implemented some suggestions like masking email addresses.


🛰🦊❤️ Orbital Fox Redux: Complete Mirror of Gemini Mailing List


However, there was one thing I really wanted to do that I was not able to accomplish: Using the same message numbering scheme. You see, Orbital Fox hosted a public HTML archive of all the mailing list messages. Each message on the mailing list was given a number and was accessible via a URL like this:


Format:
https://lists.orbitalfox.eu/archives/gemini/[YEAR]/[6 DIGIT NUMBER].html

Example:
https://lists.orbitalfox.eu/archives/gemini/2019/000046.html

Threads were represented in the HTML interface using nested lists, with links to each specific message.

Wayback machine archive of Orbital Fox's HTML threaded view for 2019


These hyperlinks to specific messages were SUPER IMPORTANT! They allowed people on the list to include hyperlinks to previous discussions or decisions when someone would ask questions or propose changes. They also help you track how ideas evolved over time. If I could use the same message numbering as Orbital Fox did, then people could still follow these hyperlinks to the message that an author was referencing.


In other words, if I could use the same numbering scheme, then the message:

https://lists.orbitalfox.eu/archives/gemini/2019/000046.html

would be available in my archive at, say:

gemini://gemi.dev/gemini-mailing-list/2019/000046.gmi

I even could *rewrite* references in the archive to point to my Gemini links, so the reference would be preserved! That would be awesome and really help the reading experience!


Numbering Madness


So, just use the same numbers from Orbital Fox's HTML interface right? That's can't be hard. Only it is, because the way messages were numbered in Orbital Fox's HTML archive is just madness and I can't seem to figure it out.


I assumed that these message numbers were assigned, starting with 000000, to each message on the mailing list, based on when the message was received. Open this Wayback machine version of the Threaded view from 2019:


Wayback machine archive of Orbital Fox's HTML threaded view for 2019


The first 4 messages in the first thread of the mailing list ("Let's get this list started") use the numbers 000000, 000001, 000002, and 000004. Where is message 000003? Well message 000003 is the first message of the 2nd thread (the absolutely insane read that starts as "Text reflow woes (or: I want bullets back!)"), since it was sent before the 4th message in the first thread was sent (which is message 000004).


So, that seems to match what I expected: Message numbers just increment. So all I need to do is sort the messages by date sent, and then assign them incremental numbers starting at 000000 right? No.


To see why, look at this Wayback machine copy of Orbital Fox's HTML archive's "date" view, which shows you all messages in a year, in the order they were received:


Orbital Fox's HTML date view for 2019


That *should* simply be a list of messages, starting at 000000.html, increasing by 1 each message. Only it's not. There are 289 messages in 2019. But the message number of the last message is 000293. Wait, what? Things seem fine up until message 000120 ("CGI suport for Gemini" from solderpunk). The message after that is 000125 ("Text format proposal (was Re: Text reflow woes (or: I want bullets back!)y)" from Jason McBrayer). After 000125 comes... 000124? And after that comes message 000126?!?! 🤬 🤬 🤬 Where are messages 000121, 000122, or 000123? Yeah, they don't seem to exist. These jumps and out-of-order numbers happen multiple times in the 2020 and 2021 archives too.


Help Needed!


After banging my head against the wall for several hours I was getting no where:


I have no idea how the message numbers used in the HTML archive for Orbital Fox map to the email messages in the mbox files.

There is just enough craziness in the numbers that I can assume they increment by 1, and then go back and manually fix a few weird ones. With over 7700 messages, that just isn't reasonable.

The archived "Date view" from the Orbital Fox HTML interface only has message number, subject, and sender display name. It doesn't show the date of the message. So I don't have enough information to match the mbox mail messages with the HTML numbers!

I could try and pull a copy of every single HTML message via the Wayback machine, extract out the date, author, content, and subject, and then try and use that to map the message number from the HTML filename back to the message in the mbox. But that sounds like a crazy amount of work, and the Wayback machine doesn't actually have a copy of every message, so while some of message numbers in my archive would map, others wouldn't, so I can't get a complete mapping.

If the mapping isn't complete, then sometimes the numbers will be right and sometimes they will be wrong. And if the mapping isn't reliable, well then what's the point?


Other random thoughts:


At first I thought this might be a "messages sorted by their UTC time" vs "messages sorted without regard to timezone" but that doesn't explain the missing numbers.

It's a mailing list, so its not like people could "delete" a message. I have no idea why there would be missing numbers.


So, please, if you were involved in the early mailing list, or have any ideas on how I can map all the 6 digit numbers used by Orbital Fox's HTML interface to the actually messages I've extracted from the mbox files, PLEASE let me know!

-- Response ended

-- Page fetched on Tue May 21 18:13:54 2024