-- Leo's gemini proxy

-- Connecting to gemi.dev:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Oddities with Gemini, Gemtext, and Geminispace


The spec for Gemini and Gemtext are short and easy to read. However there is some quirks that I think are easy to overlook. Additionally, there is what the specification says, and what capsules actually do in Geminispace. This is a collection of strange things that have caused me problems as the author of a crawler, search engine, and archive.


Gemini Protocol


For success responses, the MIME Type is optional. So while most responses look like this:

20 text/gemini\r\n

This is also valid:

20 \r\n

(NOTE: That is 20, follow by a space (0x20), followed by a CRLF).


If you don't receive a mimetype, the assumed MIME type is "text/gemini".


For error responses (4x, 5x, 6x), the error message is optional. So this is valid error response:

40 \r\n

3x responses are redirects. While most capsules use absolute URLs, the spec allows for both relative or absolute URLs. Some capsules send protocol relative URLs (e.g. "//example.com/foo" instead of "gemini://example.com"). Be prepared to handle all sorts of URLs. Some capsules will use a redirect to send you to a non-Gemini URL. This is allowed, and you should account for it.


Legacy Gemini Servers


Some older Gemini servers (such as bleyble.com) were built while the Gemini protocol was still in flux. These send a tab (0x09) between the status code and the meta field in a response line. Most clients like Lagrange seem to handle this OK. You should handle it as well.


Some servers will throw an error message if you include the port number as part of the URL. So for those, sending a request like this

gemini://example.com/

will work, but a request like this

gemini://example.com:1965/

throws an error. I have no idea why this happens. It is probably a bug in a commonly used server. While yes, that should be fixed, if you want to access those systems you need to work around it. As such I don't send a port number as part of the URL in the Gemini request if the port is the default Gemini port of 1965.


MIME types


The vast majority of content in Gemini space uses the MIME types text/gemini or text/plain. These are usually correct. Once you start getting into image/* MIME types, or others esoteric types, the given MIME type is often incorrect. Don't assume file format from MIME type. Use a real file format parsing library, based on magic numbers, like the `file` command.


Language Parameter


The Gemini spec recommends using the "lang=" parameter on the MIME type to specify a language. Most capsules don't do this. You will find content written in French or Russian, that was not sent with a lang parameter.


How a page should specify that it contains content in multiple languages is not well defined. Some capsules send a comma separate list of languages like "text/gemini;lang=de,en". The comma character is actually not allowed inside the value of a MIME type parameter, so this is invalid, and many MIME parsing libraries will throw an error.


The best way to know the language content is to use a detection algorithm like ngrams. Note that this can fail on short text, if run against preformatted sections.


Misconfigured capsules exist that will send "lang=" parameters on content that doesn't make sense, like "image/png;lang=en". This can break naive MIME type parsing code.


Charset Parameter


In Gemini, the default character encoding for all text/* MIME types is UTF-8. Most modern content is written in UTF-8, and most "text/gemini" content is in fact using UTF-8.


If content uses another encoding the server is supposed to specify this via the "charset=" parameter on the MIME type. Many sites fail to do this. For example, there are numerous Gemini mirrors of Textfiles.com. These files date from the 1980s and use extended ASCII or other character sets which don't render properly if assumed to be UTF-8.


Automatic charset encoding detection is a well researched and difficult problem. There are no silver bullets, only trade offs. This makes it very difficult to reliably parse or index text/plain files.


Gemipedia: Charset detection


Misconfigured capsules exist that will send charset parameters for content that doesn't make sense, like "image/png;charset=utf-8". This can break naive MIME type parsing code.


Slow Down Weirdness


For "44 Slow Down" responses, the spec says the meta field should be an integer number of seconds to wait before sending again. However some servers instead send a human readable error message. Don't blindly assume the meta will be a number.

Most servers don't utilize "44 Slow Down". They either get overloaded, or they start throttling or banning your IP address. If you are creating a crawler, you should implement your own throttling with exponential backoff.


Gemtext


Leading space inconsistencies


Block quote lines don't have to have a leading space:

>This is valid
> So is this

List item lines *DO* have to have a leading space:

* this is a valid list line
*this is not

Header lines don't have to have a space:

#Here is a valid header
## Here is a valid header as well

Link Lines


Link lines don't have to have a leading space.

=>gemini://example.com/ this is totally valid

This might be the most poorly worded part of the spec. It defines whitespace as "any non-zero number of consecutive spaces or tabs" which would make you think that at least 1 whitespace character is required between `=>` and the URL. However in the definition, the whitespace is enclosed in brackets, to which the spec then says "Square brackets indicate that the enclosed content is optional." So zero or more whitespace is allowed. However, then all the examples of link lines in the spec proceed to ALWAYS use whitespace between the `=>` and the URL, which reenforces the idea that at least 1 character is required. Like I said, it's confusing! Just ensure your code can properly parse link lines with 0...N amount of whitespace before the URL, since Gemtext using no whitespace exists out there. Also, unlike other specs that allow uncommon characters like vertical tabs to be whitespace, Gemini only allows whitespace to be \s or \t. So:


=>\t\t\t\t\t\t\t\tgemini://example.com/ lots of leading tabs, still works as a link

Header nesting


Header lines are not required to be in any order. While uncommon, you will find gemtext with out-of-order header lines like this:

hello
### first header, but at a depth of 3
blah
# now a "higher" header?
yep, stuff is
##crazy

Tabs in Preformatted text


A tab is a single character, ASCII code 9. Text programs are free to decide how to render a tab. Historically, on computers, tabs were rendered as 8 whitespace characters (ASCII code 32), but more modern programs allow you to change this, or default to another number of characters like 4. As such you can't depend on how tabs will be rendered in preformatted text blocks. Something may look fine in your text editor, but be rendered differently:


Amfora using 4 characters when rendering a tab

Lagrange using 8 characters when rendering a tab

cat using 8 characters when rendering a tab

TextMate using 4 characters when rendering a tab


If you leave tabs in your preformatted text, it isn't really preformatted anymore, since you don't know how it will be rendered. Just use spaces instead.


Tabs and Spaces test page


GeminiSpace


Robots.txt


Gemini specifies that only a subset of Robots.txt is valid. Specifically it does not support:

Allow Rules

Deny Rules with wildcard characters in the middle

Crawl-Delay directives


Many capsules will attempt to use these. That behavior is undefined.


TLS 1.3


Many capsules are TLS 1.3 only. Make sure you are using a TLS library that supports it.

-- Response ended

-- Page fetched on Sun May 12 22:28:16 2024