-- Leo's gemini proxy

-- Connecting to skyjake.fi:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini;charset=utf-8;lang=en

IDNs in Lagrange v0.13


Those who follow the Gemini mailing list may have noticed a message or two about IDNs and IRIs. This is the first time I'm taking a deeper look at this stuff, so here is what I've learned.


i18n


When it comes to Internationalized Domain Names, I have been blissfully unaware that it basically relies on a kludge that requires applying a complicated, special encoding to convert Unicode domains to a small-ish ASCII representation. Well, RFC 3492 is 17 years old so this is surely something that happens under the hood, a minor implementation detail in the OS? Alas, internationalization has been left to the application layer to worry about, so it needs to be handled manually.


Since Gemini allows UTF-8 encoded URLs, implementing RFC 3492 is virtually a requirement. Otherwise, one cannot make DNS lookups if the domain name contains non-ASCII characters.


As to the rest of the URL, the story is a bit simpler: normalization and escaping reserved characters. The former is needed because Unicode has multiple ways to represent the same character. Applications that deal with UTF-8 already need to use some sort of a Unicode library to actually conform to the standard. Such a library should have routines for normalization so that's one problem that's easy to deal with. (Lagrange uses GNU libunistring.) The other issue is handled by percent-encoding reserved characters, which is also straightforward.


All these encodings and translations should happen automatically and transparently.


Have some URLs with ❤️


Lagrange v0.13 embraces Unicode in both domain names and URL paths:


In the user interface, Unicode characters are shown wherever URLs are displayed: the URL bar, history, bookmark editor, etc.

blekksprut.net with CJK characters (screenshot)

You can disable URL decoding with a new setting in Preferences. This will show you all non-ASCII characters as percent-encoded UTF-8 (as was done in prior versions).

The full URL is NFC normalized before sending it to a server.

Domain names with non-ASCII characters are encoded to Punycode before doing a DNS lookup. The Punycode version of the domain name is sent to the server in the request URL, and also used for verifying the server certificate.

Paths are percent-encoded as usual before sending requests to a server.


Text rendering


Speaking of Unicode, actually rendering it on screen is not straightforward at all. Lagrange uses custom text rendering routines that currently only support left-to-right text. A small number of special Unicode codepoints are recognized and handled (such as soft hyphens) but many are just ignored, for example variation selectors.


Version 0.13 has a bunch of improvements for text rendering:


There is a new monospace font (Iosevka) that has a more retro/terminal-like design and improved Unicode coverage compared to Fira Mono. It is also a bit more compact, allowing more content to fit horizontally.

When Emojis are used in monospace text, the spacing is relaxed a bit so wide Emojis don't overlap each other. The original spacing is restored after whitespace so text stays aligned.

Unavailable Emoji variants (e.g., color) fall back to the available ones. Currently Lagrange uses a monochrome Emoji font.

I made further tweaks to clean up box-drawing and other full-height characters. Previously, depending on text scaling, consecutive lines may have overlapped by one pixel or had a a gap between the lines.


Lagrange: features, downloads, what's new


skyjake

📅 2020-12-13

🏷 Lagrange

CC-BY-SA 4.0


skyjake's Gemlog

-- Response ended

-- Page fetched on Fri Apr 26 09:16:19 2024