-- Leo's gemini proxy

-- Connecting to gemini.circumlunar.space:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Unidecode in a font


Is there such a thing as "reverse ligatures"? That is, can one character be rendered by multiple glyphs, rather than combining multiple characters into a single glyph? I'm not sure how to research this. My searches haven't been helpful.


The motivation for this question is Unidecode. Basically, could Unidecode be embedded within any arbitrary font that supports ascii characters?


Unidecode


Back in 2001, before unicode support was widespread, Sean M. Burke released Unidecode. Unidecode was basically a substitution table that reduced any (many) characters to 7-bit ascii representations.


The result is far from perfect. The article linked above describes some of the compromises and limitations of Unidecode. But the goal wasn't a perfect representation, just a "good enough" representation that a fluent speaker of the language could probably manage to make sense of.


Although it's better to simply provide broad unicode support, I agree with its author that reducing to ascii is still somewhat better than simply displaying "?????" or raw character codes, which were some alternative methods for handling unsupported characters.


Of course, Unidecode is implemented as a library of code that has to process and convert the text before rendering it. But my question is whether you need an external library at all. Can we just embed the logic of Unidecode within the font itself by using the ligature feature?


As I understand from reading the article, Unidecode is simply a substitution table that converts one non-ascii character into zero or more ascii characters:

"\x{0788}" -> "v"

"\x{0799}" -> "m"

"\x{5317}" -> "Bei"

"\x{4EB0}" -> "Jing"


As I understand it, the substitution process that Unidecode is built around is not so different from what a ligature does. To support a ligature, the font stores a special glyph for (for example) "ffi", and when it sees those three characters, it replaces them with its special glyph instead of rendering "f", "f" and "i" separately. It could just as easily replace one glyph with another, right?


So, can we map, for example, the unicode character at hex address 5317 to be rendered with the existing "B", "e" and "i" glyphs? Or do we have to actually create a distinct, self-contained "Bei" glyph if we want to use it as in a ligature substitution? I haven't looked hard enough to get an answer to this question.


The goal of this exercise would be to have a program that can apply Unidecode to any font that at least supports the 7-bit ascii range. It would simply scan the font, and for each unsupported character, it would introduce the appropriate ligature mapping back to the ascii range. Perhaps the Unidecode representation would have a small furigana-style annotation that shows the original code point address, so that someone who is interested can unambiguously identify the original characters.


In this way, any font could have a fallback for displaying unsupported characters. Of course, this assumes that Unidecode support is broader than actual Unicode support. That may have been true in 2001, but I don't know how accurate it is today. Certainly there are still many individual fonts that only implement a relatively small range of the unicode space, but modern operating systems seem to be good at falling back to a different font that supports any unknown character, so as long as the system as a whole has broad unicode support, individual fonts don't have to support everything.


emptyhallway

2022-08-19


-- Response ended

-- Page fetched on Thu May 2 07:55:44 2024