UTF-8 as Memory Unit

While half awake, I had a vision of a programming language—maybe an assembly language, or a forth—that uses UTF-8-style code points as its unit of memory.

What does this even mean? I'm not entirely sure myself, but, half-asleep, it felt fascinating.

For example, in a forth-like language, the stack might be a series of bytes. A byte isn't a lot of data, so you probably have to figure out how to deal with multiple bytes (or your language of choice has abstracted that away for you). Okay, but what if the stack was a series of UTF-8-style code points instead? Each element on the stack can now hold a lot more data, while still being one conceptual "item" on the stack. That's cool, right?

What is UTF-8?

You will get a better answer if you just look it up on wikipedia, but here's some background on where I'm coming from.

A byte is considered eight bits these days: 00000000 to 11111111. That allows for 256 values, which isn't a whole lot.

Unicode has over a million code points to keep track of, so they wanted a way to store larger numbers. The obvious solution is to just use more bytes for each code point. You could store all the different Unicode code points in three bytes each. The downside is that, if every character is three bytes, you waste a lot of space on some frequently used characters, and you mess up backwards compatibility with the ASCII character space. More importantly, if you naively use three bytes for each character, and you read it byte by byte, and you get out of sync with your data somehow, you might be turning the wrong three bytes into characters, and the result would be nonsense.

So UTF-8 is a clever solution that preserves ASCII characters, uses fewer bytes for some code points than for others, and clearly identifies where new code points start, so any data stream can stay in sync.

UTF-8 uses marker bits in each byte to indicate how the data is organized. The actual data is filled into the remaining bits. A code point in the range 0–127 is stored in a single byte. The highest bit is 0, and the remaining bits are the data.

1 byte: 0xxxxxxx

A code point in the range 128–2047 is stored in two bytes. The three highest bits in the first byte are 110, and the two highest bits in the second byte are 10.

2 bytes: 110xxxxx 10xxxxxx

Higher code points use more bytes. In each case, the first bit starts with a number of 1 bits equal to the total number of bytes in the code point. All remaining bytes start with the bits 10.

3 bytes: 1110xxxx 10xxxxxx 10xxxxxx

4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

What's "UTF-8 style" then?

UTF-8 is narrowly defined for use with Unicode. It's invalid to encode data that isn't valid Unicode. There is also as a "forbidden region" that is reserved for UTF-16 encoding, and not valid to encode in UTF-8. UTF-8 only needs four bytes to encode all Unicode code points, and doesn't define encoding for any higher numbers.

In my proposed "UTF-8 style" language, we are encoding any arbitrary binary data, not just Unicode. There is no reason to be bound to the domain-specific limits imposed by UTF-8. We can continue building longer code points by following the same pattern started in UTF-8. The first byte indicates how many bytes are used, and the remaining bytes contain the data.

5 bytes: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

6 bytes: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

7 bytes: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

8 bytes: 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

In this variation on UTF-8, therefore, we can store over 4 trillion values (4,398,046,511,104) in a single conceptual unit.

It seems likely that someone else has extended UTF-8 in this way, and probably named it, but my half-hearted searches didn't turn up anything. For the rest of this article, I'll call it GTF-8 (as in "Generic Transformation Format", in contrasct to UTF-8's "Unicode Transformation Format").

How, exactly?

So, the premise is that the language architecture is written to understand these GTF-8 data blobs. There is no other representation of a "byte" available to the user. Bit-shift operations, for example, only affect the data bits, not the structural marker bits.

Sadly, I don't have the skill or perseverance to write a proof of concept implementation. I don't have any idea how this would work in reality.

Why, again?

Computers store everything as binary data, you know? Whether a blob of bits is a number, or a letter, or a code instruction, or a bitmap—or something else entirely—depends on how the code interprets it. This is more apparent in low-level environments, like assembly languages or forth-like languages. There's something I love about this multi-facetedness of data.

English-speaking programmers (and those with compatible alphabets) can use ASCII encoding to store all of the letters they need within the basic memory element provided by most languages: the 8-bit byte. By making a GTF-8 code point the basic unit of memory, this same convenience is automatically extended to speakers of all the languages that Unicode supports (kind of).

And just in general, it increases the range of data that a "simple" container can store, which means that the same container can be used for many things more easily. Dates and timestamps, tiny graphics, all kinds of things can be stored in the 42 bits of a GTF-8 blob. Building on UTF-8 allows small data to remain small in memory, while allowing fairly large data to keep the same conceptual footprint.

I guess I don't really know how to communicate this, because these attempts at explanation don't seem to explain anything at all. Ah, well. That is often the way of dreams.

emptyhallway

2021-05-28

Follow-up 2021-08-08:

See also this project with different goals but a parallel result:

base65536