gemini://gemini.conman.org/boston/2022/05/25.1

I've fallen into a rabbit hole of URI (Uniform Resource Indentifier) encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.

The first is from the IETF (Internet Engineer Task Force) and what most non-browsers that deal with URIs use. The second is from the WHATWG (Web Hypertext Application Technology Working Group) (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).

> Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

> When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

> Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

But you do have to read the ABNF (Augmented Backus-Naur form) carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):

Table: URL percent-encoding chart (per RFC-3986) scheme auth path query fragment note ------------------------------ SPACE - Y Y Y Y ! sub-delim - m m m m " - Y Y Y Y # gen-delim - m m m m 4 $ sub-delim - m m m m % escape - Y Y Y Y & sub-delim - m m m m ' sub-delim - m m m m ( sub-delim - m m m m ) sub-delim - m m m m * sub-delim - m m m m + sub-delim N m m m m , sub-delim - m m m m - unreserved N N N N N . unreserved N N N N N / gen-delim - m m N N 0 unreserved N N N N N 1 unreserved N N N N N 2 unreserved N N N N N 3 unreserved N N N N N 4 unreserved N N N N N 5 unreserved N N N N N 6 unreserved N N N N N 7 unreserved N N N N N 8 unreserved N N N N N 9 unreserved N N N N N : gen-delim - m N N N 2 ; sub-delim - m m m m < - Y Y Y Y = sub-delim - m m m m > - Y Y Y Y ? gen-delim - m m N N @ gen-delim - m N N N A unreserved N N N N N B unreserved N N N N N C unreserved N N N N N D unreserved N N N N N E unreserved N N N N N F unreserved N N N N N G unreserved N N N N N H unreserved N N N N N I unreserved N N N N N J unreserved N N N N N K unreserved N N N N N L unreserved N N N N N M unreserved N N N N N N unreserved N N N N N O unreserved N N N N N P unreserved N N N N N Q unreserved N N N N N R unreserved N N N N N S unreserved N N N N N T unreserved N N N N N U unreserved N N N N N V unreserved N N N N N W unreserved N N N N N X unreserved N N N N N Y unreserved N N N N N Z unreserved N N N N N [ gen-delim - m m m m 2,3,4 \ - Y Y Y Y 1 ] gen-delim - m m m m 2,3,4 ^ - Y Y Y Y 2,3,4 _ unreserved - N N N N ` - Y Y Y Y 3 a unreserved N N N N N b unreserved N N N N N c unreserved N N N N N d unreserved N N N N N e unreserved N N N N N f unreserved N N N N N g unreserved N N N N N h unreserved N N N N N i unreserved N N N N N j unreserved N N N N N k unreserved N N N N N l unreserved N N N N N m unreserved N N N N N n unreserved N N N N N o unreserved N N N N N p unreserved N N N N N q unreserved N N N N N r unreserved N N N N N s unreserved N N N N N t unreserved N N N N N u unreserved N N N N N v unreserved N N N N N w unreserved N N N N N x unreserved N N N N N m unreserved N N N N N z unreserved N N N N N { - Y Y Y Y 3,4 | - Y Y Y Y 2 } - Y Y Y Y 3,4 ~ unreserved - N N N N ------------------------------ scheme auth path query fragment note

Table: Character classes as defined by RFC-3986 unreserved characters that never need to be encoded gen-delim characters defined as general use delimiters sub-delim characters defined as a potential delimiter for subcomponents in a URI escape character defined to escape other characters characters not otherwise defined, and thus must be escaped.

Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.

URI encoding

References