-- Leo's gemini proxy

-- Connecting to gemini.conman.org:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

URI encoding

I've fallen into a rabbit hole of URI (Uniform Resource Indentifier) encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.


Anyway, there are two standards that define URIs:


1. RFC-3986: Uniform Resource Identifier (URI): Generic Syntax [1]

2. URL (Uniform Resource Locator): Living Standard [2]



The first is from the IETF (Internet Engineer Task Force) and what most non-browsers that deal with URIs use. The second is from the WHATWG (Web Hypertext Application Technology Working Group) (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).


RFC-3986 is quite clear on when to encode and decode characters:


> Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

>

> When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

>

> Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

>

> “RFC-3986, section 2.4: When to Encode or Decode”

>


But you do have to read the ABNF (Augmented Backus-Naur form) carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):


Table: URL percent-encoding chart (per RFC-3986)
		scheme	auth	path	query	fragment	note
------------------------------
SPACE		-	Y	Y	Y	Y
!	sub-delim	-	m	m	m	m
"		-	Y	Y	Y	Y
#	gen-delim	-	m	m	m	m	4
$	sub-delim	-	m	m	m	m
%	escape	-	Y	Y	Y	Y
&	sub-delim	-	m	m	m	m
'	sub-delim	-	m	m	m	m
(	sub-delim	-	m	m	m	m
)	sub-delim	-	m	m	m	m
*	sub-delim	-	m	m	m	m
+	sub-delim	N	m	m	m	m
,	sub-delim	-	m	m	m	m
-	unreserved	N	N	N	N	N
.	unreserved	N	N	N	N	N
/	gen-delim	-	m	m	N	N
0	unreserved	N	N	N	N	N
1	unreserved	N	N	N	N	N
2	unreserved	N	N	N	N	N
3	unreserved	N	N	N	N	N
4	unreserved	N	N	N	N	N
5	unreserved	N	N	N	N	N
6	unreserved	N	N	N	N	N
7	unreserved	N	N	N	N	N
8	unreserved	N	N	N	N	N
9	unreserved	N	N	N	N	N
:	gen-delim	-	m	N	N	N	2
;	sub-delim	-	m	m	m	m
<		-	Y	Y	Y	Y
=	sub-delim	-	m	m	m	m
>		-	Y	Y	Y	Y
?	gen-delim	-	m	m	N	N
@	gen-delim	-	m	N	N	N
A	unreserved	N	N	N	N	N
B	unreserved	N	N	N	N	N
C	unreserved	N	N	N	N	N
D	unreserved	N	N	N	N	N
E	unreserved	N	N	N	N	N
F	unreserved	N	N	N	N	N
G	unreserved	N	N	N	N	N
H	unreserved	N	N	N	N	N
I	unreserved	N	N	N	N	N
J	unreserved	N	N	N	N	N
K	unreserved	N	N	N	N	N
L	unreserved	N	N	N	N	N
M	unreserved	N	N	N	N	N
N	unreserved	N	N	N	N	N
O	unreserved	N	N	N	N	N
P	unreserved	N	N	N	N	N
Q	unreserved	N	N	N	N	N
R	unreserved	N	N	N	N	N
S	unreserved	N	N	N	N	N
T	unreserved	N	N	N	N	N
U	unreserved	N	N	N	N	N
V	unreserved	N	N	N	N	N
W	unreserved	N	N	N	N	N
X	unreserved	N	N	N	N	N
Y	unreserved	N	N	N	N	N
Z	unreserved	N	N	N	N	N
[	gen-delim	-	m	m	m	m	2,3,4
\		-	Y	Y	Y	Y	1
]	gen-delim	-	m	m	m	m	2,3,4
^		-	Y	Y	Y	Y	2,3,4
_	unreserved	-	N	N	N	N
`		-	Y	Y	Y	Y	3
a	unreserved	N	N	N	N	N
b	unreserved	N	N	N	N	N
c	unreserved	N	N	N	N	N
d	unreserved	N	N	N	N	N
e	unreserved	N	N	N	N	N
f	unreserved	N	N	N	N	N
g	unreserved	N	N	N	N	N
h	unreserved	N	N	N	N	N
i	unreserved	N	N	N	N	N
j	unreserved	N	N	N	N	N
k	unreserved	N	N	N	N	N
l	unreserved	N	N	N	N	N
m	unreserved	N	N	N	N	N
n	unreserved	N	N	N	N	N
o	unreserved	N	N	N	N	N
p	unreserved	N	N	N	N	N
q	unreserved	N	N	N	N	N
r	unreserved	N	N	N	N	N
s	unreserved	N	N	N	N	N
t	unreserved	N	N	N	N	N
u	unreserved	N	N	N	N	N
v	unreserved	N	N	N	N	N
w	unreserved	N	N	N	N	N
x	unreserved	N	N	N	N	N
m	unreserved	N	N	N	N	N
z	unreserved	N	N	N	N	N
{		-	Y	Y	Y	Y	3,4
|		-	Y	Y	Y	Y	2
}		-	Y	Y	Y	Y	3,4
~	unreserved	-	N	N	N	N
------------------------------
		scheme	auth	path	query	fragment	note


1. WHATWG: “\” is treated as a “/” in path segment

2. WHATWG: character not encoded in path

3. WHATWG: character not encoded in query

4. WHATWG: character not encoded in fragment



Table: Encoding Key
Y	always encode
N	never encode
m	only encode when not used for their defined purpose (URI scheme dependent)
-	not allowed, even escaped


Table: Character classes as defined by RFC-3986
unreserved	characters that never need to be encoded
gen-delim	characters defined as general use delimiters
sub-delim	characters defined as a potential delimiter for subcomponents in a URI
escape	character defined to escape other characters
	characters not otherwise defined, and thus must be escaped.


Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.


References


Uniform Resource Identifier Schemes [3]

URL Interop [4]

URL Specification [5]

A practical guide to URI encoding and URI decoding [6]

(Please) Stop Using Unsafe Characters in URLs [7]

Exploiting URL Parsers: The Good, Bad, And Inconsistent [8]



[1] https://www.ietf.org/rfc/rfc3986.txt

[2] https://url.spec.whatwg.org/

[3] https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

[4] https://github.com/bagder/docs/blob/master/URL-interop.md

[5] https://alwinb.github.io/url-specification/

[6] https://qqq.is/research/a-practical-guide-to-URI-encoding-and-URI-decoding

[7] https://perishablepress.com/stop-using-unsafe-characters-in-urls/

[8] https://claroty.com/wp-content/uploads/2022/01/Exploiting-URL-Parsing-Confusion.pdf


Gemini Mention this post

Contact the author


-- Response ended

-- Page fetched on Fri Jun 14 01:25:56 2024