gemini://git.thebackupbox.net/gemini-spec/commit/20090e57b960729bdc37e8759d10d72d2d52bf0e

-- Leo's gemini proxy

-- Connecting to git.thebackupbox.net:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

repo: gemini-spec
action: commit
revision:
path_from:
revision_from: 20090e57b960729bdc37e8759d10d72d2d52bf0e:
path_to:
revision_to:

git.thebackupbox.net

gemini-spec

git://git.thebackupbox.net/gemini-spec

commit 20090e57b960729bdc37e8759d10d72d2d52bf0e
Author: Sean Conner <spc@conman.org>
Date:   Tue Mar 2 19:38:27 2021 -0500

    Start editing of the protocol specification.

    * added links to the IETF specifications being referenced.
    * removed the text/gemini section (section 5) as that will be its own
      document.

diff --git a/specification.gmi b/specification.gmi

index 6f0133c3ebd0bd5c2bd71106e635d9b6b0c8e18d..

index ..7a3ea136698888492868715152d3348d65df6e9c 100644

--- a/specification.gmi
+++ b/specification.gmi
@@ -2,7 +2,7 @@

 ## Speculative specification

-v0.14.3, November 29th 2020
+v0.15.0, March 2nd, 2021

 This is an increasingly less rough sketch of an actual spec for Project
 Gemini.  Although not finalised yet, further changes to the specification
@@ -19,6 +19,20 @@ keep notes.
 Feedback on any part of this is extremely welcome, please email
 solderpunk@posteo.net.

+This specification relies upon the following standards documents from the
+IETF:
+
+=> https://tools.ietf.org/rfc/bcp/bcp14.txt [BCP14] Key words for use in RFCs to Indicate Requirement Levels
+=> https://tools.ietf.org/rfc/rfc2045.txt   [RFC2045] Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies
+=> https://tools.ietf.org/rfc/rfc3987.txt   [RFC3987] Internationalized Resource Identifiers (IRIs)
+=> https://tools.ietf.org/rfc/std/std63.txt [STD63] UTF-8, a transformation format of ISO 10646
+=> https://tools.ietf.org/rfc/std/std66.txt [STD66] Uniform Resource Identifier (URI): Generic Syntax
+=> https://tools.ietf.org/rfc/std/std68.txt [STD68] Augmented BNF for Syntax Specifications: ABNF
+
+(TODO: The selection of [STD66] or [RFC2987] is still being decided.  Right
+now the specification is based on [STD66]; one of these two links will be
+removed prior to finalization.)
+
 # 1 Overview

 Gemini is a client-server protocol featuring request-response transactions,
@@ -45,27 +59,54 @@ S:   Sends response body (text or binary data) (see 3.3)
 S:   Closes connection
 C:   Handles response (see 3.4)

-## 1.2 Gemini URI scheme
+## 1.2 Gemini request and reply BNF
+
+This specification uses the Augmented Backus-Naur Form (ABNF) notation of
+[STD68], and uses syntax rules from the following sources: [STD68],
+[RFC2045], [STD63] and [STD66] (or [RFC2987]).  The syntax rules used are:
+
+* CRLF
+* DIGIT
+* SP
+* VCHAR
+* OCTET
+* WSP
+
+* type
+* subtype
+* parameter
+
+* UTF8-3
+* UTF8-4
+* UTF8-tail
+
+* absolute-URI
+* URI-reference
+
+## 1.3 Gemini URI scheme
+
+(NOTE: the use of "[STD66]/[RFC3987]" does not imply both are being used for
+this specification---one of them will be removed upon finalization of the
+document)

 Resources hosted via Gemini are identified using URIs with the scheme
 "gemini".  This scheme is syntactically compatible with the generic URI
-syntax defined in RFC 3986, but does not support all components of the
-generic syntax.  In particular, the authority component is allowed and
-required, but its userinfo subcomponent is NOT allowed.  The host
+syntax defined in [STD66]/[RFC3987], but does not support all components of
+the generic syntax.  In particular, the authority component is allowed and
+required, but its userinfo subcomponent MUST NOT be used.  The host
 subcomponent is required.  The port subcomponent is optional, with a default
 value of 1965.  The path, query and fragment components are allowed and have
 no special meanings beyond those defined by the generic syntax.  Spaces in
-gemini URIs should be encoded as %20, not +.
+gemini URIs should be encoded as '%20', not '+'.

 # 2 Gemini requests

-Gemini requests are a single CRLF-terminated line with the following
-structure:
+A Gemini request will conform to the following syntax:

-<URL><CR><LF>
-
-<URL> is a UTF-8 encoded absolute URL, including a scheme, of maximum length
-1024 bytes.
+```
+request = absolute-URI CRLF
+```
+The <absolute-URI> MUST be 1024 bytes or less.

 Sending an absolute URL instead of only a path or selector is effectively
 equivalent to building in a HTTP "Host" header.  It permits virtual hosting
@@ -78,36 +119,68 @@ resources at their own domain(s).

 # 3 Gemini responses

-Gemini response consist of a single CRLF-terminated header line, optionally
-followed by a response body.
-
-## 3.1 Response headers
+A Gemini response will conform to the following syntax:

-Gemini response headers look like this:
-
-<STATUS><SPACE><META><CR><LF>
+```
+reply    = input / okay / redirect / tempfail / permfail / auth
+
+input    = '1' DIGIT SP prompt        CRLF
+okay     = '2' DIGIT SP mimetype      CRLF body
+redirect = '3' DIGIT SP URI-reference CRLF
+tempfail = '4' DIGIT SP errormsg      CRLF
+permfail = '5' DIGIT SP errormsg      CRLF
+auth     = '6' DIGIT SP errormsg      CRLF
+
+prompt   = 1*(UVCHAR / SP)
+mimetype = type '/' subtype *(';' parameter)
+errormsg = 1*(UVCHAR / SP)
+body     = *OCTET
+
+UVCHAR   = VCHAR / UTF8-2v / UTF8-3 / UTF8-4
+UTF8-2v  = %xC2 %xA0-BF UTF8-tail ; no C1 control set
+         / %xC3-DF UTF8-tail
+```

-<STATUS> is a two-digit numeric status code, as described below in 3.2 and
-in Appendix 1.
+## 3.1 Response headers

-<SPACE> is a single space character, i.e. the byte 0x20.
+Generally speaking, the generic response syntax looks like:

-<META> is a UTF-8 encoded string of maximum length 1024 bytes, whose meaning
-is <STATUS> dependent.
+```
+response = DIGIT DGIT SP 1*(UVCHAR / SP) CRLF
+```

-<STATUS> and <META> are separated by a single space character.
+A two-digit response code, a space, then text (excluding any control
+characters), followed by the control codes CR and LF.  Only codes between 10
+and 69 inclusive are used for status code; any two-digit value out of this
+range MUST be rejected by the client and an error MUST be displayed to the
+user.  The meaning of the text after the status code depends upon the status
+code.  Only codes within the range of 20-29 (SUCESS) include additional data
+sent to the client.

-If <STATUS> does not belong to the "SUCCESS" range of codes, then the server
-MUST close the connection after sending the header and MUST NOT send a
-response body.
+XXX - further edits warranted
+XXX If <STATUS> does not belong to the "SUCCESS" range of codes, then the
+server MUST close the connection after sending the header and MUST NOT send
+a response body.

-If a server sends a <STATUS> which is not a two-digit number or a <META>
+XXX If a server sends a <STATUS> which is not a two-digit number or a <META>
 which exceeds 1024 bytes in length, the client SHOULD close the connection
 and disregard the response header, informing the user of an error.

+(NOTE: define the <META> portion better, and pick a better name)
+
 ## 3.2 Status codes

-Gemini uses two-digit numeric status codes.  Related status codes share the same first digit.  Importantly, the first digit of Gemini status codes do not group codes into vague categories like "client error" and "server error" as per HTTP.  Instead, the first digit alone provides enough information for a client to determine how to handle the response.  By design, it is possible to write a simple but feature complete client which only looks at the first digit.  The second digit provides more fine-grained information, for unambiguous server logging, to allow writing comfier interactive clients which provide a slightly more streamlined user interface, and to allow writing more robust and intelligent automated clients like content aggregators, search engine crawlers, etc.
+Gemini uses two-digit numeric status codes.  Related status codes share the
+same first digit.  Importantly, the first digit of Gemini status codes do
+not group codes into vague categories like "client error" and "server error"
+as per HTTP.  Instead, the first digit alone provides enough information for
+a client to determine how to handle the response.  By design, it is possible
+to write a simple but feature complete client which only looks at the first
+digit.  The second digit provides more fine-grained information, for
+unambiguous server logging, to allow writing comfier interactive clients
+which provide a slightly more streamlined user interface, and to allow
+writing more robust and intelligent automated clients like content
+aggregators, search engine crawlers, etc.

 The first digit of a response code unambiguously places the response into
 one of six categories, which define the semantics of the <META> line.
@@ -120,9 +193,9 @@ The requested resource accepts a line of textual user input.  The <META>
 line is a prompt which should be displayed to the user.  The same resource
 should then be requested again with the user's input included as a query
 component.  Queries are included in requests as per the usual generic URL
-definition in RFC3986, i.e.  separated from the path by a ?.  Reserved
-characters used in the user's input must be "percent-encoded" as per
-RFC3986, and space characters should also be percent-encoded.
+definition in [STD66]/[RFC3987], i.e.  separated from the path by a "?".
+Reserved characters used in the user's input must be "percent-encoded" as
+per [STD66]/[RFC3987]; clients MUST (SHOULD?) encode spaces as '%20'.

 ### 3.2.2 2x (SUCCESS)

@@ -148,7 +221,7 @@ automatically updating bookmarks.  There is no response body.
 Status codes beginning with 4 are TEMPORARY FAILURE status codes, meaning:

 The request has failed.  There is no response body.  The nature of the
-failure is temporary, i.e.  an identical request MAY succeed in the future.
+failure is temporary, i.e.  an identical request MAY succeed in the future.
 The contents of <META> may provide additional information on the failure,
 and should be displayed to human users.

@@ -268,7 +341,7 @@ TOFU stands for "Trust On First Use" and is public-key security model
 similar to that used by OpenSSH.  The first time a Gemini client connects to
 a server, it accepts whatever certificate it is presented.  That
 certificate's fingerprint and expiry date are saved in a persistent database
-(like the .known_hosts file for SSH), associated with the server's hostname.
+(like the .known_hosts file for SSH), associated with the server's hostname.
 On all subsequent connections to that hostname, the received certificate's
 fingerprint is computed and compared to the one in the database.  If the
 certificate is not the one previously received, but the previous
@@ -295,25 +368,25 @@ simple notion of client identity with several applications:
   voluntarily by the client, and once the client deletes a certificate and
   its matching key, the server cannot possibly "resurrect" the same value
   later (unlike so-called "super cookies").
-
+
 * Long-lived client certificates can reliably identify a user to a
   multi-user application without the need for passwords which may be
   brute-forced.  Even a stolen database table mapping certificate hashes to
   user identities is not a security risk, as rainbow tables for certificates
   are not feasible.
-
+
 * Self-hosted, single-user applications can be easily and reliably secured
   in a manner familiar from OpenSSH: the user generates a self-signed
   certificate and adds its hash to a server-side list of permitted
   certificates, analogous to the .authorized_keys file for SSH).
-
+
 Gemini requests will typically be made without a client certificate.  If a
 requested resource requires a client certificate and one is not included in
 a request, the server can respond with a status code of 60, 61 or 62 (see
 Appendix 1 below for a description of all status codes related to client
 certificates).  A client certificate which is generated or loaded in
 response to such a status code has its scope bound to the same hostname as
-the request URL and to all paths below the path of the request URL path.
+the request URL and to all paths below the path of the request URL path.
 E.g.  if a request for gemini://example.com/foo returns status 60 and the
 user chooses to generate a new client certificate in response to this, that
 same certificate should be used for subsequent requests to
@@ -324,266 +397,6 @@ clients for human users are strongly recommended to make such actions easy
 and to generally give users full control over the use of client
 certificates.

-# 5 The text/gemini media type
-
-## 5.1 Overview
-
-In the same sense that HTML is the "native" response format of HTTP and
-plain text is the native response format of gopher, Gemini defines its own
-native response format - though of course, thanks to the inclusion of a MIME
-type in the response header Gemini can be used to serve plain text, rich
-text, HTML, Markdown, LaTeX, etc.
-
-Response bodies of type "text/gemini" are a kind of lightweight hypertext
-format, which takes inspiration from gophermaps and from Markdown.  The
-format permits richer typographic possibilities than the plain text of
-Gopher, but remains extremely easy to parse.  The format is line-oriented,
-and a satisfactory rendering can be achieved with a single pass of a
-document, processing each line independently.  As per gopher, links can only
-be displayed one per line, encouraging neat, list-like structure.
-
-Similar to how the two-digit Gemini status codes were designed so that
-simple clients can function correctly while ignoring the second digit, the
-text/gemini format has been designed so that simple clients can ignore the
-more advanced features and still remain very usable.
-
-## 5.2 Parameters
-
-As a subtype of the top-level media type "text", "text/gemini" inherits the
-"charset" parameter defined in RFC 2046.  However, as noted in 3.3, the
-default value of "charset" is "UTF-8" for "text" content transferred via
-Gemini.
-
-A single additional parameter specific to the "text/gemini" subtype is
-defined: the "lang" parameter.  The value of "lang" denotes the natural
-language or language(s) in which the textual content of a "text/gemini"
-document is written.  The presence of the "lang" parameter is optional.
-When the "lang" parameter is present, its interpretation is defined entirely
-by the client.  For example, clients which use text-to-speech technology to
-make Gemini content accessible to visually impaired users may use the value
-of "lang" to improve pronunciation of content.  Clients which render text to
-a screen may use the value of "lang" to determine whether text should be
-displayed left-to-right or right-to-left.  Simple clients for users who only
-read languages written left-to-right may simply ignore the value of "lang".
-When the "lang" parameter is not present, no default value should be assumed
-and clients which require some notion of a language in order to process the
-content (such as text-to-speech screen readers) should rely on user-input to
-determine how to proceed in the absence of a "lang" parameter.
-
-Valid values for the "lang" parameter are comma-separated lists of one or
-more language tags as defined in RFC4646.  For example:
-
-* "text/gemini; lang=en" Denotes a text/gemini document written in English
-
-* "text/gemini; lang=fr" Denotes a text/gemini document written in French
-
-* "text/gemini; lang=en,fr" Denotes a text/gemini document written in a
-  mixture of English and French
-
-* "text/gemini; lang=de-CH" Denotes a text/gemini document written in Swiss
-  German
-
-* "text/gemini; lang=sr-Cyrl" Denotes a text/gemini document written in
-  Serbian using the Cyrllic script
-
-* "text/gemini; lang=zh-Hans-CN" Denotes a text/gemini document written in
-  Chinese using the Simplified script as used in mainland China
-
-## 5.3 Line-orientation
-
-As mentioned, the text/gemini format is line-oriented.  Each line of a
-text/gemini document has a single "line type".  It is possible to
-unambiguously determine a line's type purely by inspecting its first three
-characters.  A line's type determines the manner in which it should be
-presented to the user.  Any details of presentation or rendering associated
-with a particular line type are strictly limited in scope to that individual
-line.
-
-There are 7 different line types in total.  However, a fully functional and
-specification compliant Gemini client need only recognise and handle 4 of
-them - these are the "core line types", (see 5.4).  Advanced clients can
-also handle the additional "advanced line types" (see 5.5).  Simple clients
-can treat all advanced line types as equivalent to one of the core line
-types and still offer an adequate user experience.
-
-## 5.4 Core line types
-
-The four core line types are:
-
-### 5.4.1 Text lines
-
-Text lines are the most fundamental line type - any line which does not
-match the definition of another line type defined below defaults to being a
-text line.  The majority of lines in a typical text/gemini document will be
-text lines.
-
-Text lines should be presented to the user, after being wrapped to the
-appropriate width for the client's viewport (see below).  Text lines may be
-presented to the user in a visually pleasing manner for general reading, the
-precise meaning of which is at the client's discretion.  For example,
-variable width fonts may be used, spacing may be normalised, with spaces
-between sentences being made wider than spacing between words, and other
-such typographical niceties may be applied.  Clients may permit users to
-customise the appearance of text lines by altering the font, font size, text
-and background colour, etc.  Authors should not expect to exercise any
-control over the precise rendering of their text lines, only of their actual
-textual content.  Content such as ASCII art, computer source code, etc.
-which may appear incorrectly when treated as such should be enclosed between
-preformatting toggle lines (see 5.4.3).
-
-Blank lines are instances of text lines and have no special meaning.  They
-should be rendered individually as vertical blank space each time they
-occur.  In this way  they are analogous to <br/> tags in HTML.  Consecutive
-blank lines should NOT be collapsed into a fewer blank lines.  Note also
-that consecutive non-blank text lines do not form any kind of coherent unit
-or block such as a "paragraph": all text lines are independent entities.
-
-Text lines which are longer than can fit on a client's display device SHOULD
-be "wrapped" to fit, i.e.  long lines should be split (ideally at whitespace
-or at hyphens) into multiple consecutive lines of a device-appropriate
-width.  This wrapping is applied to each line of text independently.
-Multiple consecutive lines which are shorter than the client's display
-device MUST NOT be combined into fewer, longer lines.
-
-In order to take full advantage of this method of text formatting, authors
-of text/gemini content SHOULD avoid hard-wrapping to a specific fixed width,
-in contrast to the convention in Gopherspace where text is typically wrapped
-at 80 characters or fewer.  Instead, text which should be displayed as a
-contiguous block should be written as a single long line.  Most text editors
-can be configured to "soft-wrap", i.e.  to write this kind of file while
-displaying the long lines wrapped at word boundaries to fit the author's
-display device.
-
-Authors who insist on hard-wrapping their content MUST be aware that the
-content will display neatly on clients whose display device is as wide as
-the hard-wrapped length or wider, but will appear with irregular line widths
-on narrower clients.
-
-### 5.4.2 Link lines
-
-Lines beginning with the two characters "=>" are link lines, which have the
-following syntax:
-
-```
-=>[<whitespace>]<URL>[<whitespace><USER-FRIENDLY LINK NAME>]
-```
-
-where:
-
-* <whitespace> is any non-zero number of consecutive spaces or tabs
-* Square brackets indicate that the enclosed content is optional.
-* <URL> is a URL, which may be absolute or relative.
-
-All the following examples are valid link lines:
-
-```
-=> gemini://example.org/
-=> gemini://example.org/ An example link
-=> gemini://example.org/foo	Another example link at the same host
-=> foo/bar/baz.txt	A relative link
-=> 	gopher://example.org:70/1 A gopher link
-```
-
-URLs in link lines must have reserved characters and spaces percent-encoded
-as per RFC 3986.
-
-Note that link URLs may have schemes other than gemini.  This means that
-Gemini documents can simply and elegantly link to documents hosted via other
-protocols, unlike gophermaps which can only link to non-gopher content via a
-non-standard adaptation of the `h` item-type.
-
-Clients can present links to users in whatever fashion the client author
-wishes, however clients MUST NOT automatically make any network connections
-as part of displaying links whose scheme corresponds to a network protocol
-(e.g.  links beginning with gemini://, gopher://, https://, ftp:// , etc.).
-
-### 5.4.3 Preformatting toggle lines
-
-Any line whose first three characters are "```" (i.e.  three consecutive
-back ticks with no leading whitespace) are preformatted toggle lines.  These
-lines should NOT be included in the rendered output shown to the user.
-Instead, these lines toggle the parser between preformatted mode being "on"
-or "off".  Preformatted mode should be "off" at the beginning of a document.
-The current status of preformatted mode is the only internal state a parser
-is required to maintain.  When preformatted mode is "on", the usual rules
-for identifying line types are suspended, and all lines should be identified
-as preformatted text lines (see 5.4.4).
-
-Preformatting toggle lines can be thought of as analogous to <pre> and
-</pre> tags in HTML.
-
-Any text following the leading "```" of a preformat toggle line which
-toggles preformatted mode on MAY be interpreted by the client as "alt text"
-pertaining to the preformatted text lines which follow the toggle line.  Use
-of alt text is at the client's discretion, and simple clients may ignore it.
-Alt text is recommended for ASCII art or similar non-textual content which,
-for example, cannot be meaningfully understood when rendered through a
-screen reader or usefully indexed by a search engine.  Alt text may also be
-used for computer source code to identify the programming language which
-advanced clients may use for syntax highlighting.
-
-Any text following the leading "```" of a preformat toggle line which
-toggles preformatted mode off MUST be ignored by clients.
-
-### 5.4.4 Preformatted text lines
-
-Preformatted text lines should be presented to the user in a "neutral",
-monowidth font without any alteration to whitespace or stylistic
-enhancements.  Graphical clients should use scrolling mechanisms to present
-preformatted text lines which are longer than the client viewport, in
-preference to wrapping.  In displaying preformatted text lines, clients
-should keep in mind applications like ASCII art and computer source code: in
-particular, source code in languages with significant whitespace (e.g.
-Python) should be able to be copied and pasted from the client into a file
-and interpreted/compiled without any problems arising from the client's
-manner of displaying them.
-
-## 5.5 Advanced line types
-
-The following advanced line types MAY be recognised by advanced clients.
-Simple clients may treat them all as text lines as per 5.4.1 without any
-loss of essential function.
-
-### 5.5.1 Heading lines
-
-Lines beginning with "#" are heading lines.  Heading lines consist of one,
-two or three consecutive "#" characters, followed by optional whitespace,
-followed by heading text.  The number of # characters indicates the "level"
-of header;  #, ## and ### can be thought of as analogous to <h1>, <h2> and
-<h3> in HTML.
-
-Heading text should be presented to the user, and clients MAY use special
-formatting, e.g.  a larger or bold font, to indicate its status as a header
-(simple clients may simply print the line, including its leading #s, without
-any styling at all).  However, the main motivation for the definition of
-heading lines is not stylistic but to provide a machine-readable
-representation of the internal structure of the document.  Advanced clients
-can use this information to, e.g.  display an automatically generated and
-hierarchically formatted "table of contents" for a long document in a
-side-pane, allowing users to easily jump to specific sections without
-excessive scrolling.  CMS-style tools automatically generating menus or
-Atom/RSS feeds for a directory of text/gemini files can use first heading in
-the file as a human-friendly title.
-
-### 5.5.2 Unordered list items
-
-Lines beginning with "* " are unordered list items.  This line type exists
-purely for stylistic reasons.  The * may be replaced in advanced clients by
-a bullet symbol.  Any text after the "* " should be presented to the user as
-if it were a text line, i.e.  wrapped to fit the viewport and formatted
-"nicely".  Advanced clients can take the space of the bullet symbol into
-account when wrapping long list items to ensure that all lines of text
-corresponding to the item are offset an equal distance from the left of the
-screen.
-
-### 5.5.3 Quote lines
-
-Lines beginning with ">" are quote lines.  This line type exists so that
-advanced clients may use distinct styling to convey to readers the important
-semantic information that certain text is being quoted from an external
-source.  For example, when wrapping long lines to the the viewport, each
-resultant line may have a ">" symbol placed at the front.
-
 # Appendix 1. Full two digit status codes

 ## 10 INPUT
@@ -592,7 +405,7 @@ As per definition of single-digit code 1 in 3.2.

 ## 11 SENSITIVE INPUT

-As per status code 10, but for use with sensitive input such as passwords.
+As per status code 10, but for use with sensitive input such as passwords.
 Clients should present the prompt as per status code 10, but the user's
 input should not be echoed to the screen to prevent it being read by
 "shoulder surfers".
@@ -648,7 +461,7 @@ As per definition of single-digit code 5 in 3.2.
 ## 51 NOT FOUND

 The requested resource could not be found but may be available in the
-future.  (cf HTTP 404) (struggling to remember this important status code?
+future.  (cf HTTP 404) (struggling to remember this important status code?
 Easy: you can't find things hidden at Area 51!)

 ## 52 GONE
@@ -681,7 +494,7 @@ itself, which may be authorised for other resources.

 ## 62 CERTIFICATE NOT VALID

-The supplied client certificate was not accepted because it is not valid.
+The supplied client certificate was not accepted because it is not valid.
 This indicates a problem with the certificate in and of itself, with no
 consideration of the particular requested resource.  The most likely cause
 is that the certificate's validity start date is in the future or its expiry

-----END OF PAGE-----

-- Response ended

-- Page fetched on Sun Jun 2 16:52:51 2024