EOL Story

This has come up a few times on the tilde #gemini IRC channel, surprise that CR LF must be used at the end of line. CR LF lines are a rather an old convention, as far as the internet goes.

> During the early ARPAnet research days (~1970-1972), this end-of-line diversity among operating systems made network communication between diverse host systems difficult. After some discussion (recorded in early RFCs), the researchers adopted a single convention:

> ASCII text transmitted across the network *must* use the

> two-character sequence: CR LF.

https://www.rfc-editor.org/old/EOLstory.txt

Therefore if you are doing something like

    fmt.Println("20 text/plain")

you may actually instead need

    fmt.Println("20 text/plain\r")

to generate a correct response header.

    00000000: 3230 2074 6578 742f 706c 6169 6e0d 0a    20 text/plain..

However! There can be complications. You will probably want to check very carefully what Println (or equivalent) actually does, as software might (helpfully) generate CR LF for you on DOS, in which case your software would incorrectly on DOS generate CR CR NL while appearing to be correct on unix. Tests might help surface portability problems. At least run the output through a hex dumper to confirm what is going on.

DOS used CR LF from the get-go. Unix has always used LF. Mac OS (the olden one, before the unix reboot) used CR. Unicode has a bunch of things including U+2028 LINE SEPARATOR that one probably will not see much of in the wild.

Mostly this is transparent to the user, except when it is not.

Terminal Woes

The terminal has several flags that change how CR LF are handled, see termios(4) for details. Linux probably puts that page under some other section.

    $ man termios | egrep 'CR|LF'
    DESCRIPTION
      CR      Special character on input and is recognized if the ICANON flag
              ICANON and ICRNL are set and IGNCR is not set, this character is
      The NL and CR characters cannot be changed.  The values for all the
            INLCR    /* map NL into CR */
            IGNCR    /* ignore CR */
            ICRNL    /* map CR to NL (ala CRMOD) */
      If INLCR is set, a received NL character is translated into a CR
      character.  If IGNCR is set, a received CR character is ignored (not
      read).  If IGNCR is not set and ICRNL is set, a received CR character is
            ONLCR   /* map NL to CR-NL (ala CRMOD) */
            OCRNL   /* map CR to NL */
            ONOCR   /* No CR output at column 0 */
            ONLRET  /* NL performs the CR function */
      If ONLCR is set, newlines are translated to carriage return, linefeeds.
      If OCRNL is set, carriage returns are translated to newlines.
      If ONOCR is set, no CR character is output when at column 0.
      If ONLRET is set, NL also performs CR on output, and reset current column

Symptoms of getting it wrong usually involves text stair-stepping across the page. But again this is mostly transparent because programmers mostly fix these sorts of issues pretty quickly. Or, text can be overwritten, which can make debugging difficult. A hex dump might help.

    $ printf "test\rfoobar\r\n"
    foobar
    $ printf "test\rfoobar\r\n" | od -bc
    0000000  164 145 163 164 015 146 157 157 142 141 162 015 012
               t   e   s   t  \r   f   o   o   b   a   r  \r  \n
    0000015

Implementation Gotchas

Implementations may accidentally support unix NL in addition to CR NL. One way this might happen is via a call such as getline(3), from which any trailing CR or NL are removed. This means lines without the CR are supported.

    // unstrict - this has poor CR NL handling, do not use
    #include <stdio.h>
    int main(void) {
        char *line = NULL;
        size_t linesize = 0;
        ssize_t linelen;
        while ((linelen = getline(&line, &linesize, stdin)) != -1) {
            size_t last = linelen - 1;
            while (line[last] == '\r' || line[last] == '\n') {
                line[last] = '\0';
                last--;
            }
            fprintf(stderr, ">%s<\n", line);
        }
    }

Supporting only a NL opens the door to allowing clients that send unix NL. This may become established, but then break with a new server that has a correct implementation.

    $ printf '' | ./unstrict
    $ printf 'foo\r\n' | ./unstrict
    >foo<
    $ printf 'foo\n' | ./unstrict
    >foo<
    $ printf 'foo\r\r\r\r\r\r\n' | ./unstrict
    >foo<

Allowing only NL should probably be avoided, especially in these "moving away from Postel" days.

Reverse LISP Notation

Another interesting point is that the newline typically goes up front in LISP, backwards of unix. This has various advantages and drawbacks.

https://clisp.sourceforge.io/impnotes/newline-convention.html

    $ cat nl.lisp
    (write 'foo)
    (write 'bar)
    (fresh-line)
    (write 42)
    (format t "~&and that's all, folks!")
    $ sbcl --script nl.lisp
    FOOBAR
    42
    and that's all, folks!$

tags #eolstory

A legacy web search engine helpfully included search results for "tolstoy".

-- Response ended

-- Page fetched on Tue May 21 15:57:10 2024