-- Leo's gemini proxy

-- Connecting to thrig.me:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Protocol Specification

or, "how can I print a string as hex?"


"String as hex" is rather vague, and computers can be mighty picky about such details. Humans who have dealt with computers can also be picky, especially if they expect to have to support, fix, or write the code involved, and where they suspect moving goalposts as the details of what exactly a "string" is are revealed, and what exactly "as hex" means.


Management may label such "Luther, with the Questions" individuals as problematic.


Anyways, are these Perl strings?


    "test"
    "a\nb"
    "\N{SNOWMAN}"
    "x/y:z"
    "\0\0"
    ""

Technically yes--but, sometimes, no: a string may need to have a length to be valid for the protocol, or if the string is passed to C any \0 can be deeply problematic. What about non-Unicode (or even !UTF-8) encodings? What happens to those? Will the string be used as a filename? Some operating systems do not take kindly to / or : in filenames.


As for the "as hex" part, this is also vague; it could go in any of several different directions, such as 16-bit (hex) containers, perhaps for some embedded purpose?


    $ perl -e 'print pack "S*", map ord, split //, "CAT"' | od -hc
    0000000     0043    0041    0054
               C  \0   A  \0   T  \0
    0000006
    $ perl -e 'print pack "S*", map ord, split //, "CAT"' | xxd
    00000000: 4300 4100 5400                           C.A.T.

Here we see that od(1) and xxd(1) do not agree on the representation of whatever it is that the "S*" argument to the Perl pack function is producing. Did the protocol call for the words to be in little, big, or PDP-11 byte order? It did not specify? Uh-oh.


    $ perl -e 'print pack "S>*", map ord, split //, "CAT"' | od -hc
    0000000     4300    4100    5400
              \0   C  \0   A  \0   T
    0000006
    $ perl -e 'print pack "S<*", map ord, split //, "CAT"' | od -hc
    0000000     0043    0041    0054
               C  \0   A  \0   T  \0
    0000006

Anyways, hex here obviously means ASCII hexadecimal. This is still vague. Are we talking about C?


    0xFF, 0xFF, 0xCC, 0xCC, 0xEE, 0xEE

CSS colors?


    #FCE

A Perl string?


    "\xFF\xFF\xCC\xCC\xEE\xEE"

Why assume two, or maybe one hex character per unit? Why not four?


    "\x{2603}"

What happens when three, or five hex characters show up?


And are consumers of this specification supposed to support all these varied forms? Being permissive in what you accept [RFC 760] has been walked away from in recent years, probably due to the large amounts of code required to support the too many different input forms; too much code is generally a breeding ground for edge cases, undefined behavior, and security problems.


This is where unit tests help; a programmer may look at all the unit tests they would have to write and wonder if the protocol could be made less vague. Simpler.


This is all too complicated, just show me some code


Maybe you're in a hurry to debug something, in which case the following may get the job done.


    $ perl -e 'printf "0x%*vx\n", ",0x", "CAT\0"'
    0x43,0x41,0x54,0x0

This generates invalid hex from the empty string, though:


    $ perl -e 'printf "0x%*vx\n", ",0x", ""'
    0x

And maybe the width should be mandated, to make parsing more consistent:


    $ perl -e 'printf "0x%*v02x\n", ",0x", "CAT\0"'
    0x43,0x41,0x54,0x00

And should we be consistent about the case?


    $ perl -e 'printf "%v02x\n", "lo"'
    6c.6f
    $ perl -e 'printf "%v02X\n", "lo"'
    6C.6F

Maybe we do not want to generate invalid hex from the empty string:


    $ perl -E 'say "CAT\0" =~ s/(.)/sprintf "\\x%02X", ord $1/egrs'
    \x43\x41\x54\x00

But this method leaves a trailing comma that C may not like:


    $ perl -E 'say "CAT\0" =~ s/(.)/sprintf "0x%02X,", ord $1/egrs'
    0x43,0x41,0x54,0x00,

Maybe something longer?


    $ perl -E 'say join ",", map { sprintf "0x%02X", ord $_} split //, "CAT\0"'
    0x43,0x41,0x54,0x00

And what about encoding? Whoops, here's some four-character hex; hopefully anything parsing that input is okay with that.


    $ perl -E 'say join ",", map { sprintf "0x%02X", ord $_ } split //, "\N{SNOWMAN}"'
    0x2603

Should that instead have been 0xE2,0x98,0x83, or something else? If UTF-8, is 0xe2,0x98,0x83 also okay? Or should everything be normalized to upper case?


    $ perl -MEncode=encode -E 'say join ",", map { sprintf "0x%02X", ord $_ } split //, encode("UTF-8", "\N{SNOWMAN}")'

That is getting pretty long for a one-liner; I would write a script for this if it came up a lot. And maybe Encode::FB_CROAK should be set?


tags #perl #hex

-- Response ended

-- Page fetched on Tue May 21 16:20:00 2024