Geminifying talk(1)

Background: Unix talk

Unix talk has long been my favourite communication method. Two users type at each other over the network; characters typed by one appear immediately on the screen of the other. It combines the textual purity of line-based instant messaging (e.g. IRC) with the immediacy of spoken communication. You can convey a remarkable amount of extra-linguistic information by means of pauses, variation in typing speed, and backspacing, quite analogous to speech.

D. Hofstadter on why a Turing test would be more effective with talk than with a line-based system.

Problems

Much as with Gopher, talk shows its age.

It's an unencrypted and unauthenticated protocol. We can not be so trusting now.

It doesn't support unicode.

It writes directly to a user's terminal as the only way to notify the user of an incoming call.

Talk can still be used safely if it only accepts connections from local users, and I have been using it this way for decades. But I can understand why this is rare.

Geminification

I think it's past time for an updated version of talk. Gemini has shown that this kind of endeavour can succeed, and there are some lessons to take from it:

Keep the protocol utterly simple, modulo encryption and authentication.

Use TLS for encryption and authentication, because it's well-supported, familiar, and future-proof.

Prefer TOFU with long-lived self-signed certificates to the CA system.

Use utf8 as the standard character encoding.

Avoid extensibility.

Focus on privacy.

Sketch

With this in mind, here's the system I have in mind.

Protocol

Basically just TLS!

Client certificates for two-way authentication. No further negotiation or headers: all application data in each direction is understood as a utf8 character stream. This stream is to be interpreted as a sequence of lines terminated by \n. All other control characters are ignored, except \b which erases the last character of the current line (if any), and \NAK which erases the entire current line.

That's it.

EDIT: Problem

After writing this, I realised that there's one big problem with this naive protocol. For the non-textual communication which is crucial to the talk experience, it's important to have precise timing with each character appearing "as it is typed". So that would mean one packet (one TLS application record) per character, which would be rather expensive... TLS itself adds some bytes (up to 36 it seems, with no padding), then TCP and IPv4 headers add another 40. One could try to argue that this isn't worth worrying about in itself (I'm not sure that the original talk didn't have the same kind of problem (though I think it at least sometimes used UDP, so only 28 bytes of overhead)), but further problems are that network jitters would manifest directly and disrupt communication, and (what worries me most) the timing information could quite possibly be used by a snooper to identify users.

So we may need a slightly more complicated protocol, where multiple characters are buffered and sent at once with precise timing information (much like audio frames). Hmm.

This shouldn't complicate things too much. I'll need to think about it, but one plausible solution is just to alternate unicode characters with 16-bit integers indicating a delay. Though in some ways it would be nicer if the stream were still a valid utf8 character stream...

It might also then be worth mandating TLS-1.3 record padding to reduce the amount of information we leak.

Usage

Is this really enough? Initially I had in mind that we need to replicate talk's ability to look up a local user of the target host, which would complicate things dramatically. One could go even further with a federated system like XMPP, allowing for the target user to not even be a unix user on the host. But... one thing gemini has shown is that many people are able to self-host, despite NAT routers and so forth, and I think promoting this further is worthy in itself, and it lets us keep the protocol truly simple. We also sidestep the serious privacy concerns involved with relying on a third-party host -- we'd need not only end-to-end encryption, but a way to ensure metadata privacy so the third party can't tell who is talking to whom, and that's a really hard problem.

So. Here's how I see this working.

Each user creates a long-lived self-signed certificate, or e.g. reuses the one they use on their gemini server. This will be used in both directions.

One user starts a server process on their machine. This listens at a standard port (or a user-specified one, so this can still be used on a multiuser system if the users can agree on how to allocate ports). They inform others about the existence of the server however they like -- preferably including the certificate hash for extra MitM protection. Another user connects to it, providing their own certificate as a TLS client certificate. The first user is notified with information about the caller -- a petname if the user has set one for this certificate, else just the certificate information, which they then assign a petname to. They can accept the call with an interactive client which presents the text and accepts user input.

Upshot: to accept calls, you have to host a server. This might mean fiddling with port-forwarding and DNS, or setting up an onion hidden service, or convincing a pubnix to let you use a port. But to make calls you need only a client.

There are various ways the UI could work. One possibility would be an integrated program which acts as server as well as multiplexing active talk sessions. That might make sense in a GUI context. Personally, I'd prefer to split things up, with a background daemon (running as the user) which uses user-configurable means (e.g. email, write(1), notify-send(1)) to notify the user, something similar but foregrounded for initiating a call, and finally a TUI/GUI (which can talk to the other process over a socket, so it doesn't even need to know TLS) for the talk session itself.

Further remarks

What about non-textual communication? Shouldn't we have something analogous to gemini's use of mimetypes? No. That way lies a profusion of pairwise-incompatible clients each supporting different functionality. There's no reason that the same basic structure couldn't be used for e.g. voice or video, but that can be a separate protocol on a separate port.

What about a name? The obvious name is 'stalk', for "secure talk", which makes a nice command. I'm torn on whether to use it though... perhaps it's going too far in planting this as a talk(1) successor, when its single-user design makes it actually quite different.

EDIT: Additional thoughts

What about multi-user chat à la ytalk? This could be layered on top (as it was with ytalk), by having each user connect to each other user.

What about surveillance? Using direct IP connections has the downside that those who are monitoring network traffic can observe who you are talking to. But I think it's reasonable to consider this out-of-scope -- it's a general problem needing general solutions. Tor possibly is one, or at least aims to be one, and it should be easy to run this via tor.

Feedback

I'd welcome your thoughts. Can you imagine using something like this? See any problems with the design?

If there's enough interest, I could set up a mailing list.

My current plan is to go ahead and write a simple implementation, which should be straightforward enough. I haven't decided on a language yet... maybe Haskell for a first prototype, intending to rewrite in C for greater portability. Or maybe I'll make this my excuse to learn Go...