shit.cx

Weirdness Redirecting to /dev/udp

2020-11-27T19:39

I don't think I've mentioned that during the day I'm an SRE for a pretty busy website. A few weeks ago I received a report that system that I maintain, hummingbird, was sending malformed StatsD messages to our Datadog StatsD agent.

The purpose of this service is to sustain a rate of requests across the network, a very rapid heartbeat if you will. The receiving service is so fast that the measured response duration is 90% network latency. This data is used to know the HTTP performance and reliability across various paths of our Kubernetes cluster. Any problem with the network, HTTP load balancers, Kubernetes ingress controllers, service routing, pod scheduling, etc. are all detectable in these metrics. Even subtle things like uneven pod spread between AZs can be seen. It's quite helpful.

The StatsD logs showed it was sometimes receiving messages with unexpected line breaks in the middle, but it was almost always fine. The errors were evenly distributed between hosts and AZs and pods. Everything pointed to a problem in the code. This surprised me because is is pretty basic. I've stripped out the details, but this is it:

vegeta attack -rate <n>r/s <host> \
    | vegeta encode \
    | jq '[.get, .interesting, .fields] | @tsv' \
    | awk '{ <convert tsv into valid statsd message> }'
    > /dev/udp/<statsd_host>/<statsd_port>

To explain each each line:

Sustain a rate of http requests to an endpoint.

Encode the binary output into JSON lines.

Select the useful data from JSON, and convert to TSV.

Transform the data into a valid StatsD message.

Pipe it to another host over UDP.

Doing it like this provides a steady stream of data with minimal buffering.

Looking hard at the code and twiddling a few things didn't help. The errors were frequent enough that I captured a few in a ten minute tcpdump. This corroborated what the StatsD logs showed, only it added that sometimes StatsD messages spanned packets. This only happened to the biggest packets; those with a few StatsD messages within. I suppose that rather than buffering each line and writing it once the length is known, it just does its best to dump as much data into the packet as it can. I don't think bash UDP redirects are really supposed to be *that* robust, so it's fine that it does this.

I replaced the UDP redirect with a pipe to OpenBSD Netcat which really is built for this problem. That fixed the truncated messages problem, but it introduced a new one. I'll talk about that in another post.