-- Leo's gemini proxy

-- Connecting to bbs.geminispace.org:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini; charset=utf-8

help debugging likely disk write issue?


I have a server service running in a freebsd server I manage, which is connected via a pipe to a logging service which writes to disk. Periodically (generally after a few weeks), the server entirely stops doing anything. It doesn't respond to any requests, and even after a restart it doesn't bind to any ports and none of the usual startup log messages come through. The only way I've found to get it un-stuck is to restart the logging service in the service manager (runit), and then SIGKILL the old logging process.


Thus, my hypothesis is that the logger loses its ability to write to disk (for some reason - this is still the mystery). That then eventually affects the server through back-pressure: subsequent log line writes fill up the pipe and those writes start blocking. Once blocked on a logging write, the server won't do anything.


I feel like this hypothesis sufficiently explains the behavior, but now I'm lacking the freebsd knowledge to debug how the logger gets stuck. Can anyone point me to resources, or have any other ideas that might be helpful? For a little more context: the service manager is runit (which has a parent supervisor process which owns the pipe, so that the server and logging services on either end can be restarted without losing data) and the logger is svlogd.


Posted in: s/FreeBSD

๐ŸŒŠ tjp

2023-11-07 ยท 6 months ago


8 Comments โ†“


๐ŸŒŠ tjp [OP] ยท 2023-11-07 at 15:55:

please just let me know if this isn't the right place for a post like this.


๐Ÿ“ก byte ยท 2023-11-07 at 16:43:

first thing that came into my mind is that you have some internal deadlock issue, or you run out of file descriptors.


either way, try to reproduce it separately from the server part and print out the errno and strerror( errno ). also maybe it's better to read/write with pipe in a non-blocking async way, so server speed won't be affected by the logger.


๐Ÿ“ก byte ยท 2023-11-08 at 13:15:

@tjp did it work out?


โ˜•๏ธ Morgan ยท 2023-11-09 at 14:16:

Usually with this kind of issue I would try to add any kind of alternative log output possible ... aka "printf debugging" :) ... to learn more next time it happens. In cases where printf is not visible I would append to a file somewhere.


๐ŸŒŠ tjp [OP] ยท 2023-11-09 at 15:59:

Thanks for the suggestions @byte. I'd be really surprised if there was a deadlock issue in svlogd as that's a heavily tested and used program as it's the logger in runit. I also don't think it should run out of file descriptors (it'd have to leak them), but is there a way I can see file descriptor counts for a process in FreeBSD? That's the kind of debugging I'm looking for.


๐ŸŒŠ tjp [OP] ยท 2023-11-09 at 16:02:

Nothing new to report as it hasn't happened again since I first posted here (it generally runs for a few weeks before this lock-up occurs). I'm hoping to gain a little knowledge for when it does happen again: how can I inspect a running process, for example to see how many file descriptors it has open? Or whether it's currently blocked on a file write call, or has some other syscall open but not completed?


๐Ÿ“ก byte ยท 2023-11-09 at 16:17:

just strip everything from the server part, and write some logging calls in a loop there, this way you can reproduce it without waiting for a week and try to debug it.


๐ŸŒŠ tjp [OP] ยท 2023-11-13 at 18:21:

I finally figured this out.


Based on the frequency of the problem arising I thought it might be connected with svlogd's log rotation, and the svlogd manpage refers to a possibility of it locking up when an external log file processor fails. So to induce it more quickly I reduced the max log file size and generated a bunch of traffic to my server with a shell script.


I was using the !processor functionality of svlogd to run rotated log files through a script (`!gzip`), but also starting svlogd with `chpst -p 1`, which sets a soft limit of 1 process. I removed that limit and it works fine now.

-- Response ended

-- Page fetched on Sun May 19 16:31:36 2024