gemini://thrig.me/blog/2023/08/15/is-it-utf8.gmi

Another method is to brute force the problem with a Monte Carlo simulation, which at best will waste lots of CPU. However, where Monte Carlo is practical this may help confirm that a better approach has not gone off the rails. Or, that both the Monte Carlo and the better approach are wrong in ways that produce the same results, which maybe isn't so likely? Richard Feynman said things about actually verifying results in his 1974 commencement address at Caltech.

// istuf8 - what are the odds of random bytes being valid UTF-8? #include <assert.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> // doas pkg_add libutf8proc #include <libutf8proc/utf8proc.h> double byteaudit(size_t len, size_t trials) { size_t pass = 0; assert(len > 0); utf8proc_int32_t ref; utf8proc_uint8_t *in = malloc(len); if (!in) abort(); for (size_t i = 0; i < trials; i++) { // PORTABILITY this may need libbsd or something else arc4random_buf(in, len); utf8proc_uint8_t *ip = in; utf8proc_ssize_t left = (utf8proc_ssize_t) len; while (1) { utf8proc_ssize_t ret = utf8proc_iterate(ip, left, &ref); if (ret < 1) break; ip += ret; left -= ret; if (left == 0) { pass++; break; } } } free(in); return (double) pass / (double) trials * 100.0; } int main(void) { for (size_t i = 1; i <= 8; i++) printf("%lu %.2lf\n", i, byteaudit(i, 10000000U)); }

$ CFLAGS=`pkg-config --cflags --libs libutf8proc` make isutf8 cc -I/usr/local/include -DUTF8PROC_EXPORTS -L/usr/local/lib -lutf8proc -o isutf8 isutf8.c $ ./isutf8 1 49.97 2 27.95 3 15.79 4 8.93 5 5.05 6 2.85 7 1.61 8 0.91

These results agree fairly well with those from the original article, even if they do chew up a lot more CPU getting there. Someone with actual stats chops could doubtless do better with distributions and p-values than my "fairly well" line.

Length P(valid UTF-8) Monte Carlo ------------------------------------- 1 50 % ~ 49.97% 2 ~ 27.9% ~ 27.95% 3 ~ 15.8% ~ 15.79% 4 ~ 8.9% ~ 8.93%

My numbers were wildly wrong until I got all the obvious bugs out of the code. For instance code that answers "can valid UTF-8 be found at the beginning of a buffer?" will get rather different results. A program might stop reading at the first NUL or '\0' and thus ignore gibberish beyond that that another program would fail on having read past that NUL. So it will depend on the data, how it is parsed before reaching any Unicode library, and then how that library and the code that calls the library deals with what the initial parse did. In other words, it can be complicated.

The above code works. However, it is hilariously inefficient, especially for shorter byte lengths where one could simply test each value. Also fewer trials could be used to obtain good enough results, especially if you have older and slower hardware, and enough statistical chops to be confident that the results are indeed good enough.

$ CFLAGS=`pkg-config --cflags --libs libutf8proc` make isutf82 ... $ ./isutf82 > out.txt $ sed 3q out.txt 50.00 27.93 15.72 8.96 5.29 2.79 1.58 0.94 50.00 27.93 15.28 9.09 5.11 2.85 1.50 0.93 50.00 27.93 16.46 9.51 5.18 2.81 1.75 0.82

This version of the code checks each value when the length is 1 or 2. Otherwise, the number of trials is lower, but the check is done 30 times for each length. This allows the mean and standard deviation of the results to be calculated, here with R:

$ r-fu colsumm out.txt V1 V2 V3 V4 V5 V6 V7 Count 30 30.00 30.0000000 30.0000000 30.0000000 30.0000000 30.0000000 Range 0 0.00 1.7600000 1.1400000 1.0100000 0.6200000 0.4600000 Mean 50 27.93 15.7496667 8.9846667 5.0646667 2.8093333 1.5923333 Sdev 0 0.00 0.3648097 0.3219952 0.2227219 0.1508124 0.1064047 Min. 50 27.93 14.7000000 8.3900000 4.6100000 2.4400000 1.3500000 1st Qu. 50 27.93 15.5600000 8.8100000 4.8700000 2.7200000 1.5100000 Median 50 27.93 15.7350000 8.9400000 5.1300000 2.8100000 1.5850000 3rd Qu. 50 27.93 15.9600000 9.2100000 5.1900000 2.9200000 1.6600000 Max. 50 27.93 16.4600000 9.5300000 5.6200000 3.0600000 1.8100000 V8 Count 30.00000000 Range 0.34000000 Mean 0.92933333 Sdev 0.08780595 Min. 0.74000000 1st Qu. 0.85000000 Median 0.93500000 3rd Qu. 0.98000000 Max. 1.08000000

Much less CPU, and still pretty good results. Again, someone with better statistical chops could make better use of the numbers here.

Length P(valid UTF-8) Monte Carlo MCII Mean ------------------------------------------------- 1 50 % ~ 49.97% 50 % 2 ~ 27.9% ~ 27.95% ~ 27.93% 3 ~ 15.8% ~ 15.79% ~ 15.75% 4 ~ 8.9% ~ 8.93% ~ 8.98%

Is it UTF-8?

Less Terrible Code