How Can We Determine Files Types and Text File Encodings?

Determining File Types

I have a security question. How can we verify that a UTF-8 file contains only UTF-8 encoded bytes?

Running iconv all the time (the preferred solution) isn't appropriate in every situation, and only pushes back the question: how does iconv perform the verification? Other proposals suggest pushing text through UTF-8 language tools, like `read().decode('UTF-8')` in Python, but, again, the /how/ remains mysterious. Rust is particularly good here in that non-UTF-8 strings simply do not handle at all, unless you do a workaround. That provides some inherent level of validation checks. But, again, it doesn't tell me /how/ the validation is done.

The reason I want to know more in detail, is that I am concerned about polyglot filetypes. I want handle files correctly. I figure if I can wrap my head around UTF-8, I'd be half way toward understanding the problem generally.

After investigating this problem, it seems to me that files don't have an objective type. Sure, the file extension can give us a clue. Or we can check the "magic bytes" at the start of the file, which the Linux command `file` uses. But these do not /determine/ the filetype. The extension can be wrong, the magic bytes can be falsified. Or, in the case of polyglot files, the "correct" indications serve only to obscure a malicious payload of another file nested inside.

What I gather from all this is that all files are just sequences of bytes. There is no underlying type that has any ontological reality. The "type" of a file has meaning only in the interaction of that particular sequence of bytes and a particular piece of software that proposes to interpret those sequences. A file type is "verified" when the software output meets expectations for that filetype. For example, a sequence of bytes verifies that `my-image.jpg` is of the image/jpg type when an image viewer properly renders that sequence as an image. If I change the file extension to .txt and try to open it up in a text editor, the result will not be satisfactory, and thus the file will not be viewed as a valid text file.

That's about as far as anyone can get, and it doens't protect us against polyglot filetypes.

Since UTF-8 is the core filetype of the Gemini protocol, it would be good to have /a more detailed/ understanding of how UTF-8 can be validated /on a conceptual level/. I will continue poking around to find my answer, but I hope someone here might know.

Determining Textfile Encodings

A related problem I'm having is determining the encoding of textfiles. It's disturbing to me that it is impossible to determine the encoding of textfiles. I've run various checkers and have obtained remarkably different and often wrong results. I'm working with millions of vintage textfiles, leaving me with tens of thousands of ambiguous files that have to be manually checked.

So a related question: how can we know that a textfile is encoded with WINDOWS-1252? The answer cannot be "use chardet" because (a) chardet is not always right, (b) invoking chardet all the time is not always appropriate, and (c) it only pushes back the question: how does chardet determine that a textfile is WINDOWS-1252 encoded? And more importantly, why does it sometimes get it wrong?

It seems to me that it works on a statistical basis, taking samples of the text rather than checking the whole thing. And there's no way to force chardet to check the whole document. If that's true, then a probable solution is to check the /entire/ document. And yet nowhere do I find this a proposed solution to the error rate of chardet. 99% of people, including developers, simply don't care about getting text conversion right. Good enough is good enough.

I'm not satisfied with this. I have responsibilities to the past, and do not want to push forward into the future mangled text. I want to get my text conversions right 100% of the time.

I'm willing to look into the chardet code. But does anyone here have insight, to make my journey a bit more efficient?

Both of my concerns are really the same concern in the end: how to determine that this sequence of bytes is of this type or this encoding? So far, I've found no reliable answer. And because in general we have no reliable answer, we are no longer concerned about software security, but of sandboxing software. Because we know every file is a potential weapon against our systems, and all we can do is build barriers around the next completely predictable explosion.

Have you also come to this conclusion?

Thanks.

https://en.wikipedia.org/wiki/Polyglot_(computing)#Security_implications

https://medium.com/swlh/polyglot-files-a-hackers-best-friend-850bf812dd8a

https://infosecwriteups.com/polyglot-files-the-cybersecurity-chameleon-threat-29890e382b59?gi=37232c23a5b1

Edit: Here's a link to the official description of how chardet works:

https://chardet.readthedocs.io/en/latest/how-it-works.html

It provides a good overview of the complexity of determining even the simplest facts about sequences of bytes.

Here's another link:

A composite approach to language/encoding detection (2008)

https://www-archive.mozilla.org/projects/intl/universalcharsetdetection

Notably, my browser renders black diamonds with question mark character all over the place in this document, indicating that the browser fails to render the encoding properly, proving that the nightmare never ends.

Posted in: s/Gemini

🚀 blah_blah_blah

Apr 04 · 6 weeks ago

7 Comments ↓

☕️ mozz · Apr 04 at 18:43:

I understand the challenge, at the end of the day all you can do is make a best-guess based on the file extension and the surrounding context, like what operating system the file was originally saved on.

But why do you think a polygot file is a security issue? I don't see how it would be more insecure than any other untrusted file.

💎 istvan · Apr 04 at 21:59:

UTF-8 isn't a file type: it's a scheme of encoding text. This text is then connected with glyphs stored in a font.

You can include as many different encodings as you wish in a text file. For example I can make you a single text file containing UTF-16LE, UTF-8, Japanese Shift-JIS, Japanese EUC, Chinese Big5, Chinese GB2312, and ASCII encoding.

Anything that opens it will crap its pants and only show mostly garbage because documents are assumed to use one text encoding. There is nothing in plain text to hint which encoding should be used. At best, a text editor can make a heuristic guess, but non-UTF encodings often still need to be manually configured.

💎 istvan · Apr 04 at 22:04:

If you are asking about file types, which is a completely different question, there is typically some form of magic bytes that can be used to make a guess.

Ultimately, it's the responsibility of the software to figure this out.

If you replace the magic for PNGs with JPEG, your OS might guess it is a JPEG and pass it to an image editor. The image editor will attempt to parse the JPEG, find out the data just doesn't work and complain that you passed a broken/invalid JPEG.

So the problem is on the final processing end to solve. Mime and magic is just a shorthand to help guess which software to pass it to for further processing.

🐙 norayr · Apr 05 at 17:02:

i guess you know about the 'file' utiliy.

🚂 MrSVCD · Apr 05 at 22:01:

To make your life a little easier you can make a utility that detects ASCII and UTF-8 text, the rest you can't automate since there is no real way to identify between different codepages besides using a human to see if it looks correct.

🚀 blah_blah_blah [OP] · Apr 10 at 00:04:

@mozz

> But why do you think a polygot file is a security issue? I don't see how it would be more insecure than any other untrusted file.

Secure software has to presume that user input is hostile. One form of hostiliy is the poiyglot file, which appears to be one thing while (in addition, under certain circumstances) being something else.

🚀 blah_blah_blah [OP] · Apr 10 at 00:44:

The responses to my post confirm my view that the final determinant of a file's type or encoding is human judgment about whether expected software chokes on the data or not. I guess only I find this an intriguing topic, or an alarming one.