Ambiguous names in Java due to non-normalised unicode - but all OK in Python

In Java and several other languages, identifiers (e.g. method names) are allowed to contain unicode characters.

Unfortunately, some combinations of unicode characters are logically identical. For example, á (one character: Latin Small Letter a with Acute U+00E1) is the same as á (two characters: Latin Small Letter A U+0061, and Non-spacing Acute Accent U+0301). These combinations are not just similar - they are identical by definition.

Java does not do any normalisation on your code before compiling it, so two identifiers containing equivalent but different unicode combinations are considered different (ref: JLS 7 section 3.8).

$ cat U.java public class U { static String \u00e1() { return "A WITH ACUTE"; } static String a\u0301() { return "A + NON-SPACING ACUTE"; } public static void main(String[] a) { System.out.println(á()); System.out.println(á()); } } $ javac U.java && java U A WITH ACUTE A + NON-SPACING ACUTE

We can define and use two functions called á and á and they are totally independent entities.

Python 3 also allows unicode characters in identifiers, but it avoids the above problem by normalising them (ref: Python 3 Reference, section 2.3):

(Legend: A WITH ACUTE, A + NON-SPACING ACUTE)

The second definition overwrites the first because they are considered identical. You can call it via either way of saying its name.

Both ways of working are scary, but I'd definitely choose the Python 3 way if I had to.

Originally posted at 2016-08-11 07:47:40+00:00. Automatically generated from the original post : apologies for the errors introduced.