-- Leo's gemini proxy

-- Connecting to republic.circumlunar.space:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Ambiguous names in Java due to non-normalised unicode - but all OK in Python


In Java and several other languages, identifiers (e.g. method names) are allowed to contain unicode characters.


Unfortunately, some combinations of unicode characters are logically identical. For example, á (one character: Latin Small Letter a with Acute U+00E1) is the same as á (two characters: Latin Small Letter A U+0061, and Non-spacing Acute Accent U+0301). These combinations are not just similar - they are identical by definition.


Latin Small Letter a with Acute U+00E1

Latin Small Letter A U+0061

Non-spacing Acute Accent U+0301


Java does not do any normalisation on your code before compiling it, so two identifiers containing equivalent but different unicode combinations are considered different (ref: JLS 7 section 3.8).


$ cat U.java
public class U {
   static String \u00e1() { return "A WITH ACUTE"; }
   static String a\u0301() { return "A + NON-SPACING ACUTE"; }
   public static void main(String[] a) {
       System.out.println(á());
       System.out.println(á());
   }
}
$ javac U.java && java U
A WITH ACUTE
A + NON-SPACING ACUTE


normalisation

JLS 7 section 3.8


We can define and use two functions called á and á and they are totally independent entities.


But don't do this.


Python 3 also allows unicode characters in identifiers, but it avoids the above problem by normalising them (ref: Python 3 Reference, section 2.3):


Python 3 Reference, section 2.3


$ cat U.py
#!/usr/bin/env python3

def á():
   print("A WITH ACUTE")

def á():
   print("A + NON-SPACING ACUTE")

á()
á()

$ hexdump -C U.py
23 21 2f 75 73 72 2f 62  69 6e 2f 65 6e 76 20 70  |#!/usr/bin/env p|
79 74 68 6f 6e 33 0a 0a  64 65 66 20 c3 a1 28 29  |ython3..def .. ()|
3a 0a 20 20 20 20 70 72  69 6e 74 28 22 41 20 57  |:.    print("A W|
49 54 48 20 41 43 55 54  45 22 29 0a 0a 64 65 66  |ITH ACUTE")..def|
20 61 cc 81 28 29 3a 0a  20 20 20 20 70 72 69 6e  | a.. ():.    prin|
74 28 22 41 20 2b 20 4e  4f 4e 2d 53 50 41 43 49  |t("A + NON-SPACI|
4e 47 20 41 43 55 54 45  22 29 0a 0a c3 a1 28 29  |NG ACUTE").... ()|
0a 61 cc 81 28 29 0a 0a                           |. a.. ()..|
$ ./U.py
A + NON-SPACING ACUTE
A + NON-SPACING ACUTE


(Legend: A WITH ACUTE, A + NON-SPACING ACUTE)


The second definition overwrites the first because they are considered identical. You can call it via either way of saying its name.


Both ways of working are scary, but I'd definitely choose the Python 3 way if I had to.


Originally posted at 2016-08-11 07:47:40+00:00. Automatically generated from the original post : apologies for the errors introduced.


original post

-- Response ended

-- Page fetched on Sun May 19 07:56:16 2024