Tue, 25 Sep 2007
Syntax Error in a ... Comment?
A followup to my yesterday's post about object-oriented code in Java. In fact, I have considered entering this year's FIbot competition. The only drawback is that it is in Java. I have tried to compile their sample code, and here is another blog-post on how Java is bad:
[javac] Tile.java:4: unmappable character for encoding UTF8 [javac] * Created on 6. ?ervenec 2007, 18:14 [javac] ^
So even though Java people reinvent their own XML-happy wheel (
make - further reading on replacing
they still cannot XML-encode the information
about the charset of source code
Makefile-alike. Or better, into the source code
itself (Perl has been doing this for years, you can even use multiple
charsets in different parts of the source file, just notify the parser with
the appropriate "
use encoding" pragma).
Anyway, the first language which barfs with a syntax error inside a comment. Talk about brain damage.
The absurdity of the situation is even bigger when you realize that this comment (saying the file creation date in Czech) is one of those superfluous comments, presumably created by some stupid IDE: it does not add any useful information (such an information would be valuable maybe inside a version control system, but not in the source code itself), and it apparently is a syntax error in some locales.
BTW, anybody interested in forming a FIbot team with me? Preferably someone who can tolerate Java.
9 replies for this story:
Milan Zamazal wrote:
Citing from Java language specification (http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1): "Programs are written using the Unicode character set." I really, really don't understand why: 1. Unicode is insufficient; 2. character set used in the source code should be declared outside the source code file; 3. the character set should be declared at all when it is already specified in the language definition; 4. the parser should accept invalid input.
"Unicode character set" does not specify which one (UTF-8? UTF-7? UTF-16? UCS-32?). ad 1: allowing different character sets inside (different parts of) the source code file increases flexibility. ad 2: I would accept the charset being declared even in the source code file, but even this was not the case. ad 3: it isn't - see above. ad 4: there definitely are _some_ parsers which accepts this input, the parser which the authors of the code presumably use (which I suspect is the official one - JDK). And also "be conservative in what you send, and be liberal in what you accept".
Milan Zamazal wrote:
As for the character coding, see the link I provided. Ad 1: (i) I repeat: I don't understand why Unicode is insufficient; (ii) actually mixing codings within one file is a completely mad idea (why should one do it?? which editor can handle it?? how does such a mess interact with grep and other tools??); ad 4: I don't understand why one should create invalid source code input and then expect the parser to perform guesswork instead of rejecting it. Bashing Java instead of pointing out its real flaws makes no sense.
The link you mentioned says nothing about the actual encoding of the source code (The section 3.1 mentions UTF-16, but not as an requirement for the source code, it merely describes how UTF-16 works). Ad 1: Unicode is sufficient, but sometimes it is easier to do mix charsets inside one file (think generated source code from mixed sources; yes, it _can_ be recoded, but sometimes it is easier (even though grep and friends do not work then). However, this is not relevant to the blog post. Ad 4: Java definitely _is_ stupid for allowing syntax errors inside comments. Either way, it is fault of the authors of the code, or the authors of the reference implementation (which I suppose the authors use) that allows such a faulty code (if it is faulty) to be compiled.
Milan Zamazal wrote:
This is not a syntax error, it's invalid input (input data can't be decoded).
Invalid input counts as a syntax error as well. NB: I have re-read the whole discussion, and discovered that I have forgotten to add an important piece of information: The javac compiler behaviour depends on the LC_CTYPE locale. When I compile the above code with LC_CTYPE=cs_CZ.ISO-8859-2, it compiles the source code correctly. So is the developer's compiler out of spec for accepting non-UTF8 input, or is my compiler out of spec for not accepting it (or for depending on the system locale)? I have always thought the "source code" means something like "a portable and human-readable way of describing the machine instructions". Apparently, in Java it is not the case.
Yenya wrote: Perl
FWIW, perl accepts invalid characters in comments without problem. Which is how it should be. Comment is something that the compiler should not look at.
Compiler has to look at comments too. There is no other way to find where they end :-)
Well, end-of-comment characters are ASCII (either a newline or a */ sequence), and since UTF-8 is ASCII-compatible, there is no need for a compiler to parse UTF-8 inside the comment.