Fri, 07 Oct 2005
The joy of XML
I am reworking a mechanism for exporting data from IS MU to an external database. Because the data form a tree-like structure, I have decided to use XML as a data-interchange format. I knew basic facts about XML, but this was the first time I got my feet wet with it. So far it is pretty unpleasant experience.
Firstly it is necessary to describe the structure of the data. The XML world offers about half a dozen different incompatible ways to do it - for example DTD, XML Schema, or Relax NG. Each of those technologies show pretty well the main problem of declarative definitions of anything: declarative definitions are either not strong enough to express what you want (DTD, server-side includes in HTML, etc.), or they are bound to contain something close to programming language (like PHP has evolved almost into a programming language, or for example XML Schema allows to write regular expressions or to specify minimum/maximum value of an integer number). Why they are trying to be both declarative and expressive-enough? The result simply has to be ugly, yet not expressive enough for some cases.
I have decided to go with the XML Schema. So the next question is, what is a valid XML Schema. Interestingly enough, the XML Schema can itself be defined as an XML schema . So what happens, if you try to check whether the XMLSchema.xsd file is a valid XML Schema? I have tried the XML::Validator::Schema Perl module, and some web-based validator, which I cannot remember now. Both had problems with validity of the XML Schema definition.
My other complaint is that XML is too much verbose. Why the tag name should be repeated at the end of the group? And why they often use the namespace prefix in the schema definition, even though it is the sole namespace in the whole document? This makes the whole XML file both harder to write, and harder to check for syntactical and logical errors. And why the namespace is usually labeled as URL, even though it is an arbitrary string and the referenced URL even does not have to exist? It looks as an URL, but it is not. Another point against readability. And after all, they do not use the URL directly, but instead it is immediately mapped to an abbreviation, like xsi: or something like that.
Another problem is the problem of XML-handling libraries. There are many, such as libxml2 or expat (not to mention Java-based solutions). And each of them have its own Perl front-end, usually with an API 1:1 mapped from C instead of an API designed for ease of use. For example, it requires lot of not-so-pretty code to allow your element-handler throw an exception, which you want to be reported together with the line number in the source XML file.
That said, XML has also few good features: the concept of well-formedness, character set specified inside the file itself, XSLT transforms, the fact that the XML Schema definition or XSLT definition are also XML files, etc. However, I can imagine there can be a less-verbose and less-bloated technology, which would serve the same purposes as XML.
4 replies for this story:
I'm not XML expert, but I have an oponion ;). "Why the tag name should be repeated at the end of the group? And why they often use the namespace prefix in the schema definition, even though it is the sole namespace in the whole document? This makes the whole XML file both harder to write.." AFAIK the original idea was, that XML should be not handwritten, but machine-generated (Domain-specific editors - xmlmind for example or programs). "And why the namespace is usually labeled as URL, even though it is an arbitrary string and the referenced URL even does not have to exist? It looks as an URL, but it is not." Maybe because URL is nice unique way to structure this information. And as a bonus you can have your schema accessible via web and validated this way.
Well, I think it is better to write for example a XML schema by hand instead of trying to look up some half-baked OS-specific solution (if there is any). Most of namespaces I have seen (for example those at w3cschools.com) have a namespace in the URL form, but the URL itself is either not accessible or does not contain the XML schema at all. They use URL as a label, but they do not put anything meaingful to the URL itself. If you want to have an unique prefix or whatever, just use your DNS domain name. Why to add the "http://" prefix in front of it, when it is not really an URL that contains anything meaningful?
Yenya wrote: One more thought
If you think XML was not made to be handwritten (or human-readable), why it is a text format? Why not use something truly universal, yet compact, as ASN.1/DER/BER ? It is an ideal format for vendor-neutral data interchange, provided you don't want it to be human-readable.
"If you think XML was not made to be handwritten (or human-readable), why it is a text format?" Maybe because experience is that textual protocols are good way to do things. Most of the time you don't telnet to port 80 as well ;). Also notice that handwritten and human-readable are not same thing.