Monday, May 25, 2009
XML is not a universal data language.
After HTML became ubiquitous, its cousin XML was created and billed as a universal data language. Unfortunately, XML is not a data language.
If you have a struct in C++ or C# or Java, the struct definition is not sufficient for a third party to parse the struct as XML. Both when reading and writing, "glue code" is required. There is no specification for glue code - it is ad hoc. Each programming team maintains its own, which is generally mutually incompatible with every other team's.
Imagine if this were true of image formats. Imagine if, to read a GIF file, one needed not just a GIF parser, but also ad hoc "glue code" from the same team whose tools generated the image. You would not use such a format, because it is broken.
Why can't XML encode structures from programming languages without glue code?
It is because XML does not support numbers or arrays. This claim may seem outlandish, but it is not. The XML standard says nothing whatsoever about integers or floating-point numbers, or any other kind of number. The notion that ASCII "123" should correspond to integer value 123 is entirely absent from XML.
How does one represent the following C++ struct as XML?
struct foo{float bar[5];};
At one very large game publisher where I worked, it went like this:
<foo>This was clever because comments are XML nodes, and so the XML parser would present each number as a node. It was fragile because comments were no longer comments - comments had semantics! On the Internet, you can find dozens of other mutually incompatible representations, such as
<bar>1<!-- oh yes, a comment is an array delimiter!-->2<!-- -->3<!-- -->4<!-- -->5</bar>
</foo>
<foo>...or...
<bar><float>1</float><float>2</float><float>3</float><float>4</float><float>5</float></bar>
</foo>
<foo>
<bar>1 2 3 4 5</bar> <!-- present the array as raw ASCII and let each user parse ASCII by hand! -->
</foo>
It is literally impossible to support them all. Strangers who intend to share structs via XML will fail, unless they agree to one of these many variations beforehand. There exist auxiliary XML technologies for encoding structures, but how many can you name? How many have you actually used? Few map perfectly to the problem domain, and even fewer have robust implementations on many platforms.
Because XML has no support for numbers, it's unclear even how to represent a float in the above struct. Each team must write a float parser in isolation, according to no particular standard. Will every parser deal with the following cases correctly?
| +nan | -nan | +inf | -inf | 3.9265e+2 |
A text markup language like XML is not enough to safely transmit data among strangers. A data language is required for this. Many exist. The oldest known in wide usage is sexp, which is decades old and derives from Lisp.
In the last five years, there has been a resurgence of interest in data languages. The related data languages JSON and YAML have been the most popular.
JSON is a subset of JavaScript, and almost a subset of Python. What in C++ looks like this:
foo f = { { 0, 1e10, 2, 3, 4 } };
Looks like this in JSON, YAML, JavaScript or Python:
{ "f" : { "bar" : [ 0, 1e10, 2, 3, 4 ] } }
Given the struct definition above, this the only obvious way to write it as JSON. That is what a data language must do in order to be useful. XML can not do this.
We expect this much from image formats and movie formats. We should expect this of data languages, too.
Labels: uot
Subscribe to Posts [Atom]