shibboleth-dev - Concerning the ?xml thing
Subject: Shibboleth Developers
List archive
- From: "Howard Gilbert" <>
- To: <>
- Subject: Concerning the ?xml thing
- Date: Thu, 26 May 2005 22:00:30 -0400
[This is a rather long rant about some background information that everyone
who works with XML should know (to avoid magic) but most don't. It is not a
complaint about anything or anyone. Everything is fine and none of this
matters much. But in the few cases where this creeps in, or if someone
decides to "correct" what is there, this is something people should know. I
am being a bit defensive to explain things fully because there is a lot of
misinformation, some printed in books.]
It is generally recommended, but not required, that xml files begin with
some version of the magic incantation
<?xml version="1.1" encoding="ISO-8859-1"?>
This is sufficiently obscure that most people don't notice it. It is not
always included in our files and when it is it is not always coded
correctly. This is a background explanation of this thing.
First, if we had it to do all over again, we would certainly improve the
information stored in the directories of OS file systems. Most systems don't
even distinguish text from binary files. Text files are just stored as a
string of bytes without any encoding information. Of course, for fortunate
Americans, it doesn't matter. Anything we want to express can be said quite
nicely in 7 bit ASCII.
Even then there are nuanced differences between different types of ASCII
files. Is a *.java source written to JDK 1.4 or J2SE 5 standards? Since the
filesystem doesn't say, we have to hope that we use the right compiler
thanks to some associated build file. In languages where this sort of thing
matters, there is typically a "pragma" statement to express it. Well, that
is what <?xml?> is, it's the pragma of XML.
Any file ending in *.xml is presumably an XML file. However, there are two
levels of the standard. XML 1.1 adds namespaces and corrected some errors
where XML 1.0 failed to conform to the rules of the Unicode/ISO-10646
character sets (which were in turn inherited from all the other character
set standards going back to ISO646/ASCII).
Unfortunately, a number of Blog-Morons who know nothing about character
standards decided to bad-mouth XML 1.1 declaring that it was some kind of
cave-in to pressure from IBM (who was just insisting that something built on
the Unicode character standard had to follow the Unicode character
standard). The end result has been a lot of bad advice telling people to say
that their XML is 1.0. Well, our's isn't 1.0 because we use Namespaces and
DOM 3. This is XML 1.1 and it ought to be declared as such if we want to be
accurate. I believe that there are no current editors/tools that will
complain, and if you want to use 1.0 for old time's sake I have no
objection. If anyone wants to say 1.0 for ideological or political reasons,
you are dead wrong.
[If anyone really wants to discuss the nuances of the C1 control character
quadrant and the history of alternative line end sequences, we can take that
discussion off-line.]
The Encoding part is even worse, but the story is a bit longer.
In 1963 there was an ASCII standard for 6 bit graphic characters and 32
control characters in a 7 bit unit. Five years later the lower case letters
were added.
For several decades, other countries were required to substitute their
foreign character for the particular ASCII codes that the ISO646 standard
called "national use". Characters like
~`@#$%^\
could be replaced by
whatever the French or Spanish needed.
An intermediate standard defined a mechanism for transmitting 8 bit
characters over a 7 bit communications line (for ASCII terminals connected
to timesharing systems). This mechanism forced all subsequent standards to
reserve the first 32 additional code points (0x80 to 0x9F) as a second
control area, bringing the total number of control characters to 64 (the C0
original and the new C1 set), and the graphic characters were then two sets
of 93 (leaving 0x7F or DEL as reserved and putting 0xFF forever in
controversy).
By the late 80's it had become clear that there were 8 bits to a character
instead of 7. The ISO-8859 family of 8 bit character sets was developed.
They were now forced to accept the previous standard and reserve 0x80 to
0x9f as control characters even though there was no rational reason for this
in modern data communications. The extra 93 characters were enough to handle
one additional alphabet or the accented characters of a group of nations.
8859 versions were developed for Greek, Hebrew, Arabic, Russian, and a
default 8859-1 supported "Latin-1" or the Western European area. [After
fitting in the obvious countries, there was room for either Iceland or
Turkey. So of course, they decided to include Icelandic (pop. 250,000) and
leave out Turkey (pop. 66 million).]
Chinese, Japanese, and Korean have too many characters to fit in any 8-bit
set. They developed a wide range of legacy double-byte encoding systems for
their data.
Unfortunately, there is nothing in the Unix or even the NTFS file system to
store the text character encoding. Before the Internet this wasn't a big
problem because each computer was located somewhere and its owners probably
stored all their files in one encoding they adopted for the local language.
However, when files are shipped across national boundaries we can make no
assumptions about encoding standards, so HTTP included an encoding field in
its content-type header. However, since no file system allows this decision
to be made file by file, common practice is for the Web server to be
configured to slam onto the end of everything it sends out a constant
encoding suffix that represents the default encoding of all the files on the
server based on local conventions and the tools and editors that were used
to generate the files.
Then Unicode came along. The good news is that is provided code values for
all of the alphabets and most of the Far East languages. The bad news is
that it required multiple bytes per character. So encoding schemes like
UTF-8 were developed so that ASCII characters could still be stored and
transmitted in one byte, while less commonly encountered characters (at
least from an American point of view) took more bytes to store or send.
All of our project's files are ASCII. We have no examples with French
accents, let alone Hebrew. Now the good thing about ASCII is that it is the
same 7-bit character set no matter what you declare the Encoding to be. All
encodings have been carefully defined so that if all the bytes of data are
in the code range 0x00 to 0x7F, then each byte is an ASCII character.
Strictly speaking then, we are claiming more than is true if we mark any XML
file with an encoding more powerful than ASCII. However, there is no problem
extending this to the default 8-bit Western European default of ISO-8859-1
because that is the default of most editors that deal with more than just
ASCII.
However, if you declare that your Encoding is UTF-8, like a lot of XML files
claim (not ours) then you are on dangerous ground. This is only guaranteed
to be accurate if everyone uses the same multibyte character editors and
file management tools, and that simply doesn't happen in this or any other
programming project.
Here is how Encoding is supposed to work. There is a blob of bytes on disk
or coming in from the network. It is text in some encoding, but we don't
know which one, because all the stupid file systems don't have a way to
specify encoding as a file attribute. So we read the blob with an editor,
XML parser, Web Server, Web Brower, or whatever tool you like. If the file
ends in *.xml, then the tool reads the first set of bytes from the disk or
network as 8-bit bytes. Is the first byte ASCII "<" (or are the first two
bytes "<" in a UCS Unicode encoding). Is the next byte "?", and so on
through "x", "m", and "l". If so then this is a <?xml declaration at the
start of the file. Now the good news is that there can be no foreign
language characters in the rest of this declaration up to the ending ">". So
any tool can continue reading one byte at a time through to the end of the
encoding= attribute. At this point the tool stops and looks at the encoding
value. If it is UTF-8, then it switches from ASCII to multibyte sequences
according to the UTF-8 standard. After this point, any character whose high
order bit is on will be turned into a multibyte Unicode character.
In a rational world, we would know what the encoding of the file is before
we open it. Since this is not universally possible, the initial declaration
gives people in other countries a chance to declare in the initial bytes of
the file what encoding mechanism their editors and tools typically use.
Unfortunately, a lot of XML is written using ordinary text editor like vi
and Notepad that know nothing about XML and nothing about encodings. Even
when an editor can do encodings, unless the <?xml declaration is GENERATED
by the editor and not by the person doing the editing, then getting the
header to match the editor settings is a manual thing.
So specifying "encoding=UTF-8" in your XML is claiming more than you can
actually deliver. Yes it is true that if the world was perfect, and all the
editors and tools did the right thing, this would be the best choice. In
practice if anyone accidentally drops a high-bit character into the file
this declaration will screw things up in ways that are hard to diagnose
(particularly if the high bit character is homographic to a standard ASCII
character making the difference almost impossible to notice).
Thus a version of 1.1 and an encoding of ISO-8859-1 is technically the
correct thing to say if you code the <?xml?> declaration at all.
- Concerning the ?xml thing, Howard Gilbert, 05/26/2005
Archive powered by MHonArc 2.6.16.