shibboleth-dev - XML API changes in OpenSAML and Shibboleth about to be checked in
Subject: Shibboleth Developers
List archive
- From: "Howard Gilbert" <>
- To: <>
- Subject: XML API changes in OpenSAML and Shibboleth about to be checked in
- Date: Mon, 24 Jan 2005 21:16:31 -0500
JAXP
1.3 Changes to OpenSAML and Shibboleth
Feedback
from the problems uncovered deploying XML applications drives the evolution of
the W3C standards. New versions of the standards solve real problems. Thus the
migration of code to new versions of XML support may be driven by necessity
rather than a desire to pick up neat new features. Applications that are
centered entirely on XML and are controlled by external schemas and
specifications (such as OpenSAML and Shibboleth) are forced to keep up to date. Things
would be simpler if the W3C produced new standards that were compatible with
their previous standards. Unfortunately, they have adopted a policy of
replacing the definition of each interface with new versions of the same
interface name with additional methods. This means that the bundle of
interfaces (associated with one version of the standard) are tightly coupled to
a separate Jar file containing versions of the implementing classes that support
the new methods. One of
the basic programming interfaces is the DOM (Document Object Model). The DOM
interfaces are defined by packages of the form org.w3c.dom.* and define a set
of objects and methods that provide operations on the objects. A DOM 2 standard
was developed years ago, and DOM 3 component standards are now released. Driven
by requirements emerging from the SAML standard of XML syntax, OpenSAML and
Shibboleth require DOM 3 support. The
Apache Xerces project was formed from submissions from IBM (XML4J) and Sun
(ProjectX). It represents a common codebase to which all parties can submit
bugfixes and new features. Apache distributes versions of Xerces directly, but
Sun distributes the a version of the same code with slightly different packaging. Starting
with Java 1.4, Sun decided that XML was so important that it should be a
standard part of the J2SE runtime library. However, Sun's standards require
that all XML requests filter through the JAXP API, just as all database request
go through JDBC and all directory requests go through JNDI. The Apache code
contains some programming interfaces with concrete classes left over from the
old IBM XML4J days. So although Sun's distribution is based on Apache Xerces,
they tend to rename some of the classes to require everyone to go through the
public JAXP interface. Unfortunately,
Sun decided to freeze the features and standards at major release boundaries.
When Java 1.4.0 came out in Feb. 2002, the standards were DOM 2 and JAXP 1.2.
So although bugs were fixed, these versions of the standards remained the basis
for the Sun library through releases of 1.4.1 and 1.4.2 (up to 1.4.2_06). The
only way to override this type of built in function is to use the
"endorsed" library function of Java, and the only other version of
code reasonably available was the distribution from Apache. The
current version of the Xerces XML support distributed by Apache contain
interface definitions based on the old DOM 2 standard, and classes that
implement that standard. Apache provides an Ant build option to create a
version of its current Xerces release with the DOM 3 interfaces and
implementations, but it regards a library built this way to be experimental
Beta code. The plan is to convert to DOM 3 support in the 2.7.0 release of
Xerces, which currently has no planned release date. In the
Summer of 2004, Sun finally released a new major release. Designated as 1.5
under the old system, or as J2SE 5.0 in a new naming convention, this release
includes as standard both support for DOM 3 and JAXP 1.3. In November they also
released a version of the same XML library for use on earlier Java releases. So at
this moment, Sun has leapfrogged ahead of Apache. Eventually Apache will relase
2.7.0 and catch up, but even then the Sun version of the code will have the
advantage that it is built into Java (at least if you are running J2SE 5.0). It
provides all the function needed for OpenSAML and Shibboleth, and some useful
new features, but will require some conversion. The
proposal is to convert the OpenSAML and Shibboleth projects to use the new Sun
version of these libraries rather than the older Apache version. If a customer
is using J2SE 5.0 as his JRE, then no libraries are needed and everything will
work with just the standard Java runtime. For older JREs, then the five Sun jar
files replace the previously distributed two Apache jar files in the /endorsed
library. This
requires converting some existing code to use the JAXP factory standard instead
of using "new" to directly create instances of Apache classes. There
is a major benefit to the conversion, because XSD schema files used extensively
in OpenSAML and Shibboleth become first class programming objects. The changes
to the code have been made, and this paper explains them before they are checked
in. The
Libraries
Currently,
OpenSAML and Shibboleth ship with two Jar files (dom3-xercesImpl-2.x and
dom3-xml-apis-2.x where "x" is somewhere between "5.0" or
"6.2" depending on what level the authors prefer to support right
now. A
customer who uses J2SE 5.0 as his JRE (and a Servlet container such as Tomcat
5.5 that supports it) has the desired level of XML support and requires no
libraries. A
customer using some version of Java 1.4.x requires the Sun distribution of new
XML support for old Java systems. If this is checked into the current OpenSAML
and Shibboleth projects, the /endorsed directory would now have five Jar files
replacing the previous two jar files:
Essentially,
Sun breaks the Apache xml-apis library of interfaces into three separate Jar
files representing the three different interface standards (DOM, SAX, and JAXP)
from three separate organizations. This seems like a sensible piece of
housekeeping. The
implementing classes (org.apache.*) then have their packages renamed to
com.sun.org.apache.*. This causes some existing OpenSAML and Shibboleth code to
break, and that is exactly what Sun intends. Direct use of Apache implementing
classes bypasses JAXP. It is essentially the same thing as using an Oracle
database class directly instead of going through JDBC. Since Sun has to
maintain the same classes as Apache, they did not want to change the source.
However, by renaming the packages they could be sure that any code that makes
direct use of an Apache class would have to be converted. DocumentBuilderFactory
(not "new DOMParser")
The Sun
approach to functional libraries is to create a factory interface with
pluggable providers. JAXP is the factory interface for XML. Sun provides a set
of implementing classes, but I suppose you might find an alternate source of
classes to implement one or more of the XML standards. Apache
used to expose some concrete classes to perform specific functions. Some
Shibboleth and OpenSAML source includes the following statement to define the
concrete class that provides XML to DOM parsing: import org.apache.xerces.parsers.DOMParser;
Sun
doesn't want you to use direct classes, so it renamed the packages. There is
still a DOMParser class, but when Sun distributes it it is
com.sun.org.apache.xerces.internal.parsers.DOMParser. If you convert from
Apache to Sun libraries, then the old import statements and direct use of
DOMParser and a few other concrete classes will not compile. To
correct such statements, replace the direct use of classes with the JAXP
factory interface. The first step is to create a DocumentBuilderFactory object.
This object is then parameterized with information about the type of XML parser
you want (especially the XSD Schemas it should use). Then, the
DocumentBuilderFactory can be called to create one or more DocumentBuilder
objects. DocumentBuilder is almost the same as DOMParser, though a few method
details are different. There is
a similar Transformer factory interface to get an object that will convert DOM
back to a string of characters (serialize the XML). Although
there are some rough one-to-one translations between old classes and new
factories, the details of methods and properties are important. The existing
code contains some optimizations, and the same things need to be expressed with
a new semantic. XSD
Schemas
XSD
Schemas were added to XML after Xerces and its predecessors were already
distributed. As a result, versions of Schema support prior to the current Sun
JAXP 1.3 API give the clear impression that they were added as an afterthought
and squeezed in wherever they could go. There are
two views of Schemas. In one view, useful for a generic XML editor, the XSD
Schema file is identified by an attribute in the root XML element of the file
being parsed. This allows processing of any XML file referencing any Schema.
However, OpenSAML is only interested in XML from the SAML grammars, and
Shibboleth is only interested in SAML and in the XML formats of its own
configuration files. In both cases a set of Schema files is known in advance
and is distributed in the /schemas directory of the application. Old
Xerces supported two API techniques. One, suited for the first case where the
XSD schema file name was part of the XML, used a SAX interface called the
EntityResolver. This callback routine received the string from the XML, and
located the XSD schema file to which the string referred. An alternate approach
was available in which an array of open XSD Schema files was passed to the
DOMParser in advance. Although this second approach was better suited to the
SAML and Shibboleth program model, the EntityResolver was more commonly used in
the code. JAXP 1.3
promote XSD Schemas to first class objects. One or more XSD files are complied
by a SchemaFactory into a Schema object that represents their combined syntax.
That object can then be associated with XML parsers. Once the new level of code
is available, it makes sense to replace the older code. In most cases this makes
the code much smaller, because relatively complicated EntityResolver internal
classes and callback routines can be replaced with a single statement. OpenSAML
XML.ParserPool
In the
old Xerces DOMParser programming interface, an application created an XML
parser and then associated it with XSD Schema files. Some XSD file names could
be presented to the DOMParser first. Then if you used the EntityResolver
interface, schema file references in the file itself could dynamically add XSD
files to the grammar in the middle of the file processing. There was
a not terribly well documented property to cache results generated by
processing XSD files. This saves overhead of some redundant processing.
However, there are inherent inefficiencies when you try to manage schemas as a
dynamically changing collection of file names maintained as a property of an
individual parser object. Therefore,
OpenSAML maintained a pool of DOMParser objects through the ParserPool internal
class of the XML class. A pool implies that all the parser objects are
interchangeable. Since the XSD files can be dynamically added to a parser
object, the pool only works if the XSD file names are all added to the pool
before the first DOMParser is created. The XML class configures an initial set
of file names, but it exposes a public method through which other components
(Shibboleth) can add additional file names during initialization before the
first parser is created. The JAXP
1.3 standard has to support some of the legacy properties, but it provides a
much better support for Schemas as a first class object. In the new programming
model, the path to creating a parser object involves different steps:
You can
repeat steps 5 or 6 as many times as you want. You can use the
DocumentBuilderFactory to create parsers associated with one Schema, then
change the Schema and create a bunch of different parsers. Because all the work
is done creating the Schema object, DocumentBuilder parsers are fairly light
weight compared to the old DOMParsers. It is not
clear if this was forced by the new API, or if it is just a good idea. The old
ParserPool maintained one pool of homogenous parsers that are all associated
with the same list of XSD files. In the new code, ParserPool maintains a
separate collection of DocumentBuilder objects for every Schema object it
encounters. The
parent OpenSAML XML class creates a set of Schema objects for SAML 1.0 and SAML
1.1 XSD files, and it is prepared as soon as is required to build a SAML 2.0
(or combined SAML 1.1 and 2.0) Schema. One of these Schema objects is then
configured as the default and will be used whenever the caller doesn't provide
an explicit Schema parameter. Other
code, notably Shibboleth, can create its own Schema object with other
namespaces (Metadata, Trust, AAP, ...). It can then use the ParserPool logic by
passing that Schema as a parameter to get() requests to obtain a parser from
the pool. Shibboleth
Unlike
OpenSAML, Shibboleth did not create a single class to handle XML. On the other
hand, Shibboleth requires that OpenSAML be in the path, so it has clear access
to the services of XML.ParserPool and doesn't need to duplicate that function. A number
of Shibboleth classes, particularly test cases, created their own DOMParser
object and used it through a sequence of operations. Each case has to be
evaluated on its own, but overall there appear to be two questions raised by
the conversion of existing code to use a JAXP 1.3 library. Grammar
Scope
There are
pluggable components that process a file with a specific subset of Shibboleth
XML semantics. For example, a pluggable Metadata provider doesn't need access
to AAP syntax. In a few cases the current code creates a DOMParser associated
with a small list of specific XSD files. Typically this parser is used during
the test case and then discarded (garbage collected). Such a parser is so
specific that it could have no other use and would not benefit from pooling. Alternatively,
one can create a Schema from all the useful files in the /schemas resource
directory of Shibboleth. This composite object supports the parsing of
Metadata, or Trust, or AAP, or SAML statements, because all the namespaces are
defined in it. You can then pool such parser object and use them anywhere in
Shibboleth when you need a parser. Normally
you get improved integrity when a object is carefully tailored to its specific
use. However, the syntax of XSD files is inherently lax. An XML source conforms
to the schema if its top level root element is any of the top level data types
defined in the XSD file or in any secondary XSD file it imports. You always
have to test to be sure that the top level element you got is the one you were
expecting. Once you are forced to make that test, then using a composite
Shibboleth-wide Schema object for all parsing adds no additional checks and is
no more uncertain. Recycle
XML.ParserPool
has methods to get a parser object (associated with a Schema) and to return it.
If a DocumentBuilder object associated with the Schema is in the pool it is
reused. If not, a new DocumentBuilder object is created from the
DocumentBuilderFactory and Schema. However,
there is no physical tie between the ParserPool and the DocumentBuilder
objects. If someone gets a parser from the pool and does not return it, then if
it becomes unreferenced it will be Garbage collected like any other Java object
and the ParserPool will simply replace it as needed by a newly created object. It is
perfectly logical to use the ParserPool as a customized DocumentBuilderFactory.
Rather than creating your own code to handle the JAXP API, take an object from
the pool, use it, and then just leave it around. However, there is a strong
aesthetic objection (not a practical or technical objection) that every client
that takes an object from anything claiming to be a "pool" has a
moral responsibility to at least attempt to return it. A
widespread example of this dilemma is represented by the JUnit test cases. Some
of these test case classes previously created a DOMParser, then reused it
across a number of tests, and finally left if for GC. The problem with JUnit
code is that it can end abruptly if any test fails. So the only way to ensure
that a parser obtained from the pool is returned is to wrap the entire test in
a try-finally block, and that is slightly counter to the normal JUnit
aesthetic. There is
absolutely no concrete, technical, or professional reason for Shibboleth to use
DocumentBuilderFactory directly. Everything it needs can be obtained from
OpenSAML XML.ParserPool. However, to avoid the icky feeling of getting an
object from a "pool" that you have absolutely no intention of ever returning
to the pool, a small amount of code duplication was allowed. Shibboleth
provides some cover convenience classes that provide a one-for-one direct
replacement for the previous statements that created a "new
DOMParser" for local use. However, in all the non-test-case mainline code
that parses configuration files and such, the logic borrows a Shibboleth Schema
DocumentBuilder from OpenSAML XML.ParserPool, uses it to parse the file, and
then returns it to the pool. BucketOSchemas
In the
end, Shibboleth code builds on single composite Schema object and uses it for
all parsing. So it would have been enough to simply read in all the files in
the /schemas resource directory of Shibboleth and pass the entire bunch to the
SchemaFactory for compilation. However, before this final decision was reached
I had to consider the possiblity of creating more specific Schema objects based
on configured subsets of the files. In future releases, or in response to other
requirements, this may still be necessary. The
solution used in the OpenSAML XML class is to create different schemas from
different lists of file names. This is required by the unfortunate fact that
SAML 1.0 and SAML 1.1 XSD files provide incompatible definitions of the same
XML namespace. A Schema object can be based on 1.0 or 1.1 source files, but not
both, and the only way to distinguish them is by filename. Shibboleth
only supports SAML 1.0 (and SAML 2.0 fortunately introduces entirely new
namespaces). So it doesn't have the problem that OpenSAML has. If you take a fresh
look at the problem of selecting subsets from the XSD files of a directory to
compile using the JAXP 1.3 SchemaFactory interface, then an alternate approach
makes better technological sense. Each XSD
schema file defines an XML namespace designated by the
"targetNamespace" attribute of the root <schema> element.
Consider the SAML 2.0 Metadata schema. It is distributed as a file named sstc-saml-schema-metadata-2.0.xsd
When the
file is processed, the information that the Schema compiler uses is found in the
root element: <schema targetNamespace="urn:oasis:names:tc:SAML:2.0:metadata" ...
The
SchemaFactory processing and the Schema object created would be the same if the
file was renamed to "foo.xsd". There is absolutely no semantic
content to the filename. The real identifier for this file is
"urn:oasis:names:tc:SAML:2.0:metadata", but this is illegal as a
filename in most operating systems. However,
one can imagine some future XML native storage system in which these current
XSD files could be stored and retrieved based on their targetNamespace
attribute. That doesn't exist now, but with a couple of dozen extra statements
we can simulate it through an internal class that I call whimsically
BucketOSchemas. Given a
resource directory (/schemas), BucketOSchemas reads in all the *.xsd resource
files in the directory. Each file is run through a non-validating
DocumentBuilder parser (which only guarantees that it is well formed XML of
unknown syntax) to produce a DOM tree. The code checks each root document
Element for a type of "schema" and looks for a
"targetNamespace" attribute on the Element. The namespace becomes the
key and the DOM tree becomes the value for a Map. In a disk
directory, filename is the key. In this Map, namespace is the key. We did some
extra processing to convert the XSD files into DOMs, but that processing is not
wasted. As it turns out, one of the ways to pass an XSD to the SchemaFactory
compiler is to pass in a DOM source instead of a file or stream. So the work to
parse the file into memory will be recovered by work saved when the Schema
object is created. If more than one Schema object is created, then there is
even a net improved efficiency. With
BucketOSchemas, you can now create customized subset Schema objects from a list
of namespaces that you want to include (which are the true keys in the XML
standards) rather than filenames (which are semantically meaningless and
arbitrary). Of
course, this work would only be useful if you intended for Shibboleth to have
more than one Schema object with specific subsets of namespaces. After kicking
this question back and forth, I propose that there be only one composite Schema
object. Thus BucketOSchemas is not currently used for any really useful
purpose. However, having written it, there was no reason to discard it. So the
code will be checked in and available if the function is needed later. |
- XML API changes in OpenSAML and Shibboleth about to be checked in, Howard Gilbert, 01/24/2005
- RE: XML API changes in OpenSAML and Shibboleth about to be checked in, Scott Cantor, 01/25/2005
- RE: XML API changes in OpenSAML and Shibboleth about to be checked in, Howard Gilbert, 01/25/2005
- RE: XML API changes in OpenSAML and Shibboleth about to be checked in, Scott Cantor, 01/25/2005
Archive powered by MHonArc 2.6.16.