Subject: Shibboleth Developers

List archive

XML API changes in OpenSAML and Shibboleth about to be checked in

From: "Howard Gilbert" <>
To: <>
Subject: XML API changes in OpenSAML and Shibboleth about to be checked in
Date: Mon, 24 Jan 2005 21:16:31 -0500

JAXP 1.3 Changes to OpenSAML and Shibboleth

Feedback from the problems uncovered deploying XML applications drives the evolution of the W3C standards. New versions of the standards solve real problems. Thus the migration of code to new versions of XML support may be driven by necessity rather than a desire to pick up neat new features. Applications that are centered entirely on XML and are controlled by external schemas and specifications (such as OpenSAML and Shibboleth) are forced to keep up to date.

Things would be simpler if the W3C produced new standards that were compatible with their previous standards. Unfortunately, they have adopted a policy of replacing the definition of each interface with new versions of the same interface name with additional methods. This means that the bundle of interfaces (associated with one version of the standard) are tightly coupled to a separate Jar file containing versions of the implementing classes that support the new methods.

One of the basic programming interfaces is the DOM (Document Object Model). The DOM interfaces are defined by packages of the form org.w3c.dom.* and define a set of objects and methods that provide operations on the objects. A DOM 2 standard was developed years ago, and DOM 3 component standards are now released. Driven by requirements emerging from the SAML standard of XML syntax, OpenSAML and Shibboleth require DOM 3 support.

The Apache Xerces project was formed from submissions from IBM (XML4J) and Sun (ProjectX). It represents a common codebase to which all parties can submit bugfixes and new features. Apache distributes versions of Xerces directly, but Sun distributes the a version of the same code with slightly different packaging.

Starting with Java 1.4, Sun decided that XML was so important that it should be a standard part of the J2SE runtime library. However, Sun's standards require that all XML requests filter through the JAXP API, just as all database request go through JDBC and all directory requests go through JNDI. The Apache code contains some programming interfaces with concrete classes left over from the old IBM XML4J days. So although Sun's distribution is based on Apache Xerces, they tend to rename some of the classes to require everyone to go through the public JAXP interface.

Unfortunately, Sun decided to freeze the features and standards at major release boundaries. When Java 1.4.0 came out in Feb. 2002, the standards were DOM 2 and JAXP 1.2. So although bugs were fixed, these versions of the standards remained the basis for the Sun library through releases of 1.4.1 and 1.4.2 (up to 1.4.2_06). The only way to override this type of built in function is to use the "endorsed" library function of Java, and the only other version of code reasonably available was the distribution from Apache.

The current version of the Xerces XML support distributed by Apache contain interface definitions based on the old DOM 2 standard, and classes that implement that standard. Apache provides an Ant build option to create a version of its current Xerces release with the DOM 3 interfaces and implementations, but it regards a library built this way to be experimental Beta code. The plan is to convert to DOM 3 support in the 2.7.0 release of Xerces, which currently has no planned release date.

In the Summer of 2004, Sun finally released a new major release. Designated as 1.5 under the old system, or as J2SE 5.0 in a new naming convention, this release includes as standard both support for DOM 3 and JAXP 1.3. In November they also released a version of the same XML library for use on earlier Java releases.

So at this moment, Sun has leapfrogged ahead of Apache. Eventually Apache will relase 2.7.0 and catch up, but even then the Sun version of the code will have the advantage that it is built into Java (at least if you are running J2SE 5.0). It provides all the function needed for OpenSAML and Shibboleth, and some useful new features, but will require some conversion.

The proposal is to convert the OpenSAML and Shibboleth projects to use the new Sun version of these libraries rather than the older Apache version. If a customer is using J2SE 5.0 as his JRE, then no libraries are needed and everything will work with just the standard Java runtime. For older JREs, then the five Sun jar files replace the previously distributed two Apache jar files in the /endorsed library.

This requires converting some existing code to use the JAXP factory standard instead of using "new" to directly create instances of Apache classes. There is a major benefit to the conversion, because XSD schema files used extensively in OpenSAML and Shibboleth become first class programming objects. The changes to the code have been made, and this paper explains them before they are checked in.

The Libraries

Currently, OpenSAML and Shibboleth ship with two Jar files (dom3-xercesImpl-2.x and dom3-xml-apis-2.x where "x" is somewhere between "5.0" or "6.2" depending on what level the authors prefer to support right now.

A customer who uses J2SE 5.0 as his JRE (and a Servlet container such as Tomcat 5.5 that supports it) has the desired level of XML support and requires no libraries.

A customer using some version of Java 1.4.x requires the Sun distribution of new XML support for old Java systems. If this is checked into the current OpenSAML and Shibboleth projects, the /endorsed directory would now have five Jar files replacing the previous two jar files:

dom.jar (contains the org.w3c.dom interface packages)
sax.jar (contains the org.xml.sax interface packages)
jaxp-api.jar (contains the javax.xml interface packages)
xercesImpl.jar (Xerces, but with the packages renamed as com.sun.org.apache.xerces...)
xalan.jar (Xalan, but with the packages renamed as com.sun.org.apache.xalan...)

Essentially, Sun breaks the Apache xml-apis library of interfaces into three separate Jar files representing the three different interface standards (DOM, SAX, and JAXP) from three separate organizations. This seems like a sensible piece of housekeeping.

The implementing classes (org.apache.*) then have their packages renamed to com.sun.org.apache.*. This causes some existing OpenSAML and Shibboleth code to break, and that is exactly what Sun intends. Direct use of Apache implementing classes bypasses JAXP. It is essentially the same thing as using an Oracle database class directly instead of going through JDBC. Since Sun has to maintain the same classes as Apache, they did not want to change the source. However, by renaming the packages they could be sure that any code that makes direct use of an Apache class would have to be converted.

DocumentBuilderFactory (not "new DOMParser")

The Sun approach to functional libraries is to create a factory interface with pluggable providers. JAXP is the factory interface for XML. Sun provides a set of implementing classes, but I suppose you might find an alternate source of classes to implement one or more of the XML standards.

Apache used to expose some concrete classes to perform specific functions. Some Shibboleth and OpenSAML source includes the following statement to define the concrete class that provides XML to DOM parsing:

import org.apache.xerces.parsers.DOMParser;

Sun doesn't want you to use direct classes, so it renamed the packages. There is still a DOMParser class, but when Sun distributes it it is com.sun.org.apache.xerces.internal.parsers.DOMParser. If you convert from Apache to Sun libraries, then the old import statements and direct use of DOMParser and a few other concrete classes will not compile.

To correct such statements, replace the direct use of classes with the JAXP factory interface. The first step is to create a DocumentBuilderFactory object. This object is then parameterized with information about the type of XML parser you want (especially the XSD Schemas it should use). Then, the DocumentBuilderFactory can be called to create one or more DocumentBuilder objects. DocumentBuilder is almost the same as DOMParser, though a few method details are different.

There is a similar Transformer factory interface to get an object that will convert DOM back to a string of characters (serialize the XML).

Although there are some rough one-to-one translations between old classes and new factories, the details of methods and properties are important. The existing code contains some optimizations, and the same things need to be expressed with a new semantic.

XSD Schemas

XSD Schemas were added to XML after Xerces and its predecessors were already distributed. As a result, versions of Schema support prior to the current Sun JAXP 1.3 API give the clear impression that they were added as an afterthought and squeezed in wherever they could go.

There are two views of Schemas. In one view, useful for a generic XML editor, the XSD Schema file is identified by an attribute in the root XML element of the file being parsed. This allows processing of any XML file referencing any Schema. However, OpenSAML is only interested in XML from the SAML grammars, and Shibboleth is only interested in SAML and in the XML formats of its own configuration files. In both cases a set of Schema files is known in advance and is distributed in the /schemas directory of the application.

Old Xerces supported two API techniques. One, suited for the first case where the XSD schema file name was part of the XML, used a SAX interface called the EntityResolver. This callback routine received the string from the XML, and located the XSD schema file to which the string referred. An alternate approach was available in which an array of open XSD Schema files was passed to the DOMParser in advance. Although this second approach was better suited to the SAML and Shibboleth program model, the EntityResolver was more commonly used in the code.

JAXP 1.3 promote XSD Schemas to first class objects. One or more XSD files are complied by a SchemaFactory into a Schema object that represents their combined syntax. That object can then be associated with XML parsers. Once the new level of code is available, it makes sense to replace the older code. In most cases this makes the code much smaller, because relatively complicated EntityResolver internal classes and callback routines can be replaced with a single statement.

OpenSAML XML.ParserPool

In the old Xerces DOMParser programming interface, an application created an XML parser and then associated it with XSD Schema files. Some XSD file names could be presented to the DOMParser first. Then if you used the EntityResolver interface, schema file references in the file itself could dynamically add XSD files to the grammar in the middle of the file processing.

There was a not terribly well documented property to cache results generated by processing XSD files. This saves overhead of some redundant processing. However, there are inherent inefficiencies when you try to manage schemas as a dynamically changing collection of file names maintained as a property of an individual parser object.

Therefore, OpenSAML maintained a pool of DOMParser objects through the ParserPool internal class of the XML class. A pool implies that all the parser objects are interchangeable. Since the XSD files can be dynamically added to a parser object, the pool only works if the XSD file names are all added to the pool before the first DOMParser is created. The XML class configures an initial set of file names, but it exposes a public method through which other components (Shibboleth) can add additional file names during initialization before the first parser is created.

The JAXP 1.3 standard has to support some of the legacy properties, but it provides a much better support for Schemas as a first class object. In the new programming model, the path to creating a parser object involves different steps:

Create a SchemaFactory.
Collect a list of *.xsd schema source files for XML grammar you want to parse. You could have a set of files for SAML 1.0, one for just SAML 1.1, one for Shibboleth 1.2, and so on. When new and old schema files are compatible, they can be combined. Thus you can include both SAML 1.1 and SAML 2.0 in the same list.
Use the SchemaFactory to compile each set of XSD files into a corresponding thread safe shareable Schema object. This Schema object understands the namespaces defined by all the input files. One Schema object can be shared by all parsers that support that particular combination of namespaces.
Create a DocumentBuilderFactory
Associate a Schema with the DocumentBuilderFactory.
Ask the factory to create a DocumentBuilder (parser). The parser will be associated with the Schema object associated with the DocumentBuilderFactory when it was created.

You can repeat steps 5 or 6 as many times as you want. You can use the DocumentBuilderFactory to create parsers associated with one Schema, then change the Schema and create a bunch of different parsers. Because all the work is done creating the Schema object, DocumentBuilder parsers are fairly light weight compared to the old DOMParsers.

It is not clear if this was forced by the new API, or if it is just a good idea. The old ParserPool maintained one pool of homogenous parsers that are all associated with the same list of XSD files. In the new code, ParserPool maintains a separate collection of DocumentBuilder objects for every Schema object it encounters.

The parent OpenSAML XML class creates a set of Schema objects for SAML 1.0 and SAML 1.1 XSD files, and it is prepared as soon as is required to build a SAML 2.0 (or combined SAML 1.1 and 2.0) Schema. One of these Schema objects is then configured as the default and will be used whenever the caller doesn't provide an explicit Schema parameter.

Other code, notably Shibboleth, can create its own Schema object with other namespaces (Metadata, Trust, AAP, ...). It can then use the ParserPool logic by passing that Schema as a parameter to get() requests to obtain a parser from the pool.

Shibboleth

Unlike OpenSAML, Shibboleth did not create a single class to handle XML. On the other hand, Shibboleth requires that OpenSAML be in the path, so it has clear access to the services of XML.ParserPool and doesn't need to duplicate that function.

A number of Shibboleth classes, particularly test cases, created their own DOMParser object and used it through a sequence of operations. Each case has to be evaluated on its own, but overall there appear to be two questions raised by the conversion of existing code to use a JAXP 1.3 library.

Grammar Scope

There are pluggable components that process a file with a specific subset of Shibboleth XML semantics. For example, a pluggable Metadata provider doesn't need access to AAP syntax. In a few cases the current code creates a DOMParser associated with a small list of specific XSD files. Typically this parser is used during the test case and then discarded (garbage collected). Such a parser is so specific that it could have no other use and would not benefit from pooling.

Alternatively, one can create a Schema from all the useful files in the /schemas resource directory of Shibboleth. This composite object supports the parsing of Metadata, or Trust, or AAP, or SAML statements, because all the namespaces are defined in it. You can then pool such parser object and use them anywhere in Shibboleth when you need a parser.

Normally you get improved integrity when a object is carefully tailored to its specific use. However, the syntax of XSD files is inherently lax. An XML source conforms to the schema if its top level root element is any of the top level data types defined in the XSD file or in any secondary XSD file it imports. You always have to test to be sure that the top level element you got is the one you were expecting. Once you are forced to make that test, then using a composite Shibboleth-wide Schema object for all parsing adds no additional checks and is no more uncertain.

Recycle

XML.ParserPool has methods to get a parser object (associated with a Schema) and to return it. If a DocumentBuilder object associated with the Schema is in the pool it is reused. If not, a new DocumentBuilder object is created from the DocumentBuilderFactory and Schema.

However, there is no physical tie between the ParserPool and the DocumentBuilder objects. If someone gets a parser from the pool and does not return it, then if it becomes unreferenced it will be Garbage collected like any other Java object and the ParserPool will simply replace it as needed by a newly created object.

It is perfectly logical to use the ParserPool as a customized DocumentBuilderFactory. Rather than creating your own code to handle the JAXP API, take an object from the pool, use it, and then just leave it around. However, there is a strong aesthetic objection (not a practical or technical objection) that every client that takes an object from anything claiming to be a "pool" has a moral responsibility to at least attempt to return it.

A widespread example of this dilemma is represented by the JUnit test cases. Some of these test case classes previously created a DOMParser, then reused it across a number of tests, and finally left if for GC. The problem with JUnit code is that it can end abruptly if any test fails. So the only way to ensure that a parser obtained from the pool is returned is to wrap the entire test in a try-finally block, and that is slightly counter to the normal JUnit aesthetic.

There is absolutely no concrete, technical, or professional reason for Shibboleth to use DocumentBuilderFactory directly. Everything it needs can be obtained from OpenSAML XML.ParserPool. However, to avoid the icky feeling of getting an object from a "pool" that you have absolutely no intention of ever returning to the pool, a small amount of code duplication was allowed. Shibboleth provides some cover convenience classes that provide a one-for-one direct replacement for the previous statements that created a "new DOMParser" for local use. However, in all the non-test-case mainline code that parses configuration files and such, the logic borrows a Shibboleth Schema DocumentBuilder from OpenSAML XML.ParserPool, uses it to parse the file, and then returns it to the pool.

BucketOSchemas

In the end, Shibboleth code builds on single composite Schema object and uses it for all parsing. So it would have been enough to simply read in all the files in the /schemas resource directory of Shibboleth and pass the entire bunch to the SchemaFactory for compilation. However, before this final decision was reached I had to consider the possiblity of creating more specific Schema objects based on configured subsets of the files. In future releases, or in response to other requirements, this may still be necessary.

The solution used in the OpenSAML XML class is to create different schemas from different lists of file names. This is required by the unfortunate fact that SAML 1.0 and SAML 1.1 XSD files provide incompatible definitions of the same XML namespace. A Schema object can be based on 1.0 or 1.1 source files, but not both, and the only way to distinguish them is by filename.

Shibboleth only supports SAML 1.0 (and SAML 2.0 fortunately introduces entirely new namespaces). So it doesn't have the problem that OpenSAML has. If you take a fresh look at the problem of selecting subsets from the XSD files of a directory to compile using the JAXP 1.3 SchemaFactory interface, then an alternate approach makes better technological sense.

Each XSD schema file defines an XML namespace designated by the "targetNamespace" attribute of the root <schema> element. Consider the SAML 2.0 Metadata schema. It is distributed as a file named

sstc-saml-schema-metadata-2.0.xsd

When the file is processed, the information that the Schema compiler uses is found in the root element:

<schema targetNamespace="urn:oasis:names:tc:SAML:2.0:metadata" ...

The SchemaFactory processing and the Schema object created would be the same if the file was renamed to "foo.xsd". There is absolutely no semantic content to the filename. The real identifier for this file is "urn:oasis:names:tc:SAML:2.0:metadata", but this is illegal as a filename in most operating systems.

However, one can imagine some future XML native storage system in which these current XSD files could be stored and retrieved based on their targetNamespace attribute. That doesn't exist now, but with a couple of dozen extra statements we can simulate it through an internal class that I call whimsically BucketOSchemas.

Given a resource directory (/schemas), BucketOSchemas reads in all the *.xsd resource files in the directory. Each file is run through a non-validating DocumentBuilder parser (which only guarantees that it is well formed XML of unknown syntax) to produce a DOM tree. The code checks each root document Element for a type of "schema" and looks for a "targetNamespace" attribute on the Element. The namespace becomes the key and the DOM tree becomes the value for a Map.

In a disk directory, filename is the key. In this Map, namespace is the key. We did some extra processing to convert the XSD files into DOMs, but that processing is not wasted. As it turns out, one of the ways to pass an XSD to the SchemaFactory compiler is to pass in a DOM source instead of a file or stream. So the work to parse the file into memory will be recovered by work saved when the Schema object is created. If more than one Schema object is created, then there is even a net improved efficiency.

With BucketOSchemas, you can now create customized subset Schema objects from a list of namespaces that you want to include (which are the true keys in the XML standards) rather than filenames (which are semantically meaningless and arbitrary).

Of course, this work would only be useful if you intended for Shibboleth to have more than one Schema object with specific subsets of namespaces. After kicking this question back and forth, I propose that there be only one composite Schema object. Thus BucketOSchemas is not currently used for any really useful purpose. However, having written it, there was no reason to discard it. So the code will be checked in and available if the function is needed later.

XML API changes in OpenSAML and Shibboleth about to be checked in, Howard Gilbert, 01/24/2005
- RE: XML API changes in OpenSAML and Shibboleth about to be checked in, Scott Cantor, 01/25/2005
  - RE: XML API changes in OpenSAML and Shibboleth about to be checked in, Howard Gilbert, 01/25/2005