shibboleth-dev - Re: DOMParser

Subject: Shibboleth Developers

List archive

Re: DOMParser

From: Walter Hoehn <>
To: "Howard Gilbert" <>
Cc: <>
Subject: Re: DOMParser
Date: Fri, 19 Nov 2004 10:49:55 -0600

I'm in favor of changes along these lines. I've had it on my TODO list for a long time to come up with a unified way of doing parsing and validation for the entire code-base. Currently there is much ad hockery.

A while back I wrote some extensions to Apache's Catalog API that would allow schemas to be loaded from the java classpath. I intended for this to address some of the mess Howard has described, but after trying multiple times I was never able to get an answer back from the author.

For shibboleth, this is a no-brainer, regardless of the JDK 1.5 carrot. OpenSAML is the place where advanced parsing features are used and where we might run into problems.

-Walter

On Nov 19, 2004, at 10:34 AM, Howard Gilbert wrote:

For the last two days I have been getting more and more entangled in a cleanup issue.

The original idea was simple. We depend on DOM3. DOM3 is standard in Java 1.5. So, suppose one compiles this code and runs it with Java 1.5 but without separate Xerces DOM3 libraries.

The problem is that both Shibboleth and OpenSAML don't use JAXP protocol to get a DOM Parser. This is perfectly reasonable given the long history of Xerces development and the slow evolution of JAXP. However, since the need for DOM3 forces you to include a version of Xerces that does support JAXP, there is no longer a supported configuration where JAXP would not work if used.

The old way of doing business, used in current code, imports a Xerces class directly

import org.apache.xerces.parsers.DOMParser;

Then it creates an object of the class

            private DOMParser       parser = new DOMParser();

Then it sets features using an apache syntax

            parser.setFeature("http://xml.org/sax/features/validation";, true);

            parser.setFeature("http://apache.org/xml/features/validation/schema";, true);

The new JAXP approach uses DocumentBuilder Factories:

        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

        dbf.setNamespaceAware(true);

        dbf.setValidating(true);

A few features have their own methods. Others take URI named attributes as before, but note that the method name changes and the attributes may be slightly different and follow a Sun rather than Apache name:

    static final String JAXP_SCHEMA_LANGUAGE =

        "http://java.sun.com/xml/jaxp/properties/schemaLanguage";;

    static final String W3C_XML_SCHEMA =

        "http://www.w3.org/2001/XMLSchema";;

    static final String JAXP_SCHEMA_SOURCE =

        "http://java.sun.com/xml/jaxp/properties/schemaSource";;

        try {

            // Say we are using XSD, not DTD

            dbf.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);

        } catch (IllegalArgumentException x) {

            log.error("Unable to obtain usable XML parser from environment");

            return null;

        }

        // Set the Schema file

        dbf.setAttribute(JAXP_SCHEMA_SOURCE, new File(schemaSource));

The schemaSource can be a String (filename/URI) or an array of Strings. This may not be the time to raise it, but there is a philosophical dispute between me and prior coders. Some believe that the XML file should nominate its own Schema. The config file contains an XSD reference which is resolved by the EntityResolver. I, however, believe that the code expects the XML file to conform to a prespecified schema (Metadata, AAP, whatever). I think the code should force the XML file to conform to the schema that the code expects, and not be allowed to reference its own schema file which may not be what the code expects. When you adopt this model, you either specify the base schema file with the JAXP_SCHEMA_SOURCE and let the secondary files be found with EntityResolver or you specify all the schema files with an array and then don't let the EntityResolver find anything. However, lets table that discussion for a while because it is a tangent off what I want to talk about.

Once you set the features you want, you ask the factory to give you a parser. Note that "DocumentBuilder" is simply an updated name for "DOMParser". Errors at this point mean that the JAXP environment is set wrong



        DocumentBuilder parser;

        try {

            parser = dbf.newDocumentBuilder();

            parser.setErrorHandler(new SimpleErrorHandler());

            parser.setEntityResolver(new Resolver());

        } catch (ParserConfigurationException e) {

            log.error("Unable to obtain usable XML parser from environment");

            return null;

        }



Finally, you parse the file.

        try {

            doc = parser.parse(ins);

        } catch (SAXException e1) {

            log.error("Error in XML configuration file"+e1);

            return null;

        } catch (IOException e1) {

            log.error("Error accessing XML configuration file"+e1);

            return null;

        }

Those intimately familiar with the prior interface will note that the Document object is returned from parser.parse() here, where as before you had to call parser.getDocument();

So what is the fallout? Well, this turns out to be a bigger problem than I thought. I started with the Shibboleth /src, but now I find that there are direct uses of DOMParser in OpenSAML and in the test cases. I still think there is light at the end of the tunnel, but before continuing forward I better ask the list and decide to proceed or to stop and Rollback the transaction.

If I do Rollback, then the result is that we will continue to require a specific Xerces DOM3 library to be added to the Tomcat common/endorsed library even when a suitable DOM3 compliant JAXP parser is present in the environment. It's not a big deal now, but this is an area of Eternal Install Anguish.

If I proceed, then I end up with an update that hits some code (in the Origin and Target, some test cases, and OpenSAML). It is a sloppy commit, but once done we have removed an explicit org.apache.xerces import/new and switched to a standard Java Factory interface.

I believe at this point that there are no features or options that behave differently or cannot be accessed through the new interface. This is subject to testing and verification. Unfortunately, when stuff moves into the Java standard you have to "Render unto Sun that which is Sun's", and Sun doesn't always see 100% the same as Apache.

So here is the question. If I proceed forward, finish the edit, and run some successful tests, will the consensus allow me to check in the changes so we can run some more tests. If there is an aesthetic objection to flip over to JAXP, then it makes no sense to continue what has become a non-trivial effort.

Note that moving to the JAXP interface doesn't necessarily stick us with any particular Sun implementation. If Xerces 3 provides extra function, you can always add some later Xerces library to /endorsed and make it a prerequisite. It's just that we will be accessing that Xerces library using the JAXP interface and not directly through the org.apache.xerces classes.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

DOMParser, Howard Gilbert, 11/19/2004
- RE: DOMParser, Scott Cantor, 11/19/2004
  - RE: DOMParser, Howard Gilbert, 11/19/2004
    - RE: DOMParser, Scott Cantor, 11/22/2004
      - RE: DOMParser, Howard Gilbert, 11/23/2004
- Re: DOMParser, Walter Hoehn, 11/19/2004

List archive

Re: DOMParser