GenXDM Concepts


Table of Contents

Introduction
GenXDM Design
Solutions
Applications
Extensions
Processors
State
Bridges
Building Bridges
Model

Introduction

GenXDMis an XDM application programming interface (API) for analyzing, creating, and manipulating XML in Java. GenXDM embodies the XQuery Data Model, and is consequently a tree-oriented API, but it does not introduce a new tree model. Instead, it is intended to run over existing tree models, and to permit the introduction of new, specialized models optimized for a particular purpose.

GenXDM enables applications to write code that uses and manipulates XML trees without being tied to a particular XML tree representation like DOM, DOM4J, AXIOM, or any other. It also prods developers towards an immutable view of XML trees, which makes it easier and faster to work with XML across multiple cores and multiple processors.

GenXDM Design

GenXDM makes extensive use of Java generics to enable the API to run over any arbitrary tree model for which a "bridge" has been created. Bridges are provided for W3C DOM, for Apache Axiom, and for a simple "reference model" intended to aid others to develop bridges as well.

The API divides naturally into bridges and processors, which are usually organized by applications.

A bridge, as an implementation of the GenXDM handles, connects the GenXDM API to the underlying Data Model. The bridge provides the abstraction over which applications and processors operate, including the model, input/output, and a context that associates related tree-specific functions.

Operating over the data model described in the bridge, a GenXDM processor is a code library that performs a specific, well-described function over XML. For example, most processors can be described with a single word or phrase: "serializer," "parser," "converter," "validator," "transformer," "signer," and so on.

An application is a design pattern that performs a larger task, usually consisting of several smaller processes that are organized by the application. In other words, the application orchestrates the behavior of one or more processors. Additionally, an application usually manages the bridges, selecting the appropriate bridge for the incoming node type. Most often, the ProcessingContextFactories are instantiated at the application level.

A bridge exposes an instance of the Data Model (typed or untyped, mutable or immutable). A processor operates over the data model-several examples are provided with interesting functionality, core utility, or as samples for the inspiration of further development.

See Bridges and Processors for additional information.

The GenXDM design rests on four pillars: the Handle/Body design pattern, Java generics, the XQuery Data Model, and immutability for XML processing as a paradigm.

Handle/Body Pattern

GenXDM uses the Handle/Body pattern (also sometimes called the Bridge pattern). This pattern provides a well-defined set of operations over an abstraction (the handle), which may then be adapted to specific implementations (the body). For GenXDM, the primary "handles" are the:

  • Model or Cursor

  • Processing Context

  • Node Factory in the mutable API

  • Type and typed-value (Atom) Bridges in the schema-aware API

The GenXDM use of the Handle/Body pattern for XML tree models can be compared to the similar pattern used for database drivers in the Java Database Connection (JDBC) API. Each bridge may be viewed as equivalent to a vendor-specific driver.

Because applications and processors need not write separate code paths for different tree models, Handle/Body models can be injected very late, even at runtime. As a result, available models can be compared, based on the application's or processor's requirements, and the tree model best suited to the problem at hand preferred. This in turn allows for more rigorous testing and can lead to model improvements.

GenXDM and the Handle/Body model also allows developers to choose a model based on technical merits without considering the importance of the network effect for the DOM. With GenXDM, "niche" tree models for XML can be designed and optimized for particular use cases. In other words, by always using these handles for access, special-purpose bodies become more practical.

Note

The Handle/Body pattern is different from a wrapper. Whereas a wrapper encases every node in the tree, the Handle/Body pattern presents applications and processors with just one new abstraction, or handle, represented by a single instance. This allows GenXDM to add very little weight to the existing tree model compared to the weight added by a wrapper covering each node in the tree.

Although there is a cost (in memory and performance) to using the handles rather than directly manipulating the bodies, the benefits in flexibility and capability are more nearly commensurate. In exchange for a memory/performance impact measured in low single-digit percentages (for most tree model APIs), an application or processor gains the ability to manipulate all supported tree model APIs.

Java Generics

GenXDM makes extensive use of Java generics to enable the API to run over any arbitrary tree model for which a "bridge" has been created.

In addition to using built-in Java generics, GenXDM defines two common parameters, N and A:

  • N is the "node" handle.

  • A is the "atom" or "atomic value" handle.

In GenXDM, APIs that accept or return collections typically use iterable in their signatures. This is opposed to counts, specialized objects with pseudo-iterators, single-use iterators, or arrays.

Java generics provide interoperability. By defining the node and atom handle parameters, each of the tree models can be viewed and manipulated through the lens of the XQuery Data Model. As a result, GenXDM developers can ignore the effect created by the existence of parsers, processors, and applications that understand no model but the DOM, regardless of its fitness for their domain of operation. Because GenXDM provides a DOM bridge, it is able to leverage that network effect. Each bridge added increases the network effect. However, note that conversion from model to model remains expensive so single document conversion is not efficient.

XQuery Data Model

GenXDM provides a Java API that embodies the XQuery Data Model (XDM). The XDM is conceptually complete, and defined in a context that permits type definition, navigation operations, and more advanced functions. This rigorous, well-defined specification was adopted as the basis for the GenXDM API, and represents gXML's answer to the problem of variability. Any property or concept that exists in the XDM specification is present in GenXDM. If the concept is not in the XDM specification, then either it should not be exposed in the GenXDM API, or it should be compatible with the well-specified API. For example, the entire mutable GenXDM API was added as an extension because XQuery does not define operations that modify trees.

The XQuery Data Model also provides the first well-integrated access to XML Schema information. GenXDM defines a common model for XML Schema, compatible with the XDM's definition and use of XML Schema types and typed values, as a standard extension.

Although GenXDM is not the only model to provide support for XML Schema, the schema-aware extensions in GenXDM can be implemented for any tree model, and are exposed through APIs that are clearly related to, and usually extensions of, the core GenXDM APIs. In other words, by addressing the problem of variability by adhering to and conforming with the XQuery Data Model Specification, GenXDM enables the development of a "next wave" of XML processing technologies, based on XPath 2.0, XSLT 2.0, and XQuery 1.0 (including the new generation of XQuery-conformant databases).

Immutable Paradigm

GenXDM promotes a paradigm in which a received or generated XML document is input, and the document is then transformed and supplied to other processes, wherever those processes are. The nodes the any given XML instance are never modified during processing. As a result, GenXDM allows an immutable paradigm. In combination with the enabling of custom, potentially domain-specific XML tree models accessed through a GenXDM bridge, the immutable paradigm (over an immutable tree model) can achieve optimizations not possible for a tree model in which the existence of mutability prevents against caching, compaction, and deferred loading.

Note

To ease migration, the MutableContext extension permits mutability.

Solutions

GenXDM offers solutions to a number of issues with the XML tree models in Java:

  • Multiplicity

    XML in Java developers have a variety of XML tree models to choose from, including DOM, JDOM, DOM4J, XOM, and AxiOM. Innumerable private models are also available. In most cases, applications and processors written for one of these models are not usable with other models.

  • Interoperability

    Document Object Model (DOM) was the first tree model offered, and as such had a first mover advantage. Subsequent tree models were developed to address the shortcomings of the DOM, but not to interoperate with it. Although other models may have technical advantages that make them more suitable than the DOM for a given application, in order to use those new models efficiently within the JVM, all parts of the application need to use the same tree model. Developers must solve a cruel equation in which the marginal benefits of switching from the DOM are typically low, whereas the marginal costs are always high. The alternatives seem to be to write multiple code paths to achieve the same purpose (with different tree models), or to wrap each node of each tree model in an application-specific abstraction.

  • Variability

    Each of the various tree models exposes different specifications and property sets, or abstractions (node types). Additionally, in each tree model the boundaries between lexical, syntactic, and semantic may be drawn at different points. One consequence of this variability is that it is difficult or awkward to add support for specifications "higher in the stack." Such specifications are most commonly handled as extensions. For example, XPath 1.0 and XSLT 1.0 work as external tools. AxiOM is an entire XML tree model built largely so that the SOAP abstractions could be represented cleanly as extensions.

  • Weight

    The DOM requires considerably more memory than the XML itself requires. Newer tree models are better, but still weighty.

Applications

An application is a design pattern that performs a larger task, usually consisting of several smaller processes that are organized by the application. In other words, the application orchestrates the behavior of one or more processors. Additionally, an application usually manages the bridges, selecting the appropriate bridge for the incoming node type. Most often, the ProcessingContextFactories are instantiated at the application level.

Extensions

TheGenXDM API is completed with two extensions in the core ProcessingContext to permit bridges to signal support for optional functionality. The mutable extension adds mutability by adding methods to the base interfaces, or by adding new interfaces. The schema-aware or typed extension adds schema awareness, again by adding methods to base interfaces, or by adding new interfaces; the typed extension also introduces the "atom" parameter.

Mutable Extensions

MutableContext permits mutability. Although immutability provides important benefits for XML processing, all currently-available tree models are mutable, and nearly all processors and applications expect mutability. To ease migration, ProcessingContext provides a method, getMutableContext() which permits the bridge to signal that it supports mutability, by returning an implementation of the MutableContext extension.

Typed Extensions

TypedContext provides the XDM-defined schema-aware properties and manipulations. Most notably, the typed context introduces an additional parameter, the <A>tom handle. The base and mutable interfaces deal only with string values for text node and attribute content (in XDM terms, actually untyped atomic). The XQuery Data Model defines the concept of "atom", which corresponds to a typed value or list of typed values. Atoms are inherently sequences of atoms (a single atom is a one-element list); "sequence" is also introduced in the schema-aware API, but unlike atom, is not represented by an independent common parameter.

Processors

Operating over the data model described in the bridge, a GenXDM processor is a code library that performs a specific, well-described function over XML. For example, most processors can be described with a single word or phrase: "serializer," "parser," "converter," "validator," "transformer," "signer," and so on.

A processor is distinguished from an application, which may create (generate), destroy (consume), modify, and otherwise manipulate XML in multiple steps. Where a processor contributes special functionality to the performance of a goal, the application oversees and orchestrates achievement of the goal from receipt to completion.

Because of the wide variety of valid processor types, no specific interface or contract is specified for XML processors designed for use with GenXDM. This is in contrast with bridges, for which an extensive API exists in GenXDM. The difference arises from the wide variety of valid processors. While some processors might be defined to have a method with the signature: N process(N, Model<N>), for others this is entirely inappropriate. Even for processors that might reasonably process a node, their function is more clearly expressed if they "transform" or "extract" or "enhance", or otherwise mark their processing by its specific name, not the more general one.

State

GenXDM processors may be divided into two classes: stateful and stateless. Here, state refers to the processor's need to maintain state in the form of any of the parameters specialized by a particular bridge implementation (node and atom), disregarding maintenance of state unrelated to GenXDM parameters.

Bridges

A bridge, as an implementation of the GenXDM handles, connects the GenXDM API to the underlying Data Model. The bridge provides the abstraction over which applications and processors operate, including the model, input/output, and a context that associates related tree-specific functions.

Bridges are provided for:

  • Document Object Model (DOM)

  • Axis Object Model (AxiOM)

  • Cx This simple reference model can be used to develop additional custom bridges.

These bridges, included in the GenXDM source tree, provide examples of finished bridges. The development process is easily described. Note, however, that most tree models present unique challenges when adapted to the XQuery Data Model. Development time may be primarily consumed in handling these impedance mismatches.

Building Bridges

Before creating a new bridge, review the existing bridges, particularly the Cx bridge. The Cx reference bridge was created specifically to provide an example for bridge developers.

To create a new base bridge (untyped, immutable) for an as-yet unsupported tree model:

  1. Implement ProcessingContext and Model. Decide what the node abstraction must be.

    For instance: the DOM defines <N> as Node. AxiOM defines it as Object (AxiOM does not have a single base interface that marks all node types). The Cx reference model bridge uses XmlNode.

  2. Use the bridgekit module to get a simple, generic implementation of Cursor (over the custom Model).

    The bridgekit module is a collection of utilities intended to help bridge developers. It includes, for instance, an implementation of the XML Schema model (SmSchema) and the XmlAtom typed-value implementation, as well as the CursorOnModel helper.

  3. Implement FragmentBuilder.

    The FragmentBuilder interface has five methods for creating Text, Attribute, Namespace, Comment, and Processing Instruction node types, and an additional two each (start and end) for the container node types, Element and Document.

  4. Use the generic implementation of DocumentHandler from the input-output processor.

    The generic DocumentHandler in the input-output module is not terribly mature or robust, but can do the job for an initial implementation.

  5. If desired, implement mutability or schema awareness.

    See Adding Mutability and Adding Schema Awareness for details.

  6. Use the bridgetest module to verify equivalence with existing bridges.

    The bridgetest module is designed to make implementation easy; enabling each test requires only that the bridge implement the single abstract method, which returns the bridge's implementation of ProcessingContext (from which all other abstractions can be reached). Adding a test implementation is thus mostly a mechanical task.

Adding Mutability

Note

The GenXDM approach to mutability is more restricted than most current tree APIs. For example, the GenXDM mutable API does not support changing the value of a text or attribute node. Leaf nodes remain immutable; container nodes (document and element) are mutable in content (contained nodes) only.

To add support for mutability:

  1. Implement MutableContext and return it from ProcessingContext instead of null.

    MutableContext provides access to the NodeFactory, MutableModel, and MutableCursor implementations.

  2. Implement MutableModel as an extension of the base Model from the bridge created previously.

    MutableModel adds methods to set attributes and namespaces, to add, remove, and replace children.

  3. Use the bridgekit module to base the bridge's MutableCursor on its MutableModel.

    The bridgekit implementations are reasonable starting points, though optimization is likely to require a custom implementation.

  4. Implement NodeFactory.

    NodeFactory contains methods to create each node type, where MutableModel establishes the relationships between nodes.

  5. Add tests from the bridgetest module.

Adding Schema Awareness

To add support for schema-awareness:

  1. Implement TypedContext and return it from ProcessingContext instead of null; note that TypedContext is-a SmSchema. Decide what the <A> (atom) abstraction must be.

    Current implementations all define <A> as XmlAtom. This is not required.

  2. Implement TypedModel as an extension of the base Model from the bridge created previously.

    The TypedModel interface adds only five methods to Model, all related to the introduction of type names and typed values. Ensuring that the type annotations and typed values are associated with the nodes in the tree is one of the most challenging tasks in implementation.

  3. Use the bridgekit module to base the bridge's TypedCursor on its TypedModel.

    CursorOnTypedModel extends CursorOnModel as expected.

  4. Implement or reuse from the bridgekit module an AtomBridge (typed value support).

    If the chosen <A>tom is XmlAtom, the XmlAtomBridge already exists.

  5. Implement or reuse from the bridgekit module a MetaBridge (type support).

    Again, if the <A>tom is XmlAtom, a MetaBridge exists in the bridgekit.

  6. Implement SequenceBuilder as an extension of the FragmentBuilder from above.

    SequenceBuilder adds overrides for the attribute(), startElement(), and text() methods (adding type names and typed values), plus methods to create an atom and to start and end a sequence.

  7. Add the typed tests from the bridgetest module.

    As with the standard tests, these are easy to implement, following the same pattern.

Model

The core of the GenXDM paradigm is an abstraction called Model. Because this is an example of the Handle/Body design pattern (and is stateless), only one instance of Model is needed for navigation and investigation for any and all instances of the XML tree model for which the particular Model is specialized.

Model is composed from three interfaces, reflecting three different forms of information that might be obtained from an XQuery Data Model:

  • NodeInformer reports information about the content/state of a particular node in context.

  • NodeNavigator permits one to obtain a different node given a particular starting node.

  • AxisNavigator supplies iteration over the standard XPath/XQuery axes, starting from a particular origin node.

public interface Model<N>
    extends Comparator<N>, NodeInformer<N>, NodeNavigator<N>, AxisNavigator<N> {
    void stream(N node, boolean copyNamespaces, ContentHandler handler) throws GxmlException;
}

public interface NodeInformer<N> {
    Iterable<QName> getAttributeNames(N node, boolean orderCanonical);

    String getAttributeStringValue(N parent, String namespaceURI, String localName);

    URI getBaseURI(N node);

    URI getDocumentURI(N node);

    String getLocalName(N node);

    Iterable<NamespaceBinding> getNamespaceBindings(N node);

    String getNamespaceForPrefix(N node, String prefix);
    
    Iterable<String> getNamespaceNames(N node, boolean orderCanonical);

    String getNamespaceURI(N node);

    Object getNodeId(N node);

    NodeKind getNodeKind(N node);

    String getPrefix(N node);

    String getStringValue(N node);

    boolean hasAttributes(N node);

    boolean hasChildren(N node);

    boolean hasNamespaces(N node);

    boolean hasNextSibling(N node);

    boolean hasParent(N node);

    boolean hasPreviousSibling(N node);

    boolean isAttribute(N node);

    boolean isElement(N node);

    boolean isId(N node);

    boolean isIdRefs(N node);

    boolean isNamespace(N node);

    boolean isText(N node);

    boolean matches(N node, NodeKind nodeKind, String namespaceURI, String localName);

    boolean matches(N node, String namespaceURI, String localName);
}

public interface NodeNavigator<N> {
    N getAttribute(N node, String namespaceURI, String localName);

    N getElementById(N context, String id);

    N getFirstChild(N origin);

    N getFirstChildElement(N node);

    N getFirstChildElementByName(N node, String namespaceURI, String localName);

    N getLastChild(N node);

    N getNextSibling(N node);

    N getNextSiblingElement(N node);

    N getNextSiblingElementByName(N node, String namespaceURI, String localName);

    N getParent(N origin);

    N getPreviousSibling(N node);

    N getRoot(N node);
}

public interface AxisNavigator<N> {
    Iterable<N> getAncestorAxis(N node);

    Iterable<N> getAncestorOrSelfAxis(N node);

    Iterable<N> getAttributeAxis(N node, boolean inherit);

    Iterable<N> getChildAxis(N node);

    Iterable<N> getChildElements(N node);

    Iterable<N> getChildElementsByName(N node, String namespaceURI, String localName);

    Iterable<N> getDescendantAxis(N node);

    Iterable<N> getDescendantOrSelfAxis(N node);

    Iterable<N> getFollowingAxis(N node);

    Iterable<N> getFollowingSiblingAxis(N node);

    Iterable<N> getNamespaceAxis(N node, boolean inherit);

    Iterable<N> getPrecedingAxis(N node);

    Iterable<N> getPrecedingSiblingAxis(N node);
}