 | Level: Introductory Parand Darugar (tdarugar@yahoo com), Head of architecture, Yahoo! Search Marketing Services
01 Oct 2001 Based on an analysis of several large XML projects, this article examines how to make effective and efficient use of DOM. Developer/author Tony Daruger provides a set of usage patterns and a library of functions to make DOM robust and easy to use. Though the DOM offers a flexible and powerful means for creating, processing, and manipulating XML documents, some aspects of DOM make it awkward to use and can lead to brittle and buggy code. This article suggests ways to avoid the pitfalls. Perl code samples demonstrate the techniques.
The Document Object Model (DOM), is a platform- and language-neutral interface for dynamically accessing and updating the content, structure, and style of XML documents. DOM defines a standard set of interfaces for representing documents,
a standard model of how these objects can be combined, and a standard set of methods for accessing and manipulating them. DOM is a W3C Recommendation, which makes it a recognized Web standard. Implementations are available for a wide variety
of languages, including Perl, C, C++, Java, Tcl, and Python.
As I'll demonstrate in this article, DOM is an excellent choice for XML handling when stream-based models (such as SAX) are not sufficient. Unfortunately, several aspects of the specification, such as its language-neutral interface and its use of the "everything-is-a-node" abstraction, make it difficult to use and prone to generating brittle code. This was particularly evident in my company's recent review of several large DOM projects that were developed by a variety of developers over the past year. The common problems, and their remedies, are discussed below. Exploring the DOM
The DOM specification is designed to be usable with any programming language.
Therefore, it attempts to use a common, core set of features which are available in
all languages. The DOM specification also attempts to remain neutral in its interface definitions. Because of this, Perl programmers can apply their DOM knowledge when working with Java, and vice versa.
The specification also treats every part of the document as a node consisting
of a type and a value. This provides an elegant conceptual framework for dealing
with all aspects of the document. As an example, the following XML fragment
<paragraph align="left">the <it>Italicized</it> portion.</paragraph>
|
is represented via the following DOM structure:
Figure 1: DOM Representation of an XML Document

Each of the Document, Element, Text, and Attr pieces of the tree are DOM::Nodes. Design issues
The downside of DOM's language neutrality is that the methodologies and patterns that
are normally used in each programming language cannot be employed. For example,
the attributes of an XML node would naturally be represented in Perl as a hash,
since they are a set of unique name-value pairs. With DOM, however, they are
represented as a set of nodes, and the value of each is accessed via a separate
function call. Instead of using a simple hash, the programmer must learn to use a number of new
data structures and access methods. These minor inconveniences add up to unusual coding
practices and an increase in lines of code. They also force the programmer to learn the DOM method
of doing things in place of the way she would handle it intuitively.
The everything-is-a-node abstraction, while quite elegant, leads to awkward coding situations, such as the attribute node example above. This also occurs when accessing the value contained within an XML tag. Consider the XML fragment: <tagname>Value</tagname>. You may think the text value would be accessible by calling a getValue or similar method on the tagname node. In fact, the text is treated as one or more child nodes under the tagname node. Thus, in order to get the text value, you need to traverse the children of tagname, collating them into a string. There is good reason for this: tagname may contain other embedded XML tags. If tagname does contain embedded XML tags, getting its text value makes less sense. In the real world, however, we have seen very frequent coding errors caused by this lack of convenient functions.
The everything-is-a-node abstraction also loses some value because of the number of node types
that exist and because of the lack of uniformity present in their access methods. For example, the
insertData method is used to set the value of
CharacterData nodes, while the value of Attr
(attribute) nodes is set by direct access to a value field. By presenting
different interfaces for the different nodes, the uniformity and elegance of the model is diminished, and the
learning curve is increased.
Common coding problems
An analysis of several large XML projects revealed some common problems
in working with the DOM. A few of these are presented below.
Code bloat
In all of the projects that we looked at in our review, an overarching problem presented itself: it took many lines of code to do simple things. In one example, 16 lines
of code were used to check the value of an attribute. But the same task, with
improved robustness and error handling, can be accomplished in three lines of code.
What contributed to the increase in the number of code lines were the low-level
nature of the DOM API, incorrect application of methods and programming patterns, and
lack of knowledge of the full API. The following presents specific instances of these issues.
Traversing the DOM
In the code we examined, the most common task was to traverse or search the DOM. Here is a condensed version of the code required to find a node called "header" under the config section of the document:
$document_root = $dom_document->getDocumentElement();
my $config_node = $document_root->getFirstChild();
foreach my $node ( $config_node->getChildNodes() ) {
if ( $node->getName() eq "header") {
# do something
}
}
|
The document is traversed from the root by getting the top element, getting
its first child (config_node), and finally by
individually examining config_node's children.
Unfortunately, not only is this method quite verbose, but it is also fraught with
fragility and the potential to have bugs.
As an example, the second line of the code gets the intermediate node
using the getFirstChild method. Already, a
multitude of potential problems exist. The first child of the root node may
not be actually be the config_node the user is
searching for. By blindly following the first child, we have ignored the
actual name of the tag and will potentially be searching the incorrect part
of the document. A frequent error in this scenario occurs when the source
XML document contains whitespace or a carriage return after the root node;
the first child of the root node is actually a DOM::Text
node, not the intended node. To correctly navigate to our intended node, we
need to examine each of document_root's child nodes
until we find one that is not a Text node and that has the name we are looking for.
We are also ignoring the possibility that the document may have a different structure from
what we are expecting. If the document_root
doesn't have any child nodes, for example, config_node
will be set to undef, and the third line of the example will
raise an error. Therefore, to properly navigate the document, not only do we have to examine
each child node individually and check for the appropriate name, but at every step we
also have to check to make sure each method call returned a valid value. Writing
robust, error-free code that can handle arbitrary input requires both a great deal of
attention to detail and many lines of code.
Retrieving the text value within a tag
After DOM traversal, the second most common task was to retrieve the text value contained in a tag.
Consider the XML fragment <sometag>The Value</sometag>.
Having navigated our way to the sometag node, how do we
capture its text value (The Value)? An intuitive implementation may be:
As you may have guessed, the above code will not
perform the desired action. We cannot call a getData
or a similar function on the sometag node because the actual
text is stored as one or more child nodes. A better approach would be:
$sometag->getFirstChild()->getData(); |
The problem here is that the value may not actually be contained in the first
child; processing instructions or other embedded nodes may be found within
sometag, or the text value may be contained in
several child nodes instead of in just one. Recall that whitespace is frequently
represented as a text node, so the call to
$sometag->getFirstChild() may get you only
the carriage return between the tag and its value. In fact, we need to traverse
all of the children, checking for nodes of type Text,
and collating their values until we have the complete value.
getElementsByTagName
The DOM interface includes a method for finding child nodes with a given
name. For example, the call:
my @results = $document_root->getElementsByTagName("name"); |
will return an array (or a NodeList) of tags called
name from within the document. This is certainly
more convenient than the traversal methods we discussed above. It is also the
cause of a common set of bugs.
The problem is that getElementsByTagName
recursively traverses the document, returning all matching nodes. Suppose you
have a document containing customer information, company information, and
product information. All three of these items can potentially have a
name tag within them. If you were to call
getElementsByTagName searching for customer names
and ended up with product and company names, your program will likely
misbehave. Calling the function on a subtree of the document can diminish
the risks. However, XML's flexible nature makes it quite difficult to ensure
the subtree you are operating on has the structure you are expecting, and
doesn't have spurious child nodes with the name you are searching on.
Effective use of the DOM
Given the limitations imposed by DOM's design constraints, how can you use the specification effectively and efficiently? We present a few basic principles and guidelines for DOM usage, and create a library of functions to make life easier. Basic principles
Your experience using DOM will be significantly improved if you follow a
few basic principles:
- Do not use DOM to traverse the document
- Whenever possible, use XPath to find nodes or traverse the document
- Use a library of higher-level functions to make DOM use easier
These principles are derived directly from examination of common problems.
DOM traversal, as discussed above, is a leading cause of errors. It is
also, however, one of the most commonly needed functionalities. How do we traverse the
document without using the DOM?
XPath
XPath is a language for addressing, searching, and matching pieces of
the document. It is a W3C Recommendation, which makes it an accepted
standard, and it is implemented in most languages and XML packages. Chances
are your DOM package supports XPath either directly or via an add-on.
XPath provides an excellent means by which to traverse and search the document.
It uses a path notation, similar to that used in file systems and URLs,
to specify and match pieces of the document. For example, the XPath:
/x/y/z searches the document for a root node
of x, under which resides the node y, under which resides the node z. This statement returns all
nodes that match the specified path structure.
More complex matchings are possible both in terms of the structure of the document,
and the values of the nodes and their attributes. The
statement /x/y/* returns all nodes under
any node y with the parent x. /x/y[@name='a']
matches all nodes y who have a parent x, and have an attribute called
name with the value a.
A full examination of XPath and its usage is beyond the scope of this article. See Resources for links to some excellent tutorials. Take a little time to learn XPath, and you will be rewarded with much easier handling of XML documents.
Library of functions
One of the surprising aspects of our examination of the DOM projects was the
amount of copy-and-paste code that was present. Pieces of code from one file would
be copied and pasted into many others to implement similar pieces of functionality. Why would
experienced developers who otherwise employ good programming practices engage in copy-and-paste
methods instead of creating helper libraries? We believe this is because most programmers are
not DOM experts, and they will happily grab the first piece of code that does what they need. They
do not feel confident enough in their DOM skills to produce the canonical functions that make up the
helper library.
It is quite easy to create and use helper libraries to implement common functionalities; it only requires
a small amount of discipline. Below are some basic helper functions that will get you started.
getValue
The most commonly performed action when working with XML documents is
looking up the value of a given node. As discussed above, this can
present difficulties both in traversing the document to find the desired node and in retrieving the
value of the node. The traversal can be simplified using XPath, and the retrieval of the value can be
coded once and then reused. We have implemented the
getValue function with the helper of two
lower-level functions, findNode. This helper
finds and returns the first node, which matches the given XPath
expression, and getTextContents, which
non-recursively returns the concatenated values of the text nodes
under the passed-in node, as shown in Listing 2.
sub getTextContents {
my ($node, $strip)= @_;
my $contents;
if (! $node )
{
return;
}
for my $child ($node->getChildNodes()) {
if ( ! is_element_node($child) ) {
$contents .= $child->getData();
}
}
if ($strip) {
$contents =~ s/^\s+//;
$contents =~ s/\s+$//;
}
return $contents;
}
sub findNode {
my ($node, $xpath) = @_;
if (! defined($node) || ! defined($xpath) )
{
return undef;
}
my $match = ($node->xql($xpath))[0];
if (! $match )
{
return undef;
}
return $match;
}
sub getValue {
my ($node, $xpath) = @_;
my $match = findNode( $node, $xpath );
if (! defined($match) )
{
return undef;
}
return getTextContents( $match );
}
|
getValue is called by passing in both a node from which to start the search, and an XPath statement that specifies the node we're searching for. The function finds the first node to match the given XPath and extracts its text value.
setValue
Another common action is to set the value of a node to a desired value, as shown in Listing 3.
sub setValue {
my ($node, $xpath, $value) = @_;
my $match = findNode( $node, $xpath );
if (! defined($match) )
{
return undef;
}
foreach my $child ( $match->getChildNodes() )
{
$match->removeChild ($child);
}
$match->addText($value);
return $match;
}
|
This function takes a starting node and an XPath statement -- just like getValue -- and a string
to set the value of the matching node to. It finds the desired node using findNode, removes all of
its children (thereby removing any text and other elements contained within it), and sets its text contents to the passed-in string.
appendNode
While some programs look up and modify the values contained in XML documents, others modify the structure of the document itself by adding and removing nodes. This helper function simplifies the addition of a node to the document, as shown in Listing 4.
sub appendNode {
my ($doc, $nodename, $xpath, $value) = @_;
if (! defined($nodename) || ($nodename eq "") ) {
return undef;
}
my $match = findNode( $doc, $xpath );
if (! defined($match) )
{
return undef;
}
my $newnode;
eval {
$newnode = $doc->createElement( $nodename );
};
if ($@ || (! defined($newnode) )) {
return undef;
}
$match->appendChild( $newnode );
if ( defined($value) ) {
$newnode->addText($value);
}
return $newnode;
}
|
The parameters to this function are the DOM document, the name of the
node to add, the XPath statement specifying the node to add it under
(that is, what the parent node of the new node is), and, optionally, the
text value of the node. The new node is appended to the specified parent
node, and its value is set to the passed-in string.
copySubTree
Copying a section of a document into another location or document, while
not a very common operation, was the cause of much confusion and gave
rise to various inventive copy procedures. As Listing 5 illustrates, it is, in fact, fairly simple
to implement.
sub copySubTree
{
my ($sourcenode, $destnode) = @_;
my $copy_node = $sourcenode->cloneNode(1);
if ( $sourcenode->getOwnerDocument() ne $destnode->getOwnerDocument() )
{
$copy_node->setOwnerDocument( $destnode->getOwnerDocument() );
}
$destnode->appendChild($copy_node);
return $copy_node;
}
|
This function takes the source node and copies it over as a child
under the destination node. The destination node may be in another
document, in which case the subtree is copied between documents.
Conclusion
The DOM has been maligned as a difficult and nonintuitive way of manipulating XML documents. In fact, it forms a very effective base which easy-to-use systems can be built upon by following a few simple principles. DOM has already been implemented and optimized on most platforms, and is a very good choice for applications that need to search and manipulate XML
documents in complex processes.
Resources
About the author  | 
|  | Parand Tony Darugar is the head of architecture for Yahoo! Search Marketing Services (formerly Overture). His interests include Web services and Service Oriented Architectures (SOA), XML, high-performance business systems, distributed architectures, and artificial intelligence. You can reach him at tdarugar@yahoo.com. |
Rate this page
|  |