 | Level: Introductory Brett McLaughlin (brett@oreilly.com), Author, O'Reilly and Associates
26 Mar 2003 This tip details the problems associated with outputting large XML documents, starting with an examination of the options for XML output. It then looks at DOM and XML output, along with possible solutions to the memory consumption associated with extended DOM usage. You'll get an understanding of why outputting XML is so tricky, and a solid grasp of the output alternatives that are available.
One of the most common problems in the XML domain is outputting large documents. While the process of reading in XML is
fairly well understood, there is little in the way of best practices for the output of XML. In cases where the output
is fairly small, say less than 1,000 records, this is not a significant problem; developers use APIs like DOM and JDOM, or
output the XML in raw character-based content using I/O streams. However, as datasets being output grow to hold
thousands, or even tens of thousands, of members, these solutions begin to break down. This tip examines these problems, explores the available alternatives, and lays out a plan for exhaustively covering XML output.
most common ways
Output alternatives
You have several alternatives for outputting XML. Before looking into a solution for output, it's worth detailing
some of the solutions that you shouldn't use. Here are the to output XML:
- SAX
- DOM
- JAXP
- Another in-memory API, like JDOM or dom4j
- Raw I/O streams
I'll look at each in turn before laying out a solution that I'll examine through the next several tips in this series.
SAX
The first option, SAX, is really a non-option. I've included it in the list because most developers
getting started with XML hear about SAX and how quick it is for XML processing. While SAX is traditionally
considered the fastest and slimmest API for XML, it does not have the ability to output XML (or anything else,
for that matter). In fact, if you examine the SAX package (org.xml.sax), you won't find
a single output method. It is designed from the ground up to read XML, rather than write it.
Note: It is possible to modify incoming XML by using an XMLFilter. (I'll talk
a lot more about filters later in this tip and in future tips.) However, this is still not outputting XML. It's also
possible to use raw I/O streams within SAX callbacks to output XML -- but that's really just a variant of option 5 in the
list above, so I'll deal with it in Raw I/O streams.
DOM
The Document Object Model, DOM, is by far the most commonly used API for XML output. DOM is an in-memory model of
XML, meaning that it stores each element, attribute, character fragment, and XML construct in memory. You can read an XML
document or stream into a DOM tree, or build a tree from scratch.
It's equally easy to write out a DOM tree, and most
parser software packages offer utility classes to do just this. For example, Apache Xerces comes with several samples,
including dom.Writer which takes in a DOM Node and prints out the
XML representation of that Node. Listing 1 shows a portion of that code, which handles
the bulk of the printing logic.
Listing 1. Printing a DOM tree
public void write(Node node) {
// is there anything to do?
if (node == null) {
return;
}
short type = node.getNodeType();
switch (type) {
case Node.DOCUMENT_NODE: {
Document document = (Document)node;
if (!fCanonical) {
fOut.println("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
fOut.flush();
write(document.getDoctype());
}
write(document.getDocumentElement());
break;
}
case Node.DOCUMENT_TYPE_NODE: {
DocumentType doctype = (DocumentType)node;
fOut.print("<!DOCTYPE ");
fOut.print(doctype.getName());
String publicId = doctype.getPublicId();
String systemId = doctype.getSystemId();
if (publicId != null) {
fOut.print(" PUBLIC '");
fOut.print(publicId);
fOut.print("' '");
fOut.print(systemId);
fOut.print('\'');
}
else {
fOut.print(" SYSTEM '");
fOut.print(systemId);
fOut.print('\'');
}
String internalSubset = doctype.getInternalSubset();
if (internalSubset != null) {
fOut.println(" [");
fOut.print(internalSubset);
fOut.print(']');
}
fOut.println('>');
break;
}
case Node.ELEMENT_NODE: {
fOut.print('<');
fOut.print(node.getNodeName());
Attr attrs[] = sortAttributes(node.getAttributes());
for (int i = 0; i < attrs.length; i++) {
Attr attr = attrs[i];
fOut.print(' ');
fOut.print(attr.getNodeName());
fOut.print("=\"");
normalizeAndPrint(attr.getNodeValue());
fOut.print('"');
}
fOut.print('>');
fOut.flush();
Node child = node.getFirstChild();
while (child != null) {
write(child);
child = child.getNextSibling();
}
break;
}
case Node.ENTITY_REFERENCE_NODE: {
if (fCanonical) {
Node child = node.getFirstChild();
while (child != null) {
write(child);
child = child.getNextSibling();
}
}
else {
fOut.print('&');
fOut.print(node.getNodeName());
fOut.print(';');
fOut.flush();
}
break;
}
case Node.TEXT_NODE: {
normalizeAndPrint(node.getNodeValue());
fOut.flush();
break;
}
case Node.PROCESSING_INSTRUCTION_NODE: {
fOut.print("<?");
fOut.print(node.getNodeName());
String data = node.getNodeValue();
if (data != null && data.length() > 0) {
fOut.print(' ');
fOut.print(data);
}
fOut.println("?>");
fOut.flush();
break;
}
}
if (type == Node.ELEMENT_NODE) {
fOut.print("</");
fOut.print(node.getNodeName());
fOut.print('>');
fOut.flush();
}
}
|
I won't go into any real detail about this code, but notice that each and every node in the document is iterated
over, and each exists in memory. So in a DOM tree that holds 1,000, 2,000, or even 10,000 nodes, you're never even going
to get to this printing code; well before your DOM tree is built, you'll get out-of-memory errors. Storing
1,000 nodes is a memory-consumptive process, and most machines will choke. Also consider that most data
is going to require two, three, or even more nodes; each element is one node, the data within that element is another
node, and there will be one additional node per attribute. So a document with 10,000 individual pieces of data could
actually have to store 20,000, 30,000, or even upwards of 50,000 individual nodes to represent that data. Needless
to say, DOM simply cannot handle this amount of data, let alone output the data to a file.
JAXP
The Java API for XML Processing (JAXP) is another red herring, so to speak. JAXP is not itself
an API for parsing; it is merely a wrapper API for adding a convenience layer on top of SAX and DOM. Therefore, it is these
underlying APIs that control the behavior of JAXP. In other words, using SAX through JAXP has the same non-write problems
as SAX alone does, and using DOM through JAXP has the same memory-consumption problems that DOM alone does. JAXP doesn't provide you any real option that I haven't already discussed.
In-memory models
Several APIs in addition to DOM and SAX have become popular in recent years. Two notable examples of these
are JDOM and dom4j. There are others, but all are
generally similar in terms of how they operate: They use various types of in-memory models. While many are not
as memory-heavy as most DOM implementations, they still hold at least some data in memory for every piece of data in the
XML tree. This means that you eventually run into the same problems that you do with DOM: memory overflow. You may
get more mileage out of these APIs, but ultimately your hardware is going to limit your ability to load and write data.
Raw I/O streams
The final option is to use raw I/O streams. In Java code, for example, you could use
java.io.OutputStream or java.io.Writer to spit out characters
that happen to be XML-conformant. For example, Listing 1 has several statements in which XML characters are written
directly to an output stream. While this is a viable option in that it doesn't require all your XML to be represented
in memory, it does have a lot of problems of its own. First, using raw streams means that you have to be
very fastidious about escaping characters like apostrophes and quotation marks. This often makes for some very ugly
output, and creates a lot of room for error. Additionally, you have to keep up with the formatting of the tree yourself -- you can't deal with elements, sub-elements, and attributes. Instead, you have to keep up with these details on your own.
This creates even more room for error. In other words, I/O streams are an option, but not a very good one.
What to do?
So, while we haven't looked at any code yet, you should be pretty clear on how poor the options are for handling large
XML datasets, and outputting that data without incurring a tremendous amount of work and error-checking. In this series of tips, I'll present an entirely different option, one based on some SAX extensions that allows both filtering and
output. Once you make it through the next five tips, you'll be comfortably handling large XML documents with ease, all without taxing your memory resources one bit.
While you're waiting on the next tip, you should find application code of your own that uses one of the options
detailed above. Think about profiling that code, and see how much of a memory hog it is. In other words, prepare
to compare your old code with some new techniques that I'll lay out in the next several weeks. And until the next tip,
I'll see you online!
Resources
About the author  | 
|  |
Brett McLaughlin has been working in computers since the Logo days (Remember the little triangle?). He currently specializes in building application infrastructure using Java-related technologies. He has spent the last several years implementing these infrastructures at Nextel Communications and Allegiance Telecom, Inc. Brett is one of the co-founders of the Java Apache project Turbine, which builds a reusable component architecture for Web application development using Java servlets. He is also a contributor of the EJBoss project, an open source EJB application server, and Cocoon, an open source XML Web-publishing engine. |
Rate this page
|  |