 | Level: Introductory Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
01 Oct 2002 Generators are a very powerful new language feature of Python 2.2. In this tip, Uche Ogbuji presents a set of techniques for using generators for fast and lucid XML processing patterns in Python.
Of the numerous features recently added to Python, generators truly stand out. Fredrik Lundh, a Python insider well known for wry commentary, put it best: Comparing generators to other recent changes in Python, he said, "Generators [open] a new universe; the rest is more like moving the furniture around." In this tip, I show you how generators can be used to make XML DOM processing with Python even smoother.
Note: In his excellent series on generators in the IBM developerWorks Linux zone (see Resources), David Mertz discusses the basics of generators and some neat applications. If you are not familiar with generators, I recommend that you review at least his first article on the topic. You will need Python 2.2 to use generators.
Smart and fast iterators
Node iterators are a very common design pattern in XML, and are a fixture of DOM. A node iterator visits every node in an XML document in turn, following the standard document order. Node iterators are notorious for being slow and fiddly, but generators in Python offer a wonderfully concise and efficient implementation of node iterators. See Listing 1.
Listing 1. A simple node iterator using generators (nodeiter1.py)
#Required in Python 2.2, and must be the first import
from __future__ import generators
from xml.dom import Node
def doc_order_iter(node):
"""
Iterates over each node in document order,
returning each in turn
"""
#Document order returns the current node,
#then each of its children in turn
yield node
for child in node.childNodes:
#Create a generator for each child,
#Over which to iterate
for cn in doc_order_iter(child):
yield cn
return
#Parse a document then print the Python representation
#of each node object in document order
DOC = """\
<memo xmlns='http://spam.com'>
<date form="iso-8601">2002-08-14</date>
This is just to <strong>say</strong>:
I ate the eggs you left in the fridge
And were probably saving for breakfast.
Do you know? They were <emph>quite</emph> rotten.
</memo>
"""
from xml.dom import minidom
doc = minidom.parseString(DOC)
for node in doc_order_iter(doc):
print node
|
Traversing trees is one of the classic use-cases for generators (such a traversal is the main example in the official specification for Python generators). This means that XML's natural tree structure is very well suited to generator idioms. You can copy the doc_order_iter function to your code and use it any time you need to iterate over a document's nodes (a fairly common requirement in XML processing). The function starts by yielding the given node right away. Remember that this suspends its operation and returns control to the caller. The next time it is called, it initiates a loop over its children. The function recursively sets up a generator for each of its child nodes. This has the effect of traversing down the tree such that each node is visited first, and then each of its children, from left to right. The return at the end signals the complete termination of the generator's operation. The output is shown in Listing 2:
Listing 2. The output
$ python nodeiter1.py
<xml.dom.minidom.Document instance at 0x8267a1c>
<DOM Element: memo at 136739676>
<DOM Text node "
">
<DOM Element: date at 136742132>
<DOM Text node "2002-08-14">
<DOM Text node "
">
<DOM Text node "This is ju...">
<DOM Element: strong at 136744780>
<DOM Text node "say">
<DOM Text node ":">
<DOM Text node "
">
<DOM Text node "I ate the ...">
<DOM Text node "
">
<DOM Text node "And were p...">
<DOM Text node "
">
<DOM Text node "Do you kno...">
<DOM Element: emph at 136748228>
<DOM Text node "quite">
<DOM Text node " rotten.">
<DOM Text node "
">
|
How about attributes?
doc_order_iter does not iterate over attribute nodes because they are not considered children of their parent nodes in the DOM. It's not likely that you'll need to do a full iteration that includes attributes, but as an illustration, Listing 3 is a function that does so.
Listing 3. full_doc_order_iter also iterates over attributes
from xml.dom import Node
def full_doc_order_iter(node):
"""
Returns each node in document order,
including attributes, which are visited
immediately after their element
"""
yield node
if node.nodeType == Node.ELEMENT_NODE:
#Attributes are stored in a dictionary and
#have no set order. The values() call
#gets a list of actual attribute node objects
#from the dictionary
for attr in node.attributes.values():
yield attr
for child in node.childNodes:
for cn in full_doc_order_iter(child):
yield cn
return
|
After the given node is yielded, if it is an element node, its attributes are yielded as well (in arbitrary order).
Warning: This function is written as if attribute nodes have no children (and therefore there is no need to spawn a new generator for each attribute node). In the DOM proper, attributes can have one or more children, which must be text nodes. This is something of an oddity and annoyance in the DOM, and in practice, most people treat attributes atomically, even in DOM processing.
Being selective
The iterators presented so far return all the nodes in the document, without discrimination. But sometimes you want to be more selective. You can easily modify these generators to apply a criteria against each node in order to decide whether to return it. Listing 4 shows a function, doc_order_iter_filter, which is a general framework for iterating over all nodes in a document (not including attribute nodes) and returning only those that pass a filter function given to the generator.
Listing 4: doc_order_iter_filter iterates over non-attribute nodes, returning those that pass a filter
def doc_order_iter_filter(node, filter_func):
if filter_func(node):
yield node
for child in node.childNodes:
for cn in doc_order_iter_filter(child, filter_func):
yield cn
return
|
This time a filter function is passed in. This function takes a node object and returns 0 or 1. You can make this function as simple or complex as you please in its selection of nodes. Before yielding the given node, it is checked against the filter function. This check is applied again for each of the values returned from the recursive generators that are spawned.
As an example, Listing 5 uses this framework to generate only the elements in the document.
Listing 5. Using the framework to generate only the elements in a document
def get_elements(node):
elem_filter = lambda n: n.nodeType == Node.ELEMENT_NODE
return doc_order_iter_filter(node, elem_filter)
|
The filter setup does a simple check of the node type. Then a generator is created and returned, prepped to iterate over all nodes with the given filter applied. It is important to remember that the last line is not an ordinary function call. It creates a generator -- a function in suspended animation, in a way, and the calling code can then treat this return result as an iterator. This iterator, in turn, returns a new, yielded value each time it is called until the generator hits the return statement. This means that the order of control is not typical of procedural languages: In effect, the doc_order_iter_filter generator outlives the function (get_elements) from which it is invoked.
Summary
Generators are a true boon to XML processing in Python. I have found myself using them so frequently in XML processing that it's hard for me to imagine how I did without them before. Use generators when you have to perform some operation repeatedly over all or selected nodes in the document. It does not have to be in document order. You can arrange the flow of control however you need to yield the results you require.
Resources
About the author  | 
|  | Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact him at
uche@ogbuji.net. |
Rate this page
|  |