 | Level: Intermediate David Mertz (mertz@gnosis.cx), Facilitator, Gnosis Software, Inc.
26 Mar 2003 RELAX NG schemas provide a more powerful, concise, and semantically straightforward means of describing classes of valid XML instances than do W3C XML Schemas. In this installment, David continues the discussion of RELAX NG begun in part 1 of this series by addressing a few additional semantic issues and looking at tools for working with RELAX NG.
In the last installment
I gave you a fairly complete overview of
both the syntax and semantics of RELAX NG schemas. However, a
few issues were glossed over, and are worth looking at more closely.
Both DTDs and W3C XML Schemas allow for infoset augmentation,
while RELAX NG does not. James Clark, one of the creators of
RELAX NG (and many other widely used XML tools), argues vehemently
that infoset augmentation violates modularity in the roles of
XML instance documents and schemata. In other words, for
Clark, RELAX NG has a feature where DTDs and W3C Schemas have a
bug. My own feelings on the matter are mixed, but I can understand
his intuition.
Let's backtrack a little and look at what this infoset stuff
is about. Basically, you can ask an XML instance what data
it contains. If you parse the instance without validation, the
answer depends solely on what values occur in its
attributes and element bodies. If you value modularity, a
schema should only tell you whether an instance is valid or not;
it should not change the actual information in a document.
However, such modularity is violated if you use a DTD or W3C XML
Schema for validation. For example, consider the following
DTD:
Listing 1. curious.dtd
<!ELEMENT foo EMPTY>
<!ATTLIST foo bar CDATA "curious"
baz CDATA #FIXED "curiouser">
|
And this XML instance:
Listing 2. curious.xml
<?xml version="1.0"?>
<!DOCTYPE foo SYSTEM "curious.dtd">
<foo/>
|
A non-validating parser will find a different set of information in
this document than a validating parser would. Contrast the
non-validating utility xmlcat with the validating 4xml
(both echo whatever they encounter back to the console):
Listing 3. Infosets with validating and non-validating parsers
% ./xmlcat curious.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<foo></foo>
% 4xml -p curious.xml
<?xml version="1.0" encoding="utf-8"?>
<foo bar="curious" baz="curiouser"/>
|
In a W3C Schema, default and fixed attributes have
similar effects for both <xsd:attribute> and <xsd:element> tags.
The argument in favor of defaulting is that it allows XML
instance minimization. I have used defaults (or more likely
#FIXED attributes) for this very purpose. But I can also see
dangers -- both of malice and of debugging nightmares -- if the very content of a local XML document depends upon a remote
(and perhaps spoofable) URI , and even upon an absence of network
interruptions during parsing.
RELAX NG does not perform any infoset augmentation.
Well, almost -- I think Clark overstates this point. If you impose
a data type on an element or attribute, you still change the
content of the value in an important way. The value of the string
"1.0" is different from the value of the float "1.0", even though
the two are represented in exactly the same way in an XML
instance.
Stating cardinality
W3C XML Schemas have better means of requiring
occurrence cardinalities than do DTDs or RELAX NG schemas. If you
want a <foo> element to occur between 5 and 30 times within the
<bar> element, you can declare this in W3C Schemas with a
straightforward rule:
Listing 4. W3C XML Schema cardinality rule
<xsd:element name="bar">
<xsd:element name="foo" minOccurs="5" maxOccurs="30"/>
</xsd:element>
|
The same cardinality rule can be stated in a DTD, but very clumsily:
Listing 5. DTD cardinality rule
<!ELEMENT bar
(foo, foo, foo, foo, foo, foo?,foo?,foo?,foo?,foo?
foo?,foo?,foo?,foo?,foo?,foo?,foo?,foo?,foo?,foo?
foo?,foo?,foo?,foo?,foo?,foo?,foo?,foo?,foo?,foo?) >
|
What I would like for RELAX NG would be an explicit
<cardinality> tag, so that you could (hypothetically) write
something like:
Listing 6. Hypothetical RELAX NG 2.0 cardinality rule
<element name="bar" xmlns="http://relaxng.org/ns/structure/1.0>
<cardinality min="5" max="30">
<element name="foo"/>
</cardinality>
</element>
|
Unfortunately, in the current version of RELAX NG, the only
cardinalities you get are <zeroOrMore>, <oneOrMore>, and
<optional>. However, named patterns can at least be used to
make spelling out cardinalities slightly less painful. In
compact syntax, for example:
Listing 7. Actual RELAX NG compact syntax cardinality rule
start = element bar { fivefoo, upto25foo }
fivefoo = element foo { empty }, element foo { empty },
element foo { empty }, element foo { empty },
element foo { empty }
maybefoo = element foo { empty }?
upto25foo =
fivefoo?, fivefoo?, fivefoo?, fivefoo?,
maybefoo, maybefoo, maybefoo, maybefoo, maybefoo
|
I confess that this sort of naming is not perfect, but at least
it is possible to name large numbers by effectively raising to
powers through repetition of named patterns.
Transformations and validations
A variety of tools are available for working with RELAX NG
schemas. These tools are predominantly implemented in the Java language,
but some tools and libraries have been
written in Python, C#, and Visual Basic. Surprisingly, I have not
found any libraries written in other languages -- such as Perl, Ruby, or C/C++ --
that seem to be good fits.
One obvious class of RELAX NG application is validators. Just
as with validating parsers that work with DTDs or W3C XML
Schemas, a number of command line, online, and library parsers
are available for RELAX NG. A slightly less obvious class of
application is tools to transform schemas into each other.
Sun's RELAX NG Converter and James Clark's trang and
DTDinst let you convert among RELAX NG (XML and compact
syntax), DTDs, and W3C XML Schemas. I plan to write a less
ambitious Python tool (compact2xml.py) in time for the next
installment of this column, which will allow 4Suite and xvif to utilize the
RELAX NG compact syntax (the authors of each have expressed an
interest in including such a tool).
Transformations are worth looking at in a bit more detail.
Part 1 looked at ways in which
RELAX NG is strictly more powerful than W3C XML Schemas, and
looking at some best-effort transformations helped illustrate
this point. For example, the previous installment presented a
schema for a library patron, which is expressed in compact
syntax as:
Listing 8. Library patron compact syntax
element patron {
element name { text } &
element id-num { text } &
element book {
attribute isbn { text } |
attribute title { text }
}*
}
|
See Part 1 for the XML syntax version, which is
semantically identical but more verbose. trang makes a
good attempt at turning this into a W3C XML Schema. The file
extensions of the input and output file are used to guess types
(or may be overridden with switches):
Listing 9. Transforming RELAX NG to W3C XML Schema
% java -jar trang.jar patron.rnc patron.xsd
% cat patron.xsd
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified" version="1.0">
<xsd:element name="patron">
<xsd:complexType>
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element ref="name"/>
<xsd:element ref="id-num"/>
<xsd:element ref="book"/>
</xsd:choice>
</xsd:complexType>
</xsd:element>
<xsd:element name="name">
<xsd:complexType mixed="true"/>
</xsd:element>
<xsd:element name="id-num">
<xsd:complexType mixed="true"/>
</xsd:element>
<xsd:element name="book">
<xsd:complexType>
<xsd:attribute name="isbn"/>
<xsd:attribute name="title"/>
</xsd:complexType>
</xsd:element>
</xsd:schema>
|
To the credit of trang, I think this W3C Schema is
the best that can be done for the situation. Every XML
instance that's accepted by the RELAX NG schema is also accepted by
the W3C XML Schema, and many errors are rejected by both. The
problem is that there is a distinct class of XML instances
that are not really valid according to the desired rule, but
that pass validation with the W3C Schema. For example:
Listing 10. Limits of W3C XML Schema discernment
% cat patron-i1.xml
<?xml version="1.0" encoding="UTF-8"?>
<patron>
<book isbn="0-528-84460-X"/>
<name>John Doe</name> <!-- repeats name subelement -->
<name>Second Name</name>
<id-num>12345678</id-num>
<book title="Why RELAX is Clever"/>
</patron>
% cat patron-i2.xml
<?xml version="1.0" encoding="UTF-8"?>
<patron>
<name>John Doe</name>
<id-num>12345678</id-num>
<!-- Too many and too few attributes of book element -->
<book title="Why RELAX is Clever" isbn="0-528-84460-X"/>
<book/>
</patron>
% cat patron-i3.xml
<?xml version="1.0" encoding="UTF-8"?>
<patron/> <!-- No required subelements -->
|
Of course, even though the three examples above
validate falsely, W3C XML Schema still rejects XML instances with
entirely disallowed elements/attributes, or ones that nest
elements in improper ways
(for example, <book> inside <name>
rather than as a sibling).
As far as validation tools go, I find that jing does a good
job of producing useful error messages when validation fails.
The Python XML library 4Suite incorporates a version of the
xvif library, and also performs validation (the latter is
also accessible online -- see Resources). But compare the errors:
Listing 11. Validation error messages with jing
% java -jar ../trang/jing.jar patron.rng patron-i3.xml
Error at URL "file:/.../patron-i3.xml",
line number 2: unfinished element
% java -jar ../trang/jing.jar patron.rng patron-i1.xml
Error at URL "file:/.../patron-i1.xml",
line number 5: element "name" not allowed in this context
|
Listing 12. Validation error messages with 4Suite
% 4xml --rng=patron.rng patron-i1.xml
Traceback (most recent call last):
...
File "/.../site-packages/Ft/Xml/_4xml.py", line 89, in Run
raise RngInvalid(result)
Ft.Xml.Xvif.RngInvalid: Qname {None}name not exected
% 4xml --rng=patron.rng patron-i3.xml
Traceback (most recent call last):
...
Ft.Xml.Xvif.RngInvalid
|
Of course, in an application context, the choice of the
programming language that will utilize the libraries outweighs
differences in the messages produced.
Compiled validators
One category of tool that I have not seen much of outside of RELAX
NG contexts is a single-schema validator. Take a look at the
RELAX NG home page for links to such tools, including Bali and
RelaxNGCC. These frameworks automatically emit code for
specialized validation of a particular RELAX NG schema.
Presumably, such a specialized validator runs faster than a
general purpose one. Such tools are possible -- or at
least much more straightforward than the same thing would be
relative to W3C XML Schemas -- because the design of RELAX NG
is extremely well grounded in algorithmic analysis.
RELAX NG-enhanced XML editors
Unfortunately, XML editors do not yet support RELAX NG as
widely as they do W3C XML Schemas. Of course, DTDs remain much
more widely supported than either of these schema styles.
This is a shame because it would actually be far easier to
include customizations around RELAX NG in an editor because of
the simple conceptual framework of RELAX NG validation.
Ideally, a custom XML editor would utilize a RELAX NG schema to
direct and assist a user in the insertion of attributes and
elements in ways that maintain validity.
One compromise would be to use a tool like trang
to convert a RELAX NG schema into a W3C XML Schema or DTD that
approximates it, then use those within a GUI XML editor. But
doing so would help only to a limited extent.
One XML editor is built around RELAX NG -- the Java technology-based
XML Operator (see the RELAX NG home page in Resources).
I played with it a little, and found that it could be potentially useful, but it
would fall on the low end of the XML editors I have previously
reviewed (see Resources); XML Operator implements just a few features here and there,
and provides neither the huge array of tools of XML Spy, or the
simple elegance of oXygen. XML Operator implements just a few features here and there, and provides neither XML Spy's huge array of tools, or oXygen's simple elegance.
Until next time
In part 1 and here in part 2, I have looked at most of the elements
of RELAX NG, and included a summary of tools for working with it.
The third and final installment will touch briefly on how RELAX NG lets
you include external schemas in your schema, and selectively
merge the specifications of different schemas. But part 3 will primarily look at the RELAX
NG compact syntax in more detail, and explain the exact
correspondences between compact syntax and XML syntax.
Resources
- Participate in the discussion forum.
- Check out the home page for RELAX NG, which contains numerous useful links to articles, tools, and so on. Of particular note is the excellent tutorial written by two great luminaries of XML technologies, James Clark and Murata Makoto (Oasis, December 2001).
- James Clark wrote a
discussion of the algorithmic principles
behind RELAX NG validation. Interestingly, his sample code is
provided in Haskell, which has advantages that I've described
in my XML Matters installment on the HaXml library (developerWorks, October 2001).
- Use this online tool to validate an XML instance document against a RELAX NG schema.
A RELAX NG schema itself is validated during the process, as well. This tool is based on Eric
van der Vlist's xvif tool, which is written in Python.
This online validator lets you select from a set of test cases, as well as check your own examples. The test cases are also available (and briefly discussed).
-
Download the xvif library.
Alternatively, 4Suite
is a somewhat more polished tool that incorporates xvif for RELAX NG validation.
The command-line tool 4xml validates against both RELAX NG and DTDs, with various options.
4Suite includes many other tools and libraries for working with many XML-related technologies.
- For background and comparison, use
this online tool
to validate an XML instance document against a W3C XML Schema, and try
this tool to check W3C XML
Schemas against the Approved Recommendation.
- trang and jing are complementary tools for transformation between
schemata and validation against RELAX NG schemas. The former depends on the
latter, but
both can be downloaded.
- You will need to obtain an implementation of the Java API for
XML Processing (JAXP) to use trang. If you run a Java 1.4
JVM, you are fine; otherwise, you can download crimson.
-
DTDinst
is a Java technology-based tool for converting DTDs to XML instance
document format, including handling of parametric entities. The DTDinst XML format is of limited utility by itself, since
nothing else works with it. However, an XSLT style sheet
is available to transform this format into RELAX NG (with a few
caveats). You will need an XSLT tool to utilize it.
-
Find a collection of schemata and test XML instance documents for the library patron example discussed in this article.
- Read David Mertz's roundup of XML editors: Part 1 (developerWorks, August 2002) examines Java and MacOS applications,
while Part 2 looks at Windows-based products (developerWorks, September 2002).
You'll find all of the previous installments of the XML Matters column.
- Find more XML resources on the developerWorks XML zone.
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
About the author  | 
|  | David Mertz, in his gnomist aspirations, wishes he had coined the observation that the great thing about standards is that there are so many to choose from. But then, he is also fuzzy on OS design. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed. |
Rate this page
|  |