Level: Intermediate Edd Dumbill (edd@xml.com), Editor and publisher, xmlhack.com
24 Mar 2004 Edd Dumbill continues the development of a vocabulary for describing open source software projects, looking at existing software registries and examining the problem of constraining property values.
In the first
article in this series, I introduced the project to build DOAP ("Description of a project"),
an RDF/XML vocabulary for describing open source projects. DOAP
will meet the needs of project maintainers who find they must
register their software at myriad Web sites, and for anyone seeking
to exchange such data. That article outlined existing
work in this area, and defined the boundaries of the project.
This time, I will distill a set of terms that are candidates for inclusion in this vocabulary and talk
about some of the difficulties inherent in specifying it. I will show you that the
admirable aim of being able to share DOAP descriptions
globally has some consequences for the design of this
vocabulary.
Condensing terms
Table 1 shows a survey of metadata terms used in various
software directory Web sites and also in the open source metadata
framework. For some terms, I have made a closest-approximation
categorisation -- for example equating "lead developer" with
"maintainer." I have also excluded items relevant to software
releases, which are out of scope for this stage of DOAP. The table
is useful as a broad-brush survey of terms in circulation.
Table 1. Commonly used metadata terms in open source software directories
|
Term
|
Freshmeat
|
OMF
|
Advogato
|
GNOME
|
Sourceforge
|
KDE-apps.org
| | bug tracker | | | | y | y | | | category | y | y | | y | y | | | creation date | y | y | | y | y | | | cvs repository | | | | y | y | | | demo site | y | | | | | | | description | y | y | y | y | | y | | development stage | y | |
| | y | | | download page | y | | | y | y | y | | freshmeat url | y | | y | | | | | homepage | y | | y | y | y | y | | intended audience | y | | | | y | | | license | y | y | y | | y | y | | mailing list | | | | y | y | | | mirror site | y | | | | | | | natural language | | y | | | y | | | operating system | y | | | | y | | | programming language | y | | | | y | | | purchase link | y | | | | | | | relationships | | | y | | y | | | rel contributor | | | y | | | | | rel developer | | | y | | y | | | rel documenter | | y | y | | | | | rel helper | | | y | | | | | rel lead developer | | y | y | y | y | | | screenshots | y | | | y | | y | | short description | y | | | y | y | | | title | y | y | y | y | y | y | | version | y | y | | y | y | y |
In addition to the terms found in Table 1, here are a few other
items that are commonly used in open source projects
(mostly of a more social nature) to consider:
- Wikis: Often used to host development documentation.
- Other kinds of source repositories: Subversion, Arch, and BitKeeper
are commonly used in addition to CVS.
- Additional project roles: Include at least "translator" and
"tester."
- PGP public key: Many software releases are digitally signed with
PGP to give some guarantee of authenticity.
Most of the sites surveyed in Table 1 employ a very simple model
where the project forms the sole entity. The metadata items are
then simple properties of this entity. Later, you will see that a case
can be made for some of those property values to be
complex entities themselves. An example of this is properties whose
domain (the collection of permissible values for the property) is
people. It is clearly desirable to record more than just a name to identify a person. However, it's important to strike a balance
between completeness and overcomplicating the vocabulary.
Figure 1 shows a partial entity-relationship diagram for the
vocabulary. These diagrams can prove helpful in cases that have multiple interacting entities. An alternative to
entity-relationship modeling is to use UML, the Unified Modeling
Language. Several other articles have examined in
depth the application of UML to creating XML vocabularies,
including those using W3C XML Schema (see Resources).
In this case, you have few entities and many attributes. You can
probably manage well without constructing a complete set of
diagrams. Due to the simple nature of the task, most of the
challenges lie not in the modeling itself, but in making DOAP easy to create and process.
Figure 1. A partial entity relationship diagram for the new vocabulary.
So far you have accumulated a set of candidate terms to be
included in your vocabulary. Choosing which to use is a matter
of design, trial-and-error, and personal preference. In due course,
you will need to construct some example usages in order to get a
feel for how the vocabulary will work. You'll also need to test out your design
ideas. Before then you must consider some of the problems you'll come
up against while data modelling.
Uniquely identifying projects
To efficiently manipulate the data expressed in your vocabulary,
you need to nominate at least one property as identifying a project.
This is analogous to making a column in a database a primary key.
Unlike in the case of a database, however, a locally unique key will not do.
The project key must be globally unique if DOAP descriptions are to
be shared on the Web. Yet how can you administer this? One of the
basic principles of DOAP is that it is decentralized. Descriptions
can be created and distributed without registering on a particular
Web site.
On the Web, a common way of globally identifying an item is to
give it a URI. As every software project has a home page on the Web
it seems sensible to nominate the home page URI as the
identifying property for a project. The only other major contender
for the identifying property is the project's name. The weakness of
using the name lies in the lack of an authority to appeal to in the
case of duplicates. It is not uncommon for projects to choose the
same name. In such cases, confusion often arises and the conflict may have no
happy conclusion. With homepage URIs (that is, URLs), the
global authority of the DNS system ensures no name clashes.
Using home page URLs has one obvious disadvantage. In the
ideal world, cool URIs don't change (see Resources). In the real
world they change all the time. A project maintainer may change ISPs
or host institutions. The project might get a new maintainer with
different resources. Or it may just be that a Web site is
reorganised. Clearly you do not want all DOAP descriptions to be
invalidated if such a thing happens.
To solve this problem you need an old home page
property. A project can have more than one of these properties, which can be
added whenever the site is moved. You can then consider the old home
page also to be an identifying property. The constraint is that no
other project must ever use the old home page address.
How does this work out? Imagine you have descriptions of the same
project contained in two independent DOAP files:
- One refers to a project that gives the home page property as http://example.org/xmlparser
- The second gives the home page
property as http://example.org/projects/xml/parser and includes the URL http://example.org/xmlparser as
an old home page property
Any processing agent can then figure out that these are the same two projects.
This plan has been shown to work well in the FOAF
(Friend-of-a-friend) project for expressing personal information
and social networks. Find more details under the heading
"Merging FOAF descriptions" in my article "Finding friends with XML and RDF."
Constraining property values
The desire for DOAP to be decentralized and global raises issues
in areas other than the unique identification of projects. The
range of values that properties can take must also be predictable
in some way in order to perform useful processing over the global
collection of DOAP information. To illustrate this, I'll take a look at the license property.
Data design 101. Humans can tell that there is no difference in
intended meaning between GPL2, GNU General Public License,
Version 2, and even http://www.gnu.org/licenses/gpl.html.
Computers obviously cannot. The conventional database-inspired
solution to a problem like this is to settle on an agreed-upon set of
codes or abbreviations for the various licenses. Additionally, you
will need an extension mechanism for when a custom license is
used.
Unremarkable so far. Here's where an interesting aspect of using
RDF/XML comes into play. In RDF/XML, the property may take
two kinds of values: one is a resource, identified
by a URI, and the other is a string literal. These literals may be
datatyped, so you could define a W3C XML Schema enumeration to
govern the permissible values (see Resources). The license property
could then be one of, for example, GPL, BSD, Apache, and so on. If it
is "Other" then you could add an extra text field to describe the
alternative license.
The disadvantage of this approach is that an extra burden is
placed upon those who process a DOAP file. They now need to take
into account the presence of an extra schema and import the heavy
machinery required to do W3C XML Schema validation. Even then, all
they get is an opaque string that must be augmented with extra
information if it is to be useful to a human observer. Using an
enumeration also creates extra overhead for the maintainers of
the DOAP vocabulary, as there is now an extra schema to take care
of and distribute.
You gain extra flexibility if you use a resource instead of a
literal. You can then allocate URIs in space you control to denote
licenses. For example, http://example.org/doap/licenses/GPL could
be used for the GNU General Public License. (The domain
"example.org" is used illustratively here.) You can also put a Web
page at that location with further information about the GPL,
including its full text. As an additional courtesy, you can publish
the complete list of licenses you support at
http://example.org/doap/licenses/. This adds no extra overhead
for a DOAP processor. It is as easy to look for the string
http://example.org/doap/licenses/GPL as it is the string GPL.
You can make things even easier for processors if you create an RDF
file hosted at the .../doap/licenses/ URL that contains a
computer-processible license list, augmented with handy data such
as labels and descriptions for each license.
This technique also neatly solves the extensibility problem.
Imagine that you, Acme Corp., create your own Acme Open Source
License. All you need to do is guarantee that you control a URI similar
to http://acme.com/license/AOSL and use that as the value of the
license property in DOAP descriptions. And if you're a good citizen
you'll put an explanatory Web page at that URI.
Using resource URIs in this way has two disadvantages.
The first is the simple matter that it's easier to type "GPL" than
the full URI suggested above. This is not a large problem and can
be ameliorated somewhat by providing shortcut syntaxes or tool
support later in the project: In RDF, labels can be used to provide
human-readable interpretations of resource URIs. It's certainly
less of a problem than either having a free-for-all string or the
burden of bringing in schema validation.
The second and more serious disadvantage is the legimate concern
that you as DOAP's maintainer might lose control of the URI-space
http://example.org/doap. While in the short term the URIs could
continue to be used without invalidating their status as opaque
identifiers, considerable confusion could arise if the
content at that URI is changed or removed. Two common
means of addressing this are available today:
- Use a service such as purl.org (see Resources) that makes some
warranty of longevity for URIs registered with it, or affiliate
DOAP with a standards organisation such as OASIS that
can make a similar guarantee.
- Use a Uniform Resource Name (URN -- see Resources) rather than a
URL.
URNs provide a managed namespace through the
Internet Assigned Numbers Authority (IANA). The portion of the URN namespace
allotted to DOAP can then be managed through documents submitted to
the Internet Engineering Task Force (IETF). Unfortunately, this
process is indescribably unwieldy and probably unsuitable for a
project such as this.
Finally, be aware that it may not always be best to use
a resource to represent such constants. Resources are best suited
to situations where extensibility and further investigation are
helpful. Occasionally, you may trade this off against the convenience
of using short strings.
It is useful to look at the approach taken by the Creative
Commons project (see Resources). Creative Commons allows the
application of flexible licenses to electronic media, intended to
expand the body of creative work available for others to build on
and share. They take the approach of denoting a license through a URI
as advocated above.
Conclusion
This article has attempted to address some common issues,
inspired in part by experience with the FOAF vocabulary. Taking
the Web-wide perspective that RDF brings has both advantages and
disadvantages. This situation can be characterised
generally as a trade-off between verbosity and flexiblity. Yet as
any programmer knows, it would be a mistake to optimise
prematurely. I will follow the Web-wide ethos to its conclusion,
and then show you what may be done to lower the barrier to entry for new
DOAP users.
The next article in this series will continue the design of the
vocabulary to a point where you can start experimenting with tools
and test data.
Resources - Get a good overview of the use of UML in creating XML data models with Dave Carlson's article "Modeling XML Vocabularies with UML."
- Will Provost's article "UML For W3C XML
Schema Design" goes into more detail about how to use UML with the W3C's XML Schema language.
- Learn more about the development of a W3C XML Schema-based vocabulary for financial
reporting in "Design
of the XBRL specification," a paper delivered at XML Europe by David Vun Kannon and Yufei Wang. Rational Rose was used as a modeling tool.
- Check out Tim Berners-Lee's "Cool URIs don't
change" article as he argues passionately that "URIs don't change: people change them."
- Read Edd Dumbill's article "Finding friends with XML and RDF" (developerWorks, June 2002), which explains how identifying properties can be used to merge independent descriptions.
- Find out how W3C XML Schema data types can be used in conjunction with RDF.
- Discover more about persistent URLs provided as a service on purl.org, which is managed by the Online Computer Library Center (OCLC).
OCLC is committed to the longevity of the service, which permits the registration of a Persistent URL with a redirect to its current "real home" on the Web.
- Learn about URNs as defined by a series of specifications submitted to the IETF.
- Take a look at the Creative Commons project, which has a framework for describing and applying licenses to digital media. They use URIs to denote a license, and also use RDF to express the rights attached to a media item. Uche Ogbuji examines Creative Commons in his Thinking XML column, "The commons of creativity" (developerWorks, May 2003).
- Review other articles in this series part 1 introduces the DOAP project (developerWorks, February 2004) while part 3 presents a schema for the new vocabulary and example project descriptions(developerWorks, June 2004).
- Find more XML resources on the developerWorks XML zone. Read previous installments in the XML Watch column series.
- Browse for books on these and other technical topics.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
About the author
Rate this page
|