 | Level: Intermediate Edd Dumbill (edd@usefulinc.com), Chair, XTech Conference
26 Feb 2004 In this installment, Edd Dumbill starts the development of a vocabulary to describe open source software projects, setting goals and deciding among XML and RDF schema technologies.
One of the great things about open source software is its
essential democracy: Anyone can easily start their own project, and
they often do! Unfortunately, it can be difficult for users to locate
software that suits their purposes. This need has
been met over time by different software registries. Perhaps the
best known and longest-running of these is Freshmeat, but there are
many more, often meeting more specialized needs. For example, the
Free Software Foundation's FSF/UNESCO Free Software Directory, the
GNOME Software Map, or the BioInformatics Software Map (see
Resources for links to all of these).
So many registries now exist that keeping them up to date has become
a real problem. The release cycle for diligent software maintainers
often involves visits to several Web sites to keep the information
up to date, not to mention updating their own Web sites. However, such
maintainers are few and far between, and it's not uncommon
to find out-of-date information in a registry. That this data gets
out of date is unsurprising when you consider the aspects that many
modern software projects involve: mailing lists, IRC channels, Web
sites, wikis, CVS repositories, and so on.
This article starts the development of a solution that meets the
need of keeping software project information up-to-date: a
vocabulary that can be used in an XML document for the Web-wide exchange
of project details. In this first installment, I will
outline the scope of the project, make implementation technology
choices, and look at relevant existing work.
Goals, scope, and strategy
Every project needs a name. I have chosen to take inspiration
from FOAF (Friend-of-a-friend) and christen the project DOAP, an
abbreviation for "description of a project". Now that 90% of the
difficult choices have been made, onto the rest!
A project such as this can easily get out of hand, with adverse
results. If you create something whose implementation is more
onerous than or comparable to the effort currently required with the
status quo, you are unlikely to succeed, whatever benefit your XML
vocabulary may reap. The Web is littered with failed projects that
tried to do too much. It's worth limiting yourself to a
realistically small set of goals.
The limited requirements for the first iteration of the
vocabulary will include the following:
- Internationalizable description of a software project and its
associated resources, including participants and Web resources
- Basic tools to enable the easy creation and consumption of such
descriptions
- Interoperability with other popular Web metadata projects (RSS,
FOAF, Dublin Core)
- The ability to extend the vocabulary for specialist purposes
Specifically not in scope for the first iteration is the
description of software releases. Work on this can be investigated
as a follow-up initiative. Additionally, planning data internal to
the project such as task assignments or milestones is out of scope.
You don't want to go so far as to reinvent Microsoft Project!
Use cases for project descriptions include:
- Easy importing of projects into software directories
- Data exchange between software directories
- Automatic configuration for resources such as shared CVS
repositories or bug trackers
- Assisting package maintainers who bundle software for distributors
Technology choices
Despite many years of vocabulary development, the choice of
technology remains an open question. Various popular vocabularies
that have found widespread deployment have employed different ways
of specifying the terms. Take a look at some of these to see if you
can glean good practice, or any useful warnings. See the Resources
for links to all these specifications.
- Dublin Core Metadata Element Set: This popular library metadata
application uses a technology-independent means of expression,
with accompanying specifications stating how the elements could be
expressed in RDF/XML, HTML meta tags, and W3C XML Schema. Dublin
Core has been very successful, but has some ambiguity in the
interpretation of the semantics of its terms, leading to some
interoperability issues. For example, Creator is specified
as "an entity primarily responsible for making the content of the
resource. Examples of a Creator include a person, an organisation,
or a service. Typically, the name of a Creator should be used to
indicate the entity." For computing purposes, the term "name" may
have a pretty broad interpretation, and the definition above is
only really effective for human consumption of the metadata.
Neither is it clear whether the creating entity must be, for example, a human or a
collection of humans.
- RSS (RDF Site Summary/Really Simple Syndication): The many
flavors of this specification have chosen different routes: Version
0.91 used an XML DTD with additional prose; version 1.0 used prose
plus examples, with an informative RDF schema; and 2.0 is specified
as prose with examples. Underspecification has been a persistent
problem in RSS interoperability.
- ebXML: Vocabularies from this electronic business project
typically use a large amount of prose in their formal
specification; examples, XML DTDs, and schemas are also
provided.
- HTML: Enormously successful by any standard, HTML proliferated
largely on the back of ad-hoc examples. By the time it was more
formally specified, it was too late. The cleanup operation has
taken years. Poor interoperability has cost dearly in terms of time
wasted on testing Web sites in multiple browsers, and indeed in
supporting legacy behavior in the browers themselves.
As DOAP is primarily intended for computer consumption, it seems
plain that some kind of machine-readable schema for the vocabulary
will be required. On the other hand, as humans will create the data,
it is equally important that the human-readable information is sufficient to avoid interoperability problems through
underspecification. One of DOAP's explicit goals is an interchange
vocabulary, so it's important to minimize data loss
through incompatible use of the terms. If you've ever tried to
synchronize vCard data between devices you will know that, despite
the data ostensibly conforming to the specification, each
implementation has its own quirks that need workarounds.
XML or RDF?
One of the really attractive aspects of Dublin Core (DC) is its
mapping to a variety of expressions in RDF, XML, and HTML. Such
generality warms the heart of any software developer. That
notwithstanding, it is probably fair to say that the majority of DC
deployment on the Web has been within RDF.
The example of ebXML demonstrates that where interoperability and
interchange are critical, a well-defined serialization is a must. This presents
a thorny choice -- whether to choose an
RDF or XML representation. For metadata applications, RDF is
generally considered the first-choice language. RDF, unfortunately
and undeservedly, has a reputation as a bit of bogieman due to its
additional constraints over XML: You can't just write a tag soup
with RDF and expect it to work, and you don't get the full benefit
of using RDF unless you use RDF-aware processing tools. Many
battles were fought over this in the development of RSS 1.0, which
as a result tries to hide away its RDF-ness as much as
possible.
A straight XML serialization has its difficulties, too. You have a
choice of schema languages with which to define your document
structure, each with different levels of expressivity and tool
support. DTDs, while arguably still most widespread, do not offer a
very expressive means of defining a document, and are generally
held to be yesterday's technology. W3C XML Schema (WXS) is
more flexible, but is a heavyweight solution whose
acceptance is highest in the commercial software world -- it is
certainly not human readable. RELAX NG is a promising newcomer,
perhaps more understandable than WXS, and boasts easy conversion to
WXS. It also has a human-readable compact syntax, making it more
easily written by hand. Should an XML route be taken, RELAX NG
seems the best bet, as it is readily converted to the other two,
and easier to understand.
XML-only serialization presents some difficulties in the
specification area. Whereas XML defines well the syntax of a
document, it says nothing about the semantics of the elements. RDF
Schema (and its bigger brother OWL, the W3C Ontology Language)
allows you to say that a software project maintainer is a
subclass of the Dublin Core term "creator". Any RDF application
that knew how to handle Dublin Core could then make at least basic
use of DOAP data. By contrast a straight XML document would have no
meaning for an application that didn't have explicit code to
process the DOAP namespace, even if it had the corresponding
schema.
Lastly, a big unsolved problem remains in XML -- the namespace-mixing issue. Given two arbitrary vocabularies from
different namespaces, how might these be mixed to create a compound
vocabulary? This problem has no general solution, meaning
that except where the solution is explicitly stated by means of
another schema combining the two, each XML vocabulary remains an
island. On the other hand, RDF has a well-specified solution. So,
if you count mixing DOAP with other namespaces a priority, RDF may
be a preferable solution.
To summarise: Should you choose straight XML, which may be simpler
for people to understand, or RDF, with its flexibility and
accompanying constraints?
Regular readers of my columns may already have guessed that my
proclivities lie with the use of RDF. That is indeed the way that
this project will proceed, since RDF is so well-suited to metadata
applications. However, the problems outlined above will not be
forgotten, and along the way I will look for ways to mitigate the
perceived complexity of using RDF. It will definitely be
advantageous if DOAP can be processed using normal XML tools.
For the purposes of automated consumption, RDF Schema will be
used to specify the DOAP vocabulary. It will be augmented by prose
as much as possible. The FOAF specification (see Resources) takes
this route, with some success.
Existing work
Now that the technology choices have been made, it's important to see
what existing work relates to the goals of the project.
Having reviewed this work, I'll make a start on the definition of
the vocabulary in the next article in this series. Links to this work can be
found in Resources, and are recommended
reading.
- Freshmeat XML export: The Freshmeat.net software registry
provides an XML export of all its data, updated daily. They also
provide a DTD for the XML format used. Leigh Dodds has done some
work transforming this export into data using terms from
FOAF.
- Open Source Metadata Framework: This project focuses on metadata
for documentation for open source projects, and thus shares some
significant goals with DOAP. It is in wide deployment as part of
the ScrollKeeper Open Documentation Cataloging Project.
- PRJ Project Vocabulary: This vocabulary, created by Danny Ayers,
is actually aimed at being a general project management vocabulary,
irrespective of domain.
- CPAN2FOAF: CPAN, the Comprehensive Perl Archive Network, is a
large repository of Perl software. Dan Brickley has worked on
converting authorship metadata into FOAF/RDF.
- Description of a Software Project: This is the beginning of a
DOAP-like vocabulary, created by Max Völkel.
- RPMFind: This software location service uses RDF descriptions of
software packaged using the RPM format. The metadata is very
detailed about each software release.
Resources
About the author
Rate this page
|  |