Level: Introductory Juan Huerta (huerta@us.ibm.com), Research Staff Member, T. J. Watson Research Center, IBM David Lubensky (davidlu@us.ibm.com), Manager, Advanced Conversational Services, T. J. Watson Research Center, IBM David Nahamoo (nahamoo@us.ibm.com), DGM Human Language Technologies, T. J. Watson Research Center, IBM Roberto Pieraccini (rpieracc@us.ibm.com), Multilingual NL components, T. J. Watson Research Center, IBM T.V. Raman (tvraman@us.ibm.com), Speech/Voice Recognition Research Human Language Technologies, T. J. Watson Research Center, IBM Charlie Wiecha (wiecha@us.ibm.com), Senior Manager, Interaction Middleware and Standards for Portal Server, T. J. Watson Research Center, IBM
07 Oct 2004 Speech application development is evolving to dynamically generated VoiceXML. Now companies can cost-effectively add speech to Web apps and not sacrifice the quality of the resulting Voice User Interface. Reusable Dialog Components, a component framework based on JavaServer Pages, are central to this evolution. Explore this roadmap for driving down the overall cost of creating, deploying, and managing speech solutions. Also, learn how complex speech applications built with today's technologies can interoperate with speech-enabled Web applications for a smooth transition and a seamless user experience.
The speech-technology industry took its first step toward the adoption
of a Web programming model by standardizing VoiceXML, Version 2.0.
First-generation voice-enabled Web applications were mostly built of static
VoiceXML pages.
The next step is a move to complex applications deployed on standard Web servers and implemented through programs that deliver dynamically generated VoiceXML markup. To add speech-enabled Web applications to the mainstream is to adopt uniform programming models to create and deploy these speech-enabled Web applications.
 |
Jumpstart the RDC effort
IBM is spearheading this effort with the initial donation of a set of reference Reusable Dialog Components (RDCs) and the supporting framework integrated within the Java 2 Platform, Enterprise Edition (J2EE) and JavaServer Pages (JSP) programming models. The reference
implementation of this framework is available as open source through
the Apache Jakarta Taglibs project (see Resources).
The initial set of components and the underlying framework is the start of a community effort toward the
evolution of a common programming model for voice interaction based
on J2EE and JSP technology. With this framework, developers can avoid
compartmentalizing speech into its own application-development niche.
Members of the community supporting this initiative can use the framework
and reference implementation according to their business models.
An appeal to Web developers
Unifying the visual and voice Web can lead to a common framework that
consists of collecting and presenting information. Visual Web applications
perform user interaction through widgets assembled on HTML pages. Specialized components that deal with specific types of data, such as money, time, dates, and addresses, help reduce the cost of implementing complex interactive applications and greatly accelerate the development of visual Web applications through the process of customization and reuse.
|
|
On the visual Web, creating sophisticated user interaction is mediated by component libraries that ease the generation of complex HTML pages. The move to dynamically generated VoiceXML requires
similar component libraries that capture best practices in Voice User Interface (VUI) design.
Mainstreaming voice access to the Web changes today's practice of developing entire speech applications to a model where voice access is achieved by replacing the visual view layer with a high-quality VUI. In this model, you develop Web applications using standard application frameworks such as Struts; you achieve voice access by creating appropriate views that are assembled from a set of reusable and configurable components. You need to create such components within a framework that encourages interoperability across components to help unify the speech applications market.
Whereas the visual Web can rely on a persistent visual display backed by
error-free user input, the speech medium is temporal and nonpersistent.
Speech interaction is characterized by a sequence of turns where requests
or pieces of information are alternatively spoken by the system and by the
user. Although it is advancing at a fast pace, speech-recognition technology
is still error-prone and needs to be backed up by confirmation, correction
and reprompting. With prepackaged dialog components, Web developers can more efficiently handle these aspects of conversational interaction and ease the overall task of speech enablement.
For effective use by nonspeech specialists, speech components must embed much of the specific knowledge that enables the creation of high quality speech interfaces. Thus, you must incorporate grammars, prompts, confirmation, and correction strategies into these components. You must also ensure that the components are sufficiently configurable to allow reuse within a wide range of applications. Finally, you should be able to put together sophisticated components from simpler ones.
The Reusable Dialog Component (RDC) framework embodies all of these features. RDCs are interoperable
components within the J2EE and JSP framework that offer a means to bring speech-specific knowledge to . Each RDC component is composed of a data model, speech-specific assets like grammar and prompts, configuration files, and the dialog logic needed to collect a piece of information. The VoiceXML that performs the VUI is generated by the component implementation. A developer writes an application by instantiating these components and specifying their run-time behaviors through component attributes and configuration files. The data model is where components store the values collected from the user interaction; and components handle data validation and normalization.
Component data models are implemented as Java beans. Each component implements a set of tasks including data collection, confirmation, validation and disambiguation. Component authors can provide custom implementations for all of these tasks. Atomic RDCs collect simple data values such as a time, the name of a place, or an alphanumeric string; you can put atoms together to form composite RDCs. You also can aggregate composite and atomic RDCs to form more complex
components. The resulting composite RDCs are structured in the same way as atomic ones. They also have a data model, implement sets of tasks, and
include speech-specific assets. Their behavior is specified by attributes and
configuration files. The framework provides a container tag to facilitate the
construction of composite RDCs. The container implementation invokes a
pluggable dialog-management strategy that controls the activation of the
constituent RDCs. The framework provides a default-directed dialog strategy
that a developer can override.
Remove the cost from development of speech solutions
Building on standardized programming models creates the opportunity to
develop mainstream tools for speech enablement. This section outlines
the roadmap for how IBM sees today's world of speech-oriented applications and the
evolution toward a world where speech enablement is just another aspect of
overall application development.
Adopt the Web programming model for voice interaction
The speech-technology industry took one of its first steps toward
integrating voice interaction with mainstream applications when it adopted
VoiceXML and the associated Web programming model built around HTTP and distributed resources that are identified through URLs. This adoption allowed the speech-technology industry to move away from speech applications written as executable programs that link directly to the underlying speech engines. Today you can develop voice applications using standards-compliant VoiceXML, Version 2.0, which avoids tying the final application to any specific vendor's engine application programming interfaces (APIs).
From static to dynamic VoiceXML
To continue this evolution means creating Web applications that emit
standards-compliant VoiceXML. This follows the same evolutionary pattern as seen on the visual Web; static HTML pages have been replaced over time by server-side Web application frameworks that emit HTML.
Creation of standardized Web programming models that abstract the details of back-end integration, as well as the underlying business logic that determines the transitions among different stages in an application, have facilitated server-side deployment of Web applications. These standardized models help developers integrate user tasks into ever-larger applications. As the speech-enabled Web evolves in an analogous manner, voice application development moves from today's voice-specific programming model and associated tools to one in which voice interaction is authored as a specialized view that binds to a common underlying Web application.
Tools for voice enablement
Tools for speech-enabling Web applications can integrate seamlessly with
mainstream Web application tools. An example is the Struts builder available
within the IBM WebSphere® Studio Application Developer tool, as shown
in Figure 1.
Figure 1. The Struts builder
Click to view a larger version of Figure 1.
With this Struts builder, speech specialists can focus on the task of creating high-quality voice user interaction without having to develop the complete application. These VUI components can incorporate best practices of VUI design and help ensure that speech-enabling Web applications do not sacrifice the quality of the user experience. Finally, during this transition period, you can still integrate existing speech-enabled applications created within today's
voice-centric programming models into the overall
application flow by using the underlying Web framework defined by HTTP. As an example, a voice-enabled financial portal created by binding a VUI to an underlying Web application might choose to invoke a pre-existing speech bank
application through a URL, or more generally, as a Web service. (Struts allows the separation of the presentation layer from the underlying application flow. To produce the voice view, you can voice-enable Struts applications by replacing visual-view JSP pages with RDC-based JSP pages.)
The goal: Drive cost out of voice applications
When the transition to speech-enabling Web applications is complete, IBM expects the overall cost of voice-enablement to be significantly reduced from today's levels. Each link in the overall end-to-end value chain of speech application deployment can focus on a specific core competency.
Value propositions and business opportunities
Next, we outline how the mainstreaming of speech solutions by adding speech-enablement to the overall portfolio of Web technologies creates
new business opportunities for different segments of the speech industry. The end-to-end value chain that makes up the creation, deployment, and delivery of voice applications comprises several parts. At present, vendors play in more than one part of this value chain -- some of them in at least two or three neighboring sectors. IBM's long-term goal is to help each class of vendors focus on their particular core competencies, while relying on interoperability that comes from using standards.
Voice platform vendors
The momentum behind VoiceXML Version 2.0 has created an exponential growth in the software industry, and IBM expects this trend to be enhanced by the speech-enablement of J2EE Web applications using a standardized programming model that provides robust access, while controlling overall total cost of ownership (TCO). The ability of the mainstream Web programmer to generate high-quality VUIs expressed in VoiceXML can significantly enhance the value of robust VoiceXML browsers.
Hosting
A standardized deployment environment based on the widely used and
tested J2EE Web application architecture helps control the overall cost
of hosting and maintaining speech-enabled applications.
Speech-recognition and text-to-speech (TTS) engines
J2EE Web developers can leverage the evolution of speech technologies to deliver on-demand spoken access to Web services. This can create more volume in the market request for speech technology, which can become part of the standard assets for Web applications. Engine vendors might be enticed to add advanced functionality and technological improvement in their core technologies to support advanced requirements defined by component creators and Web developers.
Development tools
Tools that are consistent with interoperable components encourage developers to create libraries of speech-enabling building blocks. These libraries can lead to
rapid application development (RAD) and free developers to focus on
more-sophisticated user interactions.
Enterprises and service providers
As developers bring speech to standard J2EE Web applications, using the widely available skill set of J2EE and JSP Web development, they can add spoken access to businesses quickly and cost-effectively plus help control TCO.
Application developers
As developers create dynamic voice access to Web applications and services, a standardized Web-programming model and associated tools help reduce
the cost of developing on demand voice-enabled solutions. Speech-enabling
J2EE applications through JSP technology and use of dialog components can
create demand for application development services based on this standard
programming model.
In conclusion
Speech-recognition technology is mature, and mainstream deployment of
speech solutions can drive down costs in key areas like customer care.
To reduce the cost of creating, managing, and deploying mainstream speech
applications, developers must build on standardized Web-programming
models. This can turn speech-enablement into yet another access channel
to mainstream Web applications. To enable this evolution without sacrificing
the overall quality of the user experience requires the packaging of
speech-interaction expertise into standardized components that can be
integrated into mainstream Web development environments.
Resources
About the authors  | |  | Juan Huerta writes articles for IBM developerWorks. |
 | |  | David Lubensky writes articles for IBM developerWorks. |
 | |  | David Nahamoo writes articles for IBM developerWorks. |
 | |  | Roberto Pieraccini writes articles for IBM developerWorks. |
 | |  | Raman writes articles for IBM developerWorks. |
 | |  | Charlie Wiecha writes articles for IBM developerWorks. |
Rate this page
|