YOUR FEEDBACK
NGASI Releases AppServer Manager 8.1
Dave Jenkins wrote: The remote server management is a welcomed added feature...
SOA World Conference
Virtualization Conference
$200 Savings Expire May 16, 2008... – Register Today!


2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Versatile Multimodal Solutions

Digg This!

User interaction is about creating an effective man-machine conversation that leads to rapid task completion. With this in mind, we can factor the typical application into the data model that holds the current interaction state, user interface components that render this state, and interaction behavior that is determined by the active event handlers.

Today, Web interaction is authored in markup languages like XHTML, with the host document providing the data model and user interface components, and the host browser implementing the DOM2 eventing loop for bringing the Web interaction to life. The W3C architecture exemplifies this separation in its current developments:

  • Widespread adoption of CSS separates content from presentation.
  • W3C XForms separates the data model from the user interaction.
  • W3C XML Events exposes the DOM2 Events interface to the XML author, thereby making the original vision of the document is the interface a reality.

    Notice that in this architecture, the richness of end-user interaction is a function of the available user interface events and event handlers that are available to respond to a given event. As the next evolution in user interfaces, multimodal interaction can be integrated into this evolving framework by enabling the Web author to attach rich voice handlers that implement spoken dialogs to aid in rapid task completion. A means to achieve this end was first detailed in XHTML+Voice (X+V). This article describes X+V 1.1, an update to X+V that integrates the results of more than two years of experience gained by implementing multimodal solutions using this framework.

    The remaining sections of this article summarize the additions to X+V and illustrate their use in creating multimodal interaction that leverages mixed-initiative VoiceXML dialogs. Formal descriptions of these additions can be found in the X+V 1.1 specification; here we'll focus on motivating these additions and explaining their use.

    Aural CSS - Speaking in Style
    Aural CSS (ACSS) - part of W3C Cascaded Style Sheets (CSS) - enables the XHTML author to specify style rules for aural presentation. X+V 1.1 leverages Aural CSS by allowing the XHTML author to attach an aural style rule to a CSS class. In the following example, we illustrate a stylesheet fragment that attaches an aural style to XHTML p elements having class romeo and juliet.

    P.romeo { voice-family: male;
    volume:
    loud; pause-before: 20ms; }
    P.juliet { voice-family: female;
    volume: soft; }

    Create the content in style. Identify p elements in the XHTML document using class romeo or juliet as shown below.

    <body ev:event="load" ev:handler="#sayHello">
    <p id="hello_romeo" class="juliet">
    Romeo, Romeo, where art thou?
    </p>
    <p id="hello_juliet" class="romeo">
    I am here.
    </p>
    </body>

    Finally, define the voice handler invoked in the above fragment to speak the contents of the p elements. Contents from the XHTML document are accessed from within the vxml:prompt element using new X+V attribute xv:src. Using attribute src in this manner is consistent with the forthcoming W3C XHTML 2.0, which uses this attribute to allow authors to specify the contents of elements indirectly; X+V 1.1 extends element vxml:prompt with attribute xv:src to enable equivalent functionality when authoring multimodal interaction.

    <vxml:form id="sayHello">
    <vxml:block><prompt xv:src="#hello_romeo"/>
    <prompt xv:src="#hello_juliet"/>
    </vxml:block>
    </vxml:form>

    See Listing 1 for the complete example.

    Two-Way Synchronization - One Handler to Bind Them
    In a multimodal interaction, user input obtained via a given interaction modality needs to be made available to all participating interaction modalities. In practice, this reduces to two-way synchronization between the visual and auditory modalities when authoring multimodal interaction using X+V and the HTML forms module. Note that the transition to W3C XForms will enable the synchronization of more than two modalities since XForms provides an explicit data model that records interaction state. When we created X+V 1.0, we wanted to provide a smooth transition for today's Web authors using HTML forms; to ease this transition, we enabled implicit two-way synchronization between the visual and spoken interaction states. Experience in authoring applications using such implicit synchronization showed that there is value to exposing this two-way synchronization to the XML author; to this end, X+V 1.1 defines declarative construct sync, which can be used to associate components of the visual and aural interaction state. The declarative nature of element sync is particularly important when deploying X+V solutions to thin clients that may not be able to support a full scripting environment.

    Element sync can be thought of as a declarative event handler that synchronizes the two interaction state components being linked. We describe the use of element sync in the remainder of this section, with illustrative code fragments taken from the complete example shown in Listing 2.

    Element sync uses attributes input and field to specify the two interaction state components to be synchronized. The attribute names were chosen to match the interaction components most commonly used in the visual and aural modalities, respectively. Value of attribute input holds the value of attribute name from the HTML form control being synchronized. Notice that in the absence of a data model in HTML forms, this name attribute also names the location where the control stores user input. Attribute field holds the id of the VoiceXML field to be synchronized.

    In the mixed-initiative example shown in Listing 2, we've declared that the visual input controls that collect the hotel and city should be synchronized with the corresponding VoiceXML fields via the following statements:

    <xv:sync input="city" field="#field_city"/>
    <xv:sync input="hotel" field="#field_hotel"/>

    Notice that the value of attribute field is a URI, i.e., as in the rest of X+V, the VoiceXML elements may appear either within the XHTML document, or in a separate XML file. Using this addition to X+V, the author can enable several interaction metaphors that will be described in the remainder of this section.

    Spoken Input with Visual Confirmation
    The mixed-initiative VoiceXML dialog shown in Listing 2 is activated when the XHTML form gets focus. The mixed-initiative dialog collects the hotel and city names when the user speaks:

    "I would like to stay at the Chicago Airport Hilton"

    and the values are synchronized with the visual interaction components. The user gets immediate visual confirmation as a result of the values being synchronized.

    For the same utterance as above, the voice browser recognizes the hotel, but fails to identify the city. Tapered prompts within the VoiceXML handler lead the user through the rest of the task. Synchronizing the fields helps the user get immediate feedback as to the portion of the task that remains to be completed.

    Talk and Type
    Notice that resorting to tapered prompts when the mixed-initiative dialog fails to get all the desired values turns the mixed-initiative dialog into a directed dialog. When using multimodal interaction, mixed-initiative dialogs can also turn into a directed dialog if the user uses the keyboard or stylus after having provided initial speech input. To continue the example, after speaking the previously mentioned utterance, the user might explicitly move the focus to one of the visual input controls using the keypad or pointing device. Such user action can be thought of as an implicit escape from the mixed-initiative dialog. The X+V execution model specifies that only one voice handler is active at a given time; we leverage this fact in the example shown in Listing 2 by attaching simple VoiceXML-directed dialogs to each of the visual input controls for event focus. As a result, if the mixed-initiative dialog fails to perform satisfactorily for a given user, that user can provide input via the keypad or pointing device to transition to a directed dialog. To complete the example, tabbing to the hotel input control would result in the directed dialog for that field being activated, and this would first cancel the mixed-initiative dialog.

    Talk or Type
    In certain situations, the author might wish to create a multimodal experience in which the user is initially prompted with a voice prompt, along with the activation of a mixed-initiative dialog. If the user starts providing input via the keypad, the author might wish to keep the mixed-initiative dialog active, rather than resorting to a directed dialog where the user hears a spoken prompt for each field. This form of interaction can be enabled by simply dropping the binding of the directed dialog handlers to the individual visual controls.

    Cancel - Speech Is Silver, but Silence Is Golden
    We described the implicit canceling of a running VoiceXML dialog in the previous section. This cancel action is made available to the XHTML author via handler cancel. We use this in Listing 2 where we attach handler cancel to XHTML control reset. The effect is that if the user resets the form via the nonspeech modality, the currently active voice handler is canceled, and the user interaction returns to its initial state.

    Deploying X+V Multimodal Solutions
    Reuse of content assets is one of X+V's strongest features. X+V enables speech interaction specialists to create high-quality spoken dialogs that can be easily plugged into well-designed visual interfaces. In this context, it is worth noting that multimodal solutions need to be deployed in a variety of device and network environments in contrast to traditional desktop Web interfaces that adhere to the one browser fits all model. What follows is an overview of the various deployments enabled by the X+V architecture.

    PDA
    A PDA capable of local speech processing can execute the voice handlers shown in Listing 2 locally on the device. Visual and spoken modalities can be tightly synchronized to provide a rich multimodal user experience. Figure 1 shows a fat-client architecture in which the multiple modalities are processed locally on the client.

    Mixed-mode client
    A PDA or smart phone may not always have sufficient resources to perform every speech processing task. In this case, complex speech processing, and consequently, the execution of the associated voice handlers can be off-loaded to a voice server running on the network. Notice that the X+V design is particularly well-suited to this kind of deployment since the voice handlers do not have to reside in the same document as the rest of the XHTML markup. In Figure 2, an overview of this architecture, the mixed-mode client off-loads some processing to a network server. The X+V markup can be distributed as needed.

    This feature is a consequence of reusing XML Events within X+V for creating event bindings. XML Events has been designed to operate in the hypertext environment of the Web, and a key design feature is that event handlers can be referred to via a URI. We emphasize this point in this article because the X+V specification itself does not highlight this fact, since we got the feature for free by virtue of reusing XML Events. Since the publication of X+V 1.0, authoring of remote voice handlers has been one of the most frequently asked questions from X+V partners, and we highlight the solution here for this reason.

    Thin client
    Consider a cellphone that has sufficient speech processing capability to implement only simple command and control navigation. All other speech functions are performed by off-loading the processing to a network-based voice server. In such an environment, the X+V document illustrated in Listing 2 can be factored into separate XML documents that hold the visual XHTML markup and event bindings, and a separate XML document that holds the voice handlers. The event bindings in the XHTML markup can be updated to point to the network location where the XML document containing the voice handlers will be processed. Figure 3 shows an overview of this architecture. A distributed multimodal client can carry out processing of user input at different points on the network. The markup solution provided by X+V allows for the relevant markup to be transmitted and cached where it will eventually be used. This can save valuable bandwidth and XML processing on thin clients.

    This has the advantage that only the visual markup is transmitted to the thin client, i.e., the deployment respects the separation of concerns inherent in this deployment by not transmitting the voice markup to the thin client. This can save valuable bandwidth and processing cycles on the thin client. The voice handlers are available and preloaded into the voice server which can consequently be extremely responsive when it receives a request for invoking one of the specified voice handlers from the thin client. XML Events bridges the two execution environments, i.e., the visual browser on the thin client and the voice browser on the network.

    Finally, notice that the completely declarative nature of the X+V document is a major win in reliably deploying the multimodal application to devices such as cellphones which typically lack support for a full scripting environment.

    References

  • XML Events: www.w3.org/TR/xml-events
  • CSS: www.w3.org/TR/CSS2
  • W3C XForms: www.w3.org/TR/xforms
  • XHTML + Voice: www.w3.org/TR/xhtml+voice
  • X + V 1.1: www-3.ibm.com/software/pervasive/multimodal
  • W3C XHTML 2.0: www.w3.org/TR/xhtml2
    About T.V. Raman
    T. V. Raman is an accomplished Computer Scientist with over 8 years of industry experience in research and advanced technology development. During this time, he has authored 2 books and several scientific publications; his work on auditory interfaces was profiled in the September 1996 issue of Scientific American. He has leading edge expertise in auditory interfaces, scripting languages, Internet technologies including Web server applications and Web standards. Raman participates in numerous W3C working groups and authored Aural CSS (ACSS); in 1996 he wrote the first ACSS implementation. Raman has been actively participating in defining XML specifications for the next generation multimodal WWW including XForms, XML Events, and XHTML+Voice.

  • XML JOURNAL LATEST STORIES . . .
    3rd International Virtualization Conference & Expo: Themes & Topics
    From Application Virtualization to Xen, a round-up of the virtualization themes & topics being discussed in NYC June 23-24, 2008 by the world-class speaker faculty at the 3rd International Virtualization Conference & Expo being held by SYS-CON Events in The Roosevelt Hotel, in midtown
    Red Hat Named "Platinum Sponsor" of Virtualization Conference & Expo
    Red Hat is a trusted open source provider. Red Hat offers enterprise customers a long-term plan for building infrastructures on the quality and innovation of open source. Combining open source operating system platform, Red Hat Enterprise Linux, together with applications, management
    JustSystems Contributes Key XBRL Rendering Technology to Financial Community
    JustSystems announced that it is contributing intellectual property rights for its invention of eXtensible Business Reporting Language (XBRL) rendering technologies to XBRL International, the standards body responsible for the oversight of the XBRL specification. The invention, known a
    JustSystems Launches Campaign for XBRL Success
    JustSystems announced its campaign to help organizations adopt XBRL (eXtensible Business Reporting Language), the XML-based standard for communicating financial and business information. In related news, JustSystems also announced that it has contributed intellectual property rights of
    Virtualization Meets DaaS - Desktop-as-a-Service
    After a $1.5 million angel round, Desktone, which was started in 2006 by Eric Pulier, who also started SOA Software, US Interactive and IVT, picked up $17 million in first-round funding about a year ago from Highland Capital Partners, SoftBank Capital, Citrix Systems and the China-base
    SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
    SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

    SYS-CON FEATURED WHITEPAPERS


    ADS BY GOOGLE
    BREAKING XML NEWS
    RCG IT Addresses BI and SOA Convergence and Business Architecture at TDWI World Conference in Chicago
    RCG Information Technology, Inc. (http://www.rcgit.com/) will participate in The Data Wareho