VoiceXML
Grammars For VoiceXML Applications
Digg This!
The previous article in this series, "Tools for Developing VoiceXML Applications" (Vol. 2, issue 3), reviewed tools that can aid the development and testing of VoiceXML applications. Now we dive into the mechanisms of writing the grammars and review the standards being developed around the ways of representing them.
The last article pointed out that some tools/components, such as the Nuance Grammar Builder and Tellme Studio's Grammar tools, focused specifically on developing the grammars to be used by VoiceXML applications. Such an application is as rich as the grammar it supports. Support for voice input-based interactions is one of the key differentiators between the traditional interactive voice response (IVR) applications (which are based primarily on touch-tone phone inputs) and the next-generation, highly interactive voice-based applications, with support for near-natural language-based conversations.
Designing and developing grammars for your VoiceXML application can be an intricate and challenging task, but the effort will be well spent and rewarded. You might want to take small steps - develop a simple grammar that recognizes phrases first, then progressively mature your application to understand near-natural language conversations.
What Is a Grammar?
According to the VoiceXML specification
(www.voicexml.org/specs/VoiceXML100.pdf), a speech grammar specifies a set of utterances that a user may speak to perform an action or supply information, and provides a corresponding string value or set of attribute-value pairs to describe the information or action (Section 10.1 of the specification). VoiceXML allows developers to create grammars for both spoken input and DTMF-based grammars for input through touch-tone key presses.
The grammar for an input element is specified by a <grammar> and/or <dtmf> tag. Grammars can be specified for <field>, <form>, <link>, <choice>, <record>, and <transfer> elements. The current specification doesn't contain a requirement to support any particular grammar representation format, and is flexible to support any format that agrees with the requirements. However, Java Speech Grammar Format (JSGF) and Nuance Grammar Specification Language (GSL) are commonly seen as support grammar(s) by existing VoiceXML implementation platforms. As discussed later, W3C Voice Browser Activity (www.w3.org/Voice/) has also introduced a working draft for Speech Recognition Grammar Specification for the W3C Speech Interface Framework.
VoiceXML supports both inline (within the VoiceXML document) and external grammars. An implementation may support both complete and incomplete versions of an inline grammar (as demonstrated in the code snippets). Examples of <grammar> tag:
- Complete inline grammar
<field name="cmd">
<grammar type="application/
x-gsl">
<![CDATA[
... complete inline
grammar ...
]]>
</grammar>
...
</field>
- External grammar
<field name="cmd">
<grammar type="application/x-jsgf"
src="PortalCmd.gram" caching="safe"/>
- Incomplete inline link grammar
<link next="Transfer2Operator.vxml">
<grammar type="application/x-jsgf">
operator | help
</grammar>
<dtmf>0</dtmf>
</link>
- DTMF grammar
<dtmf type="application/x-jsgf">
1 {email} | 2 {calendar} | 3 {tasks} | 4 {contacts}
</dtmf>
Built-in Grammars
VoiceXML supports a number of basic built-in types: Boolean (yes/no), date, digits, currency, number, phone, time. All have corresponding DTMF input associated with them as well. Some built-in types, particularly digits and Boolean, support parameters as well. Examples using built-in types:
<field name="credit_card" type="dig
its?length=12">
<prompt>Please enter your 12 digit credit card number</prompt>
...
</field>
and
<dtmf src="builtin:grammar/boolean?y=1"/>
Requirements for Grammars
Before we review the mechanisms that allow us to create the grammars for VoiceXML applications, let me explain how we define the requirements for our grammar. I'll walk you through an example of a voice portal application that allows an employee to log into a corporate portal and interact with a company's messaging system. The following is the simple initial document for our voice portal VoiceXML applications:
<?xml version="1.0"?>
<vxml version="1.0"?>
<form id="main">
<field name="cmd">
<prompt>Welcome to your Voice
Portal. What can I do for you?
</prompt>
<grammar src="PortalCmd.gram"/>
<filled>
<submit
next="VoicePortalMain.jsp"/>
</filled>
</field>
</form>
</vxml>
Here is the flow of the application:
- System: Welcome to your Voice Portal. What can I do for you?
- User: <speaks - "Could you check my e-mail?">
- System: <goes to the next step>
To understand what grammar we need to define, we need to list what the user can say:
- Could you check my e-mail?
- Go to my calendar
- Address book
- Please check my contacts
- Go to my tasks
Essentially, we're watching for five key terms: e-mail, calendar, address book, contacts, and tasks. In addition, we've established some decorative phrases that need to be recognized with these terms. Without worrying how the actual format may look, an abstract representation of our "grammar" consists of three segments (one followed by another):
- Optional <Polite Command>: Please, could you, and the like
- Optional word: My
- One of the key terms: E-mail, calendar, address book, contacts, tasks
In the next section we'll take what we've learned and put it into a grammar representation format.
Grammar Elements
A grammar is essentially a collection of rules and their corresponding expressions, which describe how the rule is evaluated. Rule expressions use a series of operators that build up a group of possible utterances the user can say to be recognized. For example, the rule "Cmd" can recognize the utterances "e-mail", "calendar", "tasks", "contacts", and "address book".
Cmd [
email
calendar
tasks
contacts
(address book)
]
This could be a possible grammar for an employee portal scenario that would provide the employee with the capability of interacting with a corporate messaging system. However, this would require the user to remember the "commands" so as to be able to use the system. Combining multiple rules and operators can enrich rule expressions. Let's look at an enriched version of the foregoing grammar as an example:
Sentence [
(?Polite ?Prefix ?my Cmd)
]
Polite [
please
(could you)
]
Prefix [
go
(go to)
check
]
Cmd [
email
...
]
This grammar allows the user to say phrases such as "please check my calendar" or "go to my e-mail", but is flexible enough to recognize commands such as "address book". This improves interactivity and allows more natural language, like conversations, to happen between the voice application and the human user. Let's look at the expression for the rule "Sentence":
Sentence [
(?Polite ?Prefix ?my Cmd)
]
The expression consists of an optional Polite, followed by an optional Prefix, followed by an optional literal my, and, finally, a mandatory Cmd. The key to recognition here is still the "command," but it's now enriched with a list of phrases we expect the user to employ around it. Table 1 summarizes the operators typically used to create rule expressions. (Note: These operators may be represented differently by different grammar formats.) The rules are applicable both for literals and rule names (references).
Java Speech Grammar Format
Sun Microsystems' platform and vendor-independent format for use in speech recognition, popularly known as JSGF, is part of the Java Speech API initiative (http://java.sun.com/products/javamedia/speech/) and is now available as a W3C Note at www.w3.org/TR/jsgf/. JSGF leverages the Java programming language to provide naming conventions for grammar and package names. Similar to the Java programming language, JSGF also provides the concept of documentation comments. JSGF grammars can define both public and private rules. Unless defined public, a rule is implicitly private and can be referenced only within the local grammar. JSGF allows the reuse of existing and predefined grammars using the Java-like import declaration. It allows rule composition using sequences, alternatives, weights, groupings, and traditional expression operators such as "<li>" and "+". JSGF allows grammar developers to attach tags to rule definitions to be returned to the application.
Listing 1 shows the basic voice portal grammar defined according to JSGF specification #JSGF V1.0.
Nuance GSL
Nuance GSL is the format used by the popular Nuance advanced speech recognition system (ASR). GSL has been incorporated by VoiceXML development platforms such as BeVocal Café, Nuance Voyager, Tellme Studio, and others. More information on this format is available from Nuance Developer Network (http://extranet.nuance.com/developer/). GSL allows rule composition using constructs such as optional, sequences, alternatives, and probabilities, and expression operators such as "<li>" and "+". VoiceXML implementation platforms such as BeVocal Café
(http://cafe.bevocal.com/docs/grammar/index.html) and Tellme Studio
(http://cafe.bevocal.com/docs/grammar/index.html)
have also documented their specific usage and support for GSL. The example in the "Grammar Elements" section above used Nuance GSL as the format. Figure 1 shows Nuance V-Builder (discussed later in the article in the tools section) editing the voice portal grammar.
Speech Recognition Grammar Specification
As part of the effort to develop standards to allow speech-based interaction with Web-based applications, W3C has established a voice browser activity. The activity has the charter to establish standards and specifications for speech grammars, voice dialogs, speech synthesis, natural language representation, multimodal systems, and reusable dialog components. More information on the browser activity is available from www.w3.org/Voice/.
One of the early works of the activity is a working draft on the Speech Recognition Grammar Specification for W3C Speech Interface Framework (although the W3C hasn't yet established an acronym for the grammar specification, we'll refer to it from now on as SRGS), which defines syntax for use in speech recognition systems. The syntax of the grammar is described in two formats: an XML-based syntax (with an associated DTD) and a traditional augmented BNF (ABNF) syntax. Similar to JSGF, SRGS also allows the creation of rule definitions, which can be used both locally and externally. Rule expansions can be created through a combination of sequences, alternatives, weights, counts (optional, zero or more, one or more), and tags. As part of the current working draft, an XSLT stylesheet that can convert the XML grammars to their corresponding ABNF form is also defined.
Nothing works better than an example. Let's look at Listing 2 to see how our voice portal grammar can be defined using the SRGS, which, being based on XML, allows the flexibility to use mechanisms such as XSLT and other XML-generation mechanisms to generate dynamic grammars based on the SRGF.
From Phrases to Near-Natural Language Conversations
Thus far in our examples we've only seen grammars that recognize primarily one key input value from the user. It's possible, of course, to create chained forms/
interactions through which the application can request one input after another. For instance, in an employee directory application (similar to the one developed in the February XML-J [Vol. 2, issue 2]), it's possible to include two interactions, such as "please say the name of the person" or "please say mobile or direct to be able to connect to the appropriate person's required number." However, to enable near-natural language-like conversations, support for an interaction model such as "please call hitesh on his mobile phone" would be the key. In Listing 3 there are two input values: the name of the person and the type of phone you want to call.
Depending on the platform and grammar language supported, you can return a string-concatenated string with the name-value pair, or the platform might support returning multiple tags. The following return statement can be used when a platform supports the creation of a set of tags:
Sentence [
(?Polite ?call Person:person
?[at on] ?[his her]
Phone_Type:type
?phone ?number)
{return([<id $person> <type
$type>])}
]
DTMF Entries
A key feature of the VoiceXML standard is the ability to recognize voice interactions as well as DTMF entries through a keypad DTMF. Some platforms allow an extension to be added to the same grammar to recognize DTMF by using the literal dtmf-n (where n is between 0 and 9) as a possibility. For example, we can expand our voice portal Cmd grammar by including dtmf-1, dtmf-2, and so on as shortcuts for e-mail, calendar, and so on, respectively:
Sentence [
(?Polite ?Prefix ?my Cmd)
]
...
Cmd [
[dtmf-1 email] {return(email)}
[dtmt-2 calendar] {return(calendar)}
[dtmf-3 tasks] {return(tasks)}
[dtmf-4 contacts] {return(contacts)}
[dtmf-5 (address book)] {return(contacts)}
]
Note: VoiceXML also supports a separate <dtmf> element to specify dtmf-based grammars.
Dynamic Grammars
Similar to the VoiceXML documents, grammars can be generated dynamically as well. For example, scripting frameworks such as JavaServer Pages, Active Server Pages, PHP, Perl, and/or XML/
XSLT can be used to create a grammar dynamically from a database or other server-side objects. This grammar can then be referenced by the VoiceXML application using the <grammar src=""/> tag, a mechanism provided by VoiceXML to reference external grammars. For example, the Person rule in the employee directory can dynamically reference the names and IDs from a corporate employee directory.
Listing 4 shows how the above employee directory grammar (static) can be modified to incorporate a dynamically generated grammar based on the employee directory contained in an RDBMS.
Grammar Tools
In the previous article we reviewed the development tools available around VoiceXML. In this section we revisit Nuance V-Builder and Tellme Studio; both tools have components focused on developing and testing grammars for VoiceXML applications.
Nuance V-Builder
As demonstrated in the last issue, V-Builder is one of the first WYSIWYG class tools for developing VoiceXML-based applications. Grammar Builder - a key component of the tool (also available as a separate tool) - is focused specifically on the visual creation of grammars based on the Nuance Grammar Specification Language. Figure 2 represents the mixed-mode grammar we developed in the previous article for an assisted telephone directory.
Tellme Studio Grammar Tools
Tellme Studio, a hosted VoiceXML platform (also reviewed in the previous article), provides Web-based tools to validate a particular grammar (either stored in a temporary scratchpad or loaded from a URL) and allows the user to test the grammar by parsing input (via text) utterances into name-value pairs. The grammar tools section of the studio also includes two more tools:
- A grammar phrase generator, which has the ability to generate full coverage or random phrases that can be recognized by the particular grammar (up to 100 phrases, as shown in Figure 3)
- A DTMF generator, which generates the corresponding dtmf input to recognize a list of words and any conflicts
Tellme Studio platform supports Nuance GSL for writing grammars. A grammar reference is available (both in online/HTML and PDF formats) from Tellme Studio's site at
http://studio.tellme.com/grammarref/.
Conclusion
Effective grammars are the key to success for any VoiceXML application. At present, multiple formats are available to represent grammars. However, as demonstrated by the simple examples in this article, it is evident that we can readily add strong interactivity in VoiceXML applications by developing and using effective and rich grammars. Grammars are the key to differentiating interactive VoiceXML applications from the traditional touch tone-based IVR applications. Continuing with our hands-on theme, my next article will revisit developing applications with VoiceXML, this time with the upcoming Microsoft .NET platform.
References
Specifications/Drafts
- VoiceXML 1.0:
www.voicexml.org/specs/VoiceXML100.pdf
- Java Speech Grammar Format (W3C Note):
www.w3.org/TR/jsgf/
- Java Speech Grammar Format Specification:
http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html
- Speech Recognition Grammar Specification for the W3C Speech Interface Framework (W3C Working Draft): www.w3.org/TR/speech-grammar/
Tools
- Tellme Studio Grammar Tools:
http://studio.tellme.com
- Nuance Grammar Builder: www.nuance.com/index.htma?SCREEN=grammar_builder1
Documentation
- Tellme Studio Grammar Reference:
http://studio.tellme.com/grammarref/
- BeVocal Café Grammar Reference:
http://cafe.bevocal.com/docs/grammar/index.html
About Hitesh SethHitesh Seth is chief technology officer of ikigo, Inc., a provider of XML-based web-services monitoring and management software. A freelance writer and well-known speaker, he regularly writes for technology publications on VoiceXML, Web Services, J2EE and Microsoft .NET, Wireless Computing & Enterprise/B2B Integration. He is the conference chair for VoiceXML Planet Conference & Expo.