YOUR FEEDBACK
More on the Software Assembly Question - Do Design Patterns Help?
Yanic wrote: Hi, > UML and MDA are being changed to be more data and doc...
SOA World Conference
Virtualization Conference
$50 Savings Expire May 23, 2008... – Register Today!


2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Searching XML Files with XSLT

Digg This!

XSLT is generally used to parse and translate XML files, but with some more advanced techniques, it's possible to search for specific attributes (or elements) of any XML document or list of documents.

While developing a search tool, I learned how to replace strings and translate characters (uppercase to lowercase) in XSL. After a little research, it took only a few hours to put together what I needed. The examples in this article show how to implement a simple search mechanism to search a DocBook file and display the search results in HTML format.

When XML first appeared, much of the hype was about how well suited it is for searching documents (and it is - but I haven't seen many implementations). I'd been working on a Web site and wanted to add a search box. Using XML, searching the site should be easy enough. The site itself is in one DocBook XML document. Articles (posts) can be posted to the Web site as well. These are each stored in a separate DocBook Article XML file. Yet another XML file on the server contains a list of these article files, which more or less functions as an index for the directory.

More specifically, the requirements I came up with were:

  1. Allow the user to enter a keyword into a text box.
  2. Choose to search for any keyword, just author names, or just titles.
  3. Scan the Web site content or scan the articles.
  4. Display the results, showing text around the found keyword and a link to the page or article.
  5. Show how many times the keyword was found.
  6. Highlight the keywords in the above results.
Design Overview
For those who've seen the examples for Microsoft's Index Server, I was going after the same thing: a search result showing where the keyword was found, with a link to the document (or chapter and section in the case of the DocBook). The final result of searching the Web site is shown in Figure 1, which displays the search box with the options to search for content or titles (searching by author is for the "Article" search).

Since the Web site uses the DocBook Book and the DocBook Article formats, I separated the functionality and made two different search templates. This article focuses mainly on the DocBook format and the showSearch template (see Listing 1). However, both formats have a similar structure after the root element, so the Author, Title, and Paragraph templates (see Listing 2) can be used with either search. On the home page the main search box searches the site's content or just page titles (the DocBook). If a user is viewing a list of articles in a category, there's another search box to search all articles.

Designing the XSLT templates
The first template accepts four parameters, everything we need to start searching: what part of the document to search, a keyword, a path, and a filename. Declared at the top of the first XSL stylesheet, these parameters are available to any template whether listed in the current file or in an included stylesheet (see Listing 3). Next, the showSearch template performs the search using apply-templates and the XPath query:

$work/book/descendant::*[contains(text(),$searchString) and name()=$searchType]

Here the variable $work contains our document. The rest of the XPath query will actually find matching text() nodes containing our search keyword. The query will also filter to match only certain nodes we specify (author nodes, title nodes, or paragraph nodes) by using the $searchType variable. The technique for getting the XPath to search the entire document is to use the descendant::* axis.

What is an "axis"? Performing complex XSLT processing usually requires the use of axes. This example is fairly simple: matching nodes under the <book> node. Other axes retrieve the value of the next node, previous node, current node (self), and parent node. So, if you want to return only books or articles with authors (assuming some don't have an author), you can use a template to match authors and use the parent axis to retrieve the book's title. Many of the axes have shorthand counterparts such as ".." for parent node and "." for the current node. I've found, however, that the shorthand doesn't work as well as using the axis by name.

The XPath is the only difference between searching articles and searching books. For article searches the XPath simply needs to be changed to $work/article/descendant::*.

With the above query, we can return results, but XML and XSL are case sensitive, so searching for "cad" won't return any elements containing "CAD". After a little research I came across the XSLT translate function, which is designed to change characters from one format to another. In this case all uppercase letters will be translated to lowercase before the query is executed. To use the translate function, two variables containing characters are created, one lowercase and one uppercase, and placed on top of the stylesheet.

The XPath query and the translate function use these variables, swapping any character from one set (the uppercase) to the other (lowercase) for both the search term and elements (see the declarations in Listing 4). The query is run twice, once to get the count of how many matches were found and once as the select statement of the apply-template element. Doing this type of operation takes more processing power, but I've been surprised by how quickly the query is performed.

The final XPath query in the apply-templates element looks like:

<xsl:apply-templates select="$work/book/descendant::
*[contains(translate(normalize-space(text()),
$ucletters,$lcletters),translate($searchString,
$ucletters,$lcletters)) and name()=$searchType]" mode="search"/>

Displaying the search results
Getting the apply-templates to match the correct elements was only half the battle. Along the way, I started to build templates to display what was being returned. The XSL templates in Listing 2 match and display the Text() element of each Author, Title, and Paragraph where a keyword was found. The structure of "DocBook" is probably pretty good for searching since all the content is laid out in a fairly strict format. I've kept things simple for the examples in this article; as more elements are introduced, these templates will become more complex.

Highlighting Keywords
The next step (and the second new XSLT function I had to implement) was to highlight keywords displayed within the search results. The replace-string template (see Listing 5) is similar to a recursive method, continuously parsing the text until all the keywords have been replaced.

I've learned that using recursive techniques is a very common and powerful tool in XSLT. The whole idea around transformations and using a document tree relies heavily on recursive structures and implementations.

Placed at the beginning of the stylesheet document (under all the other variables), the variable $myReplacedText holds the complete value of the HTML text that will be replaced. To keep things simple, the application can accept only one keyword or exact phrase per search. Since $ myReplacedText is at the beginning of the document, it can be used in any template throughout the XSL stylesheet. After this variable is added, the completed XSL declarations look like the snippet in Listing 4.

In Listing 5 the recursive string-replace template replaces keywords found in the document with the keyword itself, surrounded by an HTML <font> tag.

Now that the XSL templates are done, the fun can begin. In an ASP or JSP file, load the XML document, create an XSLT parser, and set the parameters for $searchString and $searchType. In my final application I used another XML file and XSL template that listed the documents in a directory containing Articles. The first template actually looped through the list of articles, performed the same search, calling a showArticleSearch template, and passed the parameters and document for each article in the list.

Summary
With the translate function, descendant axis, and replace template, building a small search utility gave me some exposure to how more advanced XSLT techniques can be used. I've implemented this type of searching on a couple of Web sites and the performance is pretty good. It isn't a complete solution, but it adds a nice feature. The listings and examples here are taken from a larger solution. If anyone would like a complete set of working files, please send me an e-mail.

About Roy Hoobler
Roy Hoobler has been developing custom Web applications since 1996. After completing his MCSD certification, he spent the mid-'90s at a large consulting firm focused on intranet/extranet applications for Fortune 1000 companies. In 1998 Roy joined Net@Work (www.netatwork.com) as director of Internet technologies, specializing in systems architecture, project management, and research into emerging programming methods.

wrote: 1) Article rating does not work; 2) Where's the 'listing'? 3) Poor article quality.
read & respond »
XML JOURNAL LATEST STORIES . . .
3rd International Virtualization Conference & Expo: Themes & Topics
From Application Virtualization to Xen, a round-up of the virtualization themes & topics being discussed in NYC June 23-24, 2008 by the world-class speaker faculty at the 3rd International Virtualization Conference & Expo being held by SYS-CON Events in The Roosevelt Hotel, in midtown
EDI to XML: A Practical Approach
While EDI transactions account for most worldwide commercial activity, XML-based alternatives are beginning to gain traction. According to Forrester Research, stateful XML, stateless XML, and even flat file exchanges are all projected to grow at a faster rate than EDI over the next few
Red Hat Named "Platinum Sponsor" of Virtualization Conference & Expo
Red Hat is a trusted open source provider. Red Hat offers enterprise customers a long-term plan for building infrastructures on the quality and innovation of open source. Combining open source operating system platform, Red Hat Enterprise Linux, together with applications, management
JustSystems Contributes Key XBRL Rendering Technology to Financial Community
JustSystems announced that it is contributing intellectual property rights for its invention of eXtensible Business Reporting Language (XBRL) rendering technologies to XBRL International, the standards body responsible for the oversight of the XBRL specification. The invention, known a
JustSystems Launches Campaign for XBRL Success
JustSystems announced its campaign to help organizations adopt XBRL (eXtensible Business Reporting Language), the XML-based standard for communicating financial and business information. In related news, JustSystems also announced that it has contributed intellectual property rights of
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS


ADS BY GOOGLE
BREAKING XML NEWS
Woodstream Selects EXTOL Business Integrator to Improve Business Processes, Customer Collaboration and Internal Integration
Woodstream, providers of pet, lawn-care and animal-friendly brands such as Perky-Pet,