YOUR FEEDBACK
José D'Andrade wrote: "...it may never be released..." Why? "...if Midori isn’t heir to Windows Mi...
AJAXWorld RIA Conference
$300 Savings Expire August 8
Register Today and SAVE!


2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Streamline Your XML Searches
An index-free approach to managing data

Imagine a customer has hired you to put together a solution for managing a huge quantity of XML information. The firm's team is using XML because it gives them flexibility in how the data is structured. They like the fact that they do not need to specify a given record structure up front, and they can change the XML structure of records whenever they need to. Still, the question remains, "How do you manage and search for records?"

One choice, of course, would be to put the data into a relational database. This approach is not always convenient because you need to decide ahead of time what the schema of the tables will be. Such a database will need to be managed, and any changes in requirements will demand changes in the schema as well as migration of the data. A second option is to put the data into the XML column of a relational database. This will work for moderate to large quantities of data, but what if you have truly huge amounts of data?

A New Approach
Now there is an alternative to a traditional database for storing and searching huge quantities of XML encoded data, and it is based on the use of a grid paradigm that eliminates the need for indices and provides the ability to guarantee the search response time.

The concept of searching XML data without using an index may well sound counter-intuitive. After all, we learned in Computer Science 101 that an index is a way of making searches go faster. The accepted practice is that if your query performance times are taking too long, reexamine your schema and figure out how to construct the right index. In practice, judicious index construction can often make certain queries run much faster. However, several challenges arise when the number of records extends into the terabyte range.

Consider first that, no matter how carefully a database is coded, there will be an n^2 (or higher) order component to the time needed to maintain the index. Whenever a record is added to the database, the proper place in the index must be found, and an entry must be added. Whenever a record is removed, the index must be similarly modified to remove the entry. Of course, the ability to search on any part of an XML record then requires that every part must be indexed. The amount of extra processing involved to keep the indices up to date can be very significant if the index is large.

The second effect of an index is one of partial results. If the query involves several different fields, and if one of the values being searched for is commonly found in the data, the result set for that part of a query could be huge as well. The "and" operation in a search query effectively can cause the database to generate two huge partial result sets and then find the intersection of the large set - all in order to end up with what might be a relatively small final result set. The partial result sets must be managed in and out of memory, which can demand a great deal of processing, and the index itself can take up a lot of disk space requiring even more I/O operations.

It doesn't stop there. Once the result set is found from the indices, the database must go back to the disk and read those records into memory, causing yet more random I/O. It turns out that this random access nature of retrieving data contributes to slowing result times. A disk can be read sequentially about 10 times faster than reading random sectors because when reading sequentially, there is no need to seek and reposition the head across the disk.

The last effect of the index is to make the data coupled. An index by its nature must be an ordered set of references to all the records. An index over part of the records is just not as useful for searching as a complete index.

The solution is remarkably simple: eliminate the challenges of using indices by starting with the idea that data can be searched without an index. One benefit of this is that nothing has to be set up in advance. The XML is written to files on a disk, so it can be searched immediately.

At the same time, offer the ability to add and search XML data of any record structure, thereby eliminating the need to convert or prepare the XML records. If over time the application is modified to handle additional data, the new records can be added to the database without changing any of the existing records.

Performance
Simplicity and flexibility are nice, but how do you actually get the required performance? Two techniques are being employed today.

The first is permitting data to be spread across any number of low-cost Linux blade servers by having a controlling service as well as search services deployed on any number of Linux blade servers. Since there is no index, there is no need for any of these servers to know anything about the other servers; each blade server is given its part of the data, and it searches that part alone. In this scenario, the controlling service is able to collect the search queries and submit them in parallel to each of the search servers. It then collects all of the results from the search servers, merges them together, and returns them to the requester.

One of the great advantages of this approach is that new search servers can be added at any time. If the response time gets longer, deploy more servers and the entire search operation speeds up again. There are no dependencies between the search servers, so there is almost no limit to how much this can be scaled. More important, this property can be used to guarantee response time. As the quantity of data grows, the amount of processing power can grow, guaranteeing that the response time remains constant.

At the same time, breaking the data out into many search servers would still leave the response time unacceptable if it were not for another technique. This is the use of the controlling service to collect hundreds or thousands of queries to be searched at the same time, providing the ability to search for multiple queries at the same time without slowing down the search.

Imagine you have 20 queries looking for home listings in New York and another 20 queries looking for homes in Texas. As soon as a record is tested and found to be about a house in New York, the search mechanism continues processing on the first 20 queries, but the queries for houses in Texas can be ignored for this record, and cause no additional overhead. The search through a large amount of data might take 5 to 10 seconds. However, the ability to return 1,000 result sets from a single pass through the data means you can maintain very high speed, even without an index.

Here's how it works. When a record is found that matches a query condition, the entire record is already in memory, so it can immediately be added to the result set without additional I/O. This adds up to significant I/O savings. Since there is no index to bring in and out of memory, there are no huge partial result sets to manage, and there is no need to go back and retrieve the record contents separately. Instead, the disk is read sequentially, which is the fastest way to read a disk.

Additionally, XML records can be added and removed at any time. There is none of the overhead associated with updating an index, which in traditional index-based solutions can demand a significant amount of processing in certain situations. Therefore, when a record is added, it becomes immediately available for searching by the next query. Similarly, records can be removed quickly and easily.

The combined techniques for streamlining and speeding XML searches can be used in many situations. However, the greatest benefit comes when there are a large number of users requesting queries at the same time - making it ideally suited for a large Web site or a Web service. Furthermore, if the Web site is collecting records, it is quite convenient that the records only need to be structured into XML for storing in the database, since this is something a Web application or a Web service can do easily.

There also are significant performance advantages for Web applications that are mostly searching and displaying data, with some need for adding and removing records. Examples of this include:

  • Large classified ad sites where people can add new advertisements while others search for ads that match certain criteria.
  • A Web forum where many people are accessing, adding messages, reading other messages, and searching for messages on a particular topic.
  • A large trading or bartering site.
All these uses have some common features. They have a large amount of data; they add and remove records but typically don't do a lot of record updates; and there are large numbers of people searching for records.

. . .

The real benefit of this index-free approach is that it's easy to manage. There is no need to set up a schema beforehand. The record structure can be changed at any time, or it can be mixed with any other combination of records. There is no need to ever migrate from one schema to another. Additionally, it's easy to add more processing power to improve or maintain performance. Moreover, this approach does not require extensive IT resources to manage the data.

The fact that patented technology around these approaches is commercially available today means the power to quickly and effectively search huge stores of XML data is now available to small, medium, and large enterprises alike.

About Keith Swenson
Keith Swenson, chief architect and director of development, began his tenure at Fujitsu in 1991, developing TeamWARE Flow. He returned to Fujitsu Software Corporation in 2002 to direct the development of the Interstage family of products. Keith is currently working on standards such as WS-CAF and ASAP. He holds both a master's degree in computer science and a bachelor's degree in physics from the University of California, San Diego.

YOUR FEEDBACK
David Kershaw wrote: Finding the Declarative Tipping Point; XQuery, XML, and the RDBMS. Moving information from a database into an application may be the most common challenge developers face. How many of us make it through life without meeting object/relational (O/R) mapping in some form? Certainly not too many. Lately it has become equally difficult to avoid XML/relational (X/R) mapping. Because XML, and especially XML Schema (XSD), are object-like paradigms, the mapping difficulty is approximately the same. However, under the ever-expanding influence of XML, the extract, transform, load process that gets data from a database into an application (and vice versa) may be about to get radically more simple and declarative.
XML JOURNAL LATEST STORIES . . .
Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store.
Vordel unveiled version 5.1 of its XML network infrastructure products, to accelerate, manage and protect XML applications. Vordel 5.1 addresses the need for lifecycle management of policy across the SOA. By combining the central management of SOA policies with distributed enforcement ...
Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted to be...
As the number of XML files in enterprise organizations significantly increases, architects, application developers, and data integration specialists must deal simultaneously with the growing number of XML formatted messages on the network as well as their rapidly expanding file sizes. ...
DataDirect and an operating company of Progress Software Corporation announced the availability of the DataDirect Data Integration Suite. The new offering combines DataDirect Technologies' existing XML-based technologies in one package with a single, simple installation. Software devel...
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON FEATURED WHITEPAPERS


ADS BY GOOGLE
BREAKING XML NEWS

XAware, Inc. today announced the general avail...