| By Keith Swenson | Article Rating: |
|
| June 2, 2004 12:00 AM EDT | Reads: |
9,743 |
Imagine a customer has hired you to put together a solution for managing a huge quantity of XML information. The firm's team is using XML because it gives them flexibility in how the data is structured. They like the fact that they do not need to specify a given record structure up front, and they can change the XML structure of records whenever they need to. Still, the question remains, "How do you manage and search for records?"
One choice, of course, would be to put the data into a relational database. This approach is not always convenient because you need to decide ahead of time what the schema of the tables will be. Such a database will need to be managed, and any changes in requirements will demand changes in the schema as well as migration of the data. A second option is to put the data into the XML column of a relational database. This will work for moderate to large quantities of data, but what if you have truly huge amounts of data?
A New Approach
Now there is an alternative to a traditional database for storing and searching huge quantities of XML encoded data, and it is based on the use of a grid paradigm that eliminates the need for indices and provides the ability to guarantee the search response time.
The concept of searching XML data without using an index may well sound counter-intuitive. After all, we learned in Computer Science 101 that an index is a way of making searches go faster. The accepted practice is that if your query performance times are taking too long, reexamine your schema and figure out how to construct the right index. In practice, judicious index construction can often make certain queries run much faster. However, several challenges arise when the number of records extends into the terabyte range.
Consider first that, no matter how carefully a database is coded, there will be an n^2 (or higher) order component to the time needed to maintain the index. Whenever a record is added to the database, the proper place in the index must be found, and an entry must be added. Whenever a record is removed, the index must be similarly modified to remove the entry. Of course, the ability to search on any part of an XML record then requires that every part must be indexed. The amount of extra processing involved to keep the indices up to date can be very significant if the index is large.
The second effect of an index is one of partial results. If the query involves several different fields, and if one of the values being searched for is commonly found in the data, the result set for that part of a query could be huge as well. The "and" operation in a search query effectively can cause the database to generate two huge partial result sets and then find the intersection of the large set - all in order to end up with what might be a relatively small final result set. The partial result sets must be managed in and out of memory, which can demand a great deal of processing, and the index itself can take up a lot of disk space requiring even more I/O operations.
It doesn't stop there. Once the result set is found from the indices, the database must go back to the disk and read those records into memory, causing yet more random I/O. It turns out that this random access nature of retrieving data contributes to slowing result times. A disk can be read sequentially about 10 times faster than reading random sectors because when reading sequentially, there is no need to seek and reposition the head across the disk.
The last effect of the index is to make the data coupled. An index by its nature must be an ordered set of references to all the records. An index over part of the records is just not as useful for searching as a complete index.
The solution is remarkably simple: eliminate the challenges of using indices by starting with the idea that data can be searched without an index. One benefit of this is that nothing has to be set up in advance. The XML is written to files on a disk, so it can be searched immediately.
At the same time, offer the ability to add and search XML data of any record structure, thereby eliminating the need to convert or prepare the XML records. If over time the application is modified to handle additional data, the new records can be added to the database without changing any of the existing records.
Performance
Simplicity and flexibility are nice, but how do you actually get the required performance? Two techniques are being employed today.
The first is permitting data to be spread across any number of low-cost Linux blade servers by having a controlling service as well as search services deployed on any number of Linux blade servers. Since there is no index, there is no need for any of these servers to know anything about the other servers; each blade server is given its part of the data, and it searches that part alone. In this scenario, the controlling service is able to collect the search queries and submit them in parallel to each of the search servers. It then collects all of the results from the search servers, merges them together, and returns them to the requester.
One of the great advantages of this approach is that new search servers can be added at any time. If the response time gets longer, deploy more servers and the entire search operation speeds up again. There are no dependencies between the search servers, so there is almost no limit to how much this can be scaled. More important, this property can be used to guarantee response time. As the quantity of data grows, the amount of processing power can grow, guaranteeing that the response time remains constant.
At the same time, breaking the data out into many search servers would still leave the response time unacceptable if it were not for another technique. This is the use of the controlling service to collect hundreds or thousands of queries to be searched at the same time, providing the ability to search for multiple queries at the same time without slowing down the search.
Imagine you have 20 queries looking for home listings in New York and another 20 queries looking for homes in Texas. As soon as a record is tested and found to be about a house in New York, the search mechanism continues processing on the first 20 queries, but the queries for houses in Texas can be ignored for this record, and cause no additional overhead. The search through a large amount of data might take 5 to 10 seconds. However, the ability to return 1,000 result sets from a single pass through the data means you can maintain very high speed, even without an index.
Here's how it works. When a record is found that matches a query condition, the entire record is already in memory, so it can immediately be added to the result set without additional I/O. This adds up to significant I/O savings. Since there is no index to bring in and out of memory, there are no huge partial result sets to manage, and there is no need to go back and retrieve the record contents separately. Instead, the disk is read sequentially, which is the fastest way to read a disk.
Additionally, XML records can be added and removed at any time. There is none of the overhead associated with updating an index, which in traditional index-based solutions can demand a significant amount of processing in certain situations. Therefore, when a record is added, it becomes immediately available for searching by the next query. Similarly, records can be removed quickly and easily.
The combined techniques for streamlining and speeding XML searches can be used in many situations. However, the greatest benefit comes when there are a large number of users requesting queries at the same time - making it ideally suited for a large Web site or a Web service. Furthermore, if the Web site is collecting records, it is quite convenient that the records only need to be structured into XML for storing in the database, since this is something a Web application or a Web service can do easily.
There also are significant performance advantages for Web applications that are mostly searching and displaying data, with some need for adding and removing records. Examples of this include:
- Large classified ad sites where people can add new advertisements while others search for ads that match certain criteria.
- A Web forum where many people are accessing, adding messages, reading other messages, and searching for messages on a particular topic.
- A large trading or bartering site.
The real benefit of this index-free approach is that it's easy to manage. There is no need to set up a schema beforehand. The record structure can be changed at any time, or it can be mixed with any other combination of records. There is no need to ever migrate from one schema to another. Additionally, it's easy to add more processing power to improve or maintain performance. Moreover, this approach does not require extensive IT resources to manage the data.
The fact that patented technology around these approaches is commercially available today means the power to quickly and effectively search huge stores of XML data is now available to small, medium, and large enterprises alike.
Published June 2, 2004 Reads 9,743
Copyright © 2004 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Keith Swenson
Keith Swenson is Vice President of Research and Development at Fujitsu America Incorporated and is the Chief Software Architect for the Interstage family of products. He is known for having been a pioneer in Web services, collaborative planning, workflow, and business process management, having participated in the formation of a number of standards in these fields, including receiving the 2004 "Mannheim Award for Outstanding Contribution in the Field of Workflow." He has led projects at Fujitsu, Netscape, MS2 and Ashton-Tate on group collaboration spaces. His most recent project is a cloud-based offering of Interstage BPM that will offer process modeling, forms development, rules development, Web service integration and execution of completed BPM applications in a 100% hosted cloud environment. See his blog on Collaborative Planning at http://kswenson.wordpress.com/.
![]() |
Hans Kool 01/05/05 07:51:35 AM EST | |||
I just read the article "Streamline Your XML Searches However, you refer to existing patented products: "The fact that patented technology around these approaches is commercially available today means the power to quickly and effectively search huge stores of XML data is now available to small, medium, and large enterprises alike." It would be nice to place a few hyperlinks to such products, possibly also with open source links. That would make investigating this technology more practical (I searched and found all solutions to use indexing as opposite to what your article claims to be efficient). Thanks for the article. Hans Kool Thanks |
||||
- Publishing Synergy: Blog, Twitter and Ulitzer
- Will PR Firms Survive The New Media Avalanche?
- Typhoon Ondoy (Ketsana) Hits the Philippines (Part 2)
- Confessions of a Ulitzer Addict
- Cloud Computing Expo 2010 East to Attract More Than 5,000 Delegates in New York City
- Cloud Computing Journal Continues To Publish World's Best Cloud Analysts
- CIA Falls for Cloud Computing in a Big Way
- Are You Comfortable With Where Your Data Sleeps at Night?
- Dr. Leslie Lenert of CDC Speaks on Healthcare IT
- Game-Changing Innovations and the Evolving SOA Appliance
- What Happened To SOA?
- Instant Professionalism Online Despite Yourself...with Ulitzer
- Cloud CEOs, CTOs & SVPs to Speak at 4th International Cloud Computing Expo
- Publishing Synergy: Blog, Twitter and Ulitzer
- Will PR Firms Survive The New Media Avalanche?
- Typhoon Ondoy (Ketsana) Hits the Philippines (Part 2)
- Confessions of a Ulitzer Addict
- My Thoughts on Ulitzer
- Combining the Cloud with the Computing: Application Delivery Networks
- Cloud Computing Expo 2010 East to Attract More Than 5,000 Delegates in New York City
- Ulitzer vs. Ning
- Cloud Computing Journal Continues To Publish World's Best Cloud Analysts
- CIA Falls for Cloud Computing in a Big Way
- Are You Comfortable With Where Your Data Sleeps at Night?
- Where Are RIA Technologies Headed in 2008?
- AJAX World RIA Conference & Expo Kicks Off in New York City
- JSON vs XML - A Jason vs Freddie Sequel
- Processing XML with C# and .NET
- Has the Technology Bounceback Begun?
- BPEL Processes and Human Workflow
- The Top 250 Players in the Cloud Computing Ecosystem
- Open Source Database Special Feature: An Introduction to Berkeley DB XML
- "HP's Problem Ain't the SAP Install," Says Sun's Schwartz
- eXist - An Introduction To Open Source Native XML Database
- Digitizing the Planet: Google Earth vs MSN Virtual Earth vs MapQuest
- Generating XML from Relational Database Tables


































