|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS XML Streamline Your XML Searches
An index-free approach to managing data
By: Keith Swenson
Jun. 2, 2004 12:00 AM
Imagine a customer has hired you to put together a solution for managing a huge quantity of XML information. The firm's team is using XML because it gives them flexibility in how the data is structured. They like the fact that they do not need to specify a given record structure up front, and they can change the XML structure of records whenever they need to. Still, the question remains, "How do you manage and search for records?" One choice, of course, would be to put the data into a relational database. This approach is not always convenient because you need to decide ahead of time what the schema of the tables will be. Such a database will need to be managed, and any changes in requirements will demand changes in the schema as well as migration of the data. A second option is to put the data into the XML column of a relational database. This will work for moderate to large quantities of data, but what if you have truly huge amounts of data? A New Approach The concept of searching XML data without using an index may well sound counter-intuitive. After all, we learned in Computer Science 101 that an index is a way of making searches go faster. The accepted practice is that if your query performance times are taking too long, reexamine your schema and figure out how to construct the right index. In practice, judicious index construction can often make certain queries run much faster. However, several challenges arise when the number of records extends into the terabyte range. Consider first that, no matter how carefully a database is coded, there will be an n^2 (or higher) order component to the time needed to maintain the index. Whenever a record is added to the database, the proper place in the index must be found, and an entry must be added. Whenever a record is removed, the index must be similarly modified to remove the entry. Of course, the ability to search on any part of an XML record then requires that every part must be indexed. The amount of extra processing involved to keep the indices up to date can be very significant if the index is large. The second effect of an index is one of partial results. If the query involves several different fields, and if one of the values being searched for is commonly found in the data, the result set for that part of a query could be huge as well. The "and" operation in a search query effectively can cause the database to generate two huge partial result sets and then find the intersection of the large set - all in order to end up with what might be a relatively small final result set. The partial result sets must be managed in and out of memory, which can demand a great deal of processing, and the index itself can take up a lot of disk space requiring even more I/O operations. It doesn't stop there. Once the result set is found from the indices, the database must go back to the disk and read those records into memory, causing yet more random I/O. It turns out that this random access nature of retrieving data contributes to slowing result times. A disk can be read sequentially about 10 times faster than reading random sectors because when reading sequentially, there is no need to seek and reposition the head across the disk. The last effect of the index is to make the data coupled. An index by its nature must be an ordered set of references to all the records. An index over part of the records is just not as useful for searching as a complete index. The solution is remarkably simple: eliminate the challenges of using indices by starting with the idea that data can be searched without an index. One benefit of this is that nothing has to be set up in advance. The XML is written to files on a disk, so it can be searched immediately. At the same time, offer the ability to add and search XML data of any record structure, thereby eliminating the need to convert or prepare the XML records. If over time the application is modified to handle additional data, the new records can be added to the database without changing any of the existing records. Performance The first is permitting data to be spread across any number of low-cost Linux blade servers by having a controlling service as well as search services deployed on any number of Linux blade servers. Since there is no index, there is no need for any of these servers to know anything about the other servers; each blade server is given its part of the data, and it searches that part alone. In this scenario, the controlling service is able to collect the search queries and submit them in parallel to each of the search servers. It then collects all of the results from the search servers, merges them together, and returns them to the requester. One of the great advantages of this approach is that new search servers can be added at any time. If the response time gets longer, deploy more servers and the entire search operation speeds up again. There are no dependencies between the search servers, so there is almost no limit to how much this can be scaled. More important, this property can be used to guarantee response time. As the quantity of data grows, the amount of processing power can grow, guaranteeing that the response time remains constant. At the same time, breaking the data out into many search servers would still leave the response time unacceptable if it were not for another technique. This is the use of the controlling service to collect hundreds or thousands of queries to be searched at the same time, providing the ability to search for multiple queries at the same time without slowing down the search. Imagine you have 20 queries looking for home listings in New York and another 20 queries looking for homes in Texas. As soon as a record is tested and found to be about a house in New York, the search mechanism continues processing on the first 20 queries, but the queries for houses in Texas can be ignored for this record, and cause no additional overhead. The search through a large amount of data might take 5 to 10 seconds. However, the ability to return 1,000 result sets from a single pass through the data means you can maintain very high speed, even without an index. Here's how it works. When a record is found that matches a query condition, the entire record is already in memory, so it can immediately be added to the result set without additional I/O. This adds up to significant I/O savings. Since there is no index to bring in and out of memory, there are no huge partial result sets to manage, and there is no need to go back and retrieve the record contents separately. Instead, the disk is read sequentially, which is the fastest way to read a disk. Additionally, XML records can be added and removed at any time. There is none of the overhead associated with updating an index, which in traditional index-based solutions can demand a significant amount of processing in certain situations. Therefore, when a record is added, it becomes immediately available for searching by the next query. Similarly, records can be removed quickly and easily. The combined techniques for streamlining and speeding XML searches can be used in many situations. However, the greatest benefit comes when there are a large number of users requesting queries at the same time - making it ideally suited for a large Web site or a Web service. Furthermore, if the Web site is collecting records, it is quite convenient that the records only need to be structured into XML for storing in the database, since this is something a Web application or a Web service can do easily. There also are significant performance advantages for Web applications that are mostly searching and displaying data, with some need for adding and removing records. Examples of this include:
The real benefit of this index-free approach is that it's easy to manage. There is no need to set up a schema beforehand. The record structure can be changed at any time, or it can be mixed with any other combination of records. There is no need to ever migrate from one schema to another. Additionally, it's easy to add more processing power to improve or maintain performance. Moreover, this approach does not require extensive IT resources to manage the data. The fact that patented technology around these approaches is commercially available today means the power to quickly and effectively search huge stores of XML data is now available to small, medium, and large enterprises alike. YOUR FEEDBACK
XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||