|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Feature XML Computation Trees
XML Computation Trees
Jul. 31, 2002 12:00 AM
Every computer science undergraduate program in the world has two important foundation courses: data structures and algorithms. Open any book on these subjects and you'll see immediately that almost a third of it is devoted to graphs. Graphs are used to model a very large number of real-world problems: the traveling salesman problem, efficient routing of a package, network flows, and more - all are modeled as graphs and often solved by graph-based algorithms. A common use of a graph-based representation is that of a computation graph. Simply put, it's a graph that models a set of functions F1, F2, ..., Fk that take a set of inputs I1, I2, ..., In and computes a set of outputs O1, O2, ..., Om. As an example, Figure 1 shows a simple computation graph with a small number of nodes. This simple example represents the following equations:
O1 = F5(F1(I1, I2), F4(F1(I1, I2), F2(I2, I3))) Note that by managing the computation as a graph traversal you can eliminate computing F4(F1(I1, I2), F2(I2, I3)) twice. The nodes modeling both F5 and F6 access the value computed by the node representing F4. Computation graphs arise in many environments and tend to be the preferred way to model very complex computations and dependencies. In fact, most computationally intensive environments use graph-based computation structures. The most prominent example of such environments occurs in financial services, including computer-based portfolio management, computer trading, pricing systems, risk management systems, hedge systems, and insurance quoting systems. These structures also arise in contract management systems, constraint-based optimization, impact analysis, root cause determination, and simulation systems. The reason is that a graph is the perfect way to model the dependencies between computation elements as well as the perfect metaphor to program to. Most current systems managing computation graphs make use of proprietary environments to model and manage such structures. In this article I'll walk you through a different approach, one that makes use of XML to represent computation graphs. The approach is based on an XML representation that can be handed to almost context-free computation engines that cooperate to compute very large computation structures. While there would seem to be overhead and inefficiency in using XML instead of proprietary methods, this is offset by the fact that the full computation graph is partitioned so that many CPUs can work in tandem to compute the full result set. This approach has additional advantages in areas involving the maintainability of the system and the total cost of ownership. Since the full framework involves some very advanced, complex, and sometimes confidential graph-partitioning methods, this article focuses on the general framework and stays away from the mathematical aspects of the approach.
Graphs and Trees
XML also follows this pattern. It organizes data according to a containment hierarchy, and any XML document can be viewed as a containment tree. More important in the context of this article, any tree can be encoded as an XML document, and graphs can be encoded in XML by adding attributes that represent references. Formally, a graph G=G(V, E) includes a set of nodes (or vertices) {V1, V2, ..., Vn} and a set of edges {E1, E2, ..., Em} where for each i, Ei={Vk,Vl} where Vk and Vl are both nodes in the node set. A directed graph is one in which each edge has a source node and a target node; in this case an edge is sometimes written as Ei=<Vk,Vl>. Computation graphs are always represented by directed graphs in which the edges represent the computation dependencies (i.e., the value computed by the source node is an input to the computation of the target node). Graphs can have cycles. A cycle within a directed graph is a set of edges E1, E2, ..., Ek where for each i the target node of Ei is the same as the source node of Ei+1, and the target node of Ek is the same as the source node of E1. Cycles are clearly a bad thing for a program traversing the computation graph since the cycle implies an infinite loop. Therefore, most graphs managed within computer systems, and all computation graphs, tend to be directed acyclic graphs (DAGs), meaning that there's no directed cycle within the graph. A tree is a graph in which every node except the root has exactly one ancestor node. Trees are also heavily used in computation graphs; the reason is that the computation dependencies are more limited. Thus the propagation of values through the structure is simpler and faster. In many cases a computation DAG can be reduced to a computation tree by creating duplicate nodes, a common practice in efficient computation solutions.
Applications of Computation Graphs
The first involves constraint optimization. Imagine that you're building a system that calls for optimal assignment of resources (e.g., workers or crews) to tasks. Each task has various components, such as where it has to be performed and in what time frame, and the skills required. Given a set of tasks and resources, it's your job to create the best possible schedule for your workforce to perform all tasks within the defined constraints. There are many approaches for performing such optimizations, and the use of computation graphs may not be immediately obvious. Computation graphs come into play because optimization problems such as that described above need constant recalculation. This is often called continuous optimization. The problem is that as time goes by reality changes and constraints need to be reapplied. New tasks are created, tasks may be completed ahead of schedule, resources can become unavailable. All this creates a reality in which it's not good enough to compute a good schedule once a day. The schedule needs continuous updating in real time as new information comes in. To do this efficiently, a computation graph has to be maintained to enable continuous reoptimization. New inputs ripple through the graph, updating the schedule, thus avoiding recomputing from scratch. The key is real-time recomputation, the common thread in many of today's advanced uses of computation graphs. The second example is also heavily biased toward real-time recomputation. The domain is real-time pricing and real-time risk calculations for financial products and portfolios. Today's investment banks create and manage highly customized and very complex financial instruments. This is not about stocks, and not even about simple derivatives such as options and swaps. It sometimes involves hybrid derivatives, or synthetics. Investment banks take positions in one instrument or portfolio in order to make a commission or a margin, but don't want to carry the risk. In such a scenario they will hedge the portfolio with another instrument or portfolio that can be correlated with the original one. All of these instruments and portfolios can be represented as computation graphs managed for both pricing and risk management purposes. The inputs to these computation graphs include, for example, spot interest rates, exchange rates, and stock prices. There are two important characteristics of these graphs: (1) they need to be very flexible and dynamic to allow traders to create the best instruments for the bank, and (2) they're very sensitive to constantly changing conditions. As an example, risk is sensitive to many factors, one of which is time itself. Because time is continuously changing, risk calculations can potentially require continuous recomputations that can trigger adjustments to a portfolio. It's not unusual to see computation graphs with thousands and even tens of thousands of nodes that need to be recomputed in real time. Such computation graphs place a huge requirement on computation power. It's typical to see such computation engines run on very powerful (and expensive) Unix machines with a large number of CPUs. It's still never enough. These systems also tend to be huge memory hogs, requiring 2 and even 4GB of memory to hold the graph structure in memory and perform the computations. The nature of such environments is that they always require more CPUs and more memory than is readily available. Some sample causes follow.
One way to overcome the problem is to use a large set of computers on a network for real-time recomputation of such graphs. While this has been recognized by many, it hasn't been easy to do. The main reason is that most environments doing such calculations are highly proprietary and often include a home-grown object database and a proprietary communication and distribution layer. While such proprietary environments have proved themselves in operation, they've also proved to be very expensive to maintain, and the total cost of ownership (including the need to continuously maintain and improve the proprietary environment) tends to be so high that only investment banks have been able to afford it.
Computation Graphs in XML
To use XML easily (and not require HREF, XPath expression, or other types of references), the computation graph may be converted into a tree. This is done by duplicating nodes and maintaining multiple nodes that subscribe to an underlying value. While not the most elegant solution, it simplifies the algorithms for partitioning and managing the computations. Alternatively, the graph can be partitioned first; only then is each subgraph converted to a tree. The computation graphs are built using standard XML editors. This means that you can use off-the-shelf tools for building the graphs instead of writing proprietary editors, helping to reduce cost of maintenance. To make sure that the computation trees make sense in terms of the inputs and outputs, XML schemas must be employed to represent typing information. Assume for example that F is a function that takes as input a value of type Ta and a value of type Tb and computes a value of type Tc (i.e., Tc F(Ta, Tb)). Assume that G is a function that takes as input a value of type Tc and calculates a value of type Td (i.e., Td G(Tc)). Also assume that ta is an input value of type Ta, tb is an input value of type Tb, and tc is an input value of type Tc. The schema needs to be built so as to easily represent G(F(ta, tb)) as well as G(tc) in a way that can be type-checked by the XML editor. The "trick" in defining the schema is in the introduction of elements that can be either data or return values from the recursive activation of a function. Such elements are defined as compositors using choice elements that include the basic type as well as any function node that returns that value. As an example, the TcVal that becomes the type of the parameter of the function G can be defined using the following compositor:
<xsd:group name="TcVal">
Partitioning the Computation Tree
It's impossible here to survey all the methods used for partitioning. Partitioning can be proved to be NP-hard (meaning that there is probably no deterministic fast algorithm for doing general partitioning), but many approaches have been used in the past, some producing very good results. These range from simple greedy algorithms to complex spectral methods using eigenvectors of the Laplacian of the graph. Also, partitioning is often based on domain-specific nodes. The simplest example involves Monte Carlo simulation nodes since it's clear that multiple runs can be done in parallel. The partition is normally kept static, at least for a while. Although the dynamic nature of the computation tree is such that continuous partitioning may seem to be required, in practice this is overkill and takes much too long. Therefore, new partitions are created daily at most, and the subtrees are maintained throughout a relatively long period of time. Representing the computation graph in XML complicates the partitioning phase, and is the main area in which more research is required. If partitioning is done while looking at the entire graph, then the graph needs to be built in memory, which means that the partitioning process will be a memory superhog. In fact, building the tree using DOM is impossible for large graphs and shouldn't be attempted. Instead, current approaches for very large computation graphs use partitioning algorithms that can be implemented using a SAX parser with one or more passes over the graph. Multilevel bisectional algorithms (the general concept of these algorithms is illustrated in Figure 2) lend themselves to such an implementation. It's possible that JAXB, the new Java Architecture for XML Binding, may prove to be a good alternative to SAX processing for the partitioning phase. JAXB (formerly called Adelard) provides a framework for supporting a two-way mapping between XML documents based on schemas and Java objects. Schemas are compiled into Java classes that create a specialized parser based on the schema. The result is a parser that often operates as fast as a SAX parser but builds objects in memory while maintaining a small memory footprint. The JAXB documentation claims that early JAXB prototypes have shown that JAXB can be faster than SAX because the generated Java classes are precompiled and contain the schema logic, thereby avoiding the dynamic interpretation that a SAX parser performs. Using a JAXB-based approach, it may be possible to create better partitions, but whether such an approach is feasible from a memory footprint perspective has yet to be demonstrated.
Computing Subtrees
The simplest approach to the actual computation is one that parses the computation tree using SAX and maintains a stack of values that reflect a DFS traversal of the tree. If the graph isn't converted to a tree, the XML includes references and two passes may be required. In this case the partitioning method needs to reduce (and hopefully eliminate) dependencies on other subgraphs. A family of partitioning methods called bounded contractions uses this as the primary target.
Summary
There's one additional advantage that I believe is the most important one:
It's easy to debug, tune, and improve the process because you can see what's going on within this complex network of computation engines. Dependencies are implicitly described in a form that is readable by both humans and programs. These descriptions can be easily used to generate reports and perform queries using tools such as XSL and XQL - again saving a lot of complex code that would otherwise have to be written. Current approaches based on in-memory proprietary structures are very hard to debug and trace. If you're into computation graphs, this advantage alone should make you at least curious about whether XML computation graphs are appropriate for you. Finally, most of the mainstream database vendors seem to be moving toward native support for XML. As an example, the next major release of Microsoft's SQL Server supports XML as a "first-class citizen" (i.e., the database internals optimize XML handling and don't rely on a "bolt-on"). This means that mainstream databases can be used for these advanced environments instead of continuous use of proprietary or niche-vendor environments. XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||