CSE 635

    Asynchronous Systems

    Spring 1997

    PARALLELIZING BDD OPERATIONS

    Abstract

    Binary Decision Diagrams (BDDs) provide a space-efficient symbolic representation of boolean functions and find widespread use in boolean function manipulation, hardware verification and finite state modelling of concurrent systems. Manipulation of real BDDs typically tests the processor as well as the memory system resources to the limit. Attempts at distributed implementations achieve limited success and are only helpful to the extent of reducing the overhead from page faults and avoiding shortage of virtual memory. True parallelism is not possible due to the inherent sequential nature of BDD manipulation algorithms and the speedup gained from these techniques is far from linear. This project presents an augmented representation of BDDs which can be used to parallelize BDD operations and achieve close to linear speedup with the storage requirement increasing only by a constant factor over a traditional BDD representation. Together with the augmented representation, various data structures have been used in order to optimize computation on the critical path to each processor. We present the results using benchmark BDDs from the ISCAS85 suite in order to highlight the applicability of our scheme for real BDDs. The project is motivated by the need to manipulate very large BDDs in model checking and hardware verification applications.

    Introduction

    If a system can be modeled as a finite state machine, we can very efficiently prove properties about that system expressed in temporal logic by the technique of model checking . Unfortunately, finite state models of concurrent systems (e.g. computer hardware) grow exponentially in size as the number of components of the system increases. This is widely known as the state explosion problem and limits finite state verification methods to small systems. A method of skirting this problem is symbolic model checking [McMillan] . This technique avoids constructing the finite state model by representing it symbolically as a boolean formula. This allows efficient methods of manipulating boolean algeabra to be applied to the model checking process. OBDDs (Ordered Binary Decision Diagrams) provide a canonical representation of boolean functions as directed acyclic graphs. A number of operations on boolean functions can be implemented as graph algorithms on OBDD data structures. However, the success of all BDD based synthesis and verification algorithms depend on the ability to efficiently manipulate very large BDDs. Work in this direction has primarily focussed on trying to partition the BDDs across several nodes in a network of workstations (NOW) and trying to use the collection of main memories on the newtork for masking the page fault ovehead through the low communication latency of the network [Sanghavi] . The algorithm however is still inherently sequential because the nodes at one level have to wait for their parents to complete in order to know which nodes to pair with at the same level in the other BDD. We propose a new BDD representation which allows the full parallelization of operations within each processor in the APPLY phase of the breadth-first BDD algorithm.

    The report is organized as follows : the first section gives an overview of BDDs and the algorithms used to manipulate them. The next section gives a survey of various attempts at distributed BDD implementations and their contributions to performance. This is followed by a brief overview of our earlier attempts at arriving at an alternate representation for BDDs which would enable fully parallel operation at each processor and the limitations of some representations because of the way real BDDs are structured. This is followed by the presentation of our representation which augments a BDD at processor boundaries by a reachability graph and a list of paths from the previous processor. We discuss the data structures we use to further optimize computation of node pairings at processor boundaries. Finally we compare our results to the simple distibuted BDD implementation. We close with a discussion of how adaptive representations could further improve the performance.

    Binary Decision Diagrams

    A BDD [Bryant] is a simple data structure which is essentially a DAG with out-degree of every node equal to 2. Each node represents a variable in a boolean formula. The outgoing edges are referred to as the THEN and ELSE branches. Each path from the root to the terminal nodes (0 and 1) represents a decision path traced out by taking the IF branch if the variable represented by a node was true and the ELSE branch if the variable was false. Thus the BDD encapsulates information about the whole truth table of the formula it represents. A BDD can be obtained from a complete binary tree representaing the formula by removing redundant nodes. A node is redundant if either its THEN and ELSE branches point to the same node or if its THEN and ELSE branches point to the same nodes to which the THEN and ELSE branches of another node point (in this case they can be collapsed into one node). Given a boolean formula F, if BDD(f) represents the BDD represented by f we have :

    BDD(f) = x.BDD(f|x=1) + not(x).BDD(f|x=0) ..... (1)

    where f|x=0 (or 'f restricted to 0') means f with variable x set to 0.

    The variable x would become the topmost node of this BDD. A variable would similarly be chosen in the other two BDDs recursively. This defines an ordering over the variables in f. The following formula represents doing an operation op on two BDDs F and G and is popularly known as the Shannon's expansion :

    F op G = x.(F|x=1 op G|x=1) + not(x).(F|x=0 op G|x=0) .... (2)

    and follows directly from (1). Note that this assumes that F and G have the same ordering (e.g. x is the topmost variable and so on).

    Approaches to performing BDD operations :

    Formula (2) gives a recursive algorithm for operating upon two given BDDs. Two approaches to applying this algorithm are common : In the depth-first approach, the operation rooted at the THEN subtree ( F|x=1 op G|x=1 ) would be completed before the operation rooted at the ELSE subtree ( F|x=1 op G|x=1 ) Given the way a BDD is stored (all nodes at one level stored together), this algorithm would give a disorderly memory access pattern if levels were partitioned horizontally across processors. Such an approach would not be able to apply any locality of refernce over BDD nodes and hence would suffer from excessive page fault overhead for large BDDs . Another approach which is more commonly used in a parallel implementation, is the breadth-first approach given by [Ochi] . This algorithm essentially works in two phases : an APPLY phase and a REDUCE phase. In the APPLY phase, when we decide that at a particular level, two BDD nodes n1 and n2 have to combine, Shannon's expansion dictates which nodes should combine at the next level (essentially the THEN nodes of n1 and n2 combine and the ELSE nodes of n1 and n2 combine). These combine requests are 'enqueued' at the level where they have to occur. When nodes at that level combine, they dequeue and perform these requests together with enqueueing further combinations that are dictated by their own structure. When control reaches the last level and the APPLY phase is over, the REDUCE phase starts bottom-up. In this phase, the reults of the APPLY phase at each level are passed up to the upper level which requested the combination.For example, F(x=1) op G(x=1) would be passed up to the level containing variable x. This algorithm avoids the disorderly pattern of memory access present in the depth-first algorithm by processing all nodes at one level together.

    An overview of distributed BDD implementations:

    BDD operations using any of the above approaches , are highly memory and computation intensive. Operating upon large BDDs (BDDs of 40 to 50 variables are common) using these approaches directly can take substantial run time , and in most cases, the system easily runs out of virtual memory (See Sanghavi's paper outlining the combinations that they could not do on one processor because of lack of memory). For tackling the virtual memory problem, distributed versions of these algorithms have been developed ; notably those of the breadth-first approach. Depth-first approaches exploiting parallelism among several depth-first paths in the BDD have also been attempted [USC]. This would avoid the problem of loss of locality by not storing the BDD nodes level by level. Instead a pseudo-random hash function is used to partition the BDD nodes 'vertically' across processors. However, this approach also doesn't scale linearly with the number of processors as desired from a truly parallel implementation. The breadth-first apprioach facilitates locality of reference when levels are partitioned horizontally across processors. However this approach to breadth-first BDD manipulation is distributed, not parallel , because the algortihm is still inherently sequential. Every level has to wait for information from its parent about which nodes at a particular level in one BDD have to combine with which nodes at that level in the other BDD. Our experiments with ISCAS benchmark BDDs indicate that of the total execution time, 90 % of the time is spent in the APPLY phase .

    Parallelizing BDD operations

    Figure 1 : The figure shows how the nodes of the BDD map onto the nodes in the complete binary tree represnting teh logical formula. Each level stands for one variable in the formula. Note how BDD nodes x and y can be represented as a set of ranges. Also note that if there is further skipping by node y, each of the nodes marked y in the complete binary tree has a range rooted at it, giving a nested set of ranges ; and that everytime two edges converge into a BDD node from two parents, the set of ranges represented by the parents gets added to the set represented by that BDD node.

    In order for performance of parallel BDD operations to scale linearly with the number of processors, it is necessary that there be some way of starting pairing computation at every processor in parallel. A naive application of Shannon's expansion necessitates waiting at lower levels for information from higher levels about which nodes to pair. Clearly a way is needed to reduce total dependence for information at a level from its parents. Our earlier approach to this problem was to make each level completely independent of the higher levels by coming up with a Set of Ranges representation at each level. This is motivated by the following observation about the relationship of a compact BDD representation with the space-expensive complete binary tree representation : Consider the full binary tree representing the logical formula. Each level in the tree represents one variable and the tree structure gives the results of all possible decision paths starting from the root. i.e. given a value for each variable, the value of the formula (true or false) can be derived by following the appropriate THEN and ELSE paths from the root. The last level thus gives the whole truth table. Now assume that we number the tree nodes at level i by indices from 0 to 2i. This means that each BDD node at level i corresponds to a set of these indices. Two BDD nodes pair at a level if and only if the set of indices represented by them intersect. Hence if we can get a minimal representation of a BDD node's set of indices, then we can determine their pairing independent of information from the parent nodes. This representation takes the form of a set of ranges if there is reasonable amount of skipping in the BDD (even without skipping, the representation is of the form of a set of ranges but it will be storage optimal only if there is skipping in the BDD). Skipping basically means that a given decision path doesn't care about the value of the variables it skips. Consider what happens when a node representing index k at level i takes different kinds of branches. If it takes a THEN branch to node n at level i+1, then node n represents index 2*k at level i+1. If it takes an ELSE branch to n, then it represents index 2*k+1 at level i+1. Now consider what happens when the branch from index k, level i goes to node n at level j where j>i+1. In this case, if the branch was a THEN, node n would now represnt a list of indices at level j which starts at index (2*k)(j-i-1), has gap 1 and has length 2(j-i-1). For an else branch, the starting index would be (2*k+1)(j-i-1). This information can be conveniently represented by a <start,gap,length> tuple. Now consider what happens when the node n (which now represents a listof indices), takes a THEN or an ELSE branch. The starting index of the list changes and the gap doubles at every level. Now if this path again skips, then each single index in the list would give rise to another list which would be nested inside the earlier list. Also, since a BDD node can have an in-degree higher than one, it may get such contributions along several paths. Hence, in general, a BDD node would be represented by a set of such nested lists (refer to figure 1).When we need to determine whether two BDD nodes at a level, represented in this form intersect with each other, note that the algorithm takes a definitive structure. Essentially, given a BDD node that has multiple incoming paths, we can find a parent which is common to all these paths. Consider that the representation of this common parent is in the form of a list. Each element of this list has a subtree rooted at it. The contributions from the THEN path form a pattern in the left subtree and the contributions from the ELSE path form a pattern in the right subtree. Whenever along a path there is another split, a new pattern is generated in each of the left and right subtrees rooted at the split node. This gives a nice recursive pattern to the range of indices represented by a BDD node and forms the basis of a resursive algorithm for computing the intersection. However, we noted that the amount of skipping in real BDDs is quite restricted. This means that the nested lists that form part of the set of ranges would generally be single-element lists. This means that the storage needed for the representation would be exponential. Consider a classic case of a BDD that computes the exclusive-or of n variables. This BDD has no skipping and hence leads to an exponential storage requirement if the set of ranges representation is used.

    Exploiting sevaral other properties of BDD structure was attempted. Note that a BDD node which has in-degree n would have n sets of indices contributed along each path. We attempted to represent this as a single list of indices by collapsing individual contributions to form a list with periodicity equal to the GCD of individual periodicities. This would meet with limited success if, for instance, the lists of indices that came together have small GCDs giving a list with lots of redundant nodes. But note that this problem can be avoided to some extent by allowing only those lists to merge which have been contributed along identical paths. For example, if we keep 22=4 sets of lists at a node and allow only those lists to combine which came from paths matching in the last two bits, then the minimum GCD of any list will be at least 4 (because lists that start at indices of the form 2*2i will combine only with lists of the form 2*2j, those that start at indices of the form 2*(2i+1) combine only with those that start at 2*(2j+1) and so on...giving a minimum GCD of 4). Hence, if we distinguish between path histories for h bits, then the minimum GCD of the collapsed lists would be 2h. This would make the collapsed lists sparse and redundant intersections would decrease. h would hence be a parameter to the representation. However, we were still seeking a representation that could give us exact pairings at a low storage cost and hence close to parallel operation.

    
    

    Figure 2 : Shows how the parallel implementation works. The topmost diagram shows the levelwise distribution of the two operand BDDs across 4 processors. The top time line shows the order in which events will finish in the parallel implementation. Note that as soon as processor i is finished with its intersection computation, processor i+1 can start its breadth-first computation. In the simple distributed approach (as in Sanghavi et al), each processor must wait for the previous processor to finish which is demonstrated in the lower time line. The dotted lines are shown to clarify that the lines which are mapped to each other using the dotted lines represent the time taken for the breadth-first computation at the same processor (the difference being that there is no overlap in the latter case whereas our approach gives sufficient overlap and hence parallelism between processors).

    Finally we came up with a representation which meets all of the above criteria. Note that our aim is to reduce the critical path in computation of the pairings at each level. In the sequential top-down algorithm, the critical path for level i is of length i-1 , i.e., level i has to wait for i-1 previous levels to complete before proceeding. If we wish to short-circuit this critical path to, say h levels, then it means that path information from the level h levels above should be available to level i. Once the pairings at level i-h have been performed, level i can proceed with computation. Assume that there are k processors an each one of them holds h levels. This representation essentially implies that for the last processor, the critical path will be reduced to k-1 instead of (k-1)*h. This forms the basis for our implementation and allows close to linear speedup (refer to figure 2).

    Augmenting BDD representation for parallelism

    A BDD of n variables is partitioned into n/h parts with h being the number of levels of path history that the boundary levels keep (see figure 2). Each BDD node at a boundary level i is augmented with information about which parents at level i-h have paths coming into it and which paths these are. In order to keep this information as small as possible, it is kept in the form of a 32-bit bitmap (implying h<=5) . The node n at level i has a list of these bitmaps : one from each parent (h levels back) that has paths coming into it. Note that the worst-case storage requirement for this representation is bounded by b2 where b is the upper bound on the number of BDD nodes at any level. The true requirement would typically be a lot less because this bound is reached only at 'fat' levels. Note that contrary to our earlier attempt at trying to represent a BDD node with a set of indices in the name space of the complete binary tree, this representation uses the name space of the BDD which gives it its compactness. The algorithm essentially proceeds as follows : at level 0 in processor 0, 0 and 0 clearly pair. This information is sent to processor 1 and processor 0 proceeds with its computation. Meanwhile, as soon as processor 1 receives this information, it starts computing the pairings that would occur at level h held by it. Note that this is possible because processor 1's level h has complete information about the reachability pattern from level 0 to level h. As soon as this is over, this information is sent to processor 2 and processor 1 continues. Note that for the success of this scheme, it is imperative that the pairing computation be done as fast as possible. The next section outlines how the reachability information is kept and how the pairings are computed.

    Data Structures used for optimizing intersection computation

    Each BDD node at levels that are mulitples of h keeps a list of bitmaps : one for each parent that has a path coming into it. Consider the case in which h=5. Then, the bitmap for parent p has 32 bits and bit i being 1 means that path i comes into this node from the parent p . Path i means the path written out in binary is i. e.g. a path THEN THEN ELSE THEN ELSE would be 00101 which is 5. So, if at level 10, there is a BDD node which has an incoming path formed by taking 4 THEN and 1 ELSE branches from node 2 at level 5, then this node would have bit 1 (which is the path written in binary : 00001) set to 1 in its bitmap for parent 2 . The second structure we keep is a reachability graph from level i to level i+h . Every node at level i keeps a link to a list of nodes at level i+h which it can reach. Now, in order to figure out that which nodes combine at level i+h given that node m and n combine at level i, all we need to do is follow the reachability link from nodes m and n to level i+h . This gives us the list of nodes at level i+h that may potentially combine given that m and n combine. Of these, those nodes will combine which have intersecting bitmaps. This intersection can be simply checked by a bitwise AND of the bitmaps which is a cheap operation. This needs to be done for each pairing at level i. Once the pairings at the first level held by a processor (which is i+h in this case) have been computed like this, the processor can start computation without waiting for the previous processor to complete, thus gaining over the breadth-first approach.

    Results :

    These results are from the BDDs generated from the c499.blif and c432.blif netlists from the ISCAS85 benchmark suite. Both are 40 variable BDDs.

    No. of variables/levels = 40

    h=no. of levels per processor=5

    sequential case (every level waits for previous levels to complete)

    processor no. vs time taken for processing all the levels held by it

    1 1.897
    2 18.568
    3 50.925
    4 247.255
    5 33.958

    parallel case (every processor can start as soon as info from h levels back is available)

    processor no. vs time taken for computing no. of intersections at the first level held by that processor (given info from the previous processor about intersections h levels before)

    2 3.215
    3 78.635
    4 57.008
    5 75.8
    6 86.0

    Discussion of Results :

    Due to jumps to terminal nodes from nodes at levels higher than 25, the number of intersections at levels >= 25 was zero and hence those times are not shown. The first table shows the time taken to process the h levels held by each processor in the normal breadth-first approach. The second table shows how much time each processor spends in doing the intersection computation at the first level held by it given that it has received information about which nodes intersect h levels back (this info comes from the previous processor). So, processor 1 gives this info to processor 2 and starts its own computation. Processor 2, which would lie idle in the breadth-first case, uses this time to perform intersection computation given this information. If it can finish its computation before processor 1 does, we gain since processor 2 can start before processor 1 is done. However, if it does not, then we do not lose anything since we need not complete the intersection computation and simply start the computation whenever processor 1 is done ; which is the same as what would happen in the breadth-first case. Note that in computing intersections, processor 5 takes time which is a factor of 3 less than the time that processor 4 takes to complete (which is primarily due to the 'fatness' of the bdd at levels 15-20; this is a characteristic of most BDDs in that they show growth in the number of nodes per level till about the middle and then shrink down till they finally reach the two terminal nodes). The overall time that would be taken in the parallel case is the sum of the times taken by the sequential case except in cases where the intersection computation takes lesser time than the breadth first traversal. So, using the parallel approach, we get a factor of 1.7 improvement in the overall running time in the APPLY phase. For BDDs with a higher number of nodes per level, one would expect to see performance improvement by larger factors.

    References :

  • [McMillan]Kenneth L. McMillan, "Symbolic Model Checking ", Kluwer Academic Publishers.
  • [Bryant] Randal E. Bryant , "Symbolic boolean Manipulation with Ordered Binary Decision Diagrams" in ACM Computing Surveys, Sept,1992 .
  • [Ochi] H.Ochi, K.Yasuoka, and S.Yajima, " Breadth-First Manipulation of Binary Decision Diagrams" in Proc. Intl. Conf. on CAD, Nov,1993 .
  • [Sanghavi] Jagesh V. Sanghavi, Rajeev K. Ranjan, Robert K. Brayton, and Alberto Sangiovanni-Vincentelli, " High Performance BDD Package by Exploiting Memory Hierarchy " in the 33rd Design Automation Conference, 1996 .