Benchmarks
This section will be filled with benchmarks from my PhD after it is published. Currently you can see the different topics for which benchmarks exists and a comparison between single-threaded JINC and the well established OBDD library and CUDD.
Memory
The benchmark setup uses two parameters, n for the number of variables and k for the number of nodes that are created per level. There is no dependency between levels. For level i (with variable xi) we create the functions
xi,2·xi,3·xi,...,n·xi. The used hardware for this benchmark consists of 4 processors with 2.5GHz and
8GB RAM.
At first we use 10 variable levels and create 4000000 nodes per level. This means that we allocate and delete 44000001 nodes (4000001 terminal nodes + 10 · 4000000 inner nodes) for the first measurement. Each level is created in a seperate task, so that there is no concurrenct access to the unique-tables. As a second benchmark we use 1000 variable levels and create 40000 nodes per level. This results in 40040001 total nodes. The third setup also uses 1000 variable levels and creates 40000 nodes per level, but changes the creation order. Task i creates the functions i · x1, . . . , i · xk (i = 1, . . . , n). That means that there are many concurrent accesses to unique- tables. Figure 1 illustrates the results for all three benchmarks. In the first two benchmarks the number of nodes that are allocated per second and processor is almost constant. The garbage collection algorithm does not benefit in the equal amount from parallel threads as in the case of node allocation. This is due to the fact that the garbage collection has several phases which are separated via wait points. This reduces the effect of multiple parallel threads. In the third benchmark the performance of the allocation decreases, i.e., the concurrency increases with an increasing number of processors. The performance of the garbage collection decreased (compared to the second benchmark) dramatically. This is due to the fact that the nodes of a level are allocated from different memory pools. Because of this memory fragmentation the processor data cache cannot work very efficiently. Instead of fetching the data from the fast cache memory the data must be fetched from the much slower main memory. The increased execution time of the garbage collection makes the influence of the synchronization points less relevant.

Figure1 : Results for different allocation and garbage collection benchmarks.
Reordering
Figure 1 shows the parallel reordering approach applied to different instances of the n-queens problem. The x-axis represents the time needed to run a complete genetic reordering algorithm. The y-axis represents the number of CPUs that are fully used. The used hardware consists of 16 processors with 3.5GHz and 4GB RAM. The thread-pool has been limited to 16 threads. The benchmark shows that the new reordering approach scales with increasing number of variables. This can be explained with the increasing number of possible parallel operations and that the rescheduling is fast compared to a single swap operation. For the smaller examples the rescheduling process is more time consuming compared to a single swap operation. The larger examples take advantage of this new approach. The reordering of the 10-queens problem is nearly 3 times faster compared to the single threaded approach. This new approach speeds up the 11-queens problem reordering by factor 7 (700% processor load in average) and the 12-queens problem reordering by factor 10.
Figure 2 shows the number of possible parallel threads for the different instances of the n-queens problem. It illustrates that even the smaller benchmarks have a large number of parallel swap operations. As stated above the rescheduling and thread-pool management slows the overall performance. Another reason is that the number of nodes on each level is low and thereby the swap operations are not very time consuming. For the larger examples the number of possible parallel threads is higher than the number of CPUs used. This benchmark shows that even a small number of variables results in a high number of parallel threads. The average number of parallel threads for this benchmark is approximately.

Figure 1: Parallel reordering approach applied to different n-queens problem instances
Figure 2: Number of parallel swap operations for different n-queens problem instances
Calculations

Table 1: Matrix power calculation with 1 and 16 processors.
Figure 1: Matrix power processor load.
Comparisons
To show the efficiency of JINC is implemented we want to compare the running time of JINC with the well- established BDD package CUDD. The comparison with CUDD has been chosen because other libraries like the ABCD package and BDDNOW were developed with an other intention like parallelization (on different workstations while JINC has been built to use multi-processor machines efficiently) or do not include ADDs. A comparison between CUDD and BuDDy shows that the performance of both packages is almost the same. The OBDD package wld also provides a generic platform for BDD-variants, but the main focus is on the Reed Muller decomposition and (variants of) binary moment diagrams (BMD, *BMDs, K*BMDs, etc.). The wld-package is appropriate for word-level verifica- tion of hardware systems, while we concentrate on the Shannon decomposition which appears to be more ad- equate for complex matrix operations, which, e.g., are required in the context of the quantitative analysis of probabilistic systems.
We used a BDD-based state space generator and solver for the Moebius performance evaluation tool to benchmark JINC with CUDD. As a benchmark we choose the production system Kanban in different sizes.

The graphic above shows how well JINC performs in algebraic computations in comparison to CUDD. The comparison between different packages is always difficult so that we tried to eliminate troublesome effects like garbage collection and dynamic reordering. We set the parameters for both packages so that no garbage collection took place and we turned off the dynamic reordering.
For another comparison between JINC and CUDD we are using the probabilistic model checker Prism (which uses CUDD for the symbolic calculation) and PROMOC (which uses JINC). The leader election protocol is used as a benchmark.
The leader election protocol is used to elect a leader out of N processors. In this case all processors are arranged in a synchronous ring.
This protocol uses the idea that every processor randomly chooses a number from 1 to K. All chosen numbers are passed around the ring. The processor with the highest chosen number will be the leader. If there are more than one processors with the same number, the election starts again.

The graphic above shows the times (in seconds) needed to verify that a leader is elected with probability one. Both tools use the same input language, the same variable ordering and have the same internal representation. In this benchmark it can be seen that JINC is again faster than CUDD in most cases. JINC has a better scalability than CUDD. CUDD initializes very fast and is therefore faster in smaller benchmarks.





