Decomposing Large Networks


The goal in decomposing a large network for parallel simulation is to divide it into loosely-coupled subnetworks that take an equal amount of time to simulate. That way, all processors are kept busy for a large fraction of the time and the communication overhead is reduced. The best way to partition a network is highly problem-dependent, but we can give some general guidelines:

  1. Determine which elements are responsible for the majority of the simulation time consumed, then find a partition that spreads these across the available processors.
  2. Try to find n-way partitions instead of fixed partitions. An n-way partition is scalable so that one can choose an n that results in the best speedup without incurring excessive communication overhead. One can also select n to match the size of the network model being simulated. For example, we might want to partition a 500 element network across 5 processors (i.e., n=5) and a 2000 element network across 20 processors (i.e., n=20).
  3. If possible, try to have the inter-processor messages be ones on which a delay has been placed using the rvolumedelay command. A delay allows communication to be overlapped with the computation, resulting in less apparent communication overhead.

Once you have determined how you are going to partition your network model, you are ready to write a parallel script to set up and perform the simulation. Usually, the script will have several phases, with all the nodes waiting at a barrier at the end of each phase until the others catch up. Typical phases in a script would be:

  1. Each node creates the elements that will reside on that node.
  2. Each node creates messages from its own elements to elements residing on other nodes (as well as to elements residing locally).
  3. For non-interactive simulations, each node steps through the desired number of simulation timesteps.
  4. For interactive simulations, one node acts as a master and sends commands to the other nodes (e.g. step@all), while the other nodes wait at a barrier for commands from the master.

To illustrate some of the above points, we have developed a couple of example network decompositions and associated scripts for the Orient_tut example discussed in the Book of Genesis.