________________________________________________________
 
 

Description for Implementation of MPI Programs
________________________________________________________
 

 
 
VECTOR-VECTOR MULTIPLICATION 

BACKGROUND  

Vectors can be partitioned in different ways and the partitions can be assigned to different process. We briefly explain some well known partitioning techniques to write parallel programs. In the striped partitioning of a vector, the vector is divided into groups of contiguous elements, and each process is assigned one such group. 

A serial algorithm for vector-vector multiplication requires n multiplications and (n-1 ) additions .The serial program is given below : 

float VECT_VECT (int n, float x[ ], float y[ ])     
{    
    int i;   
    float dot_product;   
    dot_product  = 0.0;    
    for(i = 0; i < n; i++)    
      dot_product = dot_product + (x[i] * y[i]);    
     return(dot_product);    
   

MATRIX-VECTOR MULTIPLICATION   
 
BACKGROUND  
Matrices can be classified  into two broad categories, dense matrices and sparse matrices. Dense or full matrices  have few or no zero elements. Sparse matrices  have a majority of zero elements . In order to process a matrix input in parallel, we must partition it so that the partitions can be assigned to different processes.  If we assume that an addition and multiplication pair takes unit time, then the sequential run time of algorithm is n2. For multiplying a dense matrix A of size m x n and a vectorof size n  atleast, three distinct parallel formulations of matrix-vector multiplication's are possible.  It depends on row-wise striping, column-wise striping, or checkerboard striping of  the matrix. The following examples discuss common ways to partition matrices among processes to perform matrix-vector multiplication.  Serial algorithm of matrix vector multiplication is explained below: 
       void MAT_VECT(int n, int m, float A[ ][ ], float x[ ], float y[ ])     
     {     
        int   i, j;     
           for (i = 0; i<m; i++)     
        {     
           y[i]= 0.0;     
           for (j =0;  j<n;  j++)      
               y[i]= y[i] + A[i][j] * x[j];     
        }     
     }    
 
        The following examples discuss issues in implementation of parallel matrix-matrix multiplication algorithms. 

MATRIX-MATRIX MULTIPLICATION 
 
 BACKGROUND 

We discuss parallel algorithms for multiplying two dense, square matrices A and B of size n to yield the product matrix. All parallel matrix multiplication algorithms involves the scalar algebraic operations performed on the blocks or submatrices of the original matrix .  Such algebraic operations on the submatrices are called block matrix operations . If we assume that an addition and multiplication pair takes unit time, then the sequential run time of conventional algorithm is n3. Serial and parallel algorithms of matrix-matrix  multiplication are explained below : 
       void BLOCK_MAT_MULT ( int n, float A[ ][ ], float B[ ][ ], float C[ ][ ])        
      {             
            int      i, j, k;      

            for (i = 0; i < n-1; i++)        
                for (j = 0; j < n-1; j++)       
                {     
                    /*  Initialize all elements of C [i,  j] to zero */     

                        C[i][j] = 0.0;  
                        for (k= 0; k < n-1; k++)      
                        C[i][j] = C[i][j] + A[i][k] *  B[k][j];       
                }       
      }     
      

 

Example 13 : Description for implementation of MPI program to compute dot product of two vectors
                               using block-striped partitioning with uniform data distribution

  • Objective
  • Write a MPI program, to compute the vector-vector multiplication on p processors of PARAM 10000 using  block-striped partitioning for uniform data distribution . Assume that the vectors are of size n and p is number of processors used and n  is a multiple of p.

  • Description 
  • The partitioning is called block-striped if each process is assigned contiguous elements. The process P0 gets the first n/p elements, P1 gets the next n/p elements and so on. The distribution of 16 elements of vector a on 4 processes is shown in Figure 7. 

     

     
    Figure 7  A Typical block-striped partitioning of a vector of size 16 on 4 processes

    Initially process with rank 0 distributes the input vectors using MPI_Scatter on p processes. Each process will perform local dot product of the vectors and accumulate the partial dot product. Now the process with rank 0 performs global reduction using MPI_Reduce to get the final dot product. 
  • Input  
  • Process with rank 0 reads a real vectors of size n. Assume that the number of elements n are greater than or equal to number of processes p.  

    You have to adhere strictly to the following format for the input files. 

    #Line 1 : Number of Elements (n
    #Line 2 : List of Elements (data). 

    A sample input file for the vector A is given below 

    1.0   2.0   3.0    4.0     5.0    6.0    7.0    8.0 

    A sample input file for the vector B is given below 

    1.0   2.0   3.0    4.0     5.0    6.0    7.0    8.0

  • Output 
  • Process with rank 0 prints the final dot product of two vectors. The dot product for the sample input vectors is given below. 

     204.0 


Example 14 : Description for implementation of MPI program to compute dot product of two vectors
                               using block-striped partitioning with non-uniform data distribution
  • Objective

  •  

    Write a MPI program, to compute the dot product of two vectors on p processors of PARAM 10000 using  block-striped partitioning for non-uniform distribution of data

  • Description 
  • If p divides n evenly, processes P0 gets the first n/p elements, P1 the next n/p elements and so on. If p does not divide n evenly, and  r be the remainder, then first r processes get (n/p)+1 elements and remaining p-r processes get n/p elements.  The program described in Example 13 uses MPI_Scatter and MPI_Reduce because the vector is equally distributed on p processes. Here, we use MPI_Scatterv call, which distributes the vectors non-uniformly on all processes. 

  • Input 
    •  
    Input is given in the same format as explained in the Example 13 
     
  • Output 
  • Process with rank 0 prints the final dot product of two vectors and the result are given as in Example 13 
     


Example 15 : Description for implementation of MPI program to compute dot product of two vectors
                               using block-striped partitioning with cyclic data distribution
  • Objective
  • Write a MPI program, to compute the dot product of two vectors on p processors of PARAM 10000 using cyclic data partitioning. Assume that the vectors are of size n and p is number of processors used and n  is a multiple of p. 

  • Description  
  • In the Cyclic data partitioning process P0 gets the first element, process P1 gets the next and so on. The process Pp-1 gets (p-1)th element. If the number of elements n is more than the number processes p, then  process P0 gets pth element, process P1 gets (p+1)th element and so on. The process is repeated till all the elements are assigned. If n is not a multiple of p, then first r (r = n/p) processes will get n/p +1 elements and remaining p-r processes will get n/p elements, in cyclic fashion. The Figure 8 illustrates the example for  p = 4 and n = 16. 
     

     
    Figure 8  A Typical Cyclic data partitioning of a vector of size 16 on 4 processes 
  • Input 
    Input is given in the same format as explained in the Example 13 
     
  • Output 
  • Process with rank 0 prints the final dot product of two vectors and the result are given as in Example 13 
     


Example 16 : Description for implementation of MPI program to compute infinity norm of a matrix
                               using block-striped partitioning with uniform data distribution
  • Objective 
  • Write a MPI program to calculates infinity norm of a matrix using row wise block-striped partitioning

  • Description 
  • The infinity norm of a matrix is defined to be the maximum of sums of absolute values of elements in a row, over all rows. Row-wise block partitioning, is used and the idea is that the matrix m x n is striped among p processors so that each processors stores m/p rows of the matrix. A typical column-wise and row-wise partitioning of the matrix is shown in the Figure 9. 
     

         
        Figure 9  Uniform column-wise and row-wise striped partitioning of 16 x 16 matrix  
                        on 4 processors
         
         
    If a m x n matrix is partitioned on p processes (labeled p0 , p1, p2,..., pp-1), then process pi contains rows with indices (m/p)i, (m/p)i+1, (m/p)i+2, (m/p)i+3,........, (m/p)(i+1)-1. 
  • Input 
  • Assume A is a real matrix of size m x n.  Also the number of rows m should be greater than or equal to number of processes p. Process with rank 0 reads input data.  Format for the input file is given below. 

    You have to adhere strictly the following format for the input files. 

     #Line 1 : Number of Rows (m)  Number of columns(n). 
     #Line 2 : (data) (in row-major order. This means that the data of second row follows that of the first 
                     and so on.) 
     

  • Input file 

  •  
    A sample input file for the matrix A (8 x 8) is given below : 
    8  8 
    -1.0   2.0     3.0     4.0       5.0      6.0      7.0      8.0 
    2.0     3.0     4.0      5.0      6.0      7.0      8.0      9.0 
    3.0     4.0     5.0      6.0      7.0      8.0      9.0    10.0 
    4.0     5.0     6.0      7.0      8.0      9.0    10.0    11.0 
    5.0     6.0     7.0      8.0      9.0    10.0    11.0    12.0 
    6.0     7.0     8.0      9.0    10.0    11.0    12.0    13.0 
    7.0     8.0     8.0    10.0    11.0    12.0    13.0    14.0 
    8.0   -9.0   10.0    11.0    12.0    13.0    14.0    15.0 
     
  • Output
  • Process with rank 0 prints the infinity norm value as given below : 

    92.000 


Example 17 : Description for implementation of MPI program to compute the Matrix and Vector
                               Multiplication using self-scheduling algorithm
  •  Objective
  • Write a  MPI program,  for computing the matrix-vector multiplication on p processors of PARAM 10000 using Self  Scheduling  algorithm.

  • Description 
  • This example illustrates , one of the most common parallel algorithm prototype , the Self-Scheduling  or Master-Slave algorithm. This example is chosen not because it illustrates the best way to parallelize this particular numerical computation (it doesn't), but it illustrates the basic MPI send and MPI receive operations in the context of fundamental type of parallel algorithm applicable in many situations.

     

  

Figure 10  Communication pattern among master and slaves in self scheduling paradigm.

    We assume that the matrix A of size n x n is available with master process of rank 0 and the vector x of size n is available on all the slave processes , whose rank start from 1 onwards .The idea is that one process, which we call the master process,  distributes the work load to slave processes. When the slave finishes its workload, it informs the master which assigns a new workload to the slave. This communication between the master and slave is shown in Figure 10. This is very simple paradigm where the co-ordination is done by master. Here,  the slave processes do not have to communicate with one another. 

    The master begins by broadcasting vector x to each slave. It then sends one row of the matrix A to each slave .  At this point, the master begins a loop, terminated when it has received all of the entries in the product. The body of the loop consists of receiving one entry in the product vector from any slave , and sending the next task (row of matrix A)  to that slave. In other words, completion of one task by a slave is considered to be a request for the next task. Once all the tasks have been handed out, termination messages are sent. 

    After receiving the broadcast value of vector x , each slave also enters in a loop to compute the product of a matrix row with vector. Each slave computation is terminated by the receipt of the termination message from master. The body of the loop consists of receiving a row of matrix A,  performing the dot product with vector x, and sending the answer back to the master
     

  • Input  
  • The input should be in following format. 

    Assume that the real matrix is  of size m x n  and the real vector is of size n .  Also the number of rows m should be greater than or equal to number of processes p. Process with rank 0 reads the input matrix A  and the vector x .  Format for the input files are given below. 

    Input file 1 

    The input file for the matrix should strictly adhere to the following format. 

    #Line 1 : Number of Rows (m), Number of columns (n).  
    #Line 2 : (data) (in row-major order. This means that the data of second row follows that of the first and so on.) 

    A sample input file for the matrix (8 x 8) is given below 

    8   8 
    1.0    2.0    3.0     4.0     2.0     3.0     2.0     1.0 
    2.0    3.0    4.0     2.0     3.0     2.0     1.0     1.0 
    3.0    4.0    2.0     3.0     2.0     1.0     1.0     1.0 
    4.0    2.0    3.0     2.0     1.0     1.0     4.0     2.0 
    2.0    3.0    2.0     1.0     2.0     3.0     2.0     2.0 
    3.0    2.0    1.0     1.0     2.0     1.0     1.0     1.0 
    2.0    1.0    4.0     3.0     2.0     1.0     3.0     3.0 
    4.0    1.0    2.0     1.0     2.0     3.0     4.0     3.0 
     

     
    Input file 2 

    The input file for the vector should strictly adhere to the following format. 

    #Line 1 : Size of the vector (n
    #Line 2 : (data)  

    A sample input file for the vector (8 x 1) is given below 


    1.0 1.0 1.0 1.0 1.0  1.0  1.0  1.0 
     

  • Output 
  • Process with rank 0 prints the final matrix vector product result is given below. 
     

    18.0  18.0  17.0  19.0  17.0   12.0  19.0  20.0 
     


Example 18 : Description for implementation of MPI program to compute the Matrix and Vector
                               Multiplication using block-striped row-wise partitioning with uniform data distribution 

 
  • Objective 
  • Write  a MPI program, for computing the matrix -vector multiplication on p processor of PARAM 10000 using block striped partitioning of a matrix.

  • Description 
  • In the striped partitioning of a matrix, the matrix is divided into groups of contiguous complete rows or columns, and each processor is assigned one such group. The partitioning is called block-striped if each processor is assigned contiguous rows or columns. Striped partitioning can be block or cyclic. In a row-wise block striping of an nxn matrix on p processors (labeled P0 , P1, P2,..., Pp-1),  processor Pi contains rows with indices (n/p)i, (n/p)i+1, (n/p)i+2, (n/p)i+3,........, (n/p)(i+1)-1. A typical column-wise or row-wise partitioning of 16 x 16 matrix on 4 processors is shown in the Figure 11  

       

    Figure 11  Uniform block-striped partitioning of 16 x 16 matrix on 4 processors.

    The matrix A of size n x n is striped row-wise among p processes so that each process stores n/p rows of the matrix.  We assume that the vector x of size n x 1 is available on each process . Now, process  Pi computes the dot product of the corresponding rows of the matrix A[ * ] with the vector x[ * ]  and accumulate the partial result in the array y[ * ]. Finally, process P0 collects the dot product of different rows of the matrix with the vector from all the processes. 
     

  • Input 
    The input is given in the same format as explained in the Example 17 . Also the number of rows m should be an exact multiple of the number of processors p. This is assumed so that each process get equal number of rows
     
  • Output 
  • Process with rank 0 prints the final matrix vector product result, same as in Example 17 
     
     


Example 19 :  Description for implementation of MPI program to compute Matrix and Vector
                                Multiplication using block checkerboard partitioning
  • Objective 
  • Write a parallel MPI program, for computing the matrix -vector multiplication on p processor of PARAM 10000 using block-checkerboard partitioning . Assume that p is a perfect square number. 

    Special MPI routines can be used to arrange the processors in a square grid of q x q processors where, q2=p.  Also, assume that the size of the matrix is evenly divisible by q.

  • Description
  • In checkerboard partitioning, the matrix is divided into smaller square or rectangular blocks (submatrices) that are distributed among processes. A checkerboard partitioning splits both the rows and the columns of the matrix, so no process is assigned any complete row or column . Like striping partitioning, checkerboard partitioning can be block or cyclic. The Figure 12  explains how a 8 x 8 matrix is distributed on 16 processors using block checkerboard and cyclic checkerboard partitioning . 

     

           

     

    Figure 12  Checkerboard partitioning of 8 x 8 matrices on 16 processors. 

    We assume that processes are arranged in q x q  matrix and processes are stored in row major order. As shown in Figure 12 for square grid of 4 x 4 processors the matrix A of size m x is partitioned among processes using block checkerboard partitioning , such that each process stores m/q x n/q block of the matrix A. The vector x is distributed in portions of n/q elements to the first process in each columns of q processes . For example, the  processes P00, P10 , P20 and P30 get the n/q elements of the vector x. Each process in the first column of each row in square grid of q x q processes, possesses n/q elements of the vector. Processes P00, P10, P20, P30 broadcast elements of the vector to the other processes in the respective rows of the grid.  Each process now stores m/q x n/q  blocks of the matrix A and a  n/q elements of the vector x

    Each process then performs multiplication of its block matrix with local vector elements and stores the partial result in the vector y[ * ]. Now, each process has resultant vector y of size m/q and the first process in each row of square grid processes q x q will accumulate the partial sum in the same row involving other processes in the same row of square grid of processes. Finally, process Pgathers accumulated partial sum on each process to obtain the resultant vector y. 
     
     

  • Input 
    •  
      Assume that the input is given in the same format as explained in the Example 17 
       
  • Output 

  •  

    Process with rank 0 prints the final matrix vector product result and the result is same as in Example 17 

     

 


         


Example 20 : Description for implementation of MPI program to compute Matrix and Matrix
                              Multiplication using self-scheduling algorithm
  • Objective 
  •  Write a MPI program, for computing the matrix-matrix multiplication on  p processors of PARAM 10000 using Self-Scheduling  algorithm. 

  • Description 

  •  
    This example illustrates , one of the most common parallel algorithm prototype , the Self-Scheduling  or Master-Slave algorithm. This example is chosen not because it illustrates the best way to parallelize this particular numerical computation (it doesn't), but it illustrates the basic MPI library calls like, MPI send and MPI receive  in the context of fundamental type of parallel algorithm applicable in many situations. 

    We assume that a square matrix A of size n is available on the master process with rank 0 and another square matrix B of size n is available on all the slave  processes ,whose rank starts from 1 onwards. The idea is that  the master process,  distributes the work load to slave  processes. When the slave finishes its workload, it informs the master, about the completion of the task assigned and then master assigns a new workload to the slave. This is very simple paradigm where  the coordination is done by master. Here,  the slave processes do not have to communicate with one another. This communication among the master and the slaves is shown in Figure 13 
     
     

           
                              Figure 13  Communication pattern among master and slave in self scheduling paradigm. 

    The master process with rank 0 stores the final output square matrix C of size n . The master process  receives whole row of the product matrix C from any slave, and sends the next task (row of matrix A) to that slave processes. In other words, completion of one task by a slave is considered to be a request for the next task. Once all the tasks have been handed out, termination messages are sent . 

    Each slave process, after receiving  matrix B,  enters a loop which is terminated after receiving the termination message from master. The body of the loop consists of receiving the row of the matrix A, forming the entries Ci, j of required row by doing computation with the received row of the matrix A with the available matrix B , and sending the  computed row of the product matrix. 

  • Input 
  • Assume A and B are real square matrices of size n and the number of rows n should be greater than or equal to number of processes p. Process 0  reads  the input matrices A and B.  Format for the input file is given below. 

    The input file for the matrices A and B should strictly adhere to the following format. 

      #Line 1 : Number of Rows (n)  Number of columns(n). 
      #Line 2 : (data) (in row-major order. This means that the data of second row follows that of the first and so on.) 

     

  • Input file 1 

  •  
    A sample input file for the matrix A (8 x 8) is given below 

    8   8 
    1.0    2.0    3.0     4.0     2.0     3.0     2.0     1.0 
    2.0    3.0    4.0     2.0     3.0     2.0     1.0     1.0 
    3.0    4.0    2.0     3.0     2.0     1.0     1.0     1.0 
    4.0    2.0    3.0     2.0     1.0     1.0     4.0     2.0 
    2.0    3.0    2.0     1.0     2.0     3.0     2.0     2.0 
    3.0    2.0    1.0     1.0     2.0     1.0     1.0     1.0 
    2.0    1.0    4.0     3.0     2.0     1.0     3.0     3.0 
    4.0    1.0    2.0     1.0     2.0     3.0     4.0     3.0 
     

  • Input file 2

  •  
    A sample input file for the matrix B (8 x 8) is given below 

    8   8 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
    1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0 
     
     

  • Output 
  • Process 0  prints  final matrix-matrix product matrix C (8 x 8) .  The output of the matrix - matrix multiplication is given below 
     

    18.0   36.0   54.0   72.0   36.0   18.0    36.0   18.0 
    18.0   36.0   54.0   72.0   36.0   18.0    36.0   18.0 
    17.0   34.0   51.0   68.0   34.0   17.0    34.0   17.0 
    19.0   38.0   57.0   76.0   38.0   19.0    38.0   19.0 
    17.0   34.0   51.0   68.0   34.0   17.0    34.0   17.0 
    12.0   24.0   36.0   48.0   24.0   12.0    24.0   12.0 
    19.0   38.0   57.0   76.0   38.0   19.0    38.0   19.0 
    20.0   40.0   60.0   80.0   40.0   20.0    40.0   20.0 
     


Example 21 : Description for implementation of MPI program to compute Matrix and Matrix
                               Multiplication using block checkerboard partitioning and MPI Cartesian topology
  • Objective 
  • Write a MPI program, for computing the matrix-matrix multiplication on p processors of PARAM 10000 using block checkerboard partitioning of the matrices . Special MPI routines on cartesian topology can be used for checkerboard partitioning of matrices. 

    Assume that  p= qand the size of square matrices A and B is evenly divisible by q

  • Description 

  •  

     Assume that A and B are square matrices of size n and C be the output  matrix. These matrices are dived into blocks or submatrices to perform matrix-matrix operations in parallel .For example, an n x n matrix A can be regarded as q x q array of blocks Ai, j (0<=i <q, 0<= j < q) such that each block is an (n/q) x (n/q) submatrix. We use p processors to implement the block version of matrix multiplication in parallel by choosing q as a square root of  and compute a distinct block Ci, j on each processor. Block  and cyclic checkerboard partitioning of a 8 x 8 matrix on a square grid (4  x 4) of processors is shown in the Figure 14 
     
     

         

                            

    Figure 14  Checkerboard partitioning of 8 x 8 matrices on 16 processors

    The matrices A and B are partitioned into blocks, A i, j and B i, j (0<=i < q, 0<= j < q) of size  (n/q x n/q)   on each process. These blocks are mapped onto a q x logical mesh of processes. The processes are labeled from P0,0 to Pq-1,q-1. An example of of this situation is shown in Figure 14  Process Pi, j initially store block matrices Ai, j and Bi, j and computes block Ci, j of result matrix. To compute submatrix Ci, j , we need all submatrices , Ai, k and  Bk, j ( 0 <= k < q ) . To acquire all the required blocks, an all-to-all broadcast of matrix Ai, j 's  is performed in each row and similarly in each column of matrix Bi, j's. MPI collective communication is used to perform this operations. 

    After Pi, j acquires, A i,0 , A i,1 , A i,2 , A i, q-1 and B0, j , B1, j , B2, j , Bq-1, j , it performs the serial block matrix to matrix multiplication and accumulates the partial block  matrix Ci, j of matrix C . To obtain the resultant product matrix C, processes with rank 0 gathers all the block matrices by using MPI_Gather collective communication operation. 

     MPI provides a set of special routines to virtual  topologies. An important virtual  topology is the Cartesian topology . This is simply a decomposition in the natural co-ordinate ( e.g., x,y,z) directions. 

  • Input 
  • The input is given in the same format as explained in Example 20 . Assume that the number of processes is a perfect square. 
     

  • Output 
  • Process with rank 0  prints the  final matrix-matrix product and the result are given as in Example 20 


  Example 22 : Description for implementation of MPI program to compute Matrix and Matrix
                                Multiplication using block checkerboard partitioning and Cannon Algorithm
  • Objective 
  • Write a MPI program, for computing the matrix-matrix multiplication on p processors of PARAM 10000. Use a special MPI routines on cartesian topology for block checkerboard partitioning of the matrices, Cannon's Algorithm. 

    Assume that  p= q2 and the size of square matrices A and B is evenly divisible by q. 

  • Description 

  •  

    Cannon's algorithm is based on cartesian virtual topology. As discussed in Example 21, there are p processors arranged in q x q square grid of processors and the input matrices, A and B are distributed among the processes in checkerboard fashion. It results in constructing p block matrices of A and B. It uses only point-to-point communication for circularly shifting blocks of matrix A and matrix B among p processes. 

    The algorithm proceeds in q stages. The first step in this algorithm is to perform initial allignment of the block matrix A and block matrix B. The blocks of matrix A are circularly shifted to the i positions to left in the row of the square grid of processes, where i  is the row number of the process in the mesh. Similarly, blocks of matrix B are circularly shifted j positions upwards, where j is the column number of the process in the processes mesh.  This operation is performed by using MPI_Sendrecv_replace MPI_Send and MPI_Recv is not used for point-to-point communication, because if all the processes call MPI_Send or MPI_Recv in different order the deadlocked situation may arise . 

    The algorithm performs the following steps in each stage : 

    1. Multiply the block of matrix A and matrix B and add the resultant matrix to get the block matrix C
        which is initially set to zero. 
    2. Circularly shift the blocks of matrix A to left in the rows of the processes and the blocks of 
         matrix B upwards in the columns of the square grid of processes in a wrap around manner. 

    The communication steps for 4 x 4 square grid of processors mesh are explained in the Figure 15. 
     

      

Figure 15  The communication steps in Cannon's Algorithm on 16 processors

  • Input 
  • The input is given in the same format as explained in Example 20 
     

  • Output 
  • Process with rank 0  prints the final  product matrix and the results are given as in Example 20


Example 23 : Description for implementation of MPI program to compute Matrix and Matrix
                               Multiplication using block checkerboard partitioning and Fox Algorithm
  • Objective 
  • Write a MPI program, for computing the matrix-matrix multiplication on p processors of PARAM 10000 .Use a special MPI routines on cartesian topology for block checkerboard partitioning of the matrices, Fox's Algorithm. 

    Assume that  p= q2 and the size of square matrices A and B is evenly divisible by q. 

  • Description 
  • This algorithm is based on cartesian virtual topology. As discussed in Example 21, there are p processes arranged in q x q mesh and the input matrices are distributed among the processes in checkerboard fashion. It results in constructing p block matrices of A and B. The algorithm uses two types of communication, firstly it uses one-to-all broadcast for blocks of matrix A in processes rows. Similarly blocks of matrix B are circularly shifted j positions upwards, where j is the column of the process in the processes mesh  using MPI_Sendrecv_replace. In this case MPI_Send and MPI_Recv is not used for point-point communication because if all the processes call MPI_Send or MPI_Recv in different order the deadlocked situation may arise. 

    The algorithm proceeds in n = q stages and it performs the following step in each stages . 

    1.  Broadcast the block of matrix A on the diagonal process of a particular row in the square grid of 
         processes to other processes in the same row. 
    2.  Multiply the block of matrix A and matrix B and add the resultant matrix to get the block matrix C
         which is initially set to zero. 
    3.  Circularly shift the blocks of matrix B upwards in the processes columns and receives a fresh block 
         of matrix B from the process below it. This selection is done in wrap around manner. 
    4.  Select the  block of matrix A for the next row broadcast. If Ai, j was broadcast in the current step, 
         then select Ai,(j+1)mod q th block for the next broadcast. This selection is also done in 
         wrap around manner. 

    The communication steps for 4 x 4 square grid of processors are explained in the Figure 16 
     

   

 Figure 16  The communication steps in Fox's Algorithm on 16 processors

  • Input 
  • The input is given in the same format as explained in Example 20 
     
      

  • Output 
  • Process with rank 0  prints the  final matrix-matrix product and the result are given as in Example 20

 

 

Contents       Hands-on Session       MPI Library Calls       MPI On Web