Hands-on-session On PARAM 10000 : Day 2




________________________________________________________ Description for Implementation of MPI Programs ________________________________________________________

VECTOR-VECTOR MULTIPLICATION

BACKGROUND

Vectors can be partitioned in different ways and the partitions can be assigned to different process. We briefly explain some well known partitioning techniques to write parallel programs. In the striped partitioning of a vector, the vector is divided into groups of contiguous elements, and each process is assigned one such group.
A serial algorithm for vector-vector multiplication requires n multiplications and (n-1 ) additions .The serial program is given below :
float VECT_VECT (int n, float x[ ], float y[ ])
{
    int i;
    float dot_product;
    dot_product = 0.0;
    for(i = 0; i < n; i++)
      dot_product = dot_product + (x[i] * y[i]);
     return(dot_product);
}

MATRIX-VECTOR MULTIPLICATION

BACKGROUND

Matrices can be classified into two broad categories, dense matrices and sparse matrices. Dense or full matrices have few or no zero elements. Sparse matrices have a majority of zero elements . In order to process a matrix input in parallel, we must partition it so that the partitions can be assigned to different processes. If we assume that an addition and multiplication pair takes unit time, then the sequential run time of algorithm is n². For multiplying a dense matrix A of size m x n and a vector x of size n atleast, three distinct parallel formulations of matrix-vector multiplication's are possible. It depends on row-wise striping, column-wise striping, or checkerboard striping of the matrix. The following examples discuss common ways to partition matrices among processes to perform matrix-vector multiplication. Serial algorithm of matrix vector multiplication is explained below:

       void MAT_VECT(int n, int m, float A[ ][ ], float x[ ], float y[ ])
     {
        int   i, j;
           for (i = 0; i<m; i++)
        {
           y[i]= 0.0;
           for (j =0; j<n; j++)
               y[i]= y[i] + A[i][j] * x[j];
        }
     }

        The following examples discuss issues in implementation of parallel matrix-matrix multiplication algorithms.

MATRIX-MATRIX MULTIPLICATION

BACKGROUND

We discuss parallel algorithms for multiplying two dense, square matrices A and B of size n to yield the product matrix. All parallel matrix multiplication algorithms involves the scalar algebraic operations performed on the blocks or submatrices of the original matrix . Such algebraic operations on the submatrices are called block matrix operations . If we assume that an addition and multiplication pair takes unit time, then the sequential run time of conventional algorithm is n3. Serial and parallel algorithms of matrix-matrix multiplication are explained below :

void BLOCK_MAT_MULT ( int n, float A[ ][ ], float B[ ][ ], float C[ ][ ])

{

int i, j, k;

        for (i = 0; i < n-1; i++)
            for (j = 0; j < n-1; j++)
            {
                /* Initialize all elements of C [i, j] to zero */

                    C[i][j] = 0.0;
                    for (k= 0; k < n-1; k++)
                    C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
}

Example 13 : Description for implementation of MPI program to compute dot product of two vectors
using block-striped partitioning with uniform data distribution

Objective

Write a MPI program, to compute the vector-vector multiplication on p processors of PARAM 10000 using block-striped partitioning for uniform data distribution . Assume that the vectors are of size n and p is number of processors used and n is a multiple of p.

Description

The partitioning is called block-striped if each process is assigned contiguous elements. The process P₀ gets the first n/p elements, P₁ gets the next n/p elements and so on. The distribution of 16 elements of vector a on 4 processes is shown in Figure 7.

Figure 7 A Typical block-striped partitioning of a vector of size 16 on 4 processes
Initially process with rank 0 distributes the input vectors using MPI_Scatter on p processes. Each process will perform local dot product of the vectors and accumulate the partial dot product. Now the process with rank 0 performs global reduction using MPI_Reduce to get the final dot product.

Input

Process with rank 0 reads a real vectors of size n. Assume that the number of elements n are greater than or equal to number of processes p.
You have to adhere strictly to the following format for the input files.
#Line 1 : Number of Elements (n)
#Line 2 : List of Elements (data).
A sample input file for the vector A is given below
8
1.0   2.0   3.0    4.0     5.0    6.0    7.0    8.0
A sample input file for the vector B is given below
8
1.0   2.0   3.0    4.0     5.0    6.0    7.0    8.0

Output

Process with rank 0 prints the final dot product of two vectors. The dot product for the sample input vectors is given below.
204.0

Example 14 : Description for implementation of MPI program to compute dot product of two vectors
using block-striped partitioning with non-uniform data distribution

Objective

Write a MPI program, to compute the dot product of two vectors on p processors of PARAM 10000 using block-striped partitioning for non-uniform distribution of data .

Description

If p divides n evenly, processes P₀ gets the first n/p elements, P₁ the next n/p elements and so on. If p does not divide n evenly, and r be the remainder, then first r processes get (n/p)+1 elements and remaining p-r processes get n/p elements. The program described in Example 13 uses MPI_Scatter and MPI_Reduce because the vector is equally distributed on p processes. Here, we use MPI_Scatterv call, which distributes the vectors non-uniformly on all processes.

Input

Input is given in the same format as explained in the Example 13

Output

Process with rank 0 prints the final dot product of two vectors and the result are given as in Example 13

Example 15 : Description for implementation of MPI program to compute dot product of two vectors
using block-striped partitioning with cyclic data distribution

Objective

Write a MPI program, to compute the dot product of two vectors on p processors of PARAM 10000 using cyclic data partitioning. Assume that the vectors are of size n and p is number of processors used and n is a multiple of p.

Description

In the Cyclic data partitioning process P₀ gets the first element, process P₁gets the next and so on. The process P_p-1 gets (p-1)^th element. If the number of elements n is more than the number processes p, then process P₀ gets p^th element, process P₁ gets (p+1)^th element and so on. The process is repeated till all the elements are assigned. If n is not a multiple of p, then first r (r = n/p) processes will get n/p +1 elements and remaining p-r processes will get n/p elements, in cyclic fashion. The Figure 8 illustrates the example for p = 4 and n = 16.

Figure 8 A Typical Cyclic data partitioning of a vector of size 16 on 4 processes

Input

Input is given in the same format as explained in the Example 13

Output

Process with rank 0 prints the final dot product of two vectors and the result are given as in Example 13

Example 16 : Description for implementation of MPI program to compute infinity norm of a matrix
using block-striped partitioning with uniform data distribution

Objective

Write a MPI program to calculates infinity norm of a matrix using row wise block-striped partitioning .

Description

The infinity norm of a matrix is defined to be the maximum of sums of absolute values of elements in a row, over all rows. Row-wise block partitioning, is used and the idea is that the matrix m x n is striped among p processors so that each processors stores m/p rows of the matrix. A typical column-wise and row-wise partitioning of the matrix is shown in the Figure 9.

Figure 9 Uniform column-wise and row-wise striped partitioning of 16 x 16 matrix
                on 4 processors.

If a m x n matrix is partitioned on p processes (labeled p₀ , p₁, p₂,..., p_p-₁), then process p_i contains rows with indices (m/p)i, (m/p)i+1, (m/p)i+2, (m/p)i+3,........, (m/p)(i+1)-1.

Input

Assume A is a real matrix of size m x n. Also the number of rows m should be greater than or equal to number of processes p. Process with rank 0 reads input data. Format for the input file is given below.
You have to adhere strictly the following format for the input files.
#Line 1 : Number of Rows (m) Number of columns(n).
#Line 2 : (data) (in row-major order. This means that the data of second row follows that of the first
                 and so on.)

Input file

A sample input file for the matrix A (8 x 8) is given below :
8 8
-1.0   2.0     3.0     4.0       5.0      6.0      7.0      8.0
2.0     3.0     4.0      5.0      6.0      7.0      8.0      9.0
3.0     4.0     5.0      6.0      7.0      8.0      9.0    10.0
4.0     5.0     6.0      7.0      8.0      9.0    10.0    11.0
5.0     6.0     7.0      8.0      9.0    10.0    11.0    12.0
6.0     7.0     8.0      9.0    10.0    11.0    12.0    13.0
7.0     8.0     8.0    10.0    11.0    12.0    13.0    14.0
8.0   -9.0   10.0    11.0    12.0    13.0    14.0    15.0

Output

Process with rank 0 prints the infinity norm value as given below :
92.000

Example 17 : Description for implementation of MPI program to compute the Matrix and Vector
Multiplication using self-scheduling algorithm

Objective

Write a MPI program, for computing the matrix-vector multiplication on p processors of PARAM 10000 using Self Scheduling algorithm.

Description

This example illustrates , one of the most common parallel algorithm prototype , the Self-Scheduling or Master-Slave algorithm. This example is chosen not because it illustrates the best way to parallelize this particular numerical computation (it doesn't), but it illustrates the basic MPI send and MPI receive operations in the context of fundamental type of parallel algorithm applicable in many situations.


Figure 10 Communication pattern among master and slaves in self scheduling paradigm.

We assume that the matrix A of size n x n is available with master process of rank 0 and the vector x of size n is available on all the slave processes , whose rank start from 1 onwards .The idea is that one process, which we call the master process, distributes the work load to slave processes. When the slave finishes its workload, it informs the master which assigns a new workload to the slave. This communication between the master and slave is shown in Figure 10. This is very simple paradigm where the co-ordination is done by master. Here, the slave processes do not have to communicate with one another.
The master begins by broadcasting vector x to each slave. It then sends one row of the matrix A to each slave . At this point, the master begins a loop, terminated when it has received all of the entries in the product. The body of the loop consists of receiving one entry in the product vector from any slave , and sending the next task (row of matrix A) to that slave. In other words, completion of one task by a slave is considered to be a request for the next task. Once all the tasks have been handed out, termination messages are sent.
After receiving the broadcast value of vector x , each slave also enters in a loop to compute the product of a matrix row with vector. Each slave computation is terminated by the receipt of the termination message from master. The body of the loop consists of receiving a row of matrix A, performing the dot product with vector x, and sending the answer back to the master.

Input

The input should be in following format.
Assume that the real matrix is of size m x n and the real vector is of size n . Also the number of rows m should be greater than or equal to number of processes p. Process with rank 0 reads the input matrix A and the vector x . Format for the input files are given below.
Input file 1
The input file for the matrix should strictly adhere to the following format.
#Line 1 : Number of Rows (m), Number of columns (n).
#Line 2 : (data) (in row-major order. This means that the data of second row follows that of the first and so on.)
A sample input file for the matrix (8 x 8) is given below
8   8
1.0    2.0    3.0     4.0     2.0     3.0     2.0     1.0
2.0    3.0    4.0     2.0     3.0     2.0     1.0     1.0
3.0    4.0    2.0     3.0     2.0     1.0     1.0     1.0
4.0    2.0    3.0     2.0     1.0     1.0     4.0     2.0
2.0    3.0    2.0     1.0     2.0     3.0     2.0     2.0
3.0    2.0    1.0     1.0     2.0     1.0     1.0     1.0
2.0    1.0    4.0     3.0     2.0     1.0     3.0     3.0
4.0    1.0    2.0     1.0     2.0     3.0     4.0     3.0

Input file 2
The input file for the vector should strictly adhere to the following format.
#Line 1 : Size of the vector (n)
#Line 2 : (data)
A sample input file for the vector (8 x 1) is given below
8
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Output

Process with rank 0 prints the final matrix vector product result is given below.

18.0 18.0 17.0 19.0 17.0   12.0 19.0 20.0

Example 18 : Description for implementation of MPI program to compute the Matrix and Vector
Multiplication using block-striped row-wise partitioning with uniform data distribution

Objective

Write a MPI program, for computing the matrix -vector multiplication on p processor of PARAM 10000 using block striped partitioning of a matrix.

Description

In the striped partitioning of a matrix, the matrix is divided into groups of contiguous complete rows or columns, and each processor is assigned one such group. The partitioning is called block-striped if each processor is assigned contiguous rows or columns. Striped partitioning can be block or cyclic. In a row-wise block striping of an nxn matrix on p processors (labeled P₀ , P₁, P₂,..., P_p-1), processor P_i contains rows with indices (n/p)i, (n/p)i+1, (n/p)i+2, (n/p)i+3,........, (n/p)(i+1)-1. A typical column-wise or row-wise partitioning of 16 x 16 matrix on 4 processors is shown in the Figure 11

Figure 11 Uniform block-striped partitioning of 16 x 16 matrix on 4 processors.
The matrix A of size n x n is striped row-wise among p processes so that each process stores n/p rows of the matrix. We assume that the vector x of size n x 1 is available on each process . Now, process P_i computes the dot product of the corresponding rows of the matrix A[ * ] with the vector x[ * ] and accumulate the partial result in the array y[ * ]. Finally, process P₀ collects the dot product of different rows of the matrix with the vector from all the processes.

Input

The input is given in the same format as explained in the Example 17 . Also the number of rows m should be an exact multiple of the number of processors p. This is assumed so that each process get equal number of rows.

Output

Process with rank 0 prints the final matrix vector product result, same as in Example 17

Example 19 : Description for implementation of MPI program to compute Matrix and Vector
Multiplication using block checkerboard partitioning

Objective

Write a parallel MPI program, for computing the matrix -vector multiplication on p processor of PARAM 10000 using block-checkerboard partitioning . Assume that p is a perfect square number.
Special MPI routines can be used to arrange the processors in a square grid of q x q processors where, q²=p. Also, assume that the size of the matrix is evenly divisible by q.

Description

In checkerboard partitioning, the matrix is divided into smaller square or rectangular blocks (submatrices) that are distributed among processes. A checkerboard partitioning splits both the rows and the columns of the matrix, so no process is assigned any complete row or column . Like striping partitioning, checkerboard partitioning can be block or cyclic. The Figure 12 explains how a 8 x 8 matrix is distributed on 16 processors using block checkerboard and cyclic checkerboard partitioning .

Figure 12 Checkerboard partitioning of 8 x 8 matrices on 16 processors.

We assume that processes are arranged in q x q matrix and processes are stored in row major order. As shown in Figure 12 for square grid of 4 x 4 processors the matrix A of size m x n is partitioned among p processes using block checkerboard partitioning , such that each process stores m/q x n/q block of the matrix A. The vector x is distributed in portions of n/q elements to the first process in each columns of q processes . For example, the processes P₀₀, P₁₀ , P₂₀ and P₃₀ get the n/q elements of the vector x. Each process in the first column of each row in square grid of q x q processes, possesses n/q elements of the vector. Processes P₀₀, P₁₀, P₂₀, P₃₀ broadcast elements of the vector to the other processes in the respective rows of the grid. Each process now stores m/q x n/q blocks of the matrix A and a n/q elements of the vector x.
Each process then performs multiplication of its block matrix with local vector elements and stores the partial result in the vector y[ * ]. Now, each process has resultant vector y of size m/q and the first process in each row of square grid processes q x q will accumulate the partial sum in the same row involving other processes in the same row of square grid of processes. Finally, process P₀gathers accumulated partial sum on each process to obtain the resultant vector y.

Input

Assume that the input is given in the same format as explained in the Example 17

Output

Process with rank 0 prints the final matrix vector product result and the result is same as in Example 17

Example 20 : Description for implementation of MPI program to compute Matrix and Matrix
Multiplication using self-scheduling algorithm

Objective

Write a MPI program, for computing the matrix-matrix multiplication on p processors of PARAM 10000 using Self-Scheduling algorithm.

Description

This example illustrates , one of the most common parallel algorithm prototype , the Self-Scheduling or Master-Slave algorithm. This example is chosen not because it illustrates the best way to parallelize this particular numerical computation (it doesn't), but it illustrates the basic MPI library calls like, MPI send and MPI receive in the context of fundamental type of parallel algorithm applicable in many situations.
We assume that a square matrix A of size n is available on the master process with rank 0 and another square matrix B of size n is available on all the slave processes ,whose rank starts from 1 onwards. The idea is that the master process, distributes the work load to slave processes. When the slave finishes its workload, it informs the master, about the completion of the task assigned and then master assigns a new workload to the slave. This is very simple paradigm where the coordination is done by master. Here, the slave processes do not have to communicate with one another. This communication among the master and the slaves is shown in Figure 13

                          Figure 13 Communication pattern among master and slave in self scheduling paradigm.
The master process with rank 0 stores the final output square matrix C of size n . The master process receives whole row of the product matrix C from any slave, and sends the next task (row of matrix A) to that slave processes. In other words, completion of one task by a slave is considered to be a request for the next task. Once all the tasks have been handed out, termination messages are sent .
Each slave process, after receiving matrix B, enters a loop which is terminated after receiving the termination message from master. The body of the loop consists of receiving the row of the matrix A, forming the entries C_{i, j} of required row by doing computation with the received row of the matrix A with the available matrix B , and sending the computed row of the product matrix.
.

Input

Assume A and B are real square matrices of size n and the number of rows n should be greater than or equal to number of processes p. Process 0 reads the input matrices A and B. Format for the input file is given below.
The input file for the matrices A and B should strictly adhere to the following format.
#Line 1 : Number of Rows (n) Number of columns(n).
#Line 2 : (data) (in row-major order. This means that the data of second row follows that of the first and so on.)

Input file 1

A sample input file for the matrix A (8 x 8) is given below
8   8
1.0    2.0    3.0     4.0     2.0     3.0     2.0     1.0
2.0    3.0    4.0     2.0     3.0     2.0     1.0     1.0
3.0    4.0    2.0     3.0     2.0     1.0     1.0     1.0
4.0    2.0    3.0     2.0     1.0     1.0     4.0     2.0
2.0    3.0    2.0     1.0     2.0     3.0     2.0     2.0
3.0    2.0    1.0     1.0     2.0     1.0     1.0     1.0
2.0    1.0    4.0     3.0     2.0     1.0     3.0     3.0
4.0    1.0    2.0     1.0     2.0     3.0     4.0     3.0

Input file 2

A sample input file for the matrix B (8 x 8) is given below
8   8
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0
1.0    2.0    3.0     4.0     2.0     1.0     2.0     1.0

Output

Process 0 prints final matrix-matrix product matrix C (8 x 8) . The output of the matrix - matrix multiplication is given below

18.0   36.0   54.0   72.0   36.0   18.0    36.0   18.0
18.0   36.0   54.0   72.0   36.0   18.0    36.0   18.0
17.0   34.0   51.0   68.0   34.0   17.0    34.0   17.0
19.0   38.0   57.0   76.0   38.0   19.0    38.0   19.0
17.0   34.0   51.0   68.0   34.0   17.0    34.0   17.0
12.0   24.0   36.0   48.0   24.0   12.0    24.0   12.0
19.0   38.0   57.0   76.0   38.0   19.0    38.0   19.0
20.0   40.0   60.0   80.0   40.0   20.0    40.0   20.0

Example 21 : Description for implementation of MPI program to compute Matrix and Matrix
Multiplication using block checkerboard partitioning and MPI Cartesian topology

Objective

Write a MPI program, for computing the matrix-matrix multiplication on p processors of PARAM 10000 using block checkerboard partitioning of the matrices . Special MPI routines on cartesian topology can be used for checkerboard partitioning of matrices.
Assume that p= q²and the size of square matrices A and B is evenly divisible by q.

Description

Assume that A and B are square matrices of size n and C be the output matrix. These matrices are dived into blocks or submatrices to perform matrix-matrix operations in parallel .For example, an n x n matrix A can be regarded as q x q array of blocks A_{i, j} (0<=i <q, 0<= j < q) such that each block is an (n/q) x (n/q) submatrix. We use p processors to implement the block version of matrix multiplication in parallel by choosing q as a square root of p and compute a distinct block C_i,
j on each processor. Block and cyclic checkerboard partitioning of a 8 x 8 matrix on a square grid (4 x 4) of processors is shown in the Figure 14

Figure 14 Checkerboard partitioning of 8 x 8 matrices on 16 processors.

The matrices A and B are partitioned into p blocks, A_{i, j}and B _{i, j}(0<=i < q, 0<= j < q) of size (n/q x n/q) on each process. These blocks are mapped onto a q x q logical mesh of processes. The processes are labeled from P_0,0 to P_q-1,q-1. An example of of this situation is shown in Figure 14 Process P_{i, j} initially store block matrices A_{i, j}and B_{i, j}and computes block C_{i, j}of result matrix. To compute submatrix C_{i, j} , we need all submatrices , A_i,
k and B_{k, j} ( 0 <= k < q ) . To acquire all the required blocks, an all-to-all broadcast of matrix A_{i, j}'s is performed in each row and similarly in each column of matrix B_{i, j}'s. MPI collective communication is used to perform this operations.
After P_{i, j} acquires, A _i,0, A_i,1, A _i,2, A _{i, q-1}and B_{0, j}, B_{1, j}, B_{2, j}, B_q_-1,
j, it performs the serial block matrix to matrix multiplication and accumulates the partial block matrix C_{i, j} of matrix C . To obtain the resultant product matrix C, processes with rank 0 gathers all the block matrices by using MPI_Gather collective communication operation.
MPI provides a set of special routines to virtual topologies. An important virtual topology is the Cartesian topology . This is simply a decomposition in the natural co-ordinate ( e.g., x,y,z) directions.

Input

The input is given in the same format as explained in Example 20 . Assume that the number of processes is a perfect square.

Output

Process with rank 0 prints the final matrix-matrix product and the result are given as in Example 20

Example 22 : Description for implementation of MPI program to compute Matrix and Matrix
Multiplication using block checkerboard partitioning and Cannon Algorithm

Objective

Write a MPI program, for computing the matrix-matrix multiplication on p processors of PARAM 10000. Use a special MPI routines on cartesian topology for block checkerboard partitioning of the matrices, Cannon's Algorithm.
Assume that p= q²and the size of square matrices A and B is evenly divisible by q.

Description

Cannon's algorithm is based on cartesian virtual topology. As discussed in Example 21, there are p processors arranged in q x q square grid of processors and the input matrices, A and B are distributed among the processes in checkerboard fashion. It results in constructing p block matrices of A and B. It uses only point-to-point communication for circularly shifting blocks of matrix A and matrix B among p processes.
The algorithm proceeds in q stages. The first step in this algorithm is to perform initial allignment of the block matrix A and block matrix B. The blocks of matrix A are circularly shifted to the i positions to left in the row of the square grid of processes, where i is the row number of the process in the mesh. Similarly, blocks of matrix B are circularly shifted j positions upwards, where j is the column number of the process in the processes mesh. This operation is performed by using MPI_Sendrecv_replace . MPI_Send and MPI_Recv is not used for point-to-point communication, because if all the processes call MPI_Send or MPI_Recv in different order the deadlocked situation may arise .
The algorithm performs the following steps in each stage :
1. Multiply the block of matrix A and matrix B and add the resultant matrix to get the block matrix C,
    which is initially set to zero.
2. Circularly shift the blocks of matrix A to left in the rows of the processes and the blocks of
     matrix B upwards in the columns of the square grid of processes in a wrap around manner.
The communication steps for 4 x 4 square grid of processors mesh are explained in the Figure 15.


Figure 15 The communication steps in Cannon's Algorithm on 16 processors.

Input

The input is given in the same format as explained in Example 20

Output

Process with rank 0 prints the final product matrix and the results are given as in Example 20

Example 23 : Description for implementation of MPI program to compute Matrix and Matrix
Multiplication using block checkerboard partitioning and Fox Algorithm

Objective

Write a MPI program, for computing the matrix-matrix multiplication on p processors of PARAM 10000 .Use a special MPI routines on cartesian topology for block checkerboard partitioning of the matrices, Fox's Algorithm.
Assume that p= q²and the size of square matrices A and B is evenly divisible by q.

Description

This algorithm is based on cartesian virtual topology. As discussed in Example 21, there are p processes arranged in q x q mesh and the input matrices are distributed among the processes in checkerboard fashion. It results in constructing p block matrices of A and B. The algorithm uses two types of communication, firstly it uses one-to-all broadcast for blocks of matrix A in processes rows. Similarly blocks of matrix B are circularly shifted j positions upwards, where j is the column of the process in the processes mesh using MPI_Sendrecv_replace. In this case MPI_Send and MPI_Recv is not used for point-point communication because if all the processes call MPI_Send or MPI_Recv in different order the deadlocked situation may arise.
The algorithm proceeds in n = q stages and it performs the following step in each stages .
1. Broadcast the block of matrix A on the diagonal process of a particular row in the square grid of
     processes to other processes in the same row.
2. Multiply the block of matrix A and matrix B and add the resultant matrix to get the block matrix C,
     which is initially set to zero.
3. Circularly shift the blocks of matrix B upwards in the processes columns and receives a fresh block
     of matrix B from the process below it. This selection is done in wrap around manner.
4. Select the block of matrix A for the next row broadcast. If A_{i, j} was broadcast in the current step,
     then select A_{i,(j+1)mod q} ^th block for the next broadcast. This selection is also done in
     wrap around manner.
The communication steps for 4 x 4 square grid of processors are explained in the Figure 16

Figure 16 The communication steps in Fox's Algorithm on 16 processors.

Input

The input is given in the same format as explained in Example 20


Output

Process with rank 0 prints the final matrix-matrix product and the result are given as in Example 20

Contents       Hands-on Session       MPI Library Calls       MPI On Web