VECTORVECTOR MULTIPLICATION
BACKGROUND
Vectors can be partitioned in different ways and the partitions
can be assigned to different process. We briefly explain some well known
partitioning techniques to write parallel programs. In the striped partitioning
of a vector, the vector is divided into groups of contiguous elements,
and each process is assigned one such group.
A serial algorithm for vectorvector multiplication requires n
multiplications and (n1 ) additions .The serial program is given
below :
float VECT_VECT
(int n, float x[ ], float y[ ])
{
int i;
float dot_product;
dot_product = 0.0;
for(i = 0; i < n; i++)
dot_product = dot_product + (x[i] * y[i]);
return(dot_product);
}
MATRIXVECTOR MULTIPLICATION
BACKGROUND
Matrices can be classified into two broad categories,
dense matrices and sparse matrices. Dense or full
matrices have few or no zero elements. Sparse matrices
have a majority of zero elements . In order to process a matrix input
in parallel, we must partition it so that the partitions can be assigned
to different processes. If we assume that an addition and multiplication
pair takes unit time, then the sequential run time of algorithm is n^{2}.
For multiplying a dense matrix A of size m x n and
a vector x of size n atleast, three distinct
parallel formulations of matrixvector multiplication's are possible.
It depends on rowwise striping, columnwise striping,
or checkerboard striping of the matrix. The following examples
discuss common ways to partition matrices among processes to perform matrixvector
multiplication. Serial algorithm of matrix vector multiplication
is explained below:
void
MAT_VECT(int n, int m, float A[ ][ ], float x[ ],
float y[ ])
{
int i, j;
for (i = 0; i<m; i++)
{
y[i]= 0.0;
for (j =0; j<n; j++)
y[i]= y[i] + A[i][j] * x[j];
}
}
The following examples discuss
issues in implementation of parallel matrixmatrix multiplication algorithms.
MATRIXMATRIX
MULTIPLICATION
BACKGROUND
We discuss parallel algorithms for multiplying two dense, square
matrices A and B of size n to yield the product matrix.
All parallel matrix multiplication algorithms involves the scalar algebraic
operations performed on the blocks or submatrices of the original matrix
. Such algebraic operations on the submatrices are called block
matrix operations . If we assume that an addition and multiplication
pair takes unit time, then the sequential run time of conventional algorithm
is n3. Serial and parallel algorithms of matrixmatrix multiplication
are explained below :
void
BLOCK_MAT_MULT ( int n, float A[ ][ ], float B[ ][ ], float C[ ][ ])
{
int i, j, k;
for (i = 0; i < n1; i++)
for (j = 0; j < n1; j++)
{
/* Initialize all elements of C [i, j]
to zero */
C[i][j] = 0.0;
for (k= 0; k < n1; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
Example
13 : Description for implementation of MPI program to compute dot product of two vectors using blockstriped partitioning with uniform data distribution

Objective
Write a MPI program, to compute the vectorvector multiplication on
p processors of PARAM 10000 using blockstriped partitioning
for uniform data distribution . Assume that the vectors
are of size n and p is number of processors used and n
is a multiple of p.

Description
The partitioning is called blockstriped if each process is assigned
contiguous elements. The process P_{0} gets the first n/p
elements, P_{1} gets the next n/p elements and so
on. The distribution of 16 elements of vector a on 4 processes
is shown in Figure 7.
Figure 7 A Typical blockstriped partitioning of a vector
of size 16 on 4 processes
Initially process with rank 0 distributes the input vectors using
MPI_Scatter on p processes. Each process will perform local
dot product of the vectors and accumulate the partial dot product. Now
the process with rank 0 performs global reduction using MPI_Reduce
to get the final dot product.

Input
Process with rank 0 reads a real vectors of size n. Assume
that the number of elements n are greater than or equal to number
of processes p.
You have to adhere strictly to the following format for the input files.
#Line 1 : Number of Elements (n)
#Line 2 : List of Elements (data).
A sample input file for the vector A is given below
8
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
A sample input file for the vector B is given below
8
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0

Output
Process with rank 0 prints the final dot product of two
vectors. The dot product for the sample input vectors is given below.
204.0
Example
14 : Description for implementation of MPI program to compute dot product of two vectors using blockstriped partitioning with nonuniform data distribution

Objective
Write a MPI program, to compute the dot product of two vectors on p
processors of PARAM 10000 using blockstriped partitioning
for nonuniform distribution of data .

Description
If p divides n evenly, processes P_{0} gets the
first n/p elements, P_{1} the next n/p
elements and so on. If p does not divide n evenly, and
r be the remainder, then first r processes get (n/p)+1
elements and remaining pr processes get n/p elements.
The program described in Example 13 uses MPI_Scatter and MPI_Reduce
because the vector is equally distributed on p processes. Here,
we use MPI_Scatterv call, which distributes the vectors nonuniformly
on all processes.

Input
Input is given in the same format as explained in the Example
13

Output
Process with rank 0 prints the final dot product of two vectors and
the result are given as in Example
13
Example
15 : Description for implementation of MPI program to compute dot product of two vectors using blockstriped partitioning with cyclic data distribution

Objective
Write a MPI program, to compute the dot product of two vectors on p
processors of PARAM 10000 using cyclic data partitioning. Assume
that the vectors are of size n and p is number of processors
used and n is a multiple of p.

Description
In the Cyclic data partitioning process P_{0} gets
the first element, process P_{1 }gets the next and so on. The
process P_{p1} gets (p1)^{th} element.
If the number of elements n is more than the number processes p,
then process P_{0} gets p^{th} element,
process P_{1} gets (p+1)^{th} element and so on.
The process is repeated till all the elements are assigned. If n
is not a multiple of p, then first r (r = n/p)
processes will get n/p +1 elements and remaining pr
processes will get n/p elements, in cyclic fashion.
The Figure 8 illustrates the example for p = 4 and n
= 16.
Figure 8 A Typical Cyclic data partitioning of a vector
of size 16 on 4 processes
Input is given in the same format as explained in the Example
13

Output
Process with rank 0 prints the final dot product of two vectors and
the result are given as in Example
13
Example
16 : Description for implementation of MPI program to compute infinity norm of a matrix using blockstriped partitioning with uniform data distribution

Objective
Write a MPI program to calculates infinity norm of a matrix
using row wise blockstriped partitioning .

Description
The infinity norm of a matrix is defined to be the maximum of
sums of absolute values of elements in a row, over all rows. Rowwise block
partitioning, is used and the idea is that the matrix m x n is
striped among p processors so that each processors stores m/p
rows of the matrix. A typical columnwise and rowwise
partitioning of the matrix is shown in the Figure 9.
Figure 9 Uniform columnwise and rowwise striped partitioning
of 16 x 16 matrix
on 4 processors.
If a m x n matrix is partitioned on p processes (labeled
p_{0} , p_{1}, p_{2},...,
p_{p}_{1}), then process p_{i} contains
rows with indices (m/p)i, (m/p)i+1, (m/p)i+2,
(m/p)i+3,........, (m/p)(i+1)1.

Input
Assume A is a real matrix of size m x n. Also the
number of rows m should be greater than or equal to number of processes
p. Process with rank 0 reads input data. Format for
the input file is given below.
You have to adhere strictly the following format for the input files.
#Line 1 : Number of Rows (m) Number of columns(n).
#Line 2 : (data) (in rowmajor order. This means that the data
of second row follows that of the first
and so on.)

Input file
A sample input file for the matrix A (8 x 8) is given below :
8 8
1.0 2.0 3.0
4.0 5.0
6.0 7.0 8.0
2.0 3.0 4.0
5.0 6.0 7.0
8.0 9.0
3.0 4.0 5.0
6.0 7.0 8.0
9.0 10.0
4.0 5.0 6.0
7.0 8.0 9.0
10.0 11.0
5.0 6.0 7.0
8.0 9.0 10.0
11.0 12.0
6.0 7.0 8.0
9.0 10.0 11.0 12.0
13.0
7.0 8.0 8.0
10.0 11.0 12.0 13.0
14.0
8.0 9.0 10.0 11.0
12.0 13.0 14.0 15.0

Output
Process with rank 0 prints the infinity norm value as given below
:
92.000
Example 17 : Description for implementation of MPI program to compute the Matrix and Vector Multiplication using selfscheduling algorithm
 Objective
Write a MPI program, for computing the matrixvector multiplication
on p processors of PARAM 10000 using Self Scheduling
algorithm.

Description
This example illustrates , one of the most common parallel algorithm
prototype , the SelfScheduling or MasterSlave algorithm.
This example is chosen not because it illustrates the best way to parallelize
this particular numerical computation (it doesn't), but it illustrates
the basic MPI send and MPI receive
operations in the context of fundamental type of parallel algorithm
applicable in many situations.
Figure 10 Communication pattern among master and
slaves in self scheduling paradigm.
We assume that the matrix A of size n x n
is available with master process of rank 0 and the vector x
of size n is available on all the slave processes , whose
rank start from 1 onwards .The idea is that one process, which we call
the master process, distributes the work load to slave
processes. When the slave finishes its workload, it informs the master
which assigns a new workload to the slave. This communication between
the master and slave is shown in Figure 10. This is very
simple paradigm where the coordination is done by master. Here,
the slave processes do not have to communicate with one another.
The master begins by broadcasting vector x to each slave.
It then sends one row of the matrix A to each slave .
At this point, the master begins a loop, terminated when it has received
all of the entries in the product. The body of the loop consists of receiving
one entry in the product vector from any slave , and sending the
next task (row of matrix A) to that slave. In
other words, completion of one task by a slave is considered to
be a request for the next task. Once all the tasks have been handed out,
termination messages are sent.
After receiving the broadcast value of vector x , each slave
also enters in a loop to compute the product of a matrix row with vector.
Each slave computation is terminated by the receipt of the termination
message from master. The body of the loop consists of receiving
a row of matrix A, performing the dot product with
vector x, and sending the answer back to the master.

Input
The input should be in following format.
Assume that the real matrix is of size m x n
and the real vector is of size n . Also the number of rows
m should be greater than or equal to number of processes p.
Process with rank 0 reads the input matrix A and the
vector x . Format for the input files are given below.
Input file 1
The input file for the matrix should strictly adhere to the following
format.
#Line 1 : Number of Rows (m), Number of columns (n).
#Line 2 : (data) (in rowmajor order. This means
that the data of second row follows that of the first and so on.)
A sample input file for the matrix (8 x 8) is given below
8 8
1.0 2.0 3.0
4.0 2.0 3.0
2.0 1.0
2.0 3.0 4.0
2.0 3.0 2.0
1.0 1.0
3.0 4.0 2.0
3.0 2.0 1.0
1.0 1.0
4.0 2.0 3.0
2.0 1.0 1.0
4.0 2.0
2.0 3.0 2.0
1.0 2.0 3.0
2.0 2.0
3.0 2.0 1.0
1.0 2.0 1.0
1.0 1.0
2.0 1.0 4.0
3.0 2.0 1.0
3.0 3.0
4.0 1.0 2.0
1.0 2.0 3.0
4.0 3.0
Input file 2
The input file for the vector should strictly adhere to the following
format.
#Line 1 : Size of the vector (n)
#Line 2 : (data)
A sample input file for the vector (8 x 1) is given below
8
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Output
Process with rank 0 prints the final matrix vector product result is
given below.
18.0 18.0 17.0 19.0 17.0 12.0
19.0 20.0
Example
18 : Description for implementation of MPI program to compute the Matrix and Vector Multiplication using blockstriped rowwise partitioning with uniform data distribution

Objective
Write a MPI program, for computing the matrix vector multiplication
on p processor of PARAM 10000 using block striped partitioning
of a matrix.

Description
In the striped partitioning of a matrix, the matrix is divided
into groups of contiguous complete rows or columns, and each
processor is assigned one such group. The partitioning is called blockstriped
if each processor is assigned contiguous rows or columns.
Striped partitioning can be block or cyclic. In a
rowwise block striping of an nxn matrix on
p processors (labeled P_{0} , P_{1}, P_{2},...,
P_{p1}), processor P_{i} contains
rows with indices (n/p)i, (n/p)i+1,
(n/p)i+2, (n/p)i+3,........,
(n/p)(i+1)1. A typical columnwise or
rowwise partitioning of 16 x 16 matrix on 4 processors is shown
in the Figure 11
Figure 11 Uniform blockstriped partitioning of 16 x
16 matrix on 4 processors.
The matrix A of size n x n is striped rowwise
among p processes so that each process stores n/p rows
of the matrix. We assume that the vector x of size n x
1 is available on each process . Now, process P_{i}
computes the dot product of the corresponding rows of the matrix
A[ * ] with the vector x[ * ] and accumulate
the partial result in the array y[ * ]. Finally, process P_{0}
collects the dot product of different rows of the matrix with
the vector from all the processes.
The input is given in the same format as explained in the Example
17 . Also the number of rows m should be an exact multiple
of the number of processors p. This is assumed so that each process
get equal number of rows.

Output
Process with rank 0 prints the final matrix vector product result, same
as in Example 17
Example
19 : Description for implementation of MPI program to compute Matrix and Vector Multiplication using block checkerboard partitioning

Objective
Write a parallel MPI program, for computing the matrix vector multiplication
on p processor of PARAM 10000 using blockcheckerboard
partitioning . Assume that p is a perfect square number.
Special MPI routines can be used to arrange the processors in a square
grid of q x q processors where, q^{2}=p. Also,
assume that the size of the matrix is evenly divisible by q.

Description
In checkerboard partitioning, the matrix is divided into smaller
square or rectangular blocks (submatrices) that are
distributed among processes. A checkerboard partitioning splits
both the rows and the columns of the matrix, so no process
is assigned any complete row or column . Like striping
partitioning, checkerboard partitioning can be block or cyclic.
The Figure 12 explains how a 8 x 8 matrix is distributed on 16 processors
using block checkerboard and cyclic checkerboard partitioning
.
Figure 12 Checkerboard partitioning of 8 x 8 matrices
on 16 processors.
We assume that processes are arranged in q x q
matrix and processes are stored in row major order. As shown in Figure
12 for square grid of 4 x 4 processors the matrix A of size m
x n is partitioned among p processes using
block checkerboard partitioning , such that each process stores
m/q x n/q block of the matrix A. The vector x
is distributed in portions of n/q elements to the first process
in each columns of q processes . For example, the processes
P_{00}, P_{10} , P_{20} and P_{30} get
the n/q elements of the vector x. Each process in the first
column of each row in square grid of q x q processes, possesses
n/q elements of the vector. Processes P_{00}, P_{10}, P_{20},
P_{30} broadcast elements of the vector to the other processes
in the respective rows of the grid. Each process now stores
m/q x n/q blocks of the matrix A and a
n/q elements of the vector x.
Each process then performs multiplication of its block matrix with
local vector elements and stores the partial result in the vector y[
* ]. Now, each process has resultant vector y of size m/q and
the first process in each row of square grid processes q x q will accumulate
the partial sum in the same row involving other processes in the
same row of square grid of processes. Finally, process P_{0
}gathers accumulated partial sum on each process to obtain the resultant
vector y.

Input
Assume that the input is given in the same format as explained in the
Example
17

Output
Process with rank 0 prints the final matrix vector product result and
the result is same as in Example
17
Example
20 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using selfscheduling algorithm

Objective
Write a MPI program, for computing the matrixmatrix multiplication
on p processors of PARAM 10000 using SelfScheduling
algorithm.

Description
This example illustrates , one of the most common parallel algorithm
prototype , the SelfScheduling or MasterSlave algorithm.
This example is chosen not because it illustrates the best way to parallelize
this particular numerical computation (it doesn't), but it illustrates
the basic MPI library calls like, MPI send and MPI
receive in the context of fundamental type of parallel
algorithm applicable in many situations.
We assume that a square matrix A of size n is available
on the master process with rank 0 and another square matrix B
of size n is available on all the slave processes ,whose
rank starts from 1 onwards. The idea is that the master process,
distributes the work load to slave processes. When the slave finishes
its workload, it informs the master, about the completion of the task assigned
and then master assigns a new workload to the slave. This is very simple
paradigm where the coordination is done by master. Here, the
slave processes do not have to communicate with one another. This communication
among the master and the slaves is shown in Figure 13
Figure 13 Communication pattern among master and slave in
self scheduling paradigm.
The master process with rank 0 stores the final output square matrix
C of size n . The master process receives whole row of the
product matrix C from any slave, and sends the next task
(row of matrix A) to that slave processes. In other
words, completion of one task by a slave is considered to be a request
for the next task. Once all the tasks have been handed out, termination
messages are sent .
Each slave process, after receiving matrix B,
enters a loop which is terminated after receiving the termination message
from master. The body of the loop consists of receiving the row
of the matrix A, forming the entries C_{i, j} of
required row by doing computation with the received row of the matrix A
with the available matrix B , and sending the computed
row of the product matrix.
.

Input
Assume A and B are real square matrices of size n and
the number of rows n should be greater than or equal to number
of processes p. Process 0 reads the input matrices
A and B. Format for the input file is given below.
The input file for the matrices A and B should strictly
adhere to the following format.
#Line 1 : Number of Rows (n) Number of columns(n).
#Line 2 : (data) (in rowmajor order. This means that the data
of second row follows that of the first and so on.)

Input file 1
A sample input file for the matrix A (8 x 8) is given below
8 8
1.0 2.0 3.0
4.0 2.0 3.0
2.0 1.0
2.0 3.0 4.0
2.0 3.0 2.0
1.0 1.0
3.0 4.0 2.0
3.0 2.0 1.0
1.0 1.0
4.0 2.0 3.0
2.0 1.0 1.0
4.0 2.0
2.0 3.0 2.0
1.0 2.0 3.0
2.0 2.0
3.0 2.0 1.0
1.0 2.0 1.0
1.0 1.0
2.0 1.0 4.0
3.0 2.0 1.0
3.0 3.0
4.0 1.0 2.0
1.0 2.0 3.0
4.0 3.0

Input file 2
A sample input file for the matrix B (8 x 8) is given below
8 8
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0

Output
Process 0 prints final matrixmatrix product matrix C
(8 x 8) . The output of the matrix  matrix multiplication is given
below
18.0 36.0 54.0 72.0
36.0 18.0 36.0 18.0
18.0 36.0 54.0 72.0
36.0 18.0 36.0 18.0
17.0 34.0 51.0 68.0
34.0 17.0 34.0 17.0
19.0 38.0 57.0 76.0
38.0 19.0 38.0 19.0
17.0 34.0 51.0 68.0
34.0 17.0 34.0 17.0
12.0 24.0 36.0 48.0
24.0 12.0 24.0 12.0
19.0 38.0 57.0 76.0
38.0 19.0 38.0 19.0
20.0 40.0 60.0 80.0
40.0 20.0 40.0 20.0
Example
21 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using block checkerboard partitioning and MPI Cartesian topology

Objective
Write a MPI program, for computing the matrixmatrix multiplication
on p processors of PARAM 10000 using block checkerboard partitioning
of the matrices . Special MPI routines on cartesian topology can
be used for checkerboard partitioning of matrices.
Assume that p= q^{2 }and the size of square
matrices A and B is evenly divisible by q.

Description
Assume that A and B are square matrices of size
n and C be the output matrix. These matrices are dived
into blocks or submatrices to perform matrixmatrix operations in parallel
.For example, an n x n matrix A can be regarded as
q x q array of blocks A_{i, j} (0<=i
<q, 0<= j < q) such that each block is an (n/q)
x (n/q) submatrix. We use p processors to implement the
block version of matrix multiplication in parallel by choosing q
as a square root of p and compute a distinct block C_{i,
j} on each processor. Block and cyclic checkerboard
partitioning of a 8 x 8 matrix on a square grid (4 x 4) of processors
is shown in the Figure 14
Figure 14 Checkerboard partitioning of 8 x 8 matrices on 16 processors.
The matrices A and B are partitioned into p
blocks, A_{ }_{i, j }and
B _{i, j }(0<=i < q, 0<=
j < q) of size (n/q x n/q) on each
process. These blocks are mapped onto a q x q logical
mesh of processes. The processes are labeled from P_{0,0} to
P_{q1,q1}. An example of of this situation is shown in
Figure 14 Process P_{i, j} initially store block
matrices A_{i, j }and B_{i, j }and
computes block C_{i, j }of result matrix. To compute
submatrix C_{i, j} , we need all submatrices , A_{i,
k} and B_{k, j} ( 0 <= k <
q ) . To acquire all the required blocks, an alltoall broadcast of
matrix A_{i, j }'s is performed in each row
and similarly in each column of matrix B_{i, j}'s.
MPI collective communication is used to perform this operations.
After P_{i, j} acquires, A _{i,0 },
A_{ i,1 }, A _{i,2 },
A _{i, q1 }and B_{0, j },
B_{1, j }, B_{2, j }, B_{q}_{1,
j }, it performs the serial block matrix to matrix multiplication
and accumulates the partial block matrix C_{i, j}
of matrix C . To obtain the resultant product matrix C, processes
with rank 0 gathers all the block matrices by using MPI_Gather
collective communication operation.
MPI provides a set of special routines to virtual
topologies. An important virtual topology is the Cartesian
topology . This is simply a decomposition in the natural coordinate
( e.g., x,y,z) directions.

Input
The input is given in the same format as explained in Example
20 . Assume that the number of processes is a perfect
square.

Output
Process with rank 0 prints the final matrixmatrix
product and the result are given as in Example
20
Example 22 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using block checkerboard partitioning and Cannon Algorithm

Objective
Write a MPI program, for computing the matrixmatrix multiplication
on p processors of PARAM 10000. Use a special MPI routines on cartesian
topology for block checkerboard partitioning of the matrices, Cannon's
Algorithm.
Assume that p= q^{2 }and the size of square matrices
A and B is evenly divisible by q.

Description
Cannon's algorithm is based on cartesian virtual topology. As
discussed in Example 21,
there are p processors arranged in q x q square grid
of processors and the input matrices, A and B are distributed
among the processes in checkerboard fashion. It results in constructing
p block matrices of A and B. It uses only pointtopoint
communication for circularly shifting blocks of matrix A and
matrix B among p processes.
The algorithm proceeds in q stages. The first step in this algorithm
is to perform initial allignment of the block matrix A and block
matrix B. The blocks of matrix A are circularly shifted to
the i positions to left in the row of the square grid of
processes, where i is the row number of the
process in the mesh. Similarly, blocks of matrix B are circularly
shifted j positions upwards, where j is the column
number of the process in the processes mesh. This operation
is performed by using MPI_Sendrecv_replace . MPI_Send
and MPI_Recv is not used for pointtopoint communication, because
if all the processes call MPI_Send or MPI_Recv in
different order the deadlocked situation may arise .
The algorithm performs the following steps in each stage :
1. Multiply the block of matrix A and matrix B and add
the resultant matrix to get the block matrix C,
which is initially set to zero.
2. Circularly shift the blocks of matrix A to left in the rows
of the processes and the blocks of
matrix B upwards in the columns of
the square grid of processes in a wrap around manner.
The communication steps for 4 x 4 square grid of processors mesh are
explained in the Figure 15.
Figure 15 The communication steps in Cannon's Algorithm
on 16 processors.

Input
The input is given in the same format as explained in Example
20

Output
Process with rank 0 prints the final product matrix
and the results are given as in Example
20
Example 23 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using block checkerboard partitioning and Fox Algorithm

Objective
Write a MPI program, for computing the matrixmatrix multiplication
on p processors of PARAM 10000 .Use a special MPI routines on cartesian
topology for block checkerboard partitioning of the matrices, Fox's Algorithm.
Assume that p= q^{2 }and the size of square matrices
A and B is evenly divisible by q.

Description
This algorithm is based on cartesian virtual topology. As discussed
in Example 21, there
are p processes arranged in q x q mesh and the input
matrices are distributed among the processes in checkerboard fashion. It
results in constructing p block matrices of A and B.
The algorithm uses two types of communication, firstly it uses onetoall
broadcast for blocks of matrix A in processes rows. Similarly
blocks of matrix B are circularly shifted j positions upwards,
where j is the column of the process in the processes mesh
using MPI_Sendrecv_replace. In this case MPI_Send and MPI_Recv
is not used for pointpoint communication because if all the processes
call MPI_Send or MPI_Recv in different order the deadlocked
situation may arise.
The algorithm proceeds in n = q stages and it performs
the following step in each stages .
1. Broadcast the block of matrix A on the diagonal process
of a particular row in the square grid of
processes to other processes in the same
row.
2. Multiply the block of matrix A and matrix B
and add the resultant matrix to get the block matrix C,
which is initially set to zero.
3. Circularly shift the blocks of matrix B upwards in
the processes columns and receives a fresh block
of matrix B from the process below it. This
selection is done in wrap around manner.
4. Select the block of matrix A for the next row
broadcast. If A_{i, j} was broadcast in the current
step,
then select A_{i,(j+1)mod q}
^{th} block for the next broadcast. This selection is also done
in
wrap around manner.
The communication steps for 4 x 4 square grid of processors are explained
in the Figure 16
Figure 16 The communication steps in Fox's Algorithm
on 16 processors.

Input
The input is given in the same format as explained in Example
20

Output
Process with rank 0 prints the final matrixmatrix
product and the result are given as in Example
20
Contents
Handson Session
MPI Library Calls
MPI On Web
