VECTOR-VECTOR MULTIPLICATION
BACKGROUND
Vectors can be partitioned in different ways and the partitions
can be assigned to different process. We briefly explain some well known
partitioning techniques to write parallel programs. In the striped partitioning
of a vector, the vector is divided into groups of contiguous elements,
and each process is assigned one such group.
A serial algorithm for vector-vector multiplication requires n
multiplications and (n-1 ) additions .The serial program is given
below :
float VECT_VECT
(int n, float x[ ], float y[ ])
{
int i;
float dot_product;
dot_product = 0.0;
for(i = 0; i < n; i++)
dot_product = dot_product + (x[i] * y[i]);
return(dot_product);
}
MATRIX-VECTOR MULTIPLICATION
BACKGROUND
Matrices can be classified into two broad categories,
dense matrices and sparse matrices. Dense or full
matrices have few or no zero elements. Sparse matrices
have a majority of zero elements . In order to process a matrix input
in parallel, we must partition it so that the partitions can be assigned
to different processes. If we assume that an addition and multiplication
pair takes unit time, then the sequential run time of algorithm is n2.
For multiplying a dense matrix A of size m x n and
a vector x of size n atleast, three distinct
parallel formulations of matrix-vector multiplication's are possible.
It depends on row-wise striping, column-wise striping,
or checkerboard striping of the matrix. The following examples
discuss common ways to partition matrices among processes to perform matrix-vector
multiplication. Serial algorithm of matrix vector multiplication
is explained below:
void
MAT_VECT(int n, int m, float A[ ][ ], float x[ ],
float y[ ])
{
int i, j;
for (i = 0; i<m; i++)
{
y[i]= 0.0;
for (j =0; j<n; j++)
y[i]= y[i] + A[i][j] * x[j];
}
}
The following examples discuss
issues in implementation of parallel matrix-matrix multiplication algorithms.
MATRIX-MATRIX
MULTIPLICATION
BACKGROUND
We discuss parallel algorithms for multiplying two dense, square
matrices A and B of size n to yield the product matrix.
All parallel matrix multiplication algorithms involves the scalar algebraic
operations performed on the blocks or submatrices of the original matrix
. Such algebraic operations on the submatrices are called block
matrix operations . If we assume that an addition and multiplication
pair takes unit time, then the sequential run time of conventional algorithm
is n3. Serial and parallel algorithms of matrix-matrix multiplication
are explained below :
void
BLOCK_MAT_MULT ( int n, float A[ ][ ], float B[ ][ ], float C[ ][ ])
{
int i, j, k;
for (i = 0; i < n-1; i++)
for (j = 0; j < n-1; j++)
{
/* Initialize all elements of C [i, j]
to zero */
C[i][j] = 0.0;
for (k= 0; k < n-1; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
Example
13 : Description for implementation of MPI program to compute dot product of two vectors using block-striped partitioning with uniform data distribution
-
Objective
Write a MPI program, to compute the vector-vector multiplication on
p processors of PARAM 10000 using block-striped partitioning
for uniform data distribution . Assume that the vectors
are of size n and p is number of processors used and n
is a multiple of p.
-
Description
The partitioning is called block-striped if each process is assigned
contiguous elements. The process P0 gets the first n/p
elements, P1 gets the next n/p elements and so
on. The distribution of 16 elements of vector a on 4 processes
is shown in Figure 7.
Figure 7 A Typical block-striped partitioning of a vector
of size 16 on 4 processes
Initially process with rank 0 distributes the input vectors using
MPI_Scatter on p processes. Each process will perform local
dot product of the vectors and accumulate the partial dot product. Now
the process with rank 0 performs global reduction using MPI_Reduce
to get the final dot product.
-
Input
Process with rank 0 reads a real vectors of size n. Assume
that the number of elements n are greater than or equal to number
of processes p.
You have to adhere strictly to the following format for the input files.
#Line 1 : Number of Elements (n)
#Line 2 : List of Elements (data).
A sample input file for the vector A is given below
8
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
A sample input file for the vector B is given below
8
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
-
Output
Process with rank 0 prints the final dot product of two
vectors. The dot product for the sample input vectors is given below.
204.0
Example
14 : Description for implementation of MPI program to compute dot product of two vectors using block-striped partitioning with non-uniform data distribution
-
Objective
Write a MPI program, to compute the dot product of two vectors on p
processors of PARAM 10000 using block-striped partitioning
for non-uniform distribution of data .
-
Description
If p divides n evenly, processes P0 gets the
first n/p elements, P1 the next n/p
elements and so on. If p does not divide n evenly, and
r be the remainder, then first r processes get (n/p)+1
elements and remaining p-r processes get n/p elements.
The program described in Example 13 uses MPI_Scatter and MPI_Reduce
because the vector is equally distributed on p processes. Here,
we use MPI_Scatterv call, which distributes the vectors non-uniformly
on all processes.
-
Input
Input is given in the same format as explained in the Example
13
-
Output
Process with rank 0 prints the final dot product of two vectors and
the result are given as in Example
13
Example
15 : Description for implementation of MPI program to compute dot product of two vectors using block-striped partitioning with cyclic data distribution
-
Objective
Write a MPI program, to compute the dot product of two vectors on p
processors of PARAM 10000 using cyclic data partitioning. Assume
that the vectors are of size n and p is number of processors
used and n is a multiple of p.
-
Description
In the Cyclic data partitioning process P0 gets
the first element, process P1 gets the next and so on. The
process Pp-1 gets (p-1)th element.
If the number of elements n is more than the number processes p,
then process P0 gets pth element,
process P1 gets (p+1)th element and so on.
The process is repeated till all the elements are assigned. If n
is not a multiple of p, then first r (r = n/p)
processes will get n/p +1 elements and remaining p-r
processes will get n/p elements, in cyclic fashion.
The Figure 8 illustrates the example for p = 4 and n
= 16.
Figure 8 A Typical Cyclic data partitioning of a vector
of size 16 on 4 processes
Input is given in the same format as explained in the Example
13
-
Output
Process with rank 0 prints the final dot product of two vectors and
the result are given as in Example
13
Example
16 : Description for implementation of MPI program to compute infinity norm of a matrix using block-striped partitioning with uniform data distribution
-
Objective
Write a MPI program to calculates infinity norm of a matrix
using row wise block-striped partitioning .
-
Description
The infinity norm of a matrix is defined to be the maximum of
sums of absolute values of elements in a row, over all rows. Row-wise block
partitioning, is used and the idea is that the matrix m x n is
striped among p processors so that each processors stores m/p
rows of the matrix. A typical column-wise and row-wise
partitioning of the matrix is shown in the Figure 9.
Figure 9 Uniform column-wise and row-wise striped partitioning
of 16 x 16 matrix
on 4 processors.
If a m x n matrix is partitioned on p processes (labeled
p0 , p1, p2,...,
pp-1), then process pi contains
rows with indices (m/p)i, (m/p)i+1, (m/p)i+2,
(m/p)i+3,........, (m/p)(i+1)-1.
-
Input
Assume A is a real matrix of size m x n. Also the
number of rows m should be greater than or equal to number of processes
p. Process with rank 0 reads input data. Format for
the input file is given below.
You have to adhere strictly the following format for the input files.
#Line 1 : Number of Rows (m) Number of columns(n).
#Line 2 : (data) (in row-major order. This means that the data
of second row follows that of the first
and so on.)
-
Input file
A sample input file for the matrix A (8 x 8) is given below :
8 8
-1.0 2.0 3.0
4.0 5.0
6.0 7.0 8.0
2.0 3.0 4.0
5.0 6.0 7.0
8.0 9.0
3.0 4.0 5.0
6.0 7.0 8.0
9.0 10.0
4.0 5.0 6.0
7.0 8.0 9.0
10.0 11.0
5.0 6.0 7.0
8.0 9.0 10.0
11.0 12.0
6.0 7.0 8.0
9.0 10.0 11.0 12.0
13.0
7.0 8.0 8.0
10.0 11.0 12.0 13.0
14.0
8.0 -9.0 10.0 11.0
12.0 13.0 14.0 15.0
-
Output
Process with rank 0 prints the infinity norm value as given below
:
92.000
Example 17 : Description for implementation of MPI program to compute the Matrix and Vector Multiplication using self-scheduling algorithm
- Objective
Write a MPI program, for computing the matrix-vector multiplication
on p processors of PARAM 10000 using Self Scheduling
algorithm.
-
Description
This example illustrates , one of the most common parallel algorithm
prototype , the Self-Scheduling or Master-Slave algorithm.
This example is chosen not because it illustrates the best way to parallelize
this particular numerical computation (it doesn't), but it illustrates
the basic MPI send and MPI receive
operations in the context of fundamental type of parallel algorithm
applicable in many situations.
Figure 10 Communication pattern among master and
slaves in self scheduling paradigm.
We assume that the matrix A of size n x n
is available with master process of rank 0 and the vector x
of size n is available on all the slave processes , whose
rank start from 1 onwards .The idea is that one process, which we call
the master process, distributes the work load to slave
processes. When the slave finishes its workload, it informs the master
which assigns a new workload to the slave. This communication between
the master and slave is shown in Figure 10. This is very
simple paradigm where the co-ordination is done by master. Here,
the slave processes do not have to communicate with one another.
The master begins by broadcasting vector x to each slave.
It then sends one row of the matrix A to each slave .
At this point, the master begins a loop, terminated when it has received
all of the entries in the product. The body of the loop consists of receiving
one entry in the product vector from any slave , and sending the
next task (row of matrix A) to that slave. In
other words, completion of one task by a slave is considered to
be a request for the next task. Once all the tasks have been handed out,
termination messages are sent.
After receiving the broadcast value of vector x , each slave
also enters in a loop to compute the product of a matrix row with vector.
Each slave computation is terminated by the receipt of the termination
message from master. The body of the loop consists of receiving
a row of matrix A, performing the dot product with
vector x, and sending the answer back to the master.
-
Input
The input should be in following format.
Assume that the real matrix is of size m x n
and the real vector is of size n . Also the number of rows
m should be greater than or equal to number of processes p.
Process with rank 0 reads the input matrix A and the
vector x . Format for the input files are given below.
Input file 1
The input file for the matrix should strictly adhere to the following
format.
#Line 1 : Number of Rows (m), Number of columns (n).
#Line 2 : (data) (in row-major order. This means
that the data of second row follows that of the first and so on.)
A sample input file for the matrix (8 x 8) is given below
8 8
1.0 2.0 3.0
4.0 2.0 3.0
2.0 1.0
2.0 3.0 4.0
2.0 3.0 2.0
1.0 1.0
3.0 4.0 2.0
3.0 2.0 1.0
1.0 1.0
4.0 2.0 3.0
2.0 1.0 1.0
4.0 2.0
2.0 3.0 2.0
1.0 2.0 3.0
2.0 2.0
3.0 2.0 1.0
1.0 2.0 1.0
1.0 1.0
2.0 1.0 4.0
3.0 2.0 1.0
3.0 3.0
4.0 1.0 2.0
1.0 2.0 3.0
4.0 3.0
Input file 2
The input file for the vector should strictly adhere to the following
format.
#Line 1 : Size of the vector (n)
#Line 2 : (data)
A sample input file for the vector (8 x 1) is given below
8
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
-
Output
Process with rank 0 prints the final matrix vector product result is
given below.
18.0 18.0 17.0 19.0 17.0 12.0
19.0 20.0
Example
18 : Description for implementation of MPI program to compute the Matrix and Vector Multiplication using block-striped row-wise partitioning with uniform data distribution
-
Objective
Write a MPI program, for computing the matrix -vector multiplication
on p processor of PARAM 10000 using block striped partitioning
of a matrix.
-
Description
In the striped partitioning of a matrix, the matrix is divided
into groups of contiguous complete rows or columns, and each
processor is assigned one such group. The partitioning is called block-striped
if each processor is assigned contiguous rows or columns.
Striped partitioning can be block or cyclic. In a
row-wise block striping of an nxn matrix on
p processors (labeled P0 , P1, P2,...,
Pp-1), processor Pi contains
rows with indices (n/p)i, (n/p)i+1,
(n/p)i+2, (n/p)i+3,........,
(n/p)(i+1)-1. A typical column-wise or
row-wise partitioning of 16 x 16 matrix on 4 processors is shown
in the Figure 11
Figure 11 Uniform block-striped partitioning of 16 x
16 matrix on 4 processors.
The matrix A of size n x n is striped row-wise
among p processes so that each process stores n/p rows
of the matrix. We assume that the vector x of size n x
1 is available on each process . Now, process Pi
computes the dot product of the corresponding rows of the matrix
A[ * ] with the vector x[ * ] and accumulate
the partial result in the array y[ * ]. Finally, process P0
collects the dot product of different rows of the matrix with
the vector from all the processes.
The input is given in the same format as explained in the Example
17 . Also the number of rows m should be an exact multiple
of the number of processors p. This is assumed so that each process
get equal number of rows.
-
Output
Process with rank 0 prints the final matrix vector product result, same
as in Example 17
Example
19 : Description for implementation of MPI program to compute Matrix and Vector Multiplication using block checkerboard partitioning
-
Objective
Write a parallel MPI program, for computing the matrix -vector multiplication
on p processor of PARAM 10000 using block-checkerboard
partitioning . Assume that p is a perfect square number.
Special MPI routines can be used to arrange the processors in a square
grid of q x q processors where, q2=p. Also,
assume that the size of the matrix is evenly divisible by q.
-
Description
In checkerboard partitioning, the matrix is divided into smaller
square or rectangular blocks (submatrices) that are
distributed among processes. A checkerboard partitioning splits
both the rows and the columns of the matrix, so no process
is assigned any complete row or column . Like striping
partitioning, checkerboard partitioning can be block or cyclic.
The Figure 12 explains how a 8 x 8 matrix is distributed on 16 processors
using block checkerboard and cyclic checkerboard partitioning
.
Figure 12 Checkerboard partitioning of 8 x 8 matrices
on 16 processors.
We assume that processes are arranged in q x q
matrix and processes are stored in row major order. As shown in Figure
12 for square grid of 4 x 4 processors the matrix A of size m
x n is partitioned among p processes using
block checkerboard partitioning , such that each process stores
m/q x n/q block of the matrix A. The vector x
is distributed in portions of n/q elements to the first process
in each columns of q processes . For example, the processes
P00, P10 , P20 and P30 get
the n/q elements of the vector x. Each process in the first
column of each row in square grid of q x q processes, possesses
n/q elements of the vector. Processes P00, P10, P20,
P30 broadcast elements of the vector to the other processes
in the respective rows of the grid. Each process now stores
m/q x n/q blocks of the matrix A and a
n/q elements of the vector x.
Each process then performs multiplication of its block matrix with
local vector elements and stores the partial result in the vector y[
* ]. Now, each process has resultant vector y of size m/q and
the first process in each row of square grid processes q x q will accumulate
the partial sum in the same row involving other processes in the
same row of square grid of processes. Finally, process P0
gathers accumulated partial sum on each process to obtain the resultant
vector y.
-
Input
Assume that the input is given in the same format as explained in the
Example
17
-
Output
Process with rank 0 prints the final matrix vector product result and
the result is same as in Example
17
Example
20 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using self-scheduling algorithm
-
Objective
Write a MPI program, for computing the matrix-matrix multiplication
on p processors of PARAM 10000 using Self-Scheduling
algorithm.
-
Description
This example illustrates , one of the most common parallel algorithm
prototype , the Self-Scheduling or Master-Slave algorithm.
This example is chosen not because it illustrates the best way to parallelize
this particular numerical computation (it doesn't), but it illustrates
the basic MPI library calls like, MPI send and MPI
receive in the context of fundamental type of parallel
algorithm applicable in many situations.
We assume that a square matrix A of size n is available
on the master process with rank 0 and another square matrix B
of size n is available on all the slave processes ,whose
rank starts from 1 onwards. The idea is that the master process,
distributes the work load to slave processes. When the slave finishes
its workload, it informs the master, about the completion of the task assigned
and then master assigns a new workload to the slave. This is very simple
paradigm where the coordination is done by master. Here, the
slave processes do not have to communicate with one another. This communication
among the master and the slaves is shown in Figure 13
Figure 13 Communication pattern among master and slave in
self scheduling paradigm.
The master process with rank 0 stores the final output square matrix
C of size n . The master process receives whole row of the
product matrix C from any slave, and sends the next task
(row of matrix A) to that slave processes. In other
words, completion of one task by a slave is considered to be a request
for the next task. Once all the tasks have been handed out, termination
messages are sent .
Each slave process, after receiving matrix B,
enters a loop which is terminated after receiving the termination message
from master. The body of the loop consists of receiving the row
of the matrix A, forming the entries Ci, j of
required row by doing computation with the received row of the matrix A
with the available matrix B , and sending the computed
row of the product matrix.
.
-
Input
Assume A and B are real square matrices of size n and
the number of rows n should be greater than or equal to number
of processes p. Process 0 reads the input matrices
A and B. Format for the input file is given below.
The input file for the matrices A and B should strictly
adhere to the following format.
#Line 1 : Number of Rows (n) Number of columns(n).
#Line 2 : (data) (in row-major order. This means that the data
of second row follows that of the first and so on.)
-
Input file 1
A sample input file for the matrix A (8 x 8) is given below
8 8
1.0 2.0 3.0
4.0 2.0 3.0
2.0 1.0
2.0 3.0 4.0
2.0 3.0 2.0
1.0 1.0
3.0 4.0 2.0
3.0 2.0 1.0
1.0 1.0
4.0 2.0 3.0
2.0 1.0 1.0
4.0 2.0
2.0 3.0 2.0
1.0 2.0 3.0
2.0 2.0
3.0 2.0 1.0
1.0 2.0 1.0
1.0 1.0
2.0 1.0 4.0
3.0 2.0 1.0
3.0 3.0
4.0 1.0 2.0
1.0 2.0 3.0
4.0 3.0
-
Input file 2
A sample input file for the matrix B (8 x 8) is given below
8 8
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
1.0 2.0 3.0
4.0 2.0 1.0
2.0 1.0
-
Output
Process 0 prints final matrix-matrix product matrix C
(8 x 8) . The output of the matrix - matrix multiplication is given
below
18.0 36.0 54.0 72.0
36.0 18.0 36.0 18.0
18.0 36.0 54.0 72.0
36.0 18.0 36.0 18.0
17.0 34.0 51.0 68.0
34.0 17.0 34.0 17.0
19.0 38.0 57.0 76.0
38.0 19.0 38.0 19.0
17.0 34.0 51.0 68.0
34.0 17.0 34.0 17.0
12.0 24.0 36.0 48.0
24.0 12.0 24.0 12.0
19.0 38.0 57.0 76.0
38.0 19.0 38.0 19.0
20.0 40.0 60.0 80.0
40.0 20.0 40.0 20.0
Example
21 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using block checkerboard partitioning and MPI Cartesian topology
-
Objective
Write a MPI program, for computing the matrix-matrix multiplication
on p processors of PARAM 10000 using block checkerboard partitioning
of the matrices . Special MPI routines on cartesian topology can
be used for checkerboard partitioning of matrices.
Assume that p= q2 and the size of square
matrices A and B is evenly divisible by q.
-
Description
Assume that A and B are square matrices of size
n and C be the output matrix. These matrices are dived
into blocks or submatrices to perform matrix-matrix operations in parallel
.For example, an n x n matrix A can be regarded as
q x q array of blocks Ai, j (0<=i
<q, 0<= j < q) such that each block is an (n/q)
x (n/q) submatrix. We use p processors to implement the
block version of matrix multiplication in parallel by choosing q
as a square root of p and compute a distinct block Ci,
j on each processor. Block and cyclic checkerboard
partitioning of a 8 x 8 matrix on a square grid (4 x 4) of processors
is shown in the Figure 14
Figure 14 Checkerboard partitioning of 8 x 8 matrices on 16 processors.
The matrices A and B are partitioned into p
blocks, A i, j and
B i, j (0<=i < q, 0<=
j < q) of size (n/q x n/q) on each
process. These blocks are mapped onto a q x q logical
mesh of processes. The processes are labeled from P0,0 to
Pq-1,q-1. An example of of this situation is shown in
Figure 14 Process Pi, j initially store block
matrices Ai, j and Bi, j and
computes block Ci, j of result matrix. To compute
submatrix Ci, j , we need all submatrices , Ai,
k and Bk, j ( 0 <= k <
q ) . To acquire all the required blocks, an all-to-all broadcast of
matrix Ai, j 's is performed in each row
and similarly in each column of matrix Bi, j's.
MPI collective communication is used to perform this operations.
After Pi, j acquires, A i,0 ,
A i,1 , A i,2 ,
A i, q-1 and B0, j ,
B1, j , B2, j , Bq-1,
j , it performs the serial block matrix to matrix multiplication
and accumulates the partial block matrix Ci, j
of matrix C . To obtain the resultant product matrix C, processes
with rank 0 gathers all the block matrices by using MPI_Gather
collective communication operation.
MPI provides a set of special routines to virtual
topologies. An important virtual topology is the Cartesian
topology . This is simply a decomposition in the natural co-ordinate
( e.g., x,y,z) directions.
-
Input
The input is given in the same format as explained in Example
20 . Assume that the number of processes is a perfect
square.
-
Output
Process with rank 0 prints the final matrix-matrix
product and the result are given as in Example
20
Example 22 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using block checkerboard partitioning and Cannon Algorithm
-
Objective
Write a MPI program, for computing the matrix-matrix multiplication
on p processors of PARAM 10000. Use a special MPI routines on cartesian
topology for block checkerboard partitioning of the matrices, Cannon's
Algorithm.
Assume that p= q2 and the size of square matrices
A and B is evenly divisible by q.
-
Description
Cannon's algorithm is based on cartesian virtual topology. As
discussed in Example 21,
there are p processors arranged in q x q square grid
of processors and the input matrices, A and B are distributed
among the processes in checkerboard fashion. It results in constructing
p block matrices of A and B. It uses only point-to-point
communication for circularly shifting blocks of matrix A and
matrix B among p processes.
The algorithm proceeds in q stages. The first step in this algorithm
is to perform initial allignment of the block matrix A and block
matrix B. The blocks of matrix A are circularly shifted to
the i positions to left in the row of the square grid of
processes, where i is the row number of the
process in the mesh. Similarly, blocks of matrix B are circularly
shifted j positions upwards, where j is the column
number of the process in the processes mesh. This operation
is performed by using MPI_Sendrecv_replace . MPI_Send
and MPI_Recv is not used for point-to-point communication, because
if all the processes call MPI_Send or MPI_Recv in
different order the deadlocked situation may arise .
The algorithm performs the following steps in each stage :
1. Multiply the block of matrix A and matrix B and add
the resultant matrix to get the block matrix C,
which is initially set to zero.
2. Circularly shift the blocks of matrix A to left in the rows
of the processes and the blocks of
matrix B upwards in the columns of
the square grid of processes in a wrap around manner.
The communication steps for 4 x 4 square grid of processors mesh are
explained in the Figure 15.
Figure 15 The communication steps in Cannon's Algorithm
on 16 processors.
-
Input
The input is given in the same format as explained in Example
20
-
Output
Process with rank 0 prints the final product matrix
and the results are given as in Example
20
Example 23 : Description for implementation of MPI program to compute Matrix and Matrix Multiplication using block checkerboard partitioning and Fox Algorithm
-
Objective
Write a MPI program, for computing the matrix-matrix multiplication
on p processors of PARAM 10000 .Use a special MPI routines on cartesian
topology for block checkerboard partitioning of the matrices, Fox's Algorithm.
Assume that p= q2 and the size of square matrices
A and B is evenly divisible by q.
-
Description
This algorithm is based on cartesian virtual topology. As discussed
in Example 21, there
are p processes arranged in q x q mesh and the input
matrices are distributed among the processes in checkerboard fashion. It
results in constructing p block matrices of A and B.
The algorithm uses two types of communication, firstly it uses one-to-all
broadcast for blocks of matrix A in processes rows. Similarly
blocks of matrix B are circularly shifted j positions upwards,
where j is the column of the process in the processes mesh
using MPI_Sendrecv_replace. In this case MPI_Send and MPI_Recv
is not used for point-point communication because if all the processes
call MPI_Send or MPI_Recv in different order the deadlocked
situation may arise.
The algorithm proceeds in n = q stages and it performs
the following step in each stages .
1. Broadcast the block of matrix A on the diagonal process
of a particular row in the square grid of
processes to other processes in the same
row.
2. Multiply the block of matrix A and matrix B
and add the resultant matrix to get the block matrix C,
which is initially set to zero.
3. Circularly shift the blocks of matrix B upwards in
the processes columns and receives a fresh block
of matrix B from the process below it. This
selection is done in wrap around manner.
4. Select the block of matrix A for the next row
broadcast. If Ai, j was broadcast in the current
step,
then select Ai,(j+1)mod q
th block for the next broadcast. This selection is also done
in
wrap around manner.
The communication steps for 4 x 4 square grid of processors are explained
in the Figure 16
Figure 16 The communication steps in Fox's Algorithm
on 16 processors.
-
Input
The input is given in the same format as explained in Example
20
-
Output
Process with rank 0 prints the final matrix-matrix
product and the result are given as in Example
20
Contents
Hands-on Session
MPI Library Calls
MPI On Web
|