MPI Information




________________________________________________________ MPI Information ________________________________________________________

1.1 An overview of MPI

1.1.1 Introduction

A proposed standard Message Passing Interface(MPI) is originally designed for writing applications and libraries for distributed memory environments. The main advantages of establishing a message-passing interface for such environments are portability and ease of use, and a standard memory-passing interface is a key component in building a concurrent computing environment in which applications, software libraries, and tools can be transparently ported between different machines.

MPI is intended to be a standard message-passing interface for applications and libraries running on concurrent computers with logically distributed memory. MPI is not specifically designed for use by parallelizing compilers. MPI provides no explicit support for multithreading, since the design goals of MPI standard do not include the mandate that an implementation should be interoperable with other MPI implementations. However, MPI does provide message passing routines for exchanging all the information needed to allow a single MPI implementation to operate in a heterogeneous environment.

MPI (Message Passing Interface) is a standard specification for message passing libraries. MPI makes it relatively easy to write portable parallel programs. mpich is a portable implementation of the MPI specification for a wide variety of parallel computing environments. MPI programs are built and run using mpich implementation.

We assume that the mpich implementation is installed into '/usr/local/mpi' and the application user have added 'usr/local/mpi/bin' to their path. If mpich is installed somewhere else you should make the appropriate changes. mpich has been built and installed on the parallel systems knowing the architecture and the device. The architecture indicates the kind of processor, examples are Sun 4. The device indicates how mpich performs communication between processes, example is ch_p4. The libraries and special commands for each architecture/device pair are provided in the following directory.

/usr/local/mpi/lib/<architecture>/<device>

1.1.2 MPI Forum

In the past, both commercial and software makers provided different solutions to the users on the message passing paradigms. Important issues for the user community are portability, performance, and features. The user community, which quite definitely includes the software suppliers themselves, recently determined to address this issues.

In April 1992, the Center for Research in Parallel Computation sponsored a one-day workshop of Standards for Message Passing in a Distributed-Memory Environment. The result of that workshop, which featured presentations of many systems, was a realization both that people were eager to cooperate on the definition of a standard. At the Supercomputing' 92 conference in November, a committee was formed to define a message-passing standard. At the time of creation, few knew what the outcome might look like, but the effort was begun with the following goals:

Define a portable standard for message passing. It would not be an official, ANSI - like standard, but it would attract both implementors and users.
Operate in a completely open way. Anyone would be free to join the discussions, either by attending meetings in person or by monitoring e-mail discussions.
Be finished in one year.

The MPI effort has been a lively one, as a result of the tensions among these three goals. The MPI Forum decided to follow the format used by the High Performance Fortran Forum, which had been well received by its community. The MPI effort will be successful in attracting a wide class of vendors and users because the MPI Forum itself was so broadly based. The parallel computer vendors were represented by Convex, Cray, IBM, Intel, Meiko, nCUBE, NEC, and Thinking Machines. Members of the groups associated with the portable software libraries were also present. PVM, p4, Zipcode, Chameleon, PARMACS, TCGMSG, and Express were all represented. In addition, a number of parallel application specialists were on hand.

MPI achieves portability by providing a public-domain, platform-independent standard of message-passing library. MPI specifies this library in a language-independent form, and provides Fortran and C bindings. This specification does not contain any feature that is specific to any particular vendor, operating system, or hardware. Due to these reasons, MPI has gained wide acceptance in the parallel computing community. MPI has been implemented on IBM PCs on Windows, all main Unix worksta- tions, and all major parallel computers. This means that a parallel program written in standard C or Fortran, using MPI for message passing, could run without change on a single PC, a workstation, a network of workstations, an MPP, from any vendor, on any operating system.

MPI is not a stand-alone, self-contained software system. It serves as a message-passing communication layer on top of the native parallel programming environment, which takes care of necessities such as process management and I/O. Besides these proprietary environments, there are several public-domainMPI environments. Examples include the CHIMP implementation developed at Edinburg University, and the LAM (Local Area Multicomputer) developed at the Ohio Supercom- puter Center, which is an MPI programming environment for heterogenous Unix clusters. MPICH The most popular public-domain implementation is MPICH, developed jointly by Argonne National Laboratory and Mississippi State University.MPICH is a portable implementation of MPI on a wide range of machines, from IBM PC's, networks of workstations, to SMPs and MPPs. The portability ofMPICH means that one can simply retrieve the sameMPICH Package and install it on almost any platform.MPICH also has good performance on Many parallel machines, because it often runs in the more efficient native mode rather than over the common TCP/IP sockets.

In addition to meetings every six weeks for more than a year, there were continuous discussions via electronic mail, in which many persons from the worldwide parallel computing community participated. Equally important, an early commitment to producing a model implementation helped to demonstrate that an implementation of MPI was feasible. The MPI Standard is just being completed (May 1994). For more information of mpich Verisons refer to http://www-unix.mcs.anl.gov/mpi/mpich

1.1.3 MPI Basics

Perhaps the best way to introduce the concepts in MPI that might initially appear unfamiliar is to show how they have arisen as necessary extensions of quite familiar concepts. Let us consider what is perhaps the most elementary operation in a message-passing library, the basic send operation. In most of the current message-passing systems, it looks very much like this:

send (address, length, destination, tag ) where

address is a memory location signifying the beginning of the buffer containing the data to be sent,
length is the length in bytes of the message,
destination is the process identifier of the process to which this message is sent (usually an integer), and
tag is an arbitrary non-negative integer to restrict receipt of the message (sometimes also called type).

This particular set of parameters is frequently chosen because it is a good compromise between what the programmer needs and what the hardware can do efficiently (transfer a contiguous area of memory from one processor to another). In particular, the system software is expected to supply queuing capabilities so that a receive operation recv (address, maxlen, source, tag, actlen )

will complete successfully only if a message is received with the correct tag. Other messages are queued until a matching receive is executed. In most current systems source is an output argument indicating where the message came from, although in some systems it can also be used to restrict matching and to cause message queuing. On a receive, address and maxlen together describe the buffer into which the received data is to be put, actlen is the number of bytes received.

Message-passing systems with this sort of syntax and semantics have proven extremely useful, yet have imposed restrictions that are now recognized as undesirable by a large user community. The MPI Forum has sought to lift these restrictions by providing more flexible versions of each of these parameters, while retaining the familiar underlying meanings of the basic send and receive operations.

Let us examine these parameters one by one, in each case discussing first the current restrictions and then the MPI version. The (address, length) specification of the message to be sent was a good match for early hardware but is no longer adequate for two different reasons:

In many situations, the message to be sent is not contiguous. In the simplest case, it may be a row of a matrix that is stored columnwise. In general, it may consist of an irregularly dispersed collection of structures of different sizes. In the past, programmers (or libraries) have provided code to pack this data into contiguous buffers before sending it and to unpack it at the receiving end. However, as communications processors appear that can deal directly with striped or even more generally distributed data, it becomes more critical for performance that the packing be done "on the fly" by the communication processor in order to avoid the extra data movement. This cannot be done unless we describe the data in its original (distributed) form to the communication library.
The past few years have seen a rise in the popularity of heterogeneous computing. The popularity comes from two sources. The first is the distribution of various parts of a complex calculation among different semi-specialized computers (e.g., SIMD, vector, graphics). The second is the use of workstation networks as parallel computers. Workstation networks, consisting of machines acquired over time, are frequently made up of a variety of machine types. In both of these situations, messages must be exchanged between machines of different architectures, where (address, length) is no longer an adequate specification of the semantic content of the message. For example, with a vector of floating-point numbers, not only the floating-point format be different, but even the length may be different. This situation is true for integers as well. The communication library can do the necessary conversion if it is told precisely what is being transmitted.

The MPI solution for both of these problems is, to specify messages at a higher level and in a more flexible way to reflect the fact that the contents of a message contain much more structure than just a string of bits. Instead, an MPI message buffer is defined by a triple (address, count, datatype), describing count occurrences of the data type datatype starting at address. The power of this mechanism comes from the flexibility in the values of datatype. To begin with, datatype can take on the values of elementary data types in the host language. Thus (A,300,MPI_REAL) describes a vector A of 300 real numbers in Fortran, regardless of the length or format of a floating point number. An MPI implementation for heterogeneous networks guarantees that the same 300 reals will be received, even if the receiving machine has a very different floating-point format. The real power of data types, however, comes from the fact that users can construct their own data types using MPI routines and that these data types can describe noncontiguous data.

A nice feature of the MPI design is that MPI provides a powerful functionality based on four orthogonal concepts. These four concepts inMPI are message data types, communicators, communication operations, and virtual topology.

Separating Families of Messages

Nearly, all message-passing systems provide a tag argument for the send and receive operations. This argument allows the programmer to deal with the arrival of messages in an orderly way, even if the messages that arrive "of the wrong tag" until the program (peer) is ready for them. Usually a facility exists for specifying wild-card tags that match any tag. This mechanism has proven necessary but insufficient, because the arbitrariness of the tag choices means that the entire program must use tags in a predefined, coherent way. Particular difficulties arise in the case of libraries, written far from the application programmer in time and space, whose messages must not be accidentally received by the application program.

MPI's solution is to extend the notion of tag with a new concept : context. Contexts are allocated at run time by the system in response to user (and library) requests and are used for matching messages. They differ from tags in that they are allocated by the system instead of the user and no wild-card matching is permitted. The usual notion of message tag, with wild-card matching, is retained in MPI.

Naming Processes

Processes belong to groups. If a group contains n processes, then its processes are identified within the group by ranks, which are integers form 0 to n-1. There is an initial group to which all processes in an MPI implementation belong. Within this group, processes are numbered similarly to the way in which they are numbered in many existing message-passing systems, from 0 up to 1 less than the total number of processes.

Communicators

The notions of context and group are combined in a single object called a communicator, which becomes an argument to most point-to-point and collective operations. Thus the destination or source specified in a send or receive operation always refers to the rank of the process in the group identified with the given communicator. That is, in MPI the basic (blocking) send operation has become

MPI_Send (buf, count, datatype, dest, tag, comm) where

buf, count, datatype describes count occurrences of items of the form datatype starting at buf,
dest is the rank of the destination in the group associated with the communicator comm,
tag is as usual, and
comm identifies a group of processes and a communication context.

The receive has become
MPI_Recv (buf, count, datatype, dest, tag, comm, status)

The source, tag, and count of the message actually received can be retrieved from status. Several other message-passing systems return the "status" parameters by separate cells that implicitly reference the most recent message received. MPI's method is one aspect of its effort to the reliable in the situation where multiple threads are receiving messages on behalf of a process. The processes involved in the execution of a parallel program using MPI, are identified by a sequence of non-negative integers. If there are p processes executing a program, they will have ranks 0, 1,2,...., p-1.

A set of routines that support point-to-point communication between pairs of processes. Blocking and non-blocking versions of the routines are provided which may be used in four different communication modes. These modes correspond to different communication protocols. Message selectivity in point-to-point communication is by source process and message tag, each of which may be wildcarded to indicate that any valid value is acceptable.

The communicator abstraction that provides support for the design of safe, modular parallel software libraries. General or derived datatypes, that permit the specification of messages of noncontiguous data of different datatypes. Application topologies that specify the logical layout of processes. A common example is a Cartesian grid which is often used in two and three-dimensional problems. A rich set of collective communication routines that perform coordinated communication among a set of processes.

In MPI there is no mechanism for creating processes, and an MPI program is parallel abinitio i.e., there is a fixed number of processes from the start to the end of an application program. All processes are members of at least one process group. Initially all processes are members of the same group, and a number of routines are provided that allow an application to create (and destroy) new subgroups. Within a group each process is assigned a unique rank in the range 0 to n-1, where n is the number of processes in the group. This rank is used to identify a process, and, in particular, is used to specify the source and destination processes in a point-to-point communication operation, and
the root process in certain collective communication operations.

MPI was designed as a message passing interface rather than a complete parallel programming environment, and thus in its current form intentionally omits many desirable features. For example, MPI lacks mechanisms for process creation and control, one-sided communication operations that would permit put and get messages, and active messages, nonblocking collective communication operations, and the ability for a collective communication operation to involve more than one group, language bindings for Fortran 90 and C++.

These issues and other possible extensions to MPI, are currently being considered in the MPI-2 effort. Extensions to MPI for performing parallel I/O are also under consideration.

The Communicator Abstraction

Communicators provide support for the design of safe, modular software libraries. Here means that messages intended for receipt by a particular receive call in an application will not be incorrectly intercepted by a different receive call. Thus, communicators are a powerful mechanism for avoiding unintentional non-determinism in message passing. This may be a particular problem when using third-party software libraries that perform message passing. The point here is that the application developer has no way of knowing if the tag, group, and rank completely disambiguate the message traffic of different libraries and the rest of the application. Communicator arguments are passed to all MPI message passing routines, and a message can be communicated only if the communicator arguments passed to the send and receive routines match. Thus, in effect communicators provide an additional criterion for message selection, and hence permit the construction of independent tag spaces.

If communicators are not used to disambiguate message traffic there are two ways in which a call to a library routine can lead to unintended behaviour. In the first case, the processes enter a library routine synchronously when a send has been initiated for which the matching receive is not posted until after the library call. In this case the message may be incorrectly received in the library routine. The second possibility arises when different processes enter a library routine asynchronously resulting in a nondeterministic behaviour. If the program behaves correctly, processes 0 and 1 each receive a message from process 2, using a wildcard selection criterion to indicate that they are prepared to receive a message from any process. The three processes then pass data around in a ring within the library routine. If separate communicators are not used for the communication inside and outside of the library routine this program may intermittently fail. Suppose we delay the sending of the second message sent by process 2, for example, by inserting some computation. In this case the wildcarded receive in process 0 is satisfied by a message sent from process 1, rather than from process 2, and deadlock results. By supplying a different communicator to the library routine we can ensure that the program is executed correctly, regardless of when the processes enter the library routine.

Communicators are opaque objects, which means they can only be manipulated using MPI routines. The key point about communicators is that when a communicator is created by an MPI routine it is guaranteed to be unique. Thus it is possible to create a communicator and pass it to a software library provided that communicator is not used for any message passing outside of the library.

Communicators have a number of attributes. The group attribute identifies the process group relative to which process ranks are interpreted, and/or which identifies the process group involved in a collective communication operation. Communicators also have a topology attribute which gives the topology of the process group. In addition, users may associate arbitrary attributes with communicators through a mechanism known as caching.

Point-To-Point Communication

MPI provides routines for sending and receiving blocking and nonblocking messages. A blocking send does not return until it is safe for the application to alter the message buffer on the sending process without corrupting or changing the message sent. A nonblocking send may return while the message buffer on the sending process is still volatile, and it should not be changed until it is guaranteed that this will not corrupt the message. This may be done by either calling a routine that blocks until the message buffer may be safely reused, or by calling a routine that performs a nonblocking check on the message status. A blocking receive suspends execution on the receiving process until the incoming message has been placed in the specified application buffer. A nonblocking receive may return before the message is actually received into the specified application buffer, and a subsequent call must be made to ensure that this occurs before the buffer is reused.

Communication Modes

In MPI, a message may be sent in one of four communication modes, which approximately corresponds to the most common protocols used for point-to-point communication. In ready mode a message may be sent only if a corresponding receive has been initiated. In standard mode a message may be sent regardless of whether a corresponding receive has been initiated. MPI includes a synchronous mode which is the same as the standard mode, except that the send operation will not be complete until a corresponding receive has been initiated on the destination process. Finally, there is a buffered mode. To use buffered mode the user must first supply a buffer and associate it with a communicator. When a subsequent send is performed using that communicator, MPI may use the associated buffer to send the message. A buffered send may be performed regardless of whether a corresponding receive has been initiated.

MPI has both the blocking send and receive operations described above and nonblocking versions whose completion can be tested for and waited for explicitly. It is possible to test and wait on multiple operations simultaneously. MPI also has multiple communication modes. The standard mode corresponds to current common practice in message-passing systems. The synchronous mode requires sends to block until the corresponding receive has occurred (as opposed to the standard mode blocking send which blocks only until the buffer can be reused). The ready mode (for sends) is a way for the programmer to notify the system that the receive has been posted, so that the underlying system can use a faster protocol if it is available.

There are, therefore, 8 types of send and 2 types of receive operation. In addition, routines are provided that perform send and receive simultaneously. Different calls are provided for when the send and receive buffers are distinct, and when they are the same. The send/receive operation is blocking, so does not return until the send buffer is ready for reuse, and the incoming message has been received. The two send/receive routines bring the total number of point-to-point message passing routines up to 12.

Message-Passing Modes

It is customary in message-passing systems to use the term communication to refer to all interaction operations, i.e., communication, synchronization, and aggregation. Communications usually occur within processes of the same group. However, inter-group communications are also supported by some systems (e.g., MPI). There are three aspects of a communication mode that a user should understand

    How many processes are involved?
    How are the processes synchronized?
    How are communication buffers managed?

Three communication modes are used in today's message-passing systems. We describe these communication modes below from the user's viewpoint, using a pair of send and receive, in three different ways. We use the following code example to demonstrate the ideas.

Send and receive buffers in message passing

In the following code, process P sends a message contained in variable M to process Q, which receives the message into its variable S.

Processor P: Processor Q:

        M = 10;                                         S = - 100
L1:   send M to Q;                   L1:     receive S from P;
L2:   M = 20;;                             L2:     X = S+ 1;
        goto L1;

The variable M is often called the send message buffer (or send buffer), and S is called the receive message buffer (or receive buffer).

Synchronous Message Passing: When process P executes a synchronous send M to Q, it has to wait until process Q executes a corresponding synchronous receive S from P. Both processes will not return from the send or the receive until the message At is both sent and received. This means in the above code that variable X should evaluate to 11.

When the send and receive return, M can be immediately overwritten by P and S can be immediately read by Q, in the subsequent statements L2. Note that no additional buffer needs to be provided in synchronous message passing. The receive message buffer S is available to hold the arriving message.

Blocking Send/Receive: A blocking send is executed when a process reaches it, without waiting for a corresponding receive. A blocking send does not return until the message is sent, meaning the message variable M can be safely rewritten. Note that when the send returns, a corresponding receive is not necessarily finished, or even started. All we know is that the message is out of M. It may have been received. But it may be temporarily buffered in the sending node, somewhere in the network, or it may have arrived at a buffer in the receiving node.

A blocking receive is executed when a process reaches it, without waiting for a corresponding send. However, it cannot return until the message is received. In the above code, X should evaluate to 11 with blocking send/receive. Note that the system may have to provide a temporary buffer for blocking-mode message passing.

NonBlocking Send/Receive: A nonblocking send is executed when a process reaches it, without waiting for a corresponding receive. A nonblocking send can return immediately after it notifies the system that the message M is to be sent. The message data are not necessarily out of M. Therefore, it is unsafe to overwrite M.

A nonblocking receive is executed when a process reaches it, without waiting for a corresponding send. It can return immediately after it notifies the system that there is a message to be received. The message may have already arrived, may be still in transient, or may have not even been sent yet. With the nonblocking mode, X could evaluate to 11, 21, or -99 in the above code, depending on the relative speeds of the two processes. The system may have to provide a temporary buffer for nonblocking message passing.These three modes are compared in below table.

Comparision of Three Communication Modes

Communication Event	Synchronous	Blocking	NonBlocking
Send Start Condition	Both send and receive reached	Send reached	Send reached
Return of send indicates	Message received	Message sent	Message send initiated
Semantics	Clean	In-Between	Error-Prone
Buffering Message	Not needed	Needed	Needed
Status Checking	Not needed	Not needed	Needed
Wait Overhead	Highest	In-Between	Lowest
Overlapping in Communications and Computations	No	Yes	Yes

In real parallel systems, there are variants on this definition of synchronous (or the other two) mode. For example, in some systems, a synchronous send could return when the corresponding receive is started but not finished. A different term may be used to refer to the synchronous mode. The term asynchronous is used to refer to a mode that is not synchronous, such as blocking and nonblocking modes.

Blocking and nonblocking modes are used in almost all existing message-passing systems. They both reduce the wait time of a synchronous send. However, sufficient temporary buffer space must be available for an asynchronous send, because the corresponding receive may not be even started; thus the memory space to put the received message may not be known.

The main motivation for using the nonblocking mode is to overlap communication and computation.However, nonblocking introduces its own inefficiencies, such as extra memory space for the temporary buffers, allocation of the buffer, copying message into and out of the buffer, and the execution of an extra wait-for function. These overheads may significantly offset any gains from overlapping communication with computation.

Persistent Communication Requests

MPI also provides a set of routines for creating communication request objects that completely describe a send or receive operation by binding together all the parameters of the operation. A handle to the communication object so formed is returned, and may be passed to a routine that actually initiates the communication. As with the nonblocking communication routines, a subsequent call should be made to ensure completion of the operation.

Persistent communication objects may be used to optimize communication performance, particularly when the same communication pattern is repeated many times in an application. For example, if a send routine is called within a loop, performance may be improved by creating a communication request object that describes the parameters of the send prior to entering the loop, and then initiating the communication inside the loop to send the data on each pass through the loop.

There are five routines for creating communication objects: four for send operations (one corresponding to each communication mode), and one for receive operations. A persistent communication object should be deallocated when no longer needed.

Application Topologies

In many applications the processes are arranged with a particular topology, such as a two or three-dimensional grid. MPI provides support for general application topologies that are specified by a graph in which processes that communicate a significant amount are connected by an arc. If the application topology is an n-dimensional Cartesian grid then this generality is not needed, so as a convenience MPI provides explicit support for such topologies. For a Cartesian grid, periodic or nonperiodic boundary conditions may apply in any specified grid dimension. In MPI, a group either has a Cartesian or graph topology, or no topology. In addition to providing routines for translating between process ran and location in the topology, MPI also: allows knowledge of the application topology to be exploited in order to efficiently assign processes to physical processors, provides a routine for partitioning a Cartesian grid into hyperplane groups by removing a specified set of dimensions, provides support for shifting data along a specified dimension of Cartesian grid. By dividing a Cartesian grid into hyperplane groups, it is possible to perform collective communication operations within these groups. In particular, if all but one dimension is removed a set of one-dimensional subgroups is formed, and it is possible, for example, to perform a multicast in the corresponding direction.

Collective Communication

Collective communication routines provide for coordinated communication among a group of processes. The process group is given by the communicator object that is input to the routine. The MPI collective communication routines have been designed so that their syntax and semantics are consistent with those of the point-to-point routines. The collective communication routines maybe (but do not have to be) implemented using the MPI point-to-point routines. Collective communication routines do not have message tag arguments, though an implementation in terms of the point-to-point routines may need to make use of tags. A collective communication routine must be called by all members of the group with consistent arguments. As soon as a process has completed its role in the collective communication it may continue with other tasks. Thus, a collective communication is not necessarily a barrier synchronization for the group. MPI does not include nonblocking forms of the collective communication routines. MPI collective communication routines are divided into two broad classes: data movement routines, and global computation routines. Collective operations are of two kinds:

Data movement operations are used to rearrange data among the processes. The simplest of these is a broadcast, but many elaborate scattering and gathering operations can be defined (and are supported in MPI)
Collective computation operations (minimum, maximum, sum, logical OR, etc., as well as user-defined operations).

In both cases, a message-passing library can take advantage of its knowledge of the structure of the machine to optimize and increase the parallelism in these operations. MPI has a large set of collective communication operations, and a mechanism by which users can provide their own. In addition, MPI provides operations for creating and managing groups in a scalable way. Such groups can be used to control the scope of collective operations. MPI has an extremely flexible mechanism for describing data movement routines. These are particularly powerful when used in conjunction with the derived datatypes.

Virtual topologies

One can conceptualize process in an application-oriented topology, for convenience in programming. Both general graphs and grids of processes are supported. Topologies provide a high-level method for managing process groups without dealing with them directly. Since topologies are a standard part of MPI, we do not treat them as an exotic, advanced feature. We use them early in the book and freely from then on.

Debugging and Profiling

Rather than specify any particular interface, MPI requires the availability of "hooks" that allow users to intercept MPI calls and thus define their own debugging and profiling mechanisms.

Support for libraries

The structuring of all communication through communicators provides to library writers for the first time the capabilities they need to write parallel libraries that are completely independent of user code and interoperable with other libraries. Libraries can maintain arbitrary data, called attributes, associated with the communicators they allocate, and can specify their own error handlers.

Support for heterogeneous network

MPI programs can run on networks of machines that have different lengths and formats for various fundamental datatypes, since each communication operation specifies a (possibly very simple) structure and all the component datatypes, so that the implementation always has enough information to do data format conversions if they are necessary. MPI does not specify how this is done, however thus allowing a variety of optimizations.

1.1.4 Basic MPI LIbrary Calls

Brief Introduction to MPI Calls

Most commonly used MPI Library calls in FORTRAN/C -Language have been explained below.

Syntax: 'C' Call
'FORTRAN' Call

MPI_Init (int *argc, char **argv)

MPI_Init ( ierror)
integer ierror

Initializes the MPI execution environment

This call is required in every MPI program and must be the first MPI call. It establishes the MPI "environment". Only one invocation of MPI_Init can occur in each program execution. It takes the command line arguments as parameters. In a FORTRAN call to MPI_Init the only argument is the error code. Every Fortran MPI subroutine returns an error code in its last argument, which is either MPI_SUCCESS or an implementation-defined error code. It allows the system to do any special setup so that the MPI library can be used.

MPI_Comm_rank (MPI_Comm comm, int rank)

MPI_Comm_rank (comm, rank, ierror)
integer comm, rank, ierror

Determines the rank of the calling process in the communicator

The first argument to the call is a communicator and the rank of the process is returned in the second argument. Essentially a communicator is a collection of processes that can send messages to each other. The only communicator needed for basic programs is MPI_COMM_WORLD and is predefined in MPI and consists of the processees running when program execution begins.

MPI_Comm_size (MPI_Comm comm, int num_of_processors)

MPI_Comm_size (comm, size, ierror)
integer comm, size, ierror

Determines the size of the group associated with a communicator

This function determines the number of processors executing the program. Its first argument is the communicator and it returns the number of processes in the communicator in its second argument.

MPI_Send (void *message, int count, MPI_Datatype datatype, int destination, int tag, MPI_Comm comm)

MPI_Send(buf, count, datatype, dest, tag, comm, ierror)
<type> buf (*)
integer count, datatype, dest, tag, comm, ierror

Basic send (It is a blocking send call)

The first three arguments descibe the message as the address,count and the datatype . The content of the message are stored in the block of memory refrenced by the address. The count specifies the number of elements contained in the message which are of a MPI type MPI_DATATYPE. The next argument is the destination, an integer specifying the rank of the destination process. The tag argument helps identify messages .

MPI_Recv (void *message, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

MPI_Recv(buf, count, datatype, source, tag, comm, status, ierror)
<type> buf (*)
integer count, datatype, source, tag, comm, status, ierror

Basic receive ( It is a blocking receive call)

The first three arguments descibe the message as the address,count and the datatype. The content of the message are stored in the block of memory referenced by the address. The count specifies the number of elements contained in the message which are of a MPI type MPI_DATATYPE. The next argument is the source which specifies the rank of the sending process.MPI allows the source to be a "wild card". There is a predefined constant MPI_ANY_SOURCE that can be used if a process is ready to receive a message from any sending process rather than a particular sending process. The tag argument helps identify messages. The last argument returns information on the data that was actually received. It references a record with two fields - one for the source and the other for the tag.

MPI_Bcast (void *message, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

MPI_Bcast(buffer, count, datatype, root, comm, ierror)
<type> buffer (*)
integer count, datatype, root, comm, ierror

Broadcast a message from the process with rank "root" to all other processes of the group

It is a collective communication call in which a single process sends same data to every process. It sends a copy of the data in message on process root to each process in the communicator comm. It should be called by all processors in the communicator with the same arguments for root and comm.

MPI_Reduce (void *operand, void *result, int count, MPI_Datatype datatype, MPI_Operator op, int root, MPI_Comm comm)

MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm, ierror)
<type> sendbuf (*), recvbuf (*)
integer count, datatype, op, root, comm, ierror

Reduce values on all processes to a single value

MPI_Reduce combines the operands stored in *operand using operation op and stores the result on *result on the root. Both operand and result refer count memory locations with type datatype. MPI_Reduce must be called by all the processor in the communicator comm, and count, datatype and op must be same on each processor.

MPI_Scatter ((void *send_buffer, int send_count, MPI_DATATYPE send_type, void *recv_buffer, int recv_count, MPI_DATATYPE recv_type, int root, MPI_Comm comm)

MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root , comm, ierror)
<type> sendbuf (*), recvbuf (*)
integer sendcount, sendtype, recvcount, recvtype, root , comm, ierror

Sends data from one processor to all other processes in a group

The process with rank root distributes the contents of send_buffer among the processes. The contents of send_buffer are split into p segments each consisting of send_count elements. The first segment goes to process 0, the second to process 1 ,etc. The send arguments are significant only on process root.

MPI_Gather (void *send_buffer, int send_count, MPI_DATATYPE send_type, void *recv_buffer, int recv_count, MPI_DATATYPE recv_type, int root, MPI_Comm comm)

MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm, ierror)
<type> sendbuf (*), recvbuf (*)
integer sendcount, sendtype, recvcount, recvtype, root, comm, ierror

Gathers together values from a group of tasks

Each process in comm sends the contents of send_buffer to the process with rank root. The process root concatenates the received data in the process rank order in recv_buffer. The receive arguments are significant only on the process with rank root. The argument recv_count indicates the number of items received from each process - not the total number received.

MPI_Allgather (void *send_buffer, int send_count, MPI_DATATYPE send_type, void *recv_buffer, int recv_count, MPI_Datatype recv_type, MPI_Comm comm)

MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm, ierror)
<type> sendbuf(*), recvbuf(*)
integer sendcount, sendtype, recvcount, recvtype, comm, ierror

Gathers data from all processes and distribute it to all

MPI_Allgather gathers the contents of each send_buffer on each process. Its effect is the same as if there were a sequence of p calls to MPI_Gather, each of which has a different process acting as a root.

MPI_Finalize ()

MPI_Finalize(ierror)
integer ierror

Terminates MPI execution environment

This call must be made by every process in a MPI computation. It terminates the MPI "environment", no MPI calls my be made by a process after its call to MPI_Finalize.

MPI_Comm_split ( MPI_Comm old_comm, int split_key, int rank_key, MPI_Comm* new_comm)

MPI_Comm_split ( comm, size, ierror)
integer comm, size, ierror

Creates new communicator based on the colors and keys

The single call to MPI_Comm_split creates q new communicators, all of them having the same name, *new_comm. It creates a new communicator for each value of the split_key. Process with the same value of split_key form a new group. The rank in the new group is determined by the value of rank_key. If process A and process B call MPI_Comm split with the same value of split_key, and the rank_key argument passed by process A is less than that passed by process B, then the rank of A in underlying group new_comm will be less than the rank of process B. It is a collective call, and it must be called by all the processes in old_comm.

MPI_Comm_group ( MPI_Comm comm, MPI_Group *group)

MPI_Comm_group (comm, group, ierror)
integer comm, group, ierror

Accesses the group associated with the given communicator

MPI_Group_incl ( MPI_Group old_group, int new_group_size, int* ranks_in_old_group, MPI_Group* new_group)

MPI_Group_incl (old_group, new_group_size, ranks_in_old_group , new_group, ierror)
integer old_group, new_group_size, ranks_in_old_group (*), new_group, ierror

Produces a group by reordering an existing group and taking only unlisted members

MPI_Comm_create(MPI_Comm old_comm, MPI_Group new_group, MPI_Comm * new_comm)

MPI_Comm_create(old_comm, new_group, new_comm, ierror)
integer old_comm, new_group, new_comm, ierror

Creates a new communicator

Groups and communicators are opaque objects. From a parctical standpoint, this means that the details of their internal representation depend on the particular implementation of MPI, and, as a consequence, they cannot be directly accessed by the user. Rather the user access a handle that refrences the opaque object, and the objects are manipulated by special MPI functions MPI_Comm_create, MPI_Group_incl and MPI_Comm_group. Contexts are not explicitly used in any MPI functions. MPI_Comm_group simply returns the group underlying the communicator comm. Mpi_Group_incl creates a new group from the list of process in the existing group old_group. The number of process in he new group is the new_group _size, and the processes to be included are listed in ranks_in _old_group. MPI_Comm_create associates a context with the group new_group and creates the communicator new_comm. All of the process in new_group belong to the group underlying old_comm. MPI_Comm_create is a collective operation. All the processes in old_comm must call MPI_Comm_create with the same arguments.

Double MPI_Wtime(void)

double precision MPI_Wtime( )

Returns an ellapsed time on the calling processor

MPI provides a simple routine MPI_Wtime( ) that can be used to time programs or section of programs. MPI_Wtime( ) returns a double precision floating point number of seconds, since some arbitrary point of time in the past. The time interval can be measured by calling this routine at the beginning and at the end of program segment and subtracting the values returned.

MPI_Sendrecv (void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf , int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

MPI_Sendrecv (sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status, ierror)
<type> sendbuf (*), recvbuf (*)
integer sendcount, sendtype, dest, sendtag, recvcount, recvtype, source, recvtag, comm, status(*), ierror

Sends and recevies a message

The function MPI_Sendrecv, as its name implies, performs both a send ana a receive. The parameter list is basically just a concatenation of the parameter lists for the MPI_Send and MPI_Recv. The only difference is that the communicator parameter is not repeated. The destination and the source parameters can be the same. The "send" in an MPI_Sendrecv can be matched by an ordinary MPI_Recv, and the "receive" can be matched by and ordinary MPI_Send. The basic difference between a call to this function and MPI_Send followed by MPI_Recv (or vice versa) is that MPI can try to arrange that no deadlock occurs since it knows that the sends and receives will be paired.

MPI_Sendrecv_replace (void* buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

MPI_Sendrecv_replace (buf, count, datatype, dest, sendtag, source, recvtag, comm, status, ierror)
buf (*)
integer count, datatype, dest, sendtag, source, recvtag, comm, status (*), ierror

Sends and receives using a single buffer

MPI_Sendrecv_replace sends and receives using a single buffer.

MPI_Bsend (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

MPI_Bsend (buf, count, datatype, dest, tag, comm, ierror)
<type> buf (*)
integer count, datatype, dest, tag, comm, ierror

Basic send with user specified buffering

MPI_Bsend copies the data into a buffer and transfers the complete buffer to the user.

MPI_Isend (void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

MPI_Isend (buf, count, datatype, dest, tag, comm, request, ierror)
<type> buf (*)
integer count, datatype, dest, tag, comm, request, ierror

Begins a nonblocking send

MPI_Isend is a nonblocking send. The basic functions in MPI for starting non-blocking communications are MPI_Isend. The "I" stands for "immediate," i.e., they return (more or less) immediately.

MPI_Irecv (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

MPI_Irecv (buf, count, datatype, source, tag, comm, request, ierror)
<type> buf (*)
integer count, datatype, source, tag, comm, request, ierror

Begins a nonblocking receive

MPI_Irecv begins a nonblocking receive. The basic functions in MPI for starting non-blocking communications are MPI_Irecv. The "I" stands for "immediate," i.e., they return (more or less) immediately.

MPI_Wait (MPI_Request *request, MPI_Status *status)

MPI_Wait (request, status, ierror)
integer request, status (*), ierror

Waits for a MPI send or receive to complete

MPI_Wait waits for an MPI or receive to complete. There are variety of functions that MPI uses to complete nonblocking operations. The simplest of these is MPI_Wait. It can be used to complete any nonblocking operation. The request parameter corresponds to the request parameter returned by MPI_Isend or MPI_Irecv.

MPI_Ssend (void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

MPI_Ssend (buf, count, datatype, dest, tag, comm, ierror)
<type> buf (*)
integer count, datatype, dest, tag, comm, ierror

Builds a handle for a synchronous send

MPI_Ssend is one of the synchronous mode send operations provided by MPI.

MPI_Gatherv (void* sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, int root, MPI_Comm comm.)

MPI_Gatherv (sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, root, comm, ierror)
<type> sendbuf (*), recvbuf (*)
integer sendcount, sendtype, recvcounts (*), displs (*), recvtype, root, comm, ierror

Gathers into specified locations from all tasks in a group

A simple extension to MPI_Gather is MPI_Gatherv. MPI_Gatherv allows the size of the data being sent by each processor to vary.

MPI_Scatterv (void* sendbuf, int *sendcounts, int *displs, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

MPI_Scatterv (sendbuf, sendcounts, displs, sendtype, recvbuf, recvcount, recvtype, root, comm, ierror)
<type> sendbuf (*), recvbuf (*)
integer sendcounts (*), displs (*), sendtype, recvcount, recvtype, root, comm, ierror

Scatters abuffer in parts to all task in a group

A simple extension to MPI_Scatter is MPI_Scatterv. MPI_Scatterv allows the size of the data being sent by each processor to vary.

MPI_Allreduce (void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

MPI_Allreduce (sendbuf, recvbuf, count, datatype, op, comm, ierror)
<type> sendbuf (*), recvbuf (*)
integer count, datatype, op, comm, ierror

Combines values from all processes and distribute the result back to all processes

MPI_Allreduce combines values form all processes and distribute the result back to all processes

MPI_Alltoall (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

MPI_Alltoall (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm, ierror)
<type> sendbuf (*), recvbuf (*)
integer sendcount, sendtype, recvcount, recvtype, comm, ierror

Sends data from all to all processes

MPI_Alltoall is a collective communication operation in which every process sends distinct collection of data to every other process. This is an extension of gather and scatter operation also called as total-exchange.

MPI_Cart_create (MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart)

MPI_Cart_create (comm_old, ndims, dims, periods, reorder, comm_cart, ierror)
integer comm_old, ndims, dims(*), comm_cart, ierror logical periods (*) , reorder

Makes a new communicator to which topology information has been attached

MPI_Cart_create creates a Cartersian decomposition of the processes, with the number of dimensions given by the number_of_dimensions argument. The user can specify the number of processes in any direction by giving a positive value to the corresponding element of dimensions_sizes.

MPI_Cart_coords (MPI_Comm comm, int rank, int maxdims, int *coords)

MPI_Cart_coords (comm, rank, maxdims, cords, ierror)
integer comm, rank, maxdims, coords (*), ierror

Determines process coords in Cartesian topology given ranks in group

MPI_Cart_coords takes input as a rank in a communicator, returns the coordinates of the process with that rank. MPI_Cart_coords is the inverse to MPI_Cart_Rank; it returns the coordinates of the processes with rank rank in the Cartesian communicator comm. Note that both of these functions are local.

MPI_Cart_sub (MPI_Comm comm, int *remain_dims, MPI_Comm *newcomm)

MPI_Cart_sub (old_comm, remain_dims, new_comm, ierror)
integer old_comm, newcomm, ierror
logical remain_dims(*)

Partitions a communicator into subgroups that from lower-dimensional cartesian subgrids

MPI_Cart_sub partitions the processes in cart_comm into a collection of disjoint communicators whose union is cart_comm. Both cart_comm and each new_comm have associated Cartesian topologies.

MPI_Cart_rank (MPI_Comm comm, int *coords, int *rank)

MPI_Cart_rank (comm, coords, rank, ierror)
integer comm, coords (*), rank, ierror

Determines process rank in communicator given Cartesian location

MPI_Cart_rank returns the rank in the Cartesian communicator comm of the process with Cartesian coordinates. So coordinates is an array with order equal to the number of dimensions in the Cartesian topology associated with comm.

MPI_Cart_get (MPI_Comm comm, int maxdims, int *dims, int *periods, int *coords)

MPI_Cart_get (comm, maxdims, dims, periods, cords, ierror)
integer comm, maxdims, dims (*), coords (*), ierror logical periods (*)

Retrieve Cartesian topology information associated with a communicator

MPI_Cart_get retrieves the coordinates of the calling process in communicator.

MPI_Cart_shift (MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest)

MPI_Cart_shift (comm, direction, disp, rank_source, rank_dest, ierror)
integer comm, direction, disp, rank_source, rank_dest, ierror

Returns the shifted source and destination ranks given a shift direction and amount

MPI_Cart_shift returns rank of source and destination processes in arguments rank_source and rank_dest respectively.

MPI_Barrier (MPI_Comm comm)

MPI_Barrier (comm, ierror)
integer comm, ierror

Blocks until all process have reached this routine

MPI_Barrier blocks the calling process until all processes in comm have entered the function.

MPI_Dims_create (int nnodes, int ndims, int *dims)

MPI_Dims_create (nnodes, ndims, dims, ierror)
integer nnodes, ndims, dims(*), ierror

Creat a division of processes in the Cartesian grid

MPI_Dims_create creates a division of processes in a Cartesian grid. It is useful to choose dimension sizes for a Cartesian coordinate system.

MPI_Waitall (int count, MPI_Request *array_of_requests, MPI_Status *array_of_statuses)

MPI_Waitall (count, array_of_requests, array_of_statuses, ierror)

integer count, array_of_requests (*), array_of_statuses (MPI_status_size, *), ierrror

Waits for all given communications to complete

MPI_Waitall waits for all given communications to complete and to test all or any of the collection of nonblocking operations.

1.1.5Compilation, Linking and Execution of MPI programs on Parallel Computers

Using mpicc and mpif77

The compilation and execution details of a parallel program that uses MPI may vary on different parallel computers. The essential steps of common to all parallel systems are same, provided we execute one process on each processor. The three important steps are described below :

After compilation, linking with MPI library and creation of executable, a copy of the executable program is placed on each processor.
Each processor begins execution of its copy of the executable.
Different processes can execute different statements by branching within the program based on their process ranks.

The application users commonly use two types of MPI programming models: spmd and non-spmd. In spmd model (Single Program Multiple Data), each process runs a different program. Different program is obtained by putting branching statements within a single program. The statement executed by various processes may be different in various segments of the program, but one executable (same program) file runs on all processes.

In non-spmd programming models, each process may execute different programs, depending on the rank of processes. More than one executable (program) is needed in non-spmd model. The application user writes several distinct programs, which may or may not depend on the rank of the processes. Most of the programs in the hands-on-session use spmd models and unless specified.

mpich provides tools that simplify creation of MPI executables. As mpich programs may require special libraries and compile options, you should use commands that mpich provides for compiling and linking programs. The mpich implementation provides two commands for compiling and linking C (mpicc) and Fortran (mpif77/mpif90) programs. For compilation following commands are used depending on the C or Fortran (f77/f90) program.

mpicc hello_world.c mpif77 hello_world.f mpif90 hello_world.f

Commands for linker may include additional libraries. For example, to use some routines from the MPI library and math library, one can use the following command

mpicc -o hello_world hello_world.c -lmpe -lm

These commands are setup for a specific architecture and mpich device are located in the directory that contains the MPI libraries. For example, if the architecture is sun4 and the device is ch_p4, these commands may be found in '/usr/local/mpi/bin' (assuming that mpich Version is installed in /usr/local/mpi).

Using Makefile

For more control over the process of compiling and linking programs for mpich, you should use a 'Makefile'. You may also use these commands in Makefile particularly for programs contained in a large number of files. In addition, you can also provide a simple interface to the profiling libraries of MPI in this Makefile. The MPI profiling interface provides a convenient way for you to add performance analysis tools of any MPI implementation. The user has to specify the names of the program (s) and appropriate paths to link MPI libraries in the Makefile. To compile and link a MPI program in C or Fortran, you can use the command

make

(Click here for a Makefile for spmd.make and non-spmd.make programs).

Executing a program: Using mpirun

To run an MPI program, use the mpirun command, which is located in

`usr/local/mpi/bin'

For almost all systems, you can use this command

mpirun -np <number of processes> a.out

The argument -np gives the number of processes that will be associated with the MPI_COMM_WORLD communicator and a.out is the executable file running on all processors.

On most of the parallel computers, the Chameleon device references a 'hosts' file, an example of which comes with the MPI model implementation. It selects an appropriate set of machines to run on, based on data in that file, the time of day, and memory requirements. This approach allows individual machines and separate executables to be directly specified.

Workstation clusters require that each process in a parallel job be started individually, though programs to help start these processes exist and clusters require additional information to make use of them. mpich should be installed with a list of participation workstations in the file 'machines.<arch>' in the directory

'usr/local/mpi/util/machines'

This file is used by mpirun to choose processors to run on. The details of this process, checking your machine list can be obtained as follows:

Use the script 'tstmachines' in

'/usr/local/mpi/lib/<arch>/<device>'

to ensure that you can use all of the machines that you have listed. This script performs an rsh and a short directory listing; this tests that you both have access to the node and that a program in the current directory is visible on the remote node.

Executing a program: Using P4 procgroup files

For even more control over how jobs get started, we need to look at how mpirun starts a parallel program on a cluster of SMPs. Each time mpirun runs, it constructs and uses a new file of machine names for just that run, using the machine file as input. It is also necessary when you want closer control over the hosts you run on, or when mpirun cannot construct it automatically. Such is the case when,

You want to run on a different set of machines than those listed in the machines file.
You want to run different executables on different hosts (your program is not spmd).
You want to run on a heterogeneous network, which requires different executables.
You want to run all the processes on the same workstation, simulating parallelism by time-sharing on machine.
You want to run on a network of shared-memory multiprocessors and need to specify the number of processes that will share memory on each machine.

The format of a ch_p4 procgroup file is set of lines of the form
<Hostnode name> <# procs> <progname> [<login>]

An example of such a file (hello_world.pg), where the command is being issued form host node e01, one node of PARAM 10000, might be

e01 0 /home/other/pcopp/pcopp01/MPI/hello_world e02 1 /home/other/pcopp/pcopp01/MPI/hello_world e03 1 /home/other/pcopp/pcopp01/MPI/hello_world e04 1 /home/other/pcopp/pcopp01/MPI/hello_world
The above file specifies four processes, on each of four nodes( e01, e02, e03, and e04) of a Cluter where the user's account name (pcopp01/MPI), access to the executable /home/other/pcopp/pcopp01/MPI/hello_world of MPI program is available. Note the 0 in the first line indicates that no other processes are to be started on host node e01 than the one started by the user by his command. Here the user is logged into host e01 and wants to start a job with one process on e01a and three other processes on e02, e03, and e04.

You might want to run on four processes on one node of the machine. You can do this by repeating its name in the file described below:

local 0

e01 1 /home/other/pcopp/pcopp01/MPI/hello_world e01 1 /home/other/pcopp/pcopp01/MPI/hello_world e01 1 /home/other/pcopp/pcopp01/MPI/hello_world

Note that this is for 4 processes, one of them started by the user directly, and the other 3 specified in this file. See the MPI manual for more information. This particular ch_p4 procgroup file can be used for spmd model. In similar way, we can write format of a ch_p4 procgroup file for non-spmd model as below :

e01 0 /home/other/pcopp/pcopp01/MPI/master e02 1 /home/other/pcopp/pcopp01/MPI/slave e03 1 /home/other/pcopp/pcopp01/MPI/slave e04 1 /home/other/pcopp/pcopp01/MPI/slave

Here, the master program executes on e01 with rank 0 and slave programs execute on e02, e03, and e04.

First, the application user must compile and link using script file make for a given MPI program in C or Fortran language. Next, set up the procgroup file (hello_world.pg) explained above, and then type the following command to execute .

hello_world

Note: For all the spmd programs in the hands-on-session, the executable file run is used in place of hello_world and the executable file master is for non-spmd programs. Also run.pg file is used in place of hello_world.pg for spmd programs and master.pg file is used for non-spmd programs.

1.1.7 Compilation, Linking and Execution of MPI programs

Using mpirun

The mpich implementation provides two commands for compiling and linking C (mpicc) and Fortran (mpif77/mpif90) programs. You may also use mpicc / mpif77 and your own mpirun discussed above.

1.1.8 Example Program

Simple MPI C program "hello_world.c"

The first C parallel program is "hello_world" program, which simply prints the message "Hello _World". Each process sends a message consists of characters to another process. If there are p processes executing a program, they will have ranks 0, 1,......, p-1. In this example, process with rank 1, 2, ......, p-1 will send message to the process with rank 0 which we call as Root. The Root process receives the message from processes with rank 1, 2, ......p-1 and print them out.

The simple MPI program in C language prints hello_world message on the process with rank 0 is explained below. We describe the features of the entire code and describe the program in details. First few lines of the code explain variable definitions, and constants. Followed by these declarations, MPI library calls for initialization of MPI environment, and MPI communication associated with a communicator are declared in the program. The communication describes the communication context and an associated group of processes. The calls MPI_Comm_Size returns Numprocs the number of processes that the user has started for this program. Each process finds out its rank in the group associated with a communicator by calling MPI_Comm_rank. The following segment of the code explains these features.

The description of program is as follows:

#include <stdio.h>
#include "mpi.h"

#define BUFLEN 512

int main(argc,argv)
int argc; char *argv[];
{
      int MyRank; /* rank of processes */
      int Numprocs; /* number of processes */
      int Destination; /* rank of receiver */
      int source; /* rank of sender */
      int tag = 0; /* tag for messages */
      int Root = 0; /* Root processes with rank 0 */
      char Send_Buffer[BUFLEN],Recv_Buffer[BUFLEN]; /* Storage for message */
      MPI_Status status; /* returns status for receive */
      int iproc,Send_Buffer_Count,Recv_Buffer_Count;

MyRank is the rank of process and Numprocs is the number of processes in the communicator MPI_COMM_WORLD.

/*....MPI initialization.... */

       MPI_Init(&argc,&argv);
      MPI_Comm_rank(MPI_COMM_WORLD,&MyRank);
      MPI_Comm_size(MPI_COMM_WORLD,&Numprocs);

Now, each process with MyRank not equal to Root sends message to Root, i.e., process with rank 0. Process with rank Root receives message from all processes other than him and prints the message.

      if(MyRank != 0) {
          sprintf(Send_Buffer, "Hello World from process with rank %d !!!\n", MyRank);
          Destination = Root;
          Send_Buffer_Count=(strlen(Send_Buffer)+1);
MPI_Send(Send_Buffer,Send_Buffer_Count,MPI_CHAR, Destination, tag, MPI_COMM_WORLD);
      }
      else{
          for(iproc = 1; iproc < Numprocs; iproc++) {
              source = iproc;
              Recv_Buffer_Count=BUFLEN;
MPI_Recv(Recv_Buffer,Recv_Buffer_Count, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);
              printf("\n %s from Processor %d *** \n", Recv_Buffer, MyRank);
         }
      }

After, this MPI_Finalize is called to terminate the program. Every process in MPI computation must make this call. It terminates the MPI "environment".

/* ....Finalizing the MPI....*/

MPI_Finalize();

}

Simple MPI f77 program "hello_world.f"

program main

include "mpif.h"

integer    MyRank, Numprocs
integer    Destination, Source, iproc
integer    Destination_tag, Source_tag
integer    Root,Send_Buffer_Count,Recv_Buffer_Count
integer    status(MPI_STATUS_SIZE)
character*12    Send_Buffer,Recv_Buffer

data Send_Buffer/'Hello World'/

C.........MPI initialization....
call   MPI_INIT(ierror)
call   MPI_COMM_SIZE(MPI_COMM_WORLD, Numprocs, ierror)
call   MPI_COMM_RANK(MPI_COMM_WORLD, MyRank, ierror)

Root = 0
Send_Buffer_Count = 12
Recv_Buffer_Count = 12

if(MyRank .ne. Root) then
     Destination = Root
     Destination_tag = 0
     call MPI_SEND(Send_Buffer, Send_Buffer_Count, MPI_CHARACTER, Destination, $ Destination_tag, MPI_COMM_WORLD, ierror)
else
   do iproc = 1, Numprocs-1
     Source = iproc
     Source_tag = 0
     call MPI_RECV( Recv_Buffer, Recv_Buffer_Count, MPI_CHARACTER, Source,
$ Source_tag, MPI_COMM_WORLD, status, ierror)
     print *, Recv_Buffer,' from Process with Rank',iproc
   enddo
   endif
call MPI_FINALIZE(ierror)
stop
end

Simple MPI f90 program "hello_world.f90"

program main

USE mpi

integer    ::   MyRank, Numprocs
character*12    ::   Send_Buffer, Recv_Buffer
integer    ::   Destination, iproc
integer    ::   tag
integer   ::   Root, Send_Buffer_Count, Recv_Buffer_Count
integer   ::   status(MPI_Status_Size)

data Send_Buffer/'Hello World'/

!....MPI initialization....

  call   MPI_Init(ierror)
  call   MPI_Comm_rank(MPI_Comm_World, MyRank, ierror)
  call   MPI_Comm_size(MPI_Comm_World, Numprocs, ierror)

  Root   =   0
  tag   =   0
  Send_Buffer_Count   =   12
  Recv_Buffer_Count   =   12

if(MyRank   .NE.   Root)   then
    Destination   =   Root
    call   MPI_SEND(Send_Buffer, Send_Buffer_Count, MPI_CHARACTER,&     Destination, tag, MPI_COMM_WORLD, ierror)
else
  do   iproc   =   1,   Numprocs-1
    call   MPI_RECV(Recv_Buffer, Recv_Buffer_Count, MPI_CHARACTER,&       iproc, tag, MPI_COMM_WORLD, status, ierror)
    print*,   Recv_Buffer, iproc
    enddo
endif

call   MPI_FINALIZE( ierror )

stop
end

The basic MPI library calls used in the above codes can be used to write vast number of efficient programs. The above program is frequently called single program multiple data (spmd) programming.

1.1.9 List of Extended Tools available in MPI

MPE - MultiProcessing Environment
All parallelism is explicit, i.e. the programmer is responsible for correctly identifying parallelism and implementing the resulting algorithm using MPI constructs.
MPIX - Message Passing Interface extension library.
The MPIX Library has been developed at the Mississippi State University NSF Engineering Research Center and currently contains a set of extensions to MPI that allow many functions that previously only worked with intracommunicators to work with intercommunicators. Extensions include enhanced support for intercommunicator:

                   1. Construction
                   2. Collective operations
                   3. Topologies

UPSHOT - Visualization Tool.
Upshot (Pacheo, 1997, Gropp et. all 1994a-b, 1996a-b, MPI forum, 1994) is a parallel program performance analysis tool that comes bundled with the public domain mpich implementation of MPI. We discuss it here because it has many features in the common with other parallel performance analysis tools, and it is readily available to use with MPI. Upshot provides some of the information that is not easily determined if we use data generated by serial tools such as prof of simply add counters and/or timers using MPI's profiling interface. It attempts to provide a unified view of the profiling data generated by each process buy modifying the time stamps of events on different processes so that all the processes start and end at the same time. It also provides a convenient form for visualizing the profiling data in a Gantt chart. There are basically two methods of using Upshot. In the simpler approach, we link our source code with appropriate libraries and obtain information on the time spent by our program in each MPI function.If we desire information on more general segment segments or states of our program, we can get it to provide custom profiling data by adding appropriate function calls to our source code. Upshot is a parallel program performance analysis tool that comes bundled with the public domain mpich implementation of MPI. We discuss it here because it has many features in the common with other parallel performance analysis tools, and it is readily available to use with MPI.

JUMPSHOT - Java Based Visualization Tool.
The MPE (Multi-Processing Environment) library is distributed with the freely available MPICH implementation and provides a number of useful facilities, including debugging, logging, graphics, and some common utility routines. MPE was developed for use with MPICH but can be used with other MPI implementations. MPE provides several ways to generate logfiles, which can then be viewed with graphical tools also distributed with MPE. The easiest way to generate logfiles is to link with an MPE library that uses the MPI profiling interface. The user can also insert calls to the MPE logging routines into his or her code. MPE provides two different logfile formats, CLOG and SLOG. CLOG files consist of a simple list of timestamped events and are understood by the Jumpshot-2 viewer. SLOG stands for Scalable LOGfile format and is based on doubly timestamped states. SLOG files are understood by the Jumpshot-3 viewer.

1.1.10 MPI Information on the Web

There are large number of resources are available on the Internet. Following is a just pointer to few of them.

Implementation of MPI

The mpich implementation developed at Argonne National Lab and Mississipi State University can be downloaded from

ftp://info.mcs.anl.gov

Upshot is bundled with this implementation. It is also available as a separate package.

The LAM implementation developed at the Ohio Supercomputer Center can be downloaded from

ftp://ftp.osc.edu/pub/lam

The CHIMP implementation developed at the Edinburgh Parallel Computing Centre can be downloaded from

ftp://ftp.epcc.ed.ac.uk/pub/chimp/release

There are two implementations of MPI run under Windows. WinMPI runs under Windows 3.1 and is available at

ftp://csftp.unomaha.edu/pub/rewini/Win

MPI Web Pages

Oak Ridge National Laboratory,

http://www.epm.ornl.gov/~walker/mpi

Mississipi State University,

http://www.erc.msstate.edu/mpi

Argonne National Laboratory,

http://www.mcs.anl.gov/mpi

W32MPI is an implementation for Win32 available at

http://pandora.uc.pt/w32mpi

MPI-2 and MPI-IO

The MPI-2 home page is

http://www.mcs.anl.gov/Projects/mpi/mpi2/mpi2.html

The MPI-IO home page is

http://lovelace.nas.nasa.gov/MPI-IO/mpi-io.html

Scalable I/O Initiative on-line documentation and links at

http://www.cacr.caltech.edu/SIO/

The Pablo on-line documentation and links at

http://www-pablo.cs.uiuc.edu/

Portable Parallel file system (PPFS) on line-documentation and links at

http://www-pablo.cs.uiuc.edu/Projects/PPFS/ppfs/html

The parallel and scalable software for I/O PASSION on-line documentation links at

http://www.cat.syr.edu/passion.html

Abstract-device interface for I/O (ADIO) on-line documentation links at

http://www.mcs.anl.gov/home/thakur/adio/

ROMIO on-line documentation links at

http://www.mcs.anl.gov/home/thakur/romio

The MPI Forum on-line documentation and links at

http://www.mcs.anl.gov/mpi/

Lawrence Livermore National laboratories on-line documentation and links at

http://www.llnl.gov/liv_comp/siof/

NASA Ames Research Center on-line documentation and links at

http://parallel.nas.nasa.gov/MPI-IO/mpi-io/html

MPI Newsgroup

There is a usenet newsgroup devoted to MPI

comp.parallel.mpi

The MPI FAQ

The MPI FAQ (frequently asked questions) list is available at

http://www.erc.msstate.edu/mpi/mpi-faq.html

1.1.11 Reference Books on MPI and Parallel Computing

[1] William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI, portable parallel programming with
       the Message   Passing Interface, The MIT Press, Cambridge, Massachusetts, London, England,
      (1994).

[2] Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis, Introduction to Parallel
       Computing Design and Analysis of Algorithms. Redwood City, CA : Benjamin/Cummings
       Publishing Company.(1994).

[3] William Gropp, Rusty Lusk, Tuning MPI Applications for Peak Performance, Pittsburgh, (1996).

[4] Ian T. Foster, Designing and Building Parallel Programs, Reading,
MA : Addison-Wesley Publishing Company, (1994).

[5] Golub, G.H., Charles, F. Van loan, Matrix computations, Second Edition, The Johns Hopkins
University Press, U.S.A., (1989).

[6] James M. Ortega, Introduction to Parallel and vector solution of linear equations, Frontiers of
computer science series editor : Arnold L. Rosenberg, Plenum Press, Newyork,(1988).

[7] Michael J. Quinn, Designing Efficient Algorithms for Parallel Computers, McGraw-Hill
International Editions, Computer Science Series, McGraw-Hill, Inc. Newyork, (1987).

[8] Kai Hwang, Zhiwei Xu, Scalable Parallel Computing (Technology Architecture Programming),
McGraw Hill Newyork, (1997).

[9] Pacheco S. Peter, Parallel Programming with MPI, University of Sanfrancisco, Morgan Kaufman
      Publishers, Inc., Sanfrancisco, California, (1992).

[10] Culler David E., Jaswinder Pal Singh with Anoop Gupta, Parallel Computer Architecture,
        A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc. (1999).

Contents