|
A proposed standard Message Passing Interface(MPI) is originally designed for writing applications and libraries for distributed memory environments. The main advantages of establishing a message-passing interface for such environments are portability and ease of use, and a standard memory-passing interface is a key component in building a concurrent computing environment in which applications, software libraries, and tools can be transparently ported between different machines. MPI is intended to be a standard message-passing interface for applications and libraries running on concurrent computers with logically distributed memory. MPI is not specifically designed for use by parallelizing compilers. MPI provides no explicit support for multithreading, since the design goals of MPI standard do not include the mandate that an implementation should be interoperable with other MPI implementations. However, MPI does provide message passing routines for exchanging all the information needed to allow a single MPI implementation to operate in a heterogeneous environment. We assume that the mpich implementation is installed into
'/usr/local/mpi' and the application user have added
'usr/local/mpi/bin' to their path. If mpich is
installed somewhere else you should make the appropriate changes.
mpich has been built and installed on the parallel systems
knowing the architecture and the device. The architecture
indicates the kind of processor, examples are Sun 4. The
device indicates how mpich performs communication between
processes, example is ch_p4. The libraries and special
commands for each architecture/device pair are provided in the following
directory.
In the past, both commercial and software makers provided different solutions to the users on the message passing paradigms. Important issues for the user community are portability, performance, and features. The user community, which quite definitely includes the software suppliers themselves, recently determined to address this issues. In April 1992, the Center for Research in Parallel Computation sponsored a one-day workshop of Standards for Message Passing in a Distributed-Memory Environment. The result of that workshop, which featured presentations of many systems, was a realization both that people were eager to cooperate on the definition of a standard. At the Supercomputing' 92 conference in November, a committee was formed to define a message-passing standard. At the time of creation, few knew what the outcome might look like, but the effort was begun with the following goals:
MPI achieves portability by providing a public-domain, platform-independent standard of message-passing library. MPI specifies this library in a language-independent form, and provides Fortran and C bindings. This specification does not contain any feature that is specific to any particular vendor, operating system, or hardware. Due to these reasons, MPI has gained wide acceptance in the parallel computing community. MPI has been implemented on IBM PCs on Windows, all main Unix worksta- tions, and all major parallel computers. This means that a parallel program written in standard C or Fortran, using MPI for message passing, could run without change on a single PC, a workstation, a network of workstations, an MPP, from any vendor, on any operating system. MPI is not a stand-alone, self-contained software system. It serves as a message-passing communication layer on top of the native parallel programming environment, which takes care of necessities such as process management and I/O. Besides these proprietary environments, there are several public-domainMPI environments. Examples include the CHIMP implementation developed at Edinburg University, and the LAM (Local Area Multicomputer) developed at the Ohio Supercom- puter Center, which is an MPI programming environment for heterogenous Unix clusters. MPICH The most popular public-domain implementation is MPICH, developed jointly by Argonne National Laboratory and Mississippi State University.MPICH is a portable implementation of MPI on a wide range of machines, from IBM PC's, networks of workstations, to SMPs and MPPs. The portability ofMPICH means that one can simply retrieve the sameMPICH Package and install it on almost any platform.MPICH also has good performance on Many parallel machines, because it often runs in the more efficient native mode rather than over the common TCP/IP sockets. In addition to meetings every six weeks for more than a year, there were continuous discussions via electronic mail, in which many persons from the worldwide parallel computing community participated. Equally important, an early commitment to producing a model implementation helped to demonstrate that an implementation of MPI was feasible. The MPI Standard is just being completed (May 1994). For more information of mpich Verisons refer to http://www-unix.mcs.anl.gov/mpi/mpich 1.1.3 MPI BasicsPerhaps the best way to introduce the concepts in MPI that might
initially appear unfamiliar is to show how they have arisen as necessary
extensions of quite familiar concepts. Let us consider what is perhaps
the most elementary operation in a message-passing library, the basic
send operation. In most of the current message-passing systems, it looks
very much like this:
will complete successfully only if a message is received with the correct tag. Other messages are queued until a matching receive is executed. In most current systems source is an output argument indicating where the message came from, although in some systems it can also be used to restrict matching and to cause message queuing. On a receive, address and maxlen together describe the buffer into which the received data is to be put, actlen is the number of bytes received. Message-passing systems with this sort of syntax and semantics have proven extremely useful, yet have imposed restrictions that are now recognized as undesirable by a large user community. The MPI Forum has sought to lift these restrictions by providing more flexible versions of each of these parameters, while retaining the familiar underlying meanings of the basic send and receive operations. Let us examine these parameters one by one, in each case discussing
first the current restrictions and then the MPI version. The
(address, length) specification of the message to be sent was a
good match for early hardware but is no longer adequate for two
different reasons:
A nice feature of the MPI design is that MPI provides a powerful functionality based on four orthogonal concepts. These four concepts inMPI are message data types, communicators, communication operations, and virtual topology. Separating Families of Messages Nearly, all message-passing systems provide a tag argument for the send and receive operations. This argument allows the programmer to deal with the arrival of messages in an orderly way, even if the messages that arrive "of the wrong tag" until the program (peer) is ready for them. Usually a facility exists for specifying wild-card tags that match any tag. This mechanism has proven necessary but insufficient, because the arbitrariness of the tag choices means that the entire program must use tags in a predefined, coherent way. Particular difficulties arise in the case of libraries, written far from the application programmer in time and space, whose messages must not be accidentally received by the application program. MPI's solution is to extend the notion of tag with a new concept : context. Contexts are allocated at run time by the system in response to user (and library) requests and are used for matching messages. They differ from tags in that they are allocated by the system instead of the user and no wild-card matching is permitted. The usual notion of message tag, with wild-card matching, is retained in MPI. Naming Processes Processes belong to groups. If a group contains n processes, then its processes are identified within the group by ranks, which are integers form 0 to n-1. There is an initial group to which all processes in an MPI implementation belong. Within this group, processes are numbered similarly to the way in which they are numbered in many existing message-passing systems, from 0 up to 1 less than the total number of processes. Communicators The notions of context and group are combined in a single object
called a communicator, which becomes an argument to most
point-to-point and collective operations. Thus the destination or
source specified in a send or receive operation
always refers to the rank of the process in the group identified with
the given communicator. That is, in MPI the basic (blocking) send
operation has become
The source, tag, and count of the message actually received can be retrieved from status. Several other message-passing systems return the "status" parameters by separate cells that implicitly reference the most recent message received. MPI's method is one aspect of its effort to the reliable in the situation where multiple threads are receiving messages on behalf of a process. The processes involved in the execution of a parallel program using MPI, are identified by a sequence of non-negative integers. If there are p processes executing a program, they will have ranks 0, 1,2,...., p-1. A set of routines that support point-to-point communication between pairs of processes. Blocking and non-blocking versions of the routines are provided which may be used in four different communication modes. These modes correspond to different communication protocols. Message selectivity in point-to-point communication is by source process and message tag, each of which may be wildcarded to indicate that any valid value is acceptable. The communicator abstraction that provides support for the design of safe, modular parallel software libraries. General or derived datatypes, that permit the specification of messages of noncontiguous data of different datatypes. Application topologies that specify the logical layout of processes. A common example is a Cartesian grid which is often used in two and three-dimensional problems. A rich set of collective communication routines that perform coordinated communication among a set of processes. In MPI there is no mechanism for creating processes, and an MPI
program is parallel abinitio i.e., there is a fixed number of
processes from the start to the end of an application program. All
processes are members of at least one process group. Initially all
processes are members of the same group, and a number of routines are
provided that allow an application to create (and destroy) new
subgroups. Within a group each process is assigned a unique rank
in the range 0 to n-1, where n is the number of processes
in the group. This rank is used to identify a process, and, in
particular, is used to specify the source and destination
processes in a point-to-point communication operation, and MPI was designed as a message passing interface rather than a complete parallel programming environment, and thus in its current form intentionally omits many desirable features. For example, MPI lacks mechanisms for process creation and control, one-sided communication operations that would permit put and get messages, and active messages, nonblocking collective communication operations, and the ability for a collective communication operation to involve more than one group, language bindings for Fortran 90 and C++. These issues and other possible extensions to MPI, are currently
being considered in the MPI-2 effort. Extensions to MPI for performing
parallel I/O are also under consideration. Communicators provide support for the design of safe, modular software libraries. Here means that messages intended for receipt by a particular receive call in an application will not be incorrectly intercepted by a different receive call. Thus, communicators are a powerful mechanism for avoiding unintentional non-determinism in message passing. This may be a particular problem when using third-party software libraries that perform message passing. The point here is that the application developer has no way of knowing if the tag, group, and rank completely disambiguate the message traffic of different libraries and the rest of the application. Communicator arguments are passed to all MPI message passing routines, and a message can be communicated only if the communicator arguments passed to the send and receive routines match. Thus, in effect communicators provide an additional criterion for message selection, and hence permit the construction of independent tag spaces. If communicators are not used to disambiguate message traffic there are two ways in which a call to a library routine can lead to unintended behaviour. In the first case, the processes enter a library routine synchronously when a send has been initiated for which the matching receive is not posted until after the library call. In this case the message may be incorrectly received in the library routine. The second possibility arises when different processes enter a library routine asynchronously resulting in a nondeterministic behaviour. If the program behaves correctly, processes 0 and 1 each receive a message from process 2, using a wildcard selection criterion to indicate that they are prepared to receive a message from any process. The three processes then pass data around in a ring within the library routine. If separate communicators are not used for the communication inside and outside of the library routine this program may intermittently fail. Suppose we delay the sending of the second message sent by process 2, for example, by inserting some computation. In this case the wildcarded receive in process 0 is satisfied by a message sent from process 1, rather than from process 2, and deadlock results. By supplying a different communicator to the library routine we can ensure that the program is executed correctly, regardless of when the processes enter the library routine. Communicators are opaque objects, which means they can only be manipulated using MPI routines. The key point about communicators is that when a communicator is created by an MPI routine it is guaranteed to be unique. Thus it is possible to create a communicator and pass it to a software library provided that communicator is not used for any message passing outside of the library. Communicators have a number of attributes. The group attribute identifies the process group relative to which process ranks are interpreted, and/or which identifies the process group involved in a collective communication operation. Communicators also have a topology attribute which gives the topology of the process group. In addition, users may associate arbitrary attributes with communicators through a mechanism known as caching. Point-To-Point Communication MPI provides routines for sending and receiving blocking and nonblocking messages. A blocking send does not return until it is safe for the application to alter the message buffer on the sending process without corrupting or changing the message sent. A nonblocking send may return while the message buffer on the sending process is still volatile, and it should not be changed until it is guaranteed that this will not corrupt the message. This may be done by either calling a routine that blocks until the message buffer may be safely reused, or by calling a routine that performs a nonblocking check on the message status. A blocking receive suspends execution on the receiving process until the incoming message has been placed in the specified application buffer. A nonblocking receive may return before the message is actually received into the specified application buffer, and a subsequent call must be made to ensure that this occurs before the buffer is reused. Communication Modes In MPI, a message may be sent in one of four communication modes,
which approximately corresponds to the most common protocols used for
point-to-point communication. In ready mode a message may be sent
only if a corresponding receive has been initiated. In standard
mode a message may be sent regardless of whether a corresponding receive
has been initiated. MPI includes a synchronous mode which is the
same as the standard mode, except that the send operation will not be
complete until a corresponding receive has been initiated on the
destination process. Finally, there is a buffered mode. To use
buffered mode the user must first supply a buffer and associate
it with a communicator. When a subsequent send is performed using that
communicator, MPI may use the associated buffer to send the message. A
buffered send may be performed regardless of whether a corresponding
receive has been initiated. There are, therefore, 8 types of send and 2 types of receive operation. In addition, routines are provided that perform send and receive simultaneously. Different calls are provided for when the send and receive buffers are distinct, and when they are the same. The send/receive operation is blocking, so does not return until the send buffer is ready for reuse, and the incoming message has been received. The two send/receive routines bring the total number of point-to-point message passing routines up to 12. Message-Passing Modes It is customary in message-passing systems to use the term communication to refer to all interaction operations, i.e., communication, synchronization, and aggregation. Communications usually occur within processes of the same group. However, inter-group communications are also supported by some systems (e.g., MPI). There are three aspects of a communication mode that a user should understand How many processes are involved? Three communication modes are used in today's message-passing systems. We describe these communication modes below from the user's viewpoint, using a pair of send and receive, in three different ways. We use the following code example to demonstrate the ideas.  Send and receive buffers in message passing In the following code, process P sends a message contained in variable M to process Q, which receives the message into its variable S. Processor P: Processor Q: M = 10; S = - 100 The variable M is often called the send message buffer (or send buffer), and S is called the receive message buffer (or receive buffer). Synchronous Message Passing: When process P executes a synchronous send M to Q, it has to wait until process Q executes a corresponding synchronous receive S from P. Both processes will not return from the send or the receive until the message At is both sent and received. This means in the above code that variable X should evaluate to 11. When the send and receive return, M can be immediately overwritten by P and S can be immediately read by Q, in the subsequent statements L2. Note that no additional buffer needs to be provided in synchronous message passing. The receive message buffer S is available to hold the arriving message. Blocking Send/Receive: A blocking send is executed when a process reaches it, without waiting for a corresponding receive. A blocking send does not return until the message is sent, meaning the message variable M can be safely rewritten. Note that when the send returns, a corresponding receive is not necessarily finished, or even started. All we know is that the message is out of M. It may have been received. But it may be temporarily buffered in the sending node, somewhere in the network, or it may have arrived at a buffer in the receiving node. A blocking receive is executed when a process reaches it, without waiting for a corresponding send. However, it cannot return until the message is received. In the above code, X should evaluate to 11 with blocking send/receive. Note that the system may have to provide a temporary buffer for blocking-mode message passing. NonBlocking Send/Receive: A nonblocking send is executed when a process reaches it, without waiting for a corresponding receive. A nonblocking send can return immediately after it notifies the system that the message M is to be sent. The message data are not necessarily out of M. Therefore, it is unsafe to overwrite M. A nonblocking receive is executed when a process reaches it, without waiting for a corresponding send. It can return immediately after it notifies the system that there is a message to be received. The message may have already arrived, may be still in transient, or may have not even been sent yet. With the nonblocking mode, X could evaluate to 11, 21, or -99 in the above code, depending on the relative speeds of the two processes. The system may have to provide a temporary buffer for nonblocking message passing.These three modes are compared in below table. Comparision of Three Communication Modes
In real parallel systems, there are variants on this definition of synchronous (or the other two) mode. For example, in some systems, a synchronous send could return when the corresponding receive is started but not finished. A different term may be used to refer to the synchronous mode. The term asynchronous is used to refer to a mode that is not synchronous, such as blocking and nonblocking modes. Blocking and nonblocking modes are used in almost all existing message-passing systems. They both reduce the wait time of a synchronous send. However, sufficient temporary buffer space must be available for an asynchronous send, because the corresponding receive may not be even started; thus the memory space to put the received message may not be known. The main motivation for using the nonblocking mode is to overlap communication and computation.However, nonblocking introduces its own inefficiencies, such as extra memory space for the temporary buffers, allocation of the buffer, copying message into and out of the buffer, and the execution of an extra wait-for function. These overheads may significantly offset any gains from overlapping communication with computation. Persistent Communication Requests MPI also provides a set of routines for creating communication request objects that completely describe a send or receive operation by binding together all the parameters of the operation. A handle to the communication object so formed is returned, and may be passed to a routine that actually initiates the communication. As with the nonblocking communication routines, a subsequent call should be made to ensure completion of the operation. Persistent communication objects may be used to optimize communication performance, particularly when the same communication pattern is repeated many times in an application. For example, if a send routine is called within a loop, performance may be improved by creating a communication request object that describes the parameters of the send prior to entering the loop, and then initiating the communication inside the loop to send the data on each pass through the loop. There are five routines for creating communication objects: four for send operations (one corresponding to each communication mode), and one for receive operations. A persistent communication object should be deallocated when no longer needed. Application Topologies In many applications the processes are arranged with a particular topology, such as a two or three-dimensional grid. MPI provides support for general application topologies that are specified by a graph in which processes that communicate a significant amount are connected by an arc. If the application topology is an n-dimensional Cartesian grid then this generality is not needed, so as a convenience MPI provides explicit support for such topologies. For a Cartesian grid, periodic or nonperiodic boundary conditions may apply in any specified grid dimension. In MPI, a group either has a Cartesian or graph topology, or no topology. In addition to providing routines for translating between process ran and location in the topology, MPI also: allows knowledge of the application topology to be exploited in order to efficiently assign processes to physical processors, provides a routine for partitioning a Cartesian grid into hyperplane groups by removing a specified set of dimensions, provides support for shifting data along a specified dimension of Cartesian grid. By dividing a Cartesian grid into hyperplane groups, it is possible to perform collective communication operations within these groups. In particular, if all but one dimension is removed a set of one-dimensional subgroups is formed, and it is possible, for example, to perform a multicast in the corresponding direction. Collective Communication Collective communication routines provide for coordinated
communication among a group of processes. The process group is given by
the communicator object that is input to the routine. The MPI collective
communication routines have been designed so that their syntax and
semantics are consistent with those of the point-to-point routines. The
collective communication routines maybe (but do not have to be)
implemented using the MPI point-to-point routines. Collective
communication routines do not have message tag arguments, though an
implementation in terms of the point-to-point routines may need to make
use of tags. A collective communication routine must be called by all
members of the group with consistent arguments. As soon as a process has
completed its role in the collective communication it may continue with
other tasks. Thus, a collective communication is not necessarily a
barrier synchronization for the group. MPI does not include nonblocking
forms of the collective communication routines. MPI collective
communication routines are divided into two broad classes: data movement
routines, and global computation routines. Collective operations are of
two kinds:
Virtual topologies One can conceptualize process in an application-oriented topology, for convenience in programming. Both general graphs and grids of processes are supported. Topologies provide a high-level method for managing process groups without dealing with them directly. Since topologies are a standard part of MPI, we do not treat them as an exotic, advanced feature. We use them early in the book and freely from then on. Debugging and Profiling Rather than specify any particular interface, MPI requires the availability of "hooks" that allow users to intercept MPI calls and thus define their own debugging and profiling mechanisms. Support for libraries The structuring of all communication through communicators provides to library writers for the first time the capabilities they need to write parallel libraries that are completely independent of user code and interoperable with other libraries. Libraries can maintain arbitrary data, called attributes, associated with the communicators they allocate, and can specify their own error handlers. Support for heterogeneous network MPI programs can run on networks of machines that have different lengths and formats for various fundamental datatypes, since each communication operation specifies a (possibly very simple) structure and all the component datatypes, so that the implementation always has enough information to do data format conversions if they are necessary. MPI does not specify how this is done, however thus allowing a variety of optimizations. Brief Introduction to MPI Calls Most commonly used MPI Library calls in FORTRAN/C -Language have been explained below. Syntax: 'C' Call
MPI_Init ( ierror)
Initializes the MPI execution environment This call is required in every MPI program and must be the first MPI call. It establishes the MPI "environment". Only one invocation of MPI_Init can occur in each program execution. It takes the command line arguments as parameters. In a FORTRAN call to MPI_Init the only argument is the error code. Every Fortran MPI subroutine returns an error code in its last argument, which is either MPI_SUCCESS or an implementation-defined error code. It allows the system to do any special setup so that the MPI library can be used.
MPI_Comm_rank (comm, rank, ierror)
The first argument to the call is a communicator and the rank
of the process is returned in the second argument. Essentially a communicator
is a collection of processes that can send messages to each other. The
only communicator needed for basic programs is MPI_COMM_WORLD and
is predefined in MPI and consists of the processees running when program
execution begins.
MPI_Comm_size (comm, size, ierror)
Determines the size of the group associated with a communicator This function determines the number of processors executing the program.
Its first argument is the communicator and it returns the number of processes
in the communicator in its second argument.
MPI_Send(buf, count, datatype, dest, tag, comm, ierror)
Basic send (It is a blocking send call) The first three arguments descibe the message as the address,count and
the datatype . The content of the message are stored in the block of memory
refrenced by the address. The count specifies the number of elements contained
in the message which are of a MPI type MPI_DATATYPE. The next argument
is the destination, an integer specifying the rank of the destination process.
The tag argument helps identify messages .
MPI_Recv(buf, count, datatype, source, tag, comm, status, ierror)
Basic receive ( It is a blocking receive call) The first three arguments descibe the message as the address,count and
the datatype. The content of the message are stored in the block of memory
referenced by the address. The count specifies the number of elements contained
in the message which are of a MPI type MPI_DATATYPE. The next argument
is the source which specifies the rank of the sending process.MPI allows
the source to be a "wild card". There is a predefined constant MPI_ANY_SOURCE
that can be used if a process is ready to receive a message from any
sending process rather than a particular sending process. The tag argument
helps identify messages. The last argument returns information on the data
that was actually received. It references a record with two fields - one
for the source and the other for the tag.
MPI_Bcast(buffer, count, datatype, root, comm, ierror)
Broadcast a message from the process with rank "root" to all other processes of the group It is a collective communication call in which a single process sends
same data to every process. It sends a copy of the data in message
on process root to each process in the communicator comm.
It should be called by all processors in the communicator with the same
arguments for root and comm.
MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm,
ierror)
Reduce values on all processes to a single value MPI_Reduce combines the operands stored in *operand using operation
op and stores the result on *result on the root. Both
operand and result refer count memory locations with type datatype.
MPI_Reduce must be called by all the processor in the communicator comm,
and count, datatype and op must be same on each processor.
MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount,
recvtype, root , comm, ierror)
Sends data from one processor to all other processes in a group The process with rank root distributes the contents of
send_buffer among the processes. The contents of send_buffer
are split into p segments each consisting of send_count
elements. The first segment goes to process 0, the second to process 1
,etc. The send arguments are significant only on process root.
MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount,
recvtype, root, comm, ierror)
Gathers together values from a group of tasks Each process in comm sends the contents of send_buffer
to the process with rank root. The process root concatenates
the received data in the process rank order in recv_buffer.
The receive arguments are significant only on the process with rank root.
The argument recv_count indicates the number of items received
from each process - not the total number received.
MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount,
recvtype, comm, ierror)
Gathers data from all processes and distribute it to all MPI_Allgather gathers the contents of each send_buffer
on each process. Its effect is the same as if there were a sequence
of p calls to MPI_Gather, each of which has a different process
acting as a root.
MPI_Finalize(ierror)
Terminates MPI execution environment This call must be made by every process in a MPI computation. It terminates
the MPI "environment", no MPI calls my be made by a process after
its call to MPI_Finalize.
MPI_Comm_split ( comm, size, ierror)
Creates new communicator based on the colors and keys The single call to MPI_Comm_split creates q new communicators,
all of them having the same name, *new_comm. It creates a new communicator
for each value of the split_key. Process with the same value
of split_key form a new group. The rank in the new group is determined
by the value of rank_key. If process A and process B call
MPI_Comm split with the same value of split_key, and the rank_key argument
passed by process A is less than that passed by process B, then the rank
of A in underlying group new_comm will be less than the rank of process
B. It is a collective call, and it must be called by all the processes
in old_comm.
MPI_Comm_group (comm, group, ierror)
Accesses the group associated with the given
communicator
MPI_Group_incl (old_group, new_group_size, ranks_in_old_group
, new_group, ierror)
Produces a group by reordering an existing
group and taking only unlisted members
MPI_Comm_create(old_comm, new_group, new_comm, ierror)
Creates a new communicator Groups and communicators are opaque objects. From a parctical standpoint,
this means that the details of their internal representation depend on
the particular implementation of MPI, and, as a consequence, they cannot
be directly accessed by the user. Rather the user access a handle that
refrences the opaque object, and the objects are manipulated by special
MPI functions MPI_Comm_create, MPI_Group_incl and MPI_Comm_group. Contexts
are not explicitly used in any MPI functions. MPI_Comm_group simply returns
the group underlying the communicator comm. Mpi_Group_incl creates a new
group from the list of process in the existing group old_group. The number
of process in he new group is the new_group _size, and the processes to
be included are listed in ranks_in _old_group. MPI_Comm_create associates
a context with the group new_group and creates the communicator new_comm.
All of the process in new_group belong to the group underlying old_comm.
MPI_Comm_create is a collective operation. All the processes in old_comm
must call MPI_Comm_create with the same arguments.
double precision MPI_Wtime( ) Returns an ellapsed time on the calling processor MPI provides a simple routine MPI_Wtime( ) that can be used to time
programs or section of programs. MPI_Wtime( ) returns a double precision
floating point number of seconds, since some arbitrary point of time in
the past. The time interval can be measured by calling this routine at
the beginning and at the end of program segment and subtracting the values
returned.
MPI_Sendrecv (sendbuf, sendcount, sendtype, dest, sendtag, recvbuf,
recvcount, recvtype, source, recvtag, comm, status, ierror)
Sends and recevies a message The function MPI_Sendrecv, as its name implies, performs both a send
ana a receive. The parameter list is basically just a concatenation of
the parameter lists for the MPI_Send and MPI_Recv. The only difference
is that the communicator parameter is not repeated. The destination and
the source parameters can be the same. The "send" in an MPI_Sendrecv can
be matched by an ordinary MPI_Recv, and the "receive" can be matched by
and ordinary MPI_Send. The basic difference between a call to this function
and MPI_Send followed by MPI_Recv (or vice versa) is that MPI can try to
arrange that no deadlock occurs since it knows that the sends and receives
will be paired.
MPI_Sendrecv_replace (buf, count, datatype, dest, sendtag, source,
recvtag, comm, status, ierror)
Sends and receives using a single buffer MPI_Sendrecv_replace sends and receives using a single buffer.
MPI_Bsend (buf, count, datatype, dest, tag, comm, ierror)
Basic send with user specified buffering MPI_Bsend copies the data into a buffer and transfers the complete buffer
to the user.
MPI_Isend (buf, count, datatype, dest, tag, comm, request, ierror)
Begins a nonblocking send MPI_Isend is a nonblocking send. The basic functions in MPI for starting
non-blocking communications are MPI_Isend. The "I" stands for "immediate,"
i.e., they return (more or less) immediately.
MPI_Irecv (buf, count, datatype, source, tag, comm, request,
ierror)
Begins a nonblocking receive MPI_Irecv begins a nonblocking receive. The basic functions in MPI for starting non-blocking communications are MPI_Irecv. The "I" stands for "immediate," i.e., they return (more or less) immediately.
MPI_Wait (request, status, ierror)
Waits for a MPI send or receive to complete MPI_Wait waits for an MPI or receive to complete. There are variety
of functions that MPI uses to complete nonblocking operations. The simplest
of these is MPI_Wait. It can be used to complete any nonblocking operation.
The request parameter corresponds to the request parameter returned by
MPI_Isend or MPI_Irecv.
MPI_Ssend (buf, count, datatype, dest, tag, comm, ierror)
Builds a handle for a synchronous send MPI_Ssend is one of the synchronous mode send operations provided by
MPI.
MPI_Gatherv (sendbuf, sendcount, sendtype, recvbuf, recvcounts,
displs, recvtype, root, comm, ierror)
Gathers into specified locations from all tasks in a group A simple extension to MPI_Gather is MPI_Gatherv. MPI_Gatherv allows
the size of the data being sent by each processor to vary.
MPI_Scatterv (sendbuf, sendcounts, displs, sendtype, recvbuf,
recvcount, recvtype, root, comm, ierror)
Scatters abuffer in parts to all task in a group A simple extension to MPI_Scatter is MPI_Scatterv. MPI_Scatterv allows
the size of the data being sent by each processor to vary.
MPI_Allreduce (sendbuf, recvbuf, count, datatype, op, comm, ierror)
Combines values from all processes and distribute the result back to all processes MPI_Allreduce combines values form all processes and distribute the
result back to all processes
MPI_Alltoall (sendbuf, sendcount, sendtype, recvbuf, recvcount,
recvtype, comm, ierror)
Sends data from all to all processes MPI_Alltoall is a collective communication operation in which every
process sends distinct collection of data to every other process. This
is an extension of gather and scatter operation also called as total-exchange.
MPI_Cart_create (comm_old, ndims, dims, periods, reorder, comm_cart,
ierror)
Makes a new communicator to which topology information has been attached MPI_Cart_create creates a Cartersian decomposition of the processes,
with the number of dimensions given by the number_of_dimensions argument.
The user can specify the number of processes in any direction by giving
a positive value to the corresponding element of dimensions_sizes.
MPI_Cart_coords (comm, rank, maxdims, cords, ierror)
Determines process coords in Cartesian topology given ranks in group MPI_Cart_coords takes input as a rank in a communicator, returns the
coordinates of the process with that rank. MPI_Cart_coords is the inverse
to MPI_Cart_Rank; it returns the coordinates of the processes with rank
rank in the Cartesian communicator comm. Note that both of these functions
are local.
MPI_Cart_sub (old_comm, remain_dims, new_comm, ierror)
Partitions a communicator into subgroups that from lower-dimensional cartesian subgrids MPI_Cart_sub partitions the processes in cart_comm into a collection
of disjoint communicators whose union is cart_comm. Both cart_comm and
each new_comm have associated Cartesian topologies.
MPI_Cart_rank (comm, coords, rank, ierror)
Determines process rank in communicator given Cartesian location MPI_Cart_rank returns the rank in the Cartesian communicator comm of
the process with Cartesian coordinates. So coordinates is an array with
order equal to the number of dimensions in the Cartesian topology associated
with comm.
MPI_Cart_get (comm, maxdims, dims, periods, cords, ierror)
Retrieve Cartesian topology information associated with a communicator MPI_Cart_get retrieves the coordinates of the calling process in communicator.
MPI_Cart_shift (comm, direction, disp, rank_source, rank_dest,
ierror)
Returns the shifted source and destination ranks given a shift direction and amount MPI_Cart_shift returns rank of source and destination processes in arguments
rank_source and rank_dest respectively.
MPI_Barrier (comm, ierror)
Blocks until all process have reached this routine MPI_Barrier blocks the calling process until all processes in comm have
entered the function.
MPI_Dims_create (nnodes, ndims, dims, ierror)
Creat a division of processes in the Cartesian grid MPI_Dims_create creates a division of processes in a Cartesian grid.
It is useful to choose dimension sizes for a Cartesian coordinate system.
MPI_Waitall (count, array_of_requests, array_of_statuses, ierror) integer count, array_of_requests (*), array_of_statuses (MPI_status_size, *), ierrror Waits for all given communications to complete MPI_Waitall waits for all given communications to complete and to test all or any of the collection of nonblocking operations. Using mpicc and mpif77 The compilation and execution details of a parallel program that uses MPI may vary on different parallel computers. The essential steps of common to all parallel systems are same, provided we execute one process on each processor. The three important steps are described below :
In non-spmd programming models, each process may execute different programs, depending on the rank of processes. More than one executable (program) is needed in non-spmd model. The application user writes several distinct programs, which may or may not depend on the rank of the processes. Most of the programs in the hands-on-session use spmd models and unless specified. mpich provides tools that simplify creation of MPI
executables. As mpich programs may require special
libraries and compile options, you should use commands that mpich
provides for compiling and linking programs. The mpich
implementation provides two commands for compiling and linking C
(mpicc) and Fortran (mpif77/mpif90) programs.
For compilation following commands are used depending on the C or
Fortran (f77/f90) program. Commands for linker may include additional libraries. For
example, to use some routines from the MPI library and math library, one
can use the following command These commands are setup for a specific architecture and mpich device are located in the directory that contains the MPI libraries. For example, if the architecture is sun4 and the device is ch_p4, these commands may be found in '/usr/local/mpi/bin' (assuming that mpich Version is installed in /usr/local/mpi). Using MakefileFor more control over the process of compiling and linking programs
for mpich, you should use a 'Makefile'. You may also use
these commands in Makefile particularly for programs contained in
a large number of files. In addition, you can also provide a simple
interface to the profiling libraries of MPI in this
Makefile. The MPI profiling interface provides a convenient
way for you to add performance analysis tools of any MPI
implementation. The user has to specify the names of the program
(s) and appropriate paths to link MPI libraries in the Makefile.
To compile and link a MPI program in C or Fortran, you can use the
command (Click here for a Makefile for spmd.make and non-spmd.make programs). Executing a program: Using mpirunTo run an MPI program, use the mpirun command, which is
located in For almost all systems, you can use this command The argument -np gives the number of processes that will be associated with the MPI_COMM_WORLD communicator and a.out is the executable file running on all processors. On most of the parallel computers, the Chameleon device references a 'hosts' file, an example of which comes with the MPI model implementation. It selects an appropriate set of machines to run on, based on data in that file, the time of day, and memory requirements. This approach allows individual machines and separate executables to be directly specified. Workstation clusters require that each process in a parallel job be
started individually, though programs to help start these processes
exist and clusters require additional information to make use of
them. mpich should be installed with a list of
participation workstations in the file 'machines.<arch>' in
the directory This file is used by mpirun to choose processors to run on. The details of this process, checking your machine list can be obtained as follows: Use the script 'tstmachines' in to ensure that you can use all of the machines that you have listed. This script performs an rsh and a short directory listing; this tests that you both have access to the node and that a program in the current directory is visible on the remote node. Executing a program: Using P4 procgroup filesFor even more control over how jobs get started, we need to look at
how mpirun starts a parallel program on a cluster of SMPs. Each time mpirun runs, it constructs and uses
a new file of machine names for just that run, using the machine file as
input. It is also necessary when you want closer control over the
hosts you run on, or when mpirun cannot construct it
automatically. Such is the case when,
An example of such a file (hello_world.pg), where the command
is being issued form host node e01, one node of PARAM 10000,
might be The above file specifies four processes, on each of four nodes( e01, e02, e03, and e04) of a Cluter where the user's account name (pcopp01/MPI), access to the executable /home/other/pcopp/pcopp01/MPI/hello_world of MPI program is available. Note the 0 in the first line indicates that no other processes are to be started on host node e01 than the one started by the user by his command. Here the user is logged into host e01 and wants to start a job with one process on e01a and three other processes on e02, e03, and e04. You might want to run on four processes on one node of the machine. You can do this by repeating its name in the file described below:
Note that this is for 4 processes, one of them started by the user
directly, and the other 3 specified in this file. See the MPI
manual for more information. This particular ch_p4
procgroup file can be used for spmd model. In similar way, we can
write format of a ch_p4 procgroup file for non-spmd model
as below : Here, the master program executes on e01 with rank 0 and slave programs execute on e02, e03, and e04. First, the application user must compile and link using script file
make for a given MPI program in C or Fortran language. Next, set
up the procgroup file (hello_world.pg) explained above, and then
type the following command to execute . Note: For all the spmd programs in the hands-on-session, the executable file run is used in place of hello_world and the executable file master is for non-spmd programs. Also run.pg file is used in place of hello_world.pg for spmd programs and master.pg file is used for non-spmd programs. 1.1.7 Compilation, Linking and Execution of MPI programs Using mpirun The mpich implementation provides two commands for compiling and linking C (mpicc) and Fortran (mpif77/mpif90) programs. You may also use mpicc / mpif77 and your own mpirun discussed above. 1.1.8 Example ProgramSimple MPI C program "hello_world.c" The first C parallel program is "hello_world" program, which simply prints the message "Hello _World". Each process sends a message consists of characters to another process. If there are p processes executing a program, they will have ranks 0, 1,......, p-1. In this example, process with rank 1, 2, ......, p-1 will send message to the process with rank 0 which we call as Root. The Root process receives the message from processes with rank 1, 2, ......p-1 and print them out. The simple MPI program in C language prints hello_world message on the process with rank 0 is explained below. We describe the features of the entire code and describe the program in details. First few lines of the code explain variable definitions, and constants. Followed by these declarations, MPI library calls for initialization of MPI environment, and MPI communication associated with a communicator are declared in the program. The communication describes the communication context and an associated group of processes. The calls MPI_Comm_Size returns Numprocs the number of processes that the user has started for this program. Each process finds out its rank in the group associated with a communicator by calling MPI_Comm_rank. The following segment of the code explains these features. The description of program is as follows: #include
<stdio.h> #define BUFLEN
512
int
main(argc,argv) MyRank is the rank of process and Numprocs is the number of processes in the communicator MPI_COMM_WORLD. /*....MPI initialization.... */
MPI_Init(&argc,&argv);
Now, each process with MyRank not equal to Root sends message to Root, i.e., process with rank 0. Process with rank Root receives message from all processes other than him and prints the message. if(MyRank != 0) {
/* ....Finalizing the MPI....*/
MPI_Finalize();
} Simple MPI f77 program "hello_world.f" program main include "mpif.h" integer MyRank, Numprocs data Send_Buffer/'Hello World'/ C.........MPI initialization.... Root = 0 if(MyRank .ne. Root) then Simple MPI f90 program "hello_world.f90" program main integer :: MyRank, Numprocs data Send_Buffer/'Hello World'/ !....MPI initialization.... call MPI_Init(ierror) Root = 0 if(MyRank .NE. Root) then stop The basic MPI library calls used in the above codes can be used to write vast number of efficient programs. The above program is frequently called single program multiple data (spmd) programming. 1.1.9 List of Extended Tools available in MPI
2. Collective operations 3. Topologies
1.1.10 MPI Information on the Web There are large number of resources are available on the Internet. Following is a just pointer to few of them.Implementation of MPI
Upshot is bundled with this implementation. It is also available as
a separate package.
MPI-2 and MPI-IO
There is a usenet newsgroup devoted to MPI The MPI FAQ The MPI FAQ (frequently asked questions) list is available at 1.1.11 Reference Books on MPI and Parallel Computing [1] William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI, portable parallel programming withthe Message Passing Interface, The MIT Press, Cambridge, Massachusetts, London, England, (1994). [2] Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis, Introduction to Parallel Computing Design and Analysis of Algorithms. Redwood City, CA : Benjamin/Cummings Publishing Company.(1994). [3] William Gropp, Rusty Lusk, Tuning MPI Applications for Peak Performance, Pittsburgh, (1996). [4] Ian T. Foster, Designing and Building Parallel Programs,
Reading,
[5] Golub, G.H., Charles, F. Van loan, Matrix computations,
Second Edition, The Johns Hopkins [6] James M. Ortega, Introduction to Parallel and vector solution
of linear equations, Frontiers of
[7] Michael J. Quinn, Designing Efficient Algorithms for Parallel
Computers, McGraw-Hill [8] Kai Hwang, Zhiwei Xu, Scalable Parallel Computing (Technology
Architecture Programming),
[9] Pacheco S. Peter, Parallel Programming with MPI, University
of Sanfrancisco, Morgan Kaufman |