Table of Contents
-
Overview
-
What is Parallelism?
-
Sequential Programming
-
The Need for Faster Machines
-
Parallel Computing
-
Parallel Programming Overview
-
Architecture Taxonomy
-
SISD Model
-
SIMD Model
-
MIMD Model
-
Processor Communication
-
Shared Memory
-
Distributed Memory
-
Memory Hierarchies
-
Communications Network on the SP2
-
Parallel Programming Paradigms
-
Various Methods
-
Message Passing
-
Data Parallel
-
Implementations
- Message Passing
- Data Parallel
-
Paradigm Comparisons
-
Maturity
-
Ease of programming
-
Flexibility
-
Efficiency
-
Scalability
-
Portability
-
I/O
-
Cost
-
Steps for Creating a Parallel Program
-
Communication
-
Point to Point
-
One to All Broadcast
-
All to All Broadcast
-
One to All Personalized
-
All to all Personalized
-
Shifts
-
Collective Computation
-
Design and Performance Considerations
-
Amdahl's Law
-
Load Balancing
-
Granularity
-
Data Dependency
-
Communication Patterns and Bandwidth
-
I/O Patterns
-
Machine Configuration
-
Fault Tolerance and Restarting
-
Deadlock
-
Debugging
-
Performance Monitoring and Analysis
-
Parallel Examples
-
Essentials of Loop Parallelism
-
Calculating PI
-
Serial Problem Description
-
Parallel Solution
-
Calculating Array Elements
-
Serial Problem Description
-
Data Parallelism
-
Pool of Tasks
-
Load Balancing and Granularity
-
Simple Heat Equation
-
Serial Problem Description
-
Data Parallelism
-
Overlapping Communication and Computation
-
Application Case Study
-
References, Acknowledgements, WWW Resources
Overview
What is Parallelism?
A strategy for performing large, complex tasks faster.
A large task can either be performed serially, one step following another,
or can be decomposed into smaller tasks to be performed
simultaneously, i.e., in parallel.
Parallelism is done by:
- Breaking up the task into smaller tasks
- Assigning the smaller tasks to multiple workers to work on simultaneously
- Coordinating the workers
- Not breaking up the task so small that it takes longer to tell the
worker what to do than it does to do it
Parallel problem solving is common. Examples: building construction;
operating a large organization; automobile manufacturing plant
The automobile analogy.
Overview
Sequential Programming
Traditionally, programs have been written for serial computers:
- One instruction executed at a time
- Using one processor
- Processing speed dependent on how fast data can move through hardware
- Speed of Light = 30 cm/nanosecond
- Limits of Copper Wire = 9 cm/nanosecond
- Fastest machines execute approximately 1 instruction in 9-12 billionths
of a second
Overview
The Need for Faster Machines
You might think that one instruction executed in 9 billionths of a second would
be fast enough. You'd be wrong.
There are several classes of problems that require faster processing:
- Simulation and Modeling problems:
- Based on successive approximations
- More calculations, more precise
- Problems dependent on computations / manipulations of large amounts
of data
- Image and Signal Processing
- Entertainment (Image Rendering)
- Database and Data Mining
- Seismic
Grand Challenge Problems:
- Climate Modeling
- Fluid Turbulence
- Pollution Dispersion
- Human Genome
- Ocean Circulation
- Quantum Chromodynamics
- Semiconductor Modeling
- Superconductor Modeling
- Combustion Systems
- Vision & Cognition
Overview
Parallel Computing
Traditional Supercomputers
Technology
- Single processors were created to be as fast as possible.
- Peak performance was achieved with good memory bandwidth.
Benefits
- Supports sequential programming (Which many people understand)
- 30+ years of compiler and tool development
- I/O is relatively simple
Limitations
- Single high performance processors are extremely expensive
- Significant cooling requirements
- Single processor performance is reaching its asymptotic limit
Parallel Supercomputers
Technology
- Applying many smaller cost efficient processors to work on a part of
the same task
- Capitalizing on work done in the microprocessor and networking markets
Benefits
- Ability to achieve performance and work on problems impossible with
traditional computers.
- Exploit "off the shelf" processors, memory, disks and tape systems.
- Ability to scale to problem.
- Ability to quickly integrate new elements into systems thus capitalizing
on improvements made by other markets.
- Commonly much cheaper.
Limitations
- New technology. Programmers need to learn parallel programming approaches.
- Standard sequential codes will not "just run".
- Compilers and tools are often not mature.
- I/O is not as well understood yet.
Overview
Parallel Computing (cont)
Parallel computing requires:
- Multiple processors
(The workers)
- Network
(Link between workers)
- Environment to create and manage parallel processing
- Operating System
(Administrator of the system that knows how to
handle multiple workers)
- Parallel Programming Paradigm
- Message Passing
- Data Parallel
- Fortran 90 / High Performance Fortran
- A parallel algorithm and a parallel program
(The decomposition of the problem into pieces that
multiple workers can perform)
Overview
Parallel Programming
- Parallel programming involves:
- Decomposing an algorithm or data into parts
- Distributing the parts as tasks which are worked on by multiple
processors simultaneously
- Coordinating work and communications of those processors
- Parallel programming considerations:
- Type of parallel architecture being used
- Type of processor communications used
Architecture Taxonomy
- All parallel computers use multiple processors
- There are several different methods used to classify computers
- No single taxonomy fits all designs
- Flynn's taxonomy uses the relationship of program
instructions to program data. The four categories are:
Architecture Taxonomy
SISD Model
Single Instruction, Single Data Stream
- Not a parallel computer
- Conventional serial, scalar von Neumann computer
- One instruction stream
- A single instruction is issued each clock cycle
- Each instruction operates on a single (scalar) data element
- Limited by the number of instructions that can be issued in a given unit
of time
- Performance frequently measured in MIPS (million of instructions per
second)
- Most non-supercomputers
Automobile analogy
Single Instruction, Multiple Data Stream
- Also von Neumann architectures but more powerful instructions
- Each instruction may operate on more than one data element
- Usually intermediate host executes program logic and broadcasts
instructions to other processors
- Synchronous (lockstep)
- Rating how fast these machines can issue instructions
is not a good measure of their performance
- Performance is measured in MFLOPS (millions of floating point
operations per second)
- Two major types:
- Vector SIMD
- Parallel SIMD
Automobile analogy
Architecture Taxonomy
SIMD Model
Vector SIMD
Single instruction results in multiple operands being updated
- Scalar processing operates on single data elements. Vector processing
operates on whole vectors (groups) of data at a time.
- Examples:
- Cray 1
- NEC SX-2
- Fujitsu VP
- Hitachi S820
Single processor of:
- Cray C 90
- Cray2
- NEC SX-3
- Fujitsu VP 2000
- Convex C-2
Architecture Taxonomy
SIMD Model
Parallel SIMD
- Processor arrays - single instruction is issued and all processors
execute the same instruction, operating on different sets of data.
- Processors run in a synchronous, lockstep fashion
- Advantages
- Disadvantages
- Decisions within DO loops can result in poor execution by requiring
all processes to perform the operation controlled by decision
whether results are used or not
- Examples:
- Connection Machine CM-2
- Maspar MP-1, MP-2
Architecture Taxonomy
MIMD Model
Multiple Instructions, Multiple Data
Automobile analogy
- In order to coordinate tasks of multiple nodes working on the same
problem, some form of inter-processor communications is required to:
- Convey information and data between processors
- Synchronize node activities
- The way processors communicate is dependent upon memory architecture,
which, in turn, will affect how you write your parallel program
- The three primary memory architectures are:
- Shared Memory
- Distributed Memory
- Memory Hierarchies
Processor Communications Shared Memory
- Multiple processors operate independently but share the same memory
resources
- Only one processor can access the shared memory location at a time
- Synchronization achieved by controlling tasks' reading from and
writing to the shared memory
- Advantages
- Easy for user to use efficiently
- Data sharing among tasks is fast (speed of memory access)
- Disadvantages
- Memory is bandwidth limited. Increase of processors without
increase of bandwidth can cause severe bottlenecks
- User is responsible for specifying synchronization, e.g., locks
- Examples:
- Cray Y-MP
- Convex C-2
- Cray C-90
Processor Communications
Distributed Memory
- Multiple processors operate independently but
each has its own private memory
- Data is shared across a communications network using message passing
- User responsible for synchronization using message passing
- Advantages
- Memory scalable to number of processors. Increase number of
processors, size of memory and bandwidth increases.
- Each processor can rapidly access its own memory without interference
-
Disadvantages
- Difficult to map existing data structures to this memory organization
- User responsible for sending and receiving data among processors
- To minimize overhead and latency, data should be blocked up in
large chunks and shipped before receiving node needs it
- Examples:
- nCUBE Hypercube
- Intel Hypercube
- TMC CM-5
- IBM SP1, SP2
- Intel Paragon
Processor Communications
Memory Hierarchies
- Combination of shared memory within a group of processors and several
groups communicating through a larger memory.
- Similar to register <= cache memory <= main memory hierarchy.
- Likely to be design of the future with several processors and their
local memory surrounding a larger shared memory on a single board.
- Advantages
- Small fast memory can be used for supplying data to processors and
large slower memory can be used for a backfill to the smaller
memories.
- Disadvantages
- Possible inefficiency in data movement. Movement between memories
usually organized in pages. Inefficient if only small part of data in
page is used.
- Data arrays can be spread across several different memories slowing
access of non-adjacent elements.
- Performance may be poor unless user is concerned about memory
accessing patterns and restructures data storage to effectively
use architecture.
- Examples
- Cache systems
- Sequent
- IBM Power 4
- KSR 1
- Multi-level Memory Hierarchies
- Distributed memory
- SP2 Node Connectivity
- Nodes are connected to each other by a native high performance switch
- Nodes are connected to the network (and, hence, to each other)
via the ethernet
- Type of Communication
- ip - Internet Protocol (TCP/IP)
- runs over ethernet or
- runs over the switch (depending on environment setup)
- us - User Space
There are many methods of programming parallel computers. Two of the most
common are message passing and data parallel.
- Message Passing - the user makes calls to libraries to explicitly
share information between processors.
- Data Parallel - data partitioning determines parallelism
- Shared Memory - multiple processes sharing common memory space
- Remote Memory Operation - set of processes in which a process can
access the memory of another process without its participation
- Threads - a single process having multiple (concurrent)
execution paths
- Combined Models - composed of two or more of the above.
Note: these models are machine/architecture
independent, any of the models can be implemented on
any hardware given appropriate operating system
support.
An effective implementation is one which closely matches its
target hardware and provides the user ease in programming.
The message passing model is defined as:
- set of processes using only local memory
- processes communicate by sending and receiving
messages
- data transfer requires
cooperative operations to be performed by each
process (a send operation must have a matching
receive)
Programming with message passing is done by linking with and
making calls to libraries which manage the data exchange
between processors. Message passing libraries are available for most
modern programming languages.
The data parallel model is defined as:
- Each process works on a different part of the same data structure
- Global name space
- Commonly a Single Program Multiple Data (SPMD) approach
- Data is distributed across processors
- All message passing is done invisibly to the programmer
- Commonly built "on top of" one of the common message passing libraries
Programming with data parallel model is accomplished by
writing a program with data parallel constructs and compiling it with
a data parallel compiler.
The compiler converts the program into standard
code and calls to a message passing library to distribute the data to all
the processes.
- Message Passing
- MPI - Message Passing Interface
- PVM - Parallel Virtual Machine
- MPL - Message Passing Library
- Data Parallel
- Fortran 90 / High Performance Fortran
- Message Passing Interface often called MPI.
- A standard portable message-passing library definition developed in
1993 by a group of parallel computer vendors, software writers, and
application scientists.
- Available to both Fortran and C programs.
- Available on a wide variety of parallel machines.
- Target platform is a distributed memory system such as the SP.
- All inter-task communication is by message passing.
- All parallelism is explicit: the programmer is responsible for
parallelism the program and implementing the MPI constructs.
- Programming model is SPMD
- Parallel Virtual Machine, often called PVM,
enables a collection of different
computer systems to be viewed as a single parallel machine.
- Originally developed in 1989 as a research tool to explore heterogenous
network computing by Oak Ridge National Laboratory. Now available as
a public domain software package.
- A software tool used to create and execute concurrent or parallel
applications.
- Operates on a collection of heterogenous Unix computers connected by one
or more networks.
- All communication is accomplished by message passing.
- Comprised of two main components
- the PVM daemon process which runs on each processor.
- library interface routines provide processor control and
message passing.
- Message Passing Library often called MPL.
- Part of IBM's SP Parallel Environment.
- IBM's proprietary message passing routines.
- Designed to provide a simple and efficient set of well understood
operations for coordination and communication among processors in a
parallel application.
- Target platform is a distributed memory system such as
the SP. Has also been ported to RS/6000 clusters.
- Fortran 90 (F90) - (ISO / ANSI standard extensions to Fortran 77).
- High Performance Fortran (HPF) - extensions to F90 to support data parallel
programming.
- Compiler directives allow programmer specification of data distribution
and alignment.
- New compiler constructs and intrinsics allow the programmer to do
computations and manipulations on data with different distributions.
Before choosing a parallel programming paradigm and a particular
implementation there are many issues to be considered. Some of the
most important are:
- Maturity
- Ease of Programming
- Flexibility
- Efficiency
- Scalability
- Portability
- I/O
- Cost
The maturity of the compiler or message passing library is a major concern
when choosing a paradigm and a particular implementation. Developers
creating research codes may wish ease in development and a variety of
functionality. Developers creating production codes might be most concerned
with the level of support or lack of bugs in the tools.
- Message Passing - Message Passing Libraries are very mature.
- Research and Industry have been working with message passing for
many years.
- The functionality of these libraries is relatively simple and easy
to implement.
- These libraries are critical to all distributed memory parallel
computing. Vendors must provide a robust set of libraries to make
there products competitive.
- Data Parallel - Data Parallel compilers are relatively immature.
- Compilers commonly don't support all features of the language.
- Compilers and runtime systems are often buggy.
- It can be expected though that the maturity of these products should
increase quickly. Manufactureres and 3rd party vendors
are working to develop their technology.
- Compilers are very complex and difficult to develop.
- Only Fortran has a standard (F90 / HPF).
Ability to develop robust software quickly and efficiently
is a major concern to industry. Much of the software effort
is spent in maintenance. The paradigm must make it easy to implement
algorithms and maintain the resulting code.
- Message Passing
- The development of anything but simple programs is difficult. Often
compared to programming in assembler language.
- The programmer must explicitly implement a data distribution scheme
and all interprocess communication.
- It is the programmers responsibility to resolve data dependencies
and avoid deadlock and race conditions.
- Data parallel
- The main feature of the data parallel programming is its relative ease
when compared to message passing.
- Data distribution is simple, achieved by compiler directives.
- All interprocess communication and synchroniziation is
invisible to the developer.
Several approaches to making parallel code.
Developers choose an approach that:
- fits architecture
- easy to use and maintain
- provides needed performance
They choose from:
- Functional parallelism - different tasks done at the same time.
- Master-Slave parallelism - one process assigns subtask to other
processes.
- SPMD parallelism - Single Program, Multiple Data - same code
replicated to each process.
Not all paradigms support all approaches.
- Message Passing supports all programming approaches.
- Data Parallel only supports SPMD programming
As in all of computing the further you get from the systems level the less
efficient you get. Over the years serial compilers have gotten very efficient.
- Message Passing
- Can often come close to the actual performance levels of the
architecture.
- Since the user is required to explicitly implement a data
distribution scheme and all interprocess communication,
the performance depends on the ability of the developer.
- Data Parallel
- Data Parallel compilers and runtime systems handle all interprocess
communication and synchronization. Performance depends on how well
they can do this.
- The relative immaturity of these compilers usually means the may not
produce optimal code in many situations.
Both Data Parallel and Message passing are solutions for scalable parallel
programming.
- Message Passing - The user must be sure to develop the software so it
will scale easy.
- Data Parallel - Data parallel scales automatically. It provides features
like data padding to ensure a program will run on any size platform.
The trick is to match the program to the right size platform.
To ensure that the software you develop is portable to other architectures be
sure to choose a standard.
- Message Passing - MPI is quickly becoming the standard. PVM over the
years had become the defacto standard.
- Data Parallel - HPF is the only data parallel standard. It is based on
Fortran 90
One area of Parallel Programming that is still very immature is Input / Output
(I/O) support. I/O support is dependent not on a particular paradigm as
much as a particular implementation.
It is important to understand your I/O requirements and choose a message
passing library or data parallel compiler which supports your needs.
Cost is always an important factor when choosing any software tool.
- Message Passing Libraries are generally free.
- Data Parallel compilers generally are not. In fact the development of
these compilers is a very competitive business.
- If you are starting with an existing serial program, debug the serial code
completely
- Identify the parts of the program that can be executed concurrently:
- Requires a thorough understanding of the algorithm
- Exploit any inherent parallelism which may exist.
- May require restructuring of the program and/or algorithm.
May require an entirely new algorithm.
- Decompose the program:
- Functional Parallelism
- Data Parallelism
- Combination of both
- Code development
- Code may be influenced/determined by machine architecture
- Choose a programming paradigm
- Determine communication
- Add code to accomplish task control and communications
- Compile, Test, Debug
- Optimization
- Measure Performance
- Locate Problem Areas
- Improve them
There are three methods for decomposing a problem into smaller tasks to be
performed in parallel: Functional Decomposition, Domain Decomposition,
or a combination of both
- Functional Decomposition (Functional Parallelism)
- Decomposing the problem into different tasks which can be
distributed to multiple processors for simultaneous execution
- Good to use when there is not static structure or fixed determination
of number of calculations to be performed
- Domain Decomposition (Data Parallelism)
- Partitioning the problem's data domain and distributing portions to
multiple processors for simultaneous execution
- Good to use for problems where:
- data is static (factoring and solving large matrix or finite
difference calculations)
- dynamic data structure tied to single entity where entity can be
subsetted (large multi-body problems)
- domain is fixed but computation within various regions of the
domain is dynamic (fluid vortices models)
-
There are many ways to decompose data into partitions to be distributed:
- One Dimensional Data Distribution
- Block Distribution
- Cyclic Distribution
- Two Dimensional Data Distribution
- Block Block Distribution
- Block Cyclic Distribution
- Cyclic Block Distribution
Understanding the interprocessor communications of your program is essential.
- Message Passing communication is programed explicitly. The programmer
must understand and code the communication.
- Data Parallel compilers and run-time systems do all
communications behind the scenes. The programmer need not understand
the underlying communications. On the other hand to get good
performance from your code you should write your algorithm with the
best communication possible.
The types of communications for message passing and data parallel are exactly
the same. In fact most data parallel compilers simply use one of the
standard message passing libraries to achieve data movement.
Communications on distributed memory computers:
- Point to Point
- One to All Broadcast
- All to All Broadcast
- One to All Personalized
- All to All Personalized
- Shifts
- Collective Computation
The most basic method of communication between two processors is the
point to point message. The originating processor "sends" the message
to the destination processor. The destination processor then "receives"
the message.
The message commonly includes the information, the length of the message, the
destination address and possibly a tag.
Typical message passing libraries subdivide the basic sends
and receives into two types:
- blocking - processing waits until message is transmitted
- nonblocking - processing continues even if message hasn't been
transmitted yet
A node may have information which all the others require.
A broadcast
is a message sent to many other nodes.
A One to All broadcast occurs when one processor sends the same information
to many other nodes.
With an All to All
broadcast each processor sends its unique information to all the
other processors.
Personalized communication send a unique message to each processor.
In One to All personalized communication one processor sends a unique
message to every other processor.
In All to All Personalized communication each processor sends a unique message
to all other processors.
Shifts are permutations of information. Information is exchanged in one
logical direction or the other. Each processor exchanges the same amount of
information with its neighbor processor.
There are two types of shifts:
- Circular - Each processor exchanges information with its logical neighbor.
When there is no longer a neighbor due to an edge of data the shift
"wraps around" and takes the information from the opposite edge.
- End Off Shift - When an edge occurs, the processor is padded with zero or
a user defined value.
In collective computation (reductions), one member of the group
collects data from the other members. Commonly a mathematical
operation like a min, max, add, multiple etc. is performed.
Design and Performance Considerations
Amdahl's Law
- Amdahl's Law states that potential program
speedup is defined by the fraction of code (f) which can be parallelized:
1
speedup = --------
1 - f
- If none of the code can be parallelized, f = 0 and the speedup = 1 (no
speedup). If all of the code is parallelized, f = 1 and the speedup is
infinite (in theory).
- Introducing the number of processors performing the parallel fraction of
work, the relationship can be modeled by:
1
speedup = ------------
P + S
---
N
where P = parallel fraction, N = number of processors and S = serial
fraction.
- It soon becomes obvious that there are limits to the scalability of
parallelism. For example, at P = .50, .90 and .99 (50%, 90% and 99% of
the code is parallelizable):
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
- However, certain problems demonstrate increased performance by increasing
the problem size. For example:
2D Grid Calculations 85 seconds 85%
Serial fraction 15 seconds 15%
We can increase the problem size by halving both the grid points and
the time step, which is directly proportional to the grid spacing.
This results in four times the number of grid points (factor of two in
each direction) and twice the number of time steps. The timings then
look like:
2D Grid Calculations 680 seconds 97.84%
Serial fraction 15 seconds 2.16%
- Problems which increase the percentage of parallel time with their size
are more "scalable" than problems with a fixed percentage of parallel
time.
- In order to coordinate between different processors working on the same
problem, some form of communication between them is required
- The ratio between computation and communication is known as granularity.
- Fine-grain parallelism
- All tasks execute a small number of instructions between
communication cycles
- Facilitates load balancing
- Low computation to communication ratio
- Implies high communication overhead and less opportunity for
performance enhancement
- If granularity is too fine it is possible that the overhead
required for communications and synchronization between tasks
takes longer than the computation.
- Coarse-grain parallelism
- Typified by long computations consisting of large numbers of
instructions between communication synchronization points
- High computation to communication ratio
- Implies more opportunity for performance increase
- Harder to load balance efficiently
- The most efficient granularity is dependent on the algorithm and the
hardware environment in which it runs
- In most cases overhead associated with communications and
synchronization is high relative to execution speed
so it is advantageous to have coarse granularity.
- A data dependency exists when there is multiple use of the same storage
location
- Importance of dependencies: frequently inhibit parallel execution
- Example 1:
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0
500 CONTINUE
If Task 2 has A(J) and Task 1 has A(J-1), the value of A(J) is
dependent on:
- Example 2:
task 1 task 2
------ ------
X = 2 X = 4
. .
. .
Y = X**2 Y = X**3
The value of Y is dependent on:
- Distributed memory
If and/or when the value of X is communicated between the tasks.
- Shared memory
Which task last stores the value of X.
- Types of data dependencies
- Flow Dependent: Task 2 uses a variable computed by Task 1.
Task 1 must store/send the variable before Task 2 fetches
- Output Dependent: Task 1 and Task 2 both compute the same variable
and Task 2's value must be stored/sent after Task 1's
- Control Dependent: Task 2's execution depends upon a conditional
statement in Task 1. Task 1 must complete before a decision can
be made about executing Task 2.
- How to handle data dependencies?
- I/O operations are generally regarded as inhibitors to parallelism
- Parallel I/O systems are as yet, largely undefined and not available
- In an environment where all processors see the same filespace, write
operations will result in file overwriting
- Read operations will be affected by the fileserver's ability to handle
multiple read requests at the same time
- I/O which must be conducted over the network (non-local) can cause
severe bottlenecks
- Some options:
- Reduce overall I/O as much as possible
- Confine I/O to specific serial portions of the job
- For example, Task 1 could read an input file and then communicate
required data to other tasks. Likewise, Task 1 could perform
write operation after receiving required data from all other tasks.
- Create unique filenames for each tasks' input/output file(s)
- For distributed memory systems with shared filespace, perform I/O in
local, non-shared filespace
- For example, each processor may have /tmp filespace which can used. This is usually much more efficient than performing I/O over the
network to one's home directory.
- To approach optimum parallel performance, tailor your algorithms to the
architecture and configuration of your system
- To this end you'll need to know environmental factors such as:
- Optimum number of processor nodes
- Memory size
- Cache considerations
- System load
- In parallel programming, it is usually the programmer's responsibility to
handle events such as:
- machine failures
- task failures
- checkpointing
- restarting
- Deadlock describes a condition where two or more processes are waiting
for an event or communication from one of the other processes.
- The simplest example is demonstrated by two processes which are both
programmed to read/receive from the other before writing/sending.
- Example
TASK1 TASK2
------------------ ------------------
X = 4 Y = 8
SOURCE = TASK2 SOURCE = TASK1
RECEIVE (SOURCE,Y) RECEIVE (SOURCE,X)
DEST = TASK2 DEST = TASK1
SEND (DEST, X) SEND (DEST, Y)
Z = X + Y Z = X + Y
- Debugging parallel programs is significantly more of a challenge than
debugging serial programs
- Parallel debuggers are beginning to become available, but much work
remains to be done
- Use a modular approach to program development
- Pay as close attention to communication details as to computation details
- As with debugging, monitoring and analyzing parallel program execution
is significantly more of a challenge than for serial programs
- Parallel tools for execution monitoring and program analysis are
beginning to become available
- Some are quite useful
- Work remains to be done, particularly in the area of scalability.
Some concrete problems will help illustrate the methods of parallel
programming and the design and performance issues involved.
Each of the problems has a loop construct that forms the main
computational component of the code. Loops are a main target
for parallelizing and vectorizing code. A program often spends
much of its time in loops. When it can be done, parallelizing
these sections of code can have dramatic benefits.
A step-wise refinement procedure for developing the parallel algorithms
will be employed. An initial solution for each problem will be presented
and improved by considering performance issues.
Pseudo-code will be used to describe the solutions. The solutions will
address the following issues:
- identification of parallelism
- program decomposition
- load balancing (static vs. dynamic)
- task granularity in the case of dynamic load balancing
- communication patterns - overlapping communication and computation
Note the difference in approaches between message passing and data parallel
programming. Message passing explicitly parallelizes the loops where
data parallel replaces loops by working on entire arrays in parallel.
- Embarrassingly parallel.
- Computationally intensive.
- Minimal communication
- The value of PI can be calculated in a number of ways, many of which
are easily parallelized
- Consider the following method of approximating PI
- Inscribe a circle in a square
- Randomly generate points in the square
- Determine the number of points in the square that are also in the circle
- Let r be the number of points in the circle divided by the number of
points in the square
- PI ~ 4 r
- Note that the more points generated, the better the approximation
- Serial pseudo code for this procedure:
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
- Note that most of the time in running this program would be
spent executing the loop
Message passing solution:
- Parallel strategy: break the loop into portions which can be
executed by the processors.
- For the task of approximating PI:
- each processor executes its portion of the loop a number of
times.
- each processor can do its work without requiring any information
from the other processors (there are no data dependencies). This
situation is known as Embarassingly Parallel.
- uses SPMD model. One process acts as master and collects
the results.
- Message passing pseudo code:
npoints = 10000
circle_count = 0
p = number of processors
num = npoints/p
find out if I am master or worker
do j = 1,num
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
if I am master
receive from workers their circle_counts
compute PI (use master and workers calculations)
else if I am worker
send to master circle_count
endif
Data parallel solution:
- This example shows calculations on array elements that require
very little communication.
- Elements of 2-dimensional array are calculated.
- The calculation of elements is independent of one another -
leads to embarassingly parallel situation.
- The problem should be computationally intensive.
- Serial code could be of the form:
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
- The serial program calculates one element at a time in the specified
order.
Message Passing
- Arrays are distributed so that each processor owns a portion of an array.
- Independent calculation of array elements insures no
communication amongst processors is needed.
- Distribution scheme is chosen by other criteria, e.g. unit stride
through arrays.
- Desirable to have unit stride through arrays, then the
choice of a distribution scheme depends on the programming language.
- Fortran: block cyclic distribution
- C: cyclic block distribution
- After the array is distributed, each processor
executes the portion of the loop corresponding to the data it owns.
- Notice only the loop variables are different from the serial
solution.
- For example, with Fortran and a block cyclic distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
- Message Passing Solution:
- With Fortran storage scheme, perform block cyclic
distribution of array.
- Implement as SPMD model.
- Master process initializes array, sends info to worker
processes and receives results.
- Worker process receives info, performs its share of
computation and sends results to master.
- Pseudo code solution:
find out if I am master or worker
if I am master
initialize the array
send each worker info on part of array it owns
send each worker its portion of initial array
receive from each worker results
else if I am worker
receive from master info on part of array I own
receive from master my portion of initial array
# calculate my portion of array
do j = my first column,my last column
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
send master results
endif
Data Parallel
- A trivial problem for a data parallel language.
- Data parallel languages often have compiler directives to do
data distribution.
- Loops are replaced by a "for all elements" construct which performs
the operation in parallel.
- Good example of ease in programming versus message passing.
- Pseudo code solution:
DISTRIBUTE a (block, cyclic)
for all elements (i,j)
a(i,j) = fcn (i,j)
- We've looked at problems that are static load balanced.
- each processor has fixed amount of work to do
- may be significant idle time for faster or more lightly loaded
processors.
- Usually is not a major concern with dedicated usage. i.e.
loadleveler.
- If you have a load balance problem, you can use a "pool of tasks"
scheme. This solution only available in message passing.
- Two processes are employed
- Master Process:
- holds pool of tasks for worker processes to do
- sends worker a task when requested
- collects results from workers
- Worker Process: repeatedly does the following
- gets task from master process
- performs computation
- sends results to master
- Worker processes do not know before runtime which portion of array
they will handle or how many tasks they will perform.
- The fastest process will get more tasks to do.
Dynamic load balancing occurs at run time.
- Solution:
- Calculate an array element
- Worker process gets task from master, performs work, sends
results to master, and gets next task
- Pseudo code solution:
find out if I am master or worker
if I am master
do until no more jobs
send to worker next job
receive results from worker
end do
tell workers no more jobs
else if I am worker
do until no more jobs
receive from master next job
calculate array element: a(i,j) = fcn(i,j)
send results to master
end do
endif
- Static load balancing can result in significant idle time for
faster processors.
- Pool of tasks offers a potential solution - the faster processors
do more work.
- In the pool of tasks solution, the workers calculated array elements,
resulting in
- optimal load balancing: all processors complete work at the
same time
- fine granularity: small unit of computation, master and worker
communicate after every element
- fine granularity may cause very high communications cost
- Alternate Parallel Solution:
- give processors more work - columns or rows rather than elements
- more computation and less communication results in larger
granularity
- reduced communication may improve performance
- Most problems in parallel computing require communication among
the processors.
- Common problem requires communication with "neighbor" processor.
- The heat equation describes the temperature change over time,
given initial temperature distribution and boundary conditions.
- A finite differencing scheme is employed to solve the
heat equation numerically on a square region.
- The initial temperature is zero on
the boundaries and high in the middle.
- The boundary temperature is held at zero.
- For the fully explicit problem, a time stepping algorithm is used.
The elements of a 2-dimensional array represent the temperature at
points on the square.
- The calculation of an element is dependent on neighbor element values.
- A serial program would contain code like:
do iy = 2, ny - 1
do ix = 2, nx - 1
u2(ix, iy) =
u1(ix, iy) +
cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) +
cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
end do
end do
- Arrays are so that each
processor owns a portion of the arrays.
- Determine data dependencies
- interior elements belonging to a
processor are independent of other processors'
- border elements are dependent upon
a neighbor processor's data, communication is required.
Message Passing
- First Parallel Solution:
Data Parallel
Loops are not used. The entire array is processed in parallel.
The distribute statements layout the data in parallel.
A SHIFT is used to increment or decrement an array element.
DISTRIBUTE u1 (block,cyclic)
DISTRIBUTE u2 (block,cyclic)
u2 = u1 +
cx * (SHIFT (u1,1,dim 1) + SHIFT (u1,-1,dim 1) - 2.*u1) +
cy * (SHIFT (u1,1,dim 2) + SHIFT (u1,-1,dim 2) - 2.*u1)
References and Acknowledgements
- "IBM AIX Parallel Environment Application Development, Release 1.0", IBM
Corporation.
- Carriero, Nicholas and Gelernter, David, "How to Write Parallel Programs -
A First Course". MIT Press, Cambridge, Massachusetts.
- Dowd, Kevin, High Performance Computing", O'Reilly & Associated, Inc.,
Sebastopol, California.
- Hockney, R.W. and Jesshope, C.R., "Parallel Computers 2",Hilger, Bristol
and Philadelphia.
- Ragsdale, Susan, ed., "Parallel Programming", McGraw-Hill, Inc., New York.
- Chandy, K. Mani and Taylor, Stephen, "An Introduction to Parallel
Programming", Jones and Bartlett, Boston
- We gratefully acknowledge John Levesque of Applied Parallel Research for
the use of his "A Parallel Programming Workshop" presentation materials.
- We also gratefully acknowledge the Cornell Theory Center, Ithaca, New
York for the use of portions of their "Parallel Processing" and "Scalable
Processing" presentation materials.
© Copyright 1995-97
Maui High Performance Computing Center. All rights reserved.
Documents located on the Maui High Performance Computing Center's WWW server
are copyrighted by the MHPCC. Educational institutions are encouraged to
reproduce and distribute these materials for educational use as long as
credit and notification are provided. Please retain this copyright notice
and include this statement with any copies that you make. Also, the MHPCC
requests that you send notification of their use to help@mail.mhpcc.edu.
Commercial use of these materials is prohibited without prior written
permission.
Revised: 20 August 1997 blaise@mhpcc.edu
|