NAMA : Ery
Rivaldi
NPM :
53414655
KELAS :
4IA22
KELOMPOK : 5
MATERI :
Parallel Computing
Parallel computing is a type of computing architecture
in which several processors execute or process an application or computation
simultaneously. Parallel computing helps in performing large computations by
dividing the workload between more than one processor, all of which work
through the computation at the same time. Most supercomputers employ parallel
computing principles to operate.
Parallel computing is also known as parallel
processing.
A. Parallelism Concept
1. von
Neumann Architecture
·
Named after the Hungarian mathematician/genius
John von Neumann who first authored the general requirements for an electronic
computer in his 1945 papers.
·
Also known as "stored-program
computer" - both program instructions and data are kept in electronic
memory. Differs from earlier computers which were programmed through "hard
wiring".
·
Since then, virtually all computers have followed this basic design:
Since then, virtually all computers have followed this basic design:
·
Comprised of four main components:
a. Memory
b. Control
Unit
c. Arithmetic
Logic Unit
d. Input/Output
·
Read/write, random access memory is used
to store both program instructions and data
a. Program
instructions are coded data which tell the computer to do something
b. Data
is simply information to be used by the program
·
Control unit fetches instructions/data from
memory, decodes the instructions and then sequentially coordinates operations
to accomplish the programmed task.
·
Arithmetic Unit performs basic arithmetic
operations
·
Input/Output is the interface to the human
operator
2. Flynn's
Classical Taxonomy
·
There are different ways to classify
parallel computers.
·
One of the more widely used
classifications, in use since 1966, is called Flynn's Taxonomy.
·
Flynn's taxonomy distinguishes
multi-processor computer architectures according to how they can be classified
along the two independent dimensions of Instruction Stream and Data Stream.
Each of these dimensions can have only one of two possible states: Single or
Multiple.
·
The matrix below defines the 4 possible
classifications according to Flynn:
3. Some General Parallel Terminology
·
Like everything else, parallel computing
has its own "jargon". Some of the more commonly used terms associated
with parallel computing are listed below.
·
Most of these will be discussed in more
detail later.
Supercomputing
/ High Performance Computing (HPC)
Using the world's fastest and largest computers to solve large problems.
Node
A standalone
"computer in a box". Usually comprised of multiple CPUs/processors/cores, memory, network interfaces, etc. Nodes
are networked together to comprise a supercomputer.
CPU
/ Socket / Processor / Core
This varies, depending
upon who you talk to. In the past, a CPU (Central Processing Unit) was a
singular execution component for a computer. Then, multiple CPUs were incorporated into a node. Then,
individual CPUs were subdivided into multiple "cores", each being a unique execution unit.
CPUs with multiple cores are sometimes called "sockets" - vendor
dependent. The result is a node with multiple CPUs, each containing multiple cores. The nomenclature is confused
at times. Wonder why?
Task
A logically discrete
section of computational work. A task is typically a program or program like set of instructions that is
executed by a processor. A parallel program consists of multiple tasks running on multiple processors. Pipelining Breaking a task into steps performed by different processor units,
with inputs streaming through, much
like an assembly line; a type of parallel computing.
Shared
Memory
From a strictly hardware
point of view, describes a computer architecture where all processors have direct (usually bus based) access to common
physical memory. In a programming
sense, it describes a model where parallel tasks all have the same
"picture" of memory and can directly address and access the same
logical memory locations regardless of where the physical memory actually
exists.
Symmetric Multi-Processor (SMP)
Shared memory hardware
architecture where multiple processors share a single address space and have
equal access to all resources.
Distributed Memory
In hardware, refers to
network based memory access for physical memory that is not common. As a programming model, tasks can only logically
"see" local machine memory and must use communications to access
memory on other machines where other tasks are executing.
Communications
Parallel tasks typically
need to exchange data. There are several ways this can be accomplished, such as
through a shared memory bus or over a network, however the actual event of data
exchange is commonly referred to as communications regardless of the method employed.
Synchronization
The coordination of
parallel tasks in real time, very often associated with communications. Often
implemented by establishing a synchronization point within an application where
atask may not proceed further until another task(s) reaches the same or
logically equivalentpoint.
Synchronization usually
involves waiting by at least one task, and can therefore cause a parallel
application's wall clock execution time to increase.
Granularity
In parallel computing,
granularity is a qualitative measure of the ratio of computation tocommunication.
Coarse: relatively large
amounts of computational work are done between communication events
Fine: relatively small
amounts of computational work are done between communication events
Observed Speedup
Parallel Overhead
The amount of time
required to coordinate parallel tasks, as opposed to doing useful work. Parallel
overhead can include factors such as:
a. Task
start-up time
b. Synchronizations
c. Data
communications
d. Software
overhead imposed by parallel languages, libraries, operating system, etc.
e. Task
termination time
Massively Parallel
Refers to the hardware
that comprises a given parallel system - having many processing elements. The
meaning of "many" keeps increasing, but currently, the largest
parallel computers are comprised of processing elements numbering in the
hundreds of thousands to millions.
Embarrassingly Parallel
Solving many similar, but
independent tasks simultaneously; little to no need for coordination between
the tasks.
Scalability
Refers to a parallel
system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the
addition of more resources. Factors that contribute to scalability include:
a. Hardware
- particularly memory-cpu bandwidths and network communication properties
b. Application
algorithm
c. Parallel
overhead related
d. Characteristics
of your specific application
4. Limits
and Costs of Parallel Programming
Amdahl's Law:
·
Amdahl's Law states that potential program
speedup is defined by the fraction of code (P) that can be parallelized.
·
If none of the code can be parallelized, P
= 0 and the speedup = 1 (no speedup).
·
If all of the code is parallelized, P = 1
and the speedup is infinite (in theory).
·
If 50% of the code can be parallelized,
maximum speedup = 2, meaning the code will run twice as fast.
·
Introducing the number of processors
performing the parallel fraction of work
·
It soon becomes obvious that there are
limits to the scalability of parallelism. For example:
·
However, certain problems demonstrate
increased performance by increasing the problem size.
·
Problems that increase the percentage of
parallel time with their size are more scalable than problems with a fixed
percentage of parallel time.
Complexity:
·
In general, parallel applications are much
more complex than corresponding serial applications, perhaps an order of
magnitude. Not only do you have multiple instruction streams executing at the
same time, but you also have data flowing between them.
·
The costs of complexity are measured in
programmer time in virtually every aspect of the software development cycle:
a. Design
b. Coding
c. Debugging
d. Tuning
e. Maintenance
·
Adhering to "good" software
development practices is essential when working with parallel applications -
especially if somebody besides you will have to work with the software.
Portability:
·
Thanks to standardization in several APIs,
such as MPI, POSIX threads, and OpenMP, portability issues with parallel
programs are not as serious as in years past. However...
·
All of the usual portability issues
associated with serial programs apply to parallel programs. For example, if you
use vendor "enhancements" to Fortran, C or C++, portability will be a
problem.
·
Even though standards exist for several
APIs, implementations will differ in a number of details, sometimes to the
point of requiring code modifications in order to effect portability.
·
Operating systems can play a key role in
code portability issues.
·
Hardware architectures are
characteristically highly variable and can affect portability.
Resource Requirements:
·
The primary intent of parallel programming
is to decrease execution wall clock time, however in order to accomplish this,
more CPU time is required. For example, a parallel code that runs in 1 hour on
8 processors actually uses 8 hours of CPU time.
·
The amount of memory required can be
greater for parallel codes than serial codes, due to the need to replicate data
and for overheads associated with parallel support libraries and subsystems.
·
For short running parallel programs, there
can actually be a decrease in performance compared to a similar serial
implementation. The overhead costs associated with setting up the parallel
environment, task creation, communications and task termination can comprise a
significant portion of the total execution time for short runs.
Scalability:
·
Two types of scaling based on time to
solution: strong scaling and weak scaling.
·
Strong scaling:
a. The
total problem size stays fixed as more processors are added.
b. Goal
is to run the same problem size faster
c. Perfect
scaling means problem is solved in 1/P time (compared to serial
·
Weak scaling:
a. The
problem size per processor stays fixed as more processors are added.
b. Goal
is to run larger problem in same amount of time
c. Perfect
scaling means problem Px runs in same time as single processor run
·
The ability of a parallel program's
performance to scale is a result of a number of interrelated factors. Simply
adding more processors is rarely the answer.
·
The algorithm may have inherent limits to
scalability. At some point, adding more resources causes performance to
decrease. This is a common situation with many parallel applications.
·
Hardware factors play a significant role
in scalability. Examples:
a. Memory-cpu
bus bandwidth on an SMP machine
b. Communications
network bandwidth
c. Amount
of memory available on any given machine or set of machines
d. Processor
clock speed
·
Parallel support libraries and subsystems
software can limit scalability independent of your application.
B. Distributed Processing
Distributed processing is
a setup in which multiple individual central processing units (CPU) work on the
same programs, functions or systems to provide more capability for a computer
or other device.
C. Architectural Parallel Computing
1.Parallel
Computer Memory Architectures
Distributed Memory
General
Characteristics:
·
Like shared memory systems, distributed
memory systems vary widely but share a common characteristic. Distributed
memory systems require a communication network to connect inter-processor
memory.
·
Processors have their own local memory.
Memory addresses in one processor do not map to another processor, so there is
no concept of global address space across all processors.
·
Because each processor has its own local
memory, it operates independently. Changes it makes to its local memory have no
effect on the memory of other processors. Hence, the concept of cache coherency
does not apply.
·
When a processor needs access to data in
another processor, it is usually the task of the programmer to explicitly define
how and when data is communicated. Synchronization between tasks is likewise
the programmer's responsibility.
·
The network "fabric" used for
data transfer varies widely, though it can be as simple as Ethernet.
Advantages:
·
Memory is scalable with the number of
processors. Increase the number of processors and the size of memory increases
proportionately.
·
Each processor can rapidly access its own
memory without interference and without the overhead incurred with trying to
maintain global cache coherency.
·
Cost effectiveness: can use commodity,
off-the-shelf processors and networking.
Disadvantages:
·
The programmer is responsible for many of
the details associated with data communication between processors.
·
It may be difficult to map existing data
structures, based on global memory, to this memory organization.
·
Non-uniform memory access times - data
residing on a remote node takes longer to access than node local data.
2. Hybrid Distributed-Shared Memory
General
Characteristics:
·
The largest and fastest computers in the
world today employ both shared and distributed memory architectures.
·
The shared memory component can be a
shared memory machine and/or graphics processing units (GPU).
·
The distributed memory component is the
networking of multiple shared memory/GPU machines, which know only about their
own memory - not the memory on another machine. Therefore, network
communications are required to move data from one machine to another.
·
Current trends seem to indicate that this
type of memory architecture will continue to prevail and increase at the high
end of computing for the foreseeable future.
Advantages and Disadvantages:
·
Whatever is common to both shared and
distributed memory architectures.
·
Increased scalability is an important
advantage
·
Increased programmer complexity is an
important disadvantage
D. Introduction Thread Programming
A thread in computer
programming is a related information about the use of a single program that can
handle multiple users simultaneously.Thread This allows the program to know how
the user into the program in turn and the user will re-enter using a different
user. Multiple threads can run simultaneously with other processes dividing resources
into memory, while other processes do not share them.
E. Introduction Message Pasing, Open MP
Message Passing is a form
of communication used in parallel computing, OOT (Object Oriented Programming)
or Object-Based Programming and interprocess communication.
F. Introduction CUDA GPU Programming
GPU is a special
processor to speed up and change memory to speed up image processing. GPU
itself is usually located in the computer graphics card or laptopCUDA (Compute
Unified Device Architecture) is a scheme created by NVIDIA for NVIDIA as a GPU
(Graphic Processing Unit) capable of computing not only for graphics processing
but also for general purpose. So with the CUDA we can take advantage of many
processors from NVIDIA to perform the calculation process or even computing a
lot.
A presentation on Parallel Computing:
A presentation on Parallel Computing:
.
Source :
https://computing.llnl.gov/tutorials/parallel_comp/#Concepts
http://laoderahmatputra.blogspot.com/2016/05/pengantar-thread-programming-message.html