Computer Science Question

I want the answer without similarity at all, and Conceptual and professional, they’re important.

College of Computing and Informatics
Assignment 2
Deadline: Tuesday 28/11/2023 @ 23:59
[Total Mark for this Assignment is 8]
Student Details:
Name: ###
ID: ###
CRN: ###
Instructions:
• You must submit two separate copies (one Word file and one PDF file) using the Assignment Template on
Blackboard via the allocated folder. These files must not be in compressed format.
• It is your responsibility to check and make sure that you have uploaded both the correct files.
• Zero mark will be given if you try to bypass the SafeAssign (e.g. misspell words, remove spaces between
words, hide characters, use different character sets, convert text into image or languages other than English
or any kind of manipulation).
• Email submission will not be accepted.
• You are advised to make your work clear and well-presented. This includes filling your information on the cover
page.
• You must use this template, failing which will result in zero mark.
• You MUST show all your work, and text must not be converted into an image, unless specified otherwise by
the question.
• Late submission will result in ZERO mark.
• The work should be your own, copying from students or other resources will result in ZERO mark.
• Use Times New Roman font for all your answers.
Question One
Pg. 01
Learning
Outcome(s):
Recognize the
fundamental
principles of
parallel and
distributed
processing,
parallel system
taxonomy, and
parallel system
performance
metrics
Question One
2 Marks
Sequential and Causal Consistency in Distributed Systems
1) Explain the fundamental difference between Sequential Consistency and
Causal Consistency in the context of parallel and distributed systems. Discuss
how these consistency models define the order in which operations appear to
be executed across multiple processes.
2) Consider a simple distributed system with three processes (P1, P2, P3) that
perform arithmetic operations. Assume each process has a local variable to
store a numerical value, and they exchange messages to update these values
(e.g., add 1, multiply by 2, subtract 5, etc.). Design a scenario where the
difference between Sequential Consistency and Causal Consistency
becomes apparent. Provide a step-by-step explanation of the operations and
message exchanges, illustrating how the two consistency models would lead
to different outcomes.
Question Two
Pg. 02
Learning
Outcome(s):
Recognize the
fundamental
principles of
parallel and
distributed
processing,
parallel system
taxonomy, and
parallel system
performance
metrics
Question Two
3 Marks
Analyzing Parallel Bubble Sort Performance
Consider the problem of parallel Bubble sort. The serial time for Bubble sort is
200 seconds, and the parallel time for an efficient parallelization using a parallel
Bubble sort algorithm is 60 seconds.
Calculate Speedup and Efficiency:
1) Calculate the speedup S achieved by the parallel implementation.
2) Assume that the number of processors is 6, calculate the efficiency E
of the parallel implementation.
Analytical Questions:
3) Discuss the implications of the calculated speedup. What does a
speedup of 3.33 indicate about the efficiency of the parallel algorithm?
4) Explain how the concept of speedup differs from efficiency. In what
situations might a high speedup not necessarily imply high efficiency?
Optimization Scenario:
5) Suppose an optimization is applied to the parallel algorithm, reducing
the parallel time to 45 seconds. Recalculate the speedup and
efficiency with this improved parallel time. Discuss the impact of the
optimization on the performance metrics.
Question Three
Pg. 03
Learning
Outcome(s):
Design algorithms
using Dense
Matrix, Search
Question Three
3 Marks
Enhancing Distributed Systems with Caching and Replication
In the realm of parallel and distributed systems, the efficient management of
data is critical for performance optimization. Caching and replication are two
common strategies used to enhance data access and availability.
Algorithms for
Discrete
1) What are the essential alternatives for caching and replication in distributed
Optimization
systems?
2) Discuss the advantages and disadvantages of each alternative.
3) Explain how these alternatives contribute to improving system performance,
fault tolerance, and overall scalability.
Problems.
‫ر‬
‫الجامعة السعودية االلكتونية‬
‫ر‬
‫االلكتونية‬
‫الجامعة السعودية‬
‫‪26/12/2021‬‬
College of Computing and Informatics
CS476
Parallel and distributed systems
2
CS476
Parallel and distributed systems
Module 1
Introduction to Parallel and Distributed Computing.
3
Contents
1. Introduction-Scope, issues , applications.
2. Challenges of Parallel and Distributed Computing.
3. Implicit Parallelism: Trends in Microprocessor Architectures.
4. Dichotomy of Parallel Computing Platforms.
5. Physical Organization, co-processing.
4
Weekly Learning Outcomes
1. What is parallel or distributed computing and why is it necessary?
2. Explain Scope, and Applications and challenges of parallel and
distributed computing?
3. Understanding Parallel Platforms’ Communication Model.
5
Required Reading
Chapter 1 Introduction to Parallel and Distributed Computing: (A Grama, AGupra, G
Karypis, V Kumar. Introduction to Parallel Computing (2nd ed.). Addison Wesley, 2003.)
Recommended Reading:
Introduction to parallel and distributed processing:

Implicit parallelism https://www.youtube.com/watch?v=9_uX7L7-7ME
https://www.cs.rochester.edu/users/faculty/sandhya/csc258/
6
What is Parallel Computing:
The process of running numerous processors an application or computation
simultaneously is referred to as parallel computing. In general, it refers to a type of
computing architecture where big issues are divided into separate, smaller, typically
related sections that can be processed all at once. Multiple CPUs work together to
complete it by exchanging information across shared memory, which then combines
the findings. It facilitates the execution of complex computations by distributing the
enormous problem among multiple processors.
What is Distributed Computing :
In distributed computing, a user sees a single system that is actually made up of several
autonomous computers. There is no shared memory in distributed systems, and
machines connect with one another by exchanging messages. A single work is split up
among several processors in distributed computing.
Types of parallel computing
1. Bit-level parallelism: The form of parallel computing in which every task is dependent
on processor word size. In terms of performing a task on large-sized data, it reduces the
number of instructions the processor must execute. There is a need to split the operation into
series of instructions. For example, there is an 8-bit processor, and you want to do an
operation on 16-bit numbers. First, it must operate the 8 lower-order bits and then the 8
higher-order bits. Therefore, two instructions are needed to execute the operation. The
operation can be performed with one instruction by a 16-bit processor.
2. Instruction-level parallelism: In a single CPU clock cycle, the processor decides in
instruction- level parallelism how many instructions are implemented at the same time. For
each clock cycle phase, a processor in instruction-level parallelism can have the ability to
address that is less than one instruction.
3. Task Parallelism: Task parallelism is the form of parallelism in which the tasks are
decomposed into subtasks. Then, each subtask is allocated for execution. And, the execution
of subtasks is performed concurrently by processors.
Parallel computing
Observation
High-performance distributed computing started
with parallel computing
Multiprocessor and multicore versus multicomputer
Moore’s law (attributed to Gordon Moore, Founder of Intel): Number
of transistors doubles every 2 years
Leveraging Moore’s Law
More transistors – opportunities for
exploiting parallelism
Implicit parallelism
Pipelining
Superscalar
Explicit parallelism
Streaming and multimedia processor
extensions
– E.g., MMX, Altivec
Very long instruction words (VLIW)
Uniprocessor Limits
The power problem!
http://www.tomshardware.com/2005/11/21/the_mother_of_all_cpu_charts_2005
S.No. Parallel Computing
Many operations are performed
1
simultaneously
Single computer is required
2
Distributed Computing
System components are located at
different locations
Uses multiple computers
3
Multiple processors perform multiple
operations
Multiple computers perform multiple
operations
4
It may have shared or distributed
memory
It have only distributed memory
5
Processors communicate with each
other through bus
Computer communicate with each
other through message passing.
6
Improves the system performance
Improves system scalability, fault
tolerance and resource sharing
capabilities
Applications of Parallel Computing
There are various applications of Parallel Computing, which are as
follows:
1. One of the primary applications of parallel computing is Databases
and Data mining.
2. The real-time simulation of systems is another use of parallel
computing.
3. The technologies, such as Networked videos and Multimedia.
4. Science and Engineering.
5. Collaborative work environments.
6. The concept of parallel computing is used by augmented reality,
advanced graphics, and virtual reality.
Applications of Distributed computing:
Social networks, mobile systems, online
banking, and online gaming (e.g. multiplayer
systems) also use efficient distributed
systems. Additional areas of application for
distributed computing include e-learning
platforms, artificial intelligence, and ecommerce.
What are the challenges of parallel and distributed
computing?
Important concerns are workload sharing, which attempts to take advantage of
access to multiple computers to complete jobs faster; task migration, which
supports workload sharing by efficiently distributing jobs among machines; and
automatic task replication, which occurs at different sites for greater
reliability.
Heterogeneity: The Internet enables users to access services and run applications
over a heterogeneous collection of computers and networks. …
Transparency: …
Openness. …
Concurrency. …
Security. …
Scalability. …
Failure Handling.
Distributed Systems Issues
• The lack of global knowledge.
Naming.
Scalability: Size, geographically, administratively
Compatibility.
Process synchronization (requires global knowledge)
Resource management (requires global knowledge)
Reliability/fault tolerance: fail-stop, byzantine (arbitrary) failure models
•
•
•
•
•
•
•
Problems:
Distributed consensus
– Replication, caching consistency
– Security and trust
–
Scope of Parallelism
Conventional architectures coarsely comprise of a processor,
memory system, and the data path. Each of these components’
present significant performance bottlenecks. Parallelism addresses
each of these components in significant ways. Different applications
utilize different aspects of parallelism – e.g., data intensive
applications utilize high aggregate throughput, server applications
utilize high aggregate network bandwidth, and scientific applications
typically utilize high processing and memory system performance. It
is important to understand each of these performance bottlenecks.
Implicit Parallelism:
Microprocessor clock speeds have posted impressive gains over
the past two decades (two to three orders of magnitude) Higher
levels of device integration have made available a large number of
transistors.
-The question of how best to utilize these resources is an important
one.
-Current processors use these resources in multiple functional units
and execute multiple instructions in the same cycle.
-The precise manner in which these instructions are selected and
executed provides impressive diversity in architectures.
Very Long Instruction Word (VLIW) Processors
The hardware cost and complexity of the superscalar scheduler is a major consideration in
processor design.
#To address this issues, VLIW processors rely on compile time analysis to identify and
bundle together instructions that can be executed concurrently.
#These instructions are packed and dispatched together, and thus the name very long
instruction word.
#This concept was used with some commercial success in the Multiflow Trace machine
(circa 1984).
#Variants of this concept are employed in the Intel IA64 processors.
Dichotomy of Parallel Computing Platforms:
•
An explicitly parallel program must specify concurrency and
interaction between concurrent subtasks.
•
The former is sometimes also referred to as the control structure
and the latter as the communication model.
Control Structure of Parallel Programs
❑
❑
❑
❑
❑
Parallelism can be expressed at various levels of granularity – from instruction
level to processes.
Between these extremes exist a range of models, along with corresponding
architectural support.
Processing units in parallel computers either operate under the centralized
control of a single control unit or work independently.
If there is a single control unit that dispatches the same instruction to
various processors (that work on different data), the model is referred to as
single instruction stream, multiple data stream (SIMD).
If each processor has its own control control unit, each processor can execute
different instructions on different data items. This model is called multiple
instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b).
SIMD-MIMD Comparison
•
Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1
belonged to this class of machines.
•
Variants of this concept have found use in coprocessing units such as the MMX units in Intel
processors and DSP chips such as the Sharc.
•
SIMD relies on the regular structure of
computations (such as those in image processing).
•
It is often necessary to selectively turn off
operations on certain data items. For this reason,
most SIMD programming paradigms allow for an
“activity mask”, which determines if a processor
should participate in a computation or not.
•
In contrast to SIMD processors, MIMD processors
can execute different programs on different
processors.
•
A variant of this, called single program multiple
data streams (SPMD) executes the same program
on different processors.
•
It is easy to see that SPMD and MIMD are closely
related in terms of programming flexibility and
underlying architectural support.
•
Examples of such platforms include current
generation Sun Ultra Servers, SGI Origin Servers,
multiprocessor PCs, workstation clusters, and the
IBM SP.
Communication Model of Parallel Platforms
✓There are two primary forms of data exchange between parallel
tasks – accessing a shared data space and exchanging messages.
✓Platforms that provide a shared data space are called shared
address-space machines or multiprocessors.
✓Platforms that support messaging are also called message passing
platforms or multi computers.
Shared Memory Parallel Systems
•
•
Multiple processors can access the same memory simultaneously
Challenges:
One processor’s cache may contain a copy of data that was just
modified by another
Requires hardware support for coherence
Two processes’ operations may be interleaved
• in unintuitive ways
Requires hardware support and guarantees on atomicity and
ordering
–
•
Shared-Memory Parallel Computer
Shared-Memory Parallel Computer
Shared Memory
Programming through threading
Multiple processors share a pool of memory
Problems: cache coherence
UMA vs. NUMA architecture
Pros:
Easier to program (probably)
Cons:
Performance may surfer if the memory is located on
distant machines
Limited scalability
Distributed-Memory Parallel Computer
Distributed-Memory Parallel Computer
Distributed Memory
Programming through processes
Explicit message passing
Networking
Pros:
Tighter control on message passing
Cons:
Harder to program
Modern supercomputers are hybrids!
Multicore Resource Management
Multicore systems contain many shared resources, e.g.,
memory bandwidth and cache
Problem:
–
Management of shared resources for Efficiency, fairness
Distributed Memory Parallel Systems
❖Parallel systems that do not share memory
❖Software system support for communication
(point-to-point, group)
❖Data must be explicitly partitioned and transferred
when needed
❖Dynamic workload management?
Shared-Address-Space Platforms
• Part (or all) of the memory is accessible to all processors.
• Processors interact by modifying data objects stored in this
shared-address-space.
• If the time taken by a processor to access any memory
word in the system global or local is identical, the
platform is classified as a uniform memory access
(UMA), else, a non uniform memory access (NUMA)
machine.
NUMA and UMA Shared-Address-Space Platforms:
▪ The distinction between NUMA and UMA platforms is important from the point of
view of algorithm design. NUMA machines require locality from underlying
algorithms for performance. • Programming these platforms is easier since reads and
writes are implicitly visible to other processors.
▪ However, read-write data to shared data must be coordinated (this will be discussed in
greater detail when we talk about threads programming).
▪ Caches in such machines require coordinated access to multiple copies. This leads to
the cache coherence problem.
▪ A weaker model of these machines provides an address map, but not coordinated
access. These models are called non cache coherent shared address space machines.
•
Shared-Address-Space vs. Shared Memory Machines
o It is important to note the difference between the terms
shared address space and shared memory.
o We refer to the former as a programming abstraction and to
the latter as a physical machine attribute.
o It is possible to provide a shared address space using a
physically distributed memory.
Physical Organization of Parallel Platforms:
We begin this discussion with an ideal parallel machine called Parallel Random Access
Machine, or PRAM.
Architecture of an Ideal Parallel Computer
o A natural extension of the Random Access Machine (RAM) serial
architecture is the Parallel Random Access Machine, or PRAM.
o PRAMs consist of p processors and a global memory of unbounded size
that is uniformly accessible to all processors.
o Processors share a common clock but may execute different instructions in
each cycle.
o Depending on how simultaneous memory accesses are handled, PRAMs
can be divided into four subclasses.
Contin….
• Exclusive-read, exclusive-write (EREW) PRAM.
• Concurrent-read, exclusive-write (CREW) PRAM
• Exclusive-read, concurrent-write (ERCW) PRAM.
• Concurrent-read, concurrent-write (CRCW) PRAM.
Physical Complexity of an Ideal Parallel Computer
• Processors and memories are connected via switches.
• Since these switches must operate in O(1) time at the level
of words, for a system of p processors and m words, the
switch complexity is O (mp ).
• Clearly, for meaningful values of p and m, a true PRAM is
not realizable.
Thank You
‫ر‬
‫الجامعة السعودية االلكتونية‬
‫ر‬
‫االلكتونية‬
‫الجامعة السعودية‬
‫‪26/12/2021‬‬
College of Computing and Informatics
CS476
Parallel and distributed systems
2
CS476
Parallel and distributed systems
Module 2
Introduction & Overview -Distributed Computing.
3
Contents
1. Distributed versus Decentralized
2. Perspectives on distributed systems
3. What do we want to achieve? Overall design goals
4. Different Terminology: Failure, error, fault
5. Developing distributed systems: Pitfalls
4
Weekly Learning Outcomes
1. Describe about distributed computing overall design goals for Support
sharing of resources.
2. Understand Distribution transparency, Openness and Scalability.
5
Required Reading
Chapter 1 Introduction: Distributed Systems, 4th Edition, Version
4.01 Author(s) Maarten van Steen, AndrewS.
Tanenbaum Publisher: CreateSpace; 3.01 edition (January, 2023) ISBN: 97890-815406-3-6, (Printed version)
Recommended Reading:

6
Distributed versus Decentralized
What many people state
Distributed
DecentralizedCentralized
When does a decentralized system become distributed?
Adding 1 link between two nodes in a decentralized system?
Adding 2 links between two other nodes?
In general: adding k > 0 links….?
Alternative approach
Two views on realizing distributed systems
Integrative view: connecting existing networked computer systems into a
larger a system.
Expansive view: an existing networked computer systems is extended
with additional computers
Two definitions
A decentralized system is a networked computer system in which
processes and resources are necessarily spread across multiple
computers.
A distributed system is a networked computer system in which processes
and resources are sufficiently spread across multiple computers.
Some common misconceptions
Centralized solutions do not scale
Make distinction between logically and physically centralized. The root of the Domain Name
System:
logically centralized
physically (massively) distributed
decentralized across several organizations
Centralized solutions have a single point of failure
Generally, not true (e.g., the root of DNS). A single point of failure is often:
easier to manage
easier to make more robust
Important
There are many, poorly founded, misconceptions regarding scalability, fault tolerance, security,
etc. We need to develop skills by which distributed systems can be readily understood so as to
judge such misconceptions.
Perspectives on distributed systems
Distributed systems are complex: take persepctives
Architecture: common organizations
Process: what kind of processes, and their relationships
Communication: facilities for exchanging data
Coordination: application-independent algorithms
Naming: how do you identify resources?
Consistency and replication: performance requires of data, which need to
be the same
Fault tolerance: keep running in the presence of partial failures
Security: ensure authorized access to resources
What do we want to achieve?
Overall design goals
-Support sharing of resources
-Distribution transparency
-Openness
-Scalability
Sharing resources
Canonical examples
Cloud-based shared storage and files
Peer-to-peer assisted multimedia streaming
Shared mail services (think of outsourced mail systems)
Shared Web hosting (think of content distribution networks)
Observation
“The network is the computer”
(quote from John Gage, then at Sun Microsystems)
Distribution transparency
What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that its
processes and resources are physically distributed across multiple computers, possibly
separated by large distances. Observation.
Distribution transparancy is handled through many different techniques in a layer
between applications and operating systems: a middleware layer
Distribution transparency: Types
Transparency
Access
Location
Relocation
Migration
Replication
Concurrency
Failure
Description
Hide differences in data representation and how
an object is accessed
Hide where an object is located
Hide that an object may be moved to another
location
while in use
Hide that an object may move to another location
Hide that an object is replicated
Hide that an object may be shared by several
independent users
Hide the failure and recovery of an object
Dependability
Basics
A component provides services to clients.
To provide services, the component may
require the services from other
components ⇒ a component may
depend on some other component.
Specifically
A component C depends on C∗ if the
correctness of C’s behavior depends on
the correctness of C∗’s behavior.
(Components are processes or channels.)
Requirements related to dependability
Requirement
Description
Availability
Readiness for usage
Reliability
Continuity of service delivery
Safety
Very low probability of catastrophes
Maintainability
How easy can a failed system be repaired
Terminology
Failure, error, fault
Term
Description
Example
Failure
A component is not living up to its
specifications
Crashed program
Error
Part of a component that can lead to a
failure
Programming bug
Fault
Cause of an error
Sloppy programmer
Terminology
Handling faults
Term
Fault prevention
Fault tolerance
Fault removal
Fault forecasting
Description
Prevent the occurrence of a
fault
Build a component and make it
mask the occurrence of a fault
Reduce the presence, number,
or seriousness of a fault
Estimate current presence,
future incidence, and
consequences of faults
Example
Don’t hire sloppy
programmers
Build each component by two
independent programmers
Get rid of sloppy
programmers
Estimate how a recruiter is
doing when it comes to hiring
sloppy programmers
On security
Observation
A distributed system that is not secure, is not dependable
What we need
Confidentiality: information is disclosed only to authorized parties
Integrity: Ensure that alterations to assets of a system can be made only in
an authorized way
Authorization, Authentication, Trust
Authentication: verifying the correctness of a claimed identity
Authorization: does an identified entity has proper access rights?
Trust: one entity can be assured that another will perform particular actions
according to a specific expectation
Security mechanisms
Keeping it simple
It’s all about encrypting and decrypting data using security keys.
Notation
K (data) denotes that we use key K to encrypt/decrypt data.
Symmetric cryptosystem
With encryption key EK (data) and decryption key DK (data):
if data = DK (EK (data)) then DK = EK . Note: encryption and descryption key are the
same and should be kept secret.
Asymmetric cryptosystem
Distinguish a public key PK (data) and a private (secret) key SK (data).
Security mechanisms
Secure hashing
In practice, we use secure hash functions: H(data)
returns a fixed-length string.
Any change from data to data∗ will lead to a completely
different string
H(data∗).
Given a hash value, it is computationally impossible to find
a data with
h = H(data)
Practical digital signatures
Sign message for Bob by Alice:
Scale in distributed systems
Observation
Many developers of modern distributed systems easily use the
adjective “scalable” without making clear why their system actually
scales.
At least three components
Number of users or processes (size scalability)
Maximum distance between nodes (geographical scalability)
Number of administrative domains (administrative scalability)
Observation
Most systems account only, to a certain extent, for size scalability. Often
a solution: multiple powerful servers operating independently in parallel.
Today, the challenge still lies in geographical and administrative
scalability.
Formal analysis
A centralized service can be modeled as a simple queuing system
Assumptions and notations
The queue has infinite capacity ⇒ arrival rate of requests is not influenced by current queue
length or what is being processed.
Arrival rate requests: λ
Processing capacity service: µ requests per second
Fraction of time having k requests in the system
Formal analysis
Utilization U of a service is the fraction of time that it is busy
Average number of requests in the system
Average throughput
Formal analysis
Response time: total time take to process a request after submission
with S = 1µ being the
service time.
Observations
If U is small, response-to-service time is close to 1: a request is
immediately processed
If U goes up to 1, the system comes to a grinding halt. Solution:
decrease S.
Problems with geographical scalability
•Cannot simply go from LAN to WAN: many distributed systems assume
synchronous client-server interactions: client sends request and waits for
an answer. Latency may easily prohibit this scheme.
•WAN links are often inherently unreliable: simply moving streaming video
from LAN to WAN is bound to fail.
•Lack of multipoint communication, so that a simple search broadcast
cannot be deployed. Solution is to develop separate naming and directory
services (having their own scalability problems).
Problems with administrative scalability
Essence
Conflicting policies concerning usage (and thus payment), management, and security
Examples
Computational grids: share expensive resources between different domains.
Shared equipment: how to control, manage, and use a shared radio telescope
constructed as large-scale shared sensor network?
Exception: several peer-to-peer networks
File-sharing systems (based, e.g., on BitTorrent)
Peer-to-peer telephony (early versions of Skype)
Peer-assisted audio streaming (Spotify)
Note: end users collaborate and not administrative entities.
Techniques for scaling
Facilitate solution by moving computations to client
Techniques for scaling
Partition data and computations across multiple machines
Move computations to clients (Java applets and scripts)
Decentralized naming services (DNS)
Decentralized information systems (WWW)
Replication and caching: Make copies of data available at different machines
Replicated file servers and databases
Mirrored Websites
Web caches (in browsers and proxies)
File caching (at server and client)
Scaling: The problem with replication
Applying replication is easy, except for one thing
Having multiple copies (cached or replicated), leads to inconsistencies:
modifying one copy makes that copy different from the rest.
Always keeping copies consistent and in a general way requires global
synchronization on each modification.
Global synchronization precludes large-scale solutions.
Observation
If we can tolerate inconsistencies, we may reduce the need for global
synchronization, but tolerating inconsistencies is application dependent.
Parallel computing
Observation
High-performance distributed computing started with parallel
computing
Multiprocessor and multicore versus multicomputer
Distributed shared memory systems
Observation
Multiprocessors are relatively easy to program in comparison to multi computers yet have
problems when increasing the number of processors (or cores). Solution: Try to
implement a shared-memory model on top of a multicomputer.
Example through virtual-memory techniques
Map all main-memory pages (from different processors) into one single virtual address
space. If a process at processor A addresses a page P located at processor B, the OS
at A traps and fetches P from B, just as it would if P had been located on local disk.
Problem
Performance of distributed shared memory could never compete with that of
multiprocessors and failed to meet the expectations of programmers. It has been widely
abandoned by now.
Cluster computing
Essentially a group of high-end systems connected through a LAN
Homogeneous: same OS, near-identical hardware
Single, or tightly coupled managing node(s)
Grid computing
The next step: plenty of nodes from everywhere
Heterogeneous
Dispersed across several organizations
Can easily span a wide-area network
Note
To allow for collaborations, grids generally use virtual organizations.
In essence, this is a grouping of users (or better: their IDs) that
allows for authorization on resource allocation.
Architecture for grid computing
The layers
Fabric: Provides interfaces to local
resources (for querying state and
capabilities, locking, etc.)
Connectivity: Communication/transaction
protocols, e.g., for moving data between
resources. Also various authentication
protocols.
Resource: Manages a single resource,
such as creating processes or reading
data.
Collective: Handles access to multiple
resources: discovery, scheduling,
replication.
Application: Contains actual grid
applications in a single organization.
Distributed pervasive systems
Observation
Emerging next-generation of distributed systems in which nodes are
small, mobile, and often embedded in a larger system, characterized by
the fact that the system naturally blends into the user’s environment.
Three (overlapping) subtypes
Ubiquitous computing systems: pervasive and continuously present, i.e.,
there is a continuous interaction between system and user.
Mobile computing systems: pervasive, but emphasis is on the fact that
devices are inherently mobile.
Sensor (and actuator) networks: pervasive, with emphasis on the actual
(collaborative) sensing and actuation of the environment.
Ubiquitous systems
Core elements
(Distribution) Devices are networked, distributed, and accessible
transparently
(Interaction) Interaction between users and devices is highly unobtrusive
(Context awareness) The system is aware of a user’s context to optimize
interaction
(Autonomy) Devices operate autonomously without human intervention, and
are thus highly self-managed
(Intelligence) The system as a whole can handle a wide range of dynamic
actions and interactions
Mobile computing
Distinctive features
A myriad of different mobile devices (smartphones, tablets, GPS devices,
remote controls, active badges).
Mobile implies that a device’s location is expected to change over time ⇒
change of local services, reachability, etc. Keyword: discovery.
Maintaining stable communication can introduce serious problems.
For a long time, research has focused on directly sharing resources
between mobile devices. It never became popular and is by now
considered to be a fruitless path for research.
Bottomline
Mobile devices set up connections to stationary servers, essentially bringing
mobile computing in the position of clients of cloud-based services.
Mobile computing
Mobile cloud computing
Mobile edge computing
Sensor networks
Characteristics
The nodes to which sensors are attached are:
Many (10s-1000s)
Simple (small memory/compute/communication
capacity)
Often battery-powered (or even battery-less)
Sensor networks as distributed databases
Two extremes
The cloud-edge continuum
Developing distributed systems: Pitfalls
Observation
Many distributed systems are needlessly complex, caused by mistakes that required patching
later on. Many false assumptions are often made.
False (and often hidden) assumptions
1. The network is reliable
2. The network is secure
3. The network is homogeneous
4. The topology does not change
5. Latency is zero
6. Bandwidth is infinite
7. Transport cost is zero
8. There is one administrator
‫ر‬
‫الجامعة السعودية االلكتونية‬
‫ر‬
‫االلكتونية‬
‫الجامعة السعودية‬
‫‪26/12/2021‬‬
College of Computing and Informatics
CS476
Parallel and distributed systems
CS476
Parallel and distributed systems
Module 3
Principles of Parallel Algorithm Design
3
Contents
1. Introduction to Parallel Algorithms: Decomposition, Tasks, and Dependency
Graphs
2. Decomposition Techniques
3.Characteristics of Tasks and Interactions Dependency Graphs
4. Mapping Techniques for Load Balancing
4
Weekly Learning Outcomes
1. Explain Parallel Algorithms and Decomposition Techniques.
2. Discussing Characteristics of Tasks and Interactions Dependency Graphs.
3. Understand the basic Mapping Techniques for Load Balancing.
5
Required Reading
Chapter 3 Principles of parallel Algorithm design: (A Grama, AGupra, G
Karypis, V Kumar. Introduction to Parallel Computing (2nd ed.). Addison
Wesley, 2003.)
Recommended Reading:

https://www.cs.purdue.edu/homes/ayg/book/Slides/chap3_slides.pdf
6
Preliminaries: Decomposition, Tasks, and Dependency Graphs
The first step in developing a parallel algorithm is to decompose the problem
into tasks that can be executed concurrently.
A given problem may be docomposed into tasks in many different ways.
Tasks may be of same, different, or even interminate sizes.
A decomposition can be illustrated in the form of a directed graph with
nodes corresponding to tasks and edges indicating that the result of one
task is required for processing the next. Such a graph is called a task
dependency gra ph .
Steps in the Parallelization
•
Decomposition into tasks
Expose concurrency
Assignment to processes
Balancing load and maximizing locality
Orchestration
Name and access data
Communicate (exchange) data
synchronization among processes
Mapping
Assignment of processes to processors
–
•
–
•
–
–
–
•
–
Decomposition into Tasks
•
•
Many different decompositions possible
– Tasks may be independent or have dependencies
requiring ordering
– Tasks may execute identical or different code
– Tasks may take the same or different amounts of time
Tasks and dependencies may be abstracted into a task
dependency DAG with nodes as tasks, edges as control
dependence
Granularity of Task Decompositions
Task size (granularity) versus number of
tasks
Example: Dense matrix-vector multiply
Fine grain: each task computes an individual
element in y, large number of tasks
•
Coarse grain: each task computes multiple
elements in y, small number
Decomposition Techniques:
Exploratory Decomposition:
• Decomposition is fixed/static
from the design – Data and
recursive
• Exploration (search) of a state space of solutions
–Problem decomposition reflects shape of execution
• Goes hand-in-hand with its execution
• Examples
– discrete optimization, e.g. 0/1 integer programming
– theorem proving
• game playing
Example:
Solve a 15 puzzle
Sequence of three moves from state (a) to final state (d)
Solving a 15 puzzle
Search
generate successor states of the current state
explore each as an independent task
Exploratory Decomposition Speedup:
Solve a 15 puzzle
The decomposition behaves according to the parallel
formulation – May change the amount of work done
Speculative Decomposition
•
Dependencies between tasks are not known apriori. – Impossible to identify independent
tasks
• Two approaches
– Conservative approaches, which identify independent tasks only
when they are guaranteed to not have dependencies
• May yield little concurrency
– Optimistic approaches, which schedule tasks even when they may
potentially be inter-dependent
• Roll-back changes in case of an error
Discrete event simulation
• Centralized time-ordered event list
– you get up ->get ready->drive to work->work->eat lunchà>work some more->drive back->eat dinner->and sleep
• Simulation
–extract next event in time order
–process the event
–if required, insert new events into the event list
• Optimistic event scheduling
–assume outcomes of all prior events
–speculatively process next event
–if assumption is incorrect, roll back its effects and
continue
Simulation of a network of nodes
• Simulate network behavior for various input and
node delays – The input are dynamically changing
• Thus task dependency is unknown
Speculative vs Exploratory:
Exploratory decomposition
– The output of multiple tasks from a branch is unknown
– Parallel program perform more, less or same amount of work as
serial program
• Speculative
– The input at a branch leading to multiple parallel tasks is
unknown
– Parallel program perform more or same amount of work
as the serial algorithm Use multiple decomposition
techniques together
• One decomposition may be not optimal for concurrency
– Quicksort recursive decomposition limits concurrency (Why?)
Combined recursive and data decomposition for MIN
Characteristics of Tasks
❖
❖
❖
❖
❖
❖
❖
❖
❖
Theory
– Decomposition: to parallelize theoretically
Concurrency available in a problem
Practice
– Task creations, interactions and mapping to PEs.
Realizing concurrency practically
Characteristics of tasks and task interactions
Impact choice and performance of parallelism
Characteristics of tasks
Task generation strategies
Task sizes (the amount of work, e.g. FLOPs)
Size of data associated with tasks
Task Generation
–
–
–
• Static task generation
Concurrent tasks and task graph known a-priori (before execution)
Typically using recursive or data decomposition
Examples
Matrix operations
Graph algorithms
Image processing applications
•
•
•
• Other regularly structured problems
Dynamic task generation
–
Computations formulate concurrent tasks and task graph on
the fly
Not explicit a priori, though high-level
rules or guidelines known – Typically by
exploratory or speculative decompositions.
Also possible by recursive
decomposition, e.g. quicksort – A
classic example: game playing
15 puzzle board
•
•
•
Task Sizes/Granularity
•
–
•
–
•The amount of work à amount of
time to complete – E.g. FLOPs,
#memory access
Uniform:
Often by even data decomposition, i.e. regular
Non-uniform
Quicksort, the choice of pivot
Size of Data Associated with Tasks:
• May be small or large compared to the task sizes
How relevant to the input and/or output data sizes
Example:
size(input) < size(computation), e.g., 15 puzzle size(input) = size(computation) > size(output), e.g., min
size(input) = size(output) < size(computation), e.g., sort Considering the efforts to reconstruct the same task context small data: small efforts: task can easily migrate to another process large data: large efforts: ties the task to a process Context reconstructing vs communicating • – It depends – – • • • • – – • Characteristics of Task Interactions: – – – – – • • Aspects of interactions What: shared data or synchronizations, and sizes of the media When: the timing Who: with which task(s), and overall topology/patterns Do we know details of the above three before execution How: involve one or both? The implementation concern, implicit or explicit Orthogonal classification • Static vs. dynamic • Regular vs. irregular • Read-only vs. read-write • One-sided vs. two-sided • Aspects of interactions What: shared data or synchronizations, and sizes of the media – When: the timing – Who: with which task(s), and overall topology/patterns – Do we know details of the above three before execution – How: involve one or both? • Static interactions – Partners and timing (and else) are known a-priori – Relatively simpler to code into programs. • Dynamic interactions – – – The timing or interacting tasks cannot be determined a-priori. Harder to code, especially using explicit interaction. Example of Regular Static Interaction: Image processing algorithms: dithering, edge detection Nearest neighbor interactions on a 2D mesh Example of Irregular Static Interaction Sparse matrix vector multiplication Example: Task-Interaction Graph Sparse matrix-vector multiplication •Tasks: each task computes an entry of y[] •Assign ithrow of A to Task i. Also assign b[i] to Task i. Characteristics of Task Interactions: Aspects of interactions – What: shared data or synchronizations, and sizes of the media Read-only interactions – Tasks only read data items associated with other tasks Read-write interactions Read, as well as modify data items associated with other tasks. Harder to code Require additional synchronization primitives to avoid read-write and write-write ordering races Mapping Techniques for Load Balancing • Static and Dynamic Mapping: Parallel algorithm design Program decomposed Characteristics of task and interactions identified • Assign large amount of concurrent tasks to equal or relatively small amount of processes for execution Though often we do 1:1 mapping – – • Goal of mapping: minimize overheads There is cost to do parallelism Interactions and idling(serialization) Contradicting objectives: interactions vs idling Idling (serialization) ñ: insufficient parallelism Interactions ñ: excessive concurrency E.g. Assigning all work to one processor trivially minimizes interaction at the expense of significant idling. – • • – – – Mapping Techniques for Minimum Idling: • • Execution: alternating stages of computation and interaction Mapping must simultaneously minimize idling and load balance Idling means not doing useful work Load balance: doing the same amount of work Merely balancing load does not minimize idling – – • Example: – – – – • • • Static or dynamic mapping: Static Mapping Tasks are mapped to processes a-prior Need a good estimate of task sizes Optimal mapping may be NP complete Dynamic Mapping Tasks are mapped to processes at runtime • Because: Tasks are generated at runtime Their sizes are not known. Other factors determining the choice of mapping techniques – the size of data associated with a task – the characteristics of inter-task interactions – even the programming models and target architectures • • • • Schemes for Static Mapping: •Mappings based on data decomposition – Mostly 1-1 mapping Mappings based on task graph partitioning Hybrid mappings • Mappings Based on Data Partitioning Partition the computation using a combination of • – Data decomposition • – The ``owner-computes'' rule Example: 1-D block distribution of 2-D dense matrix 1-1 mapping of task/data and process Block Array Distribution Schemes: Block Distribution and Data Sharing for Dense Matrix Multiplication: • • Cyclic and Block Cyclic Distributions Consider a block distribution for LU decomposition (Gaussian Elimination) The amount of computation per data item varies Block decomposition would lead to significant load imbalance – – • Block Cyclic Distributions: Variation of the block distribution scheme Partition an array into many more blocks (i.e. tasks) than the number of available processes. Blocks are assigned to processes in a round-robin manner so that each process gets several non- adjacent blocks. N-1 mapping of tasks to processes Used to alleviate the load-imbalance and idling problems. – – – • Block Partitioning and Random Mapping: Sparse matrix computations •Load imbalance using block-cyclic partitioning/mapping •more non-zero blocks to diagonal processes P0, P5, P10, and P15 than others •P12 gets nothing Graph Partitioning Based Data Decomposition: Array-based partitioning and static mapping • Regular domain, i.e. rectangular, mostly dense matrix • Structured and regular interaction patterns • Quite effective in balancing the computations and minimizing the interactions • Irregular domain • Spars matrix-related • Numerical simulations of physical phenomena • Car, water/blood flow, geographic • Partition the irregular domain so as to • Assign equal number of nodes to each process • Minimizing edge count of the partition. • Mappings Based on Task Partitioning: • Schemes for Static Mapping – Mappings based on data partitioning • Mostly 1-1 mapping – Mappings based on task graph partitioning – Hybrid mappings • Data partitioning • – Data decomposition and then 1-1 mapping of tasks to PEs Partitioning a given task-dependency graph across processes • An optimal mapping for a general task-dependency graph – NPcomplete problem. • Excellent heuristics exist for structured graphs. Mapping a Binary Tree Dependency Graph: • Mapping dependency graph of quick sort to processes in a hypercube • Hypercube: n-dimensional analogue of a square and a cube – node numbers that differ in 1 bit are adjacent Hierarchical/Hybrid Mappings: • A single mapping is inadequate. – E.g. task graph mapping of the binary tree (quicksort) cannot use a large number of processors. •Hierarchical mapping – Task graph mapping at the top level – Data partitioning within each level. • Schemes for Dynamic Mapping: • Also referred to as dynamic load balancing • – Load balancing is the primary motivation for dynamic mapping. Dynamic mapping schemes can be Centralized Distributed – – Centralized Dynamic Mapping: • Processes are designated as masters or slaves Workers (slave is politically incorrect) General strategies Master has pool of tasks and as central dispatcher When one runs out of work, it requests from master for more work. Challenge When process # increases, master may become the bottleneck. Approach Chunk scheduling: a process picks up multiple tasks at once Chunk size: Large chunk sizes may lead to significant load imbalances as well Schemes to gradually decrease chunk size as the computation progresses. – • – – • – • – – • • Distributed Dynamic Mapping: • All processes are created equal Each can send or receive work from others Alleviates the bottleneck in centralized schemes. Four critical design questions: how are sending and receiving processes paired together who initiates work transfer how much work is transferred when is a transfer triggered? Answers are generally application specific. – • • – – – – • Thank You ‫ر‬ ‫االلكتونية‬ ‫الجامعة السعودية‬ ‫ر‬ ‫االلكتونية‬ ‫الجامعة السعودية‬ ‫‪26/12/2021‬‬ College of Computing and Informatics CS476 Parallel and distributed systems CS476 Parallel and distributed systems Module 4 Naming Contents 1. Naming, Identifiers & Addresses 2. Chord 3. Hierarchical Location Services (HLS) 4. Security in flat naming 5. Mounting in distributed systems 6. Name-space implementation 7. Iterative name resolution 8. Scalability issues 9. Modern & Secure DNS (Domain Name Service) 10. LDAP (Lightweight Directory Access Protocol) Weekly Learning Outcomes 1. Understand the concept of Naming and Mounting in distributed systems. 2. Learn about Modern & Secure Domain Name Service and Lightweight Directory Access Protocol. Required Reading Chapter 6 Naming: Distributed Systems, 4th Edition, Version 4.01 Author(s) Maarten van Steen, Andrew S. Tanenbaum Publisher: CreateSpace; 3.01 edition (January, 2023) ISBN: 978-90-815406-3-6, (Printed version) Recommended Reading http://cs.boisestate.edu/~amit/teaching/455/handouts /chap-05v2.pdf https://www.youtube.com/watch?v=Yma7rmHqWy8 Naming, Identifiers & Addresses Essence Names are used to denote entities in a distributed system. To operate on an entity, we need to access it at an access point. Access points are entities that are named by means of an address. Note A location-independent name for an entity E , is independent of the addresses of the access points offered by E . Pure name A name that has no meaning at all; it is just a random string. Pure names can be used for comparison only. Identifier: A name having some specific properties 1. An identifier refers to at most one entity. 2. Each entity is referred to by at most one identifier. 3. An identifier always refers to the same entity (i.e., it is never reused). Observation An identifier need not necessarily be a pure name, i.e., it may have content. Naming, Identifiers & Addresses Properties of a true identifier: • An identifier refers to at most one entity. • Each entity is referred to by at most one identifier. • An identifier always refers to the same entity (i.e., it is never reused). Addresses and identifiers are two important types of names that are each used for very different purposes. In many computer systems, addresses and identifiers are represented in machine-readable form only, that is, in the form of bit strings. For example, an Ethernet address is essentially a random string of 48 bits. Likewise, memory addresses are typically represented as 32-bit or 64-bit strings. Types of Naming Systems • Flat naming: The identifier is simply a random bit string. It does not contain any information whatsoever on how to locate an access point of its associated entity. Good for machines. • Structured naming: Composed of simple humanreadable names. Examples are file system naming and host naming on the Internet. • Attribute-based naming: Allows an entity to be described by (attribute, value) pairs. This allows a user to search more effectively by constraining some of the attributes. Broadcasting Broadcast the ID, requesting the entity to return its current address • Can never scale beyond local-area networks • Requires all processes to listen to incoming location requests Address Resolution Protocol (ARP) To find out which MAC address is associated with an IP address, broadcast the query “who has this IP address”? Forwarding pointers When an entity moves, it leaves behind a pointer to its next location • Dereferencing can be made entirely transparent to clients by simply following the chain of pointers • Update a client’s reference when present location is found • Geographical scalability problems (for which separate chain reduction mechanisms are needed): • Long chains are not fault tolerant • Increased network latency at dereferencing The principle of mobile IP (Home based approach) Illustrative: Chord Consider the organization of many nodes into a logical ring • Each node is assigned a random m-bit identifier. • Every entity is assigned a unique m-bit key. • Entity with key k falls under jurisdiction of node with smallest id ≥ k (called its successor succ(k )). Nonsolution Let each node keep track of its neighbor and start linear search along the ring. Notation We will speak of node p as the node have identifier p Chord lookup example Resolving key 26 from node 1 and key 12 from node 28 Hierarchical Location Services (HLS) Basic idea Build a large-scale search tree for which the underlying network is divided into hierarchical domains. Each domain is represented by a separate directory node. Principle HLS: Tree organization Invariants • Address of entity E is stored in a leaf or intermediate node • Intermediate nodes contain a pointer to a child if and only if the subtree rooted at the child stores an address of the entity • The root knows about all entities Storing information of an entity having two addresses in different leaf domains HLS: Lookup operation Basic principles • Start lookup at local leaf node • Node knows about E ⇒ follow downward pointer, else go up • Upward lookup always stops at root Looking up a location HLS: Insert operation (a) An insert request is forwarded to the first node that knows about entity E . (b) A chain of forwarding pointers to the leaf node is created (a) (b) Can an HLS scale? Observation A design flaw seems to be that the root node needs to keep track of all identifiers ⇒ make a distinction between a logical design and its physical implementation. Notation • Assume there are a total of N physical hosts {H1, H2 , . . . , HN }. Each host is capable of running one or more location servers. • Dk (A) denotes the domain at level k that contains address A; k = 0 denotes the root domain. • LSk (E, A) denotes the unique location server in Dk (A) responsible for keeping track of entity E . Basic idea for scaling • Choose different physical servers for the logical name servers on a per-entity basis • (at root level, but also intermediate) • Implement a mapping of entities to physical servers such that the load of storing records will be distributed Can an HLS scale? Solution • Dk = {Dk,1, Dk,2 , . . . , Dk,Nk }denotes the Nk domains at level k • Note: N0 = |D0|= 1. • For each level k , the set of hosts is partitioned into Nk subsets, with each host running a location server representing exactly one of the domains Dk,i from Dk . Principle of distributing logical location servers Security in flat naming Basics Without special measures, we need to trust that the name-resolution process to return what is associated with a flat name. Two approaches to follow: • Secure the identifier-to-entity association • Secure the name-resolution process Self-certifying names Use a value derived from the associated entity and make it (part of) the flat name: • id(entity) = hash(data associated with the entity) when dealing with read-only entities, otherwise • id(entity) = public key(entity) in which case additional data is returned, such as a verifiable digital signature. Securing the name-resolution process Much more involved: discussion deferred until discussing secure DNS. Name space Naming graph A graph in which a leaf node represents a (named) entity. A directory node is an entity that refers to other nodes. A general naming graph with a single root node Note A directory node contains a table of (node identifier, edge label) pairs. Name space We can easily store all kinds of attributes in a node • Type of the entity • An identifier for that entity • Address of the entity’s location • Nicknames • ... Note Directory nodes can also have attributes, besides just storing a directory table with (identifier, label) pairs. Name resolution Problem To resolve a name, we need a directory node. How do we actually find that (initial) node? Closure mechanism: The mechanism to select the implicit context from which to start name resolution • www.distributed-systems.net : start at a DNS name server • /home/maarten/mbox: start at the local NFS file server (possible recursive search) • 0031 20 598 7784: dial a phone number • 77.167.55.6 : route message to a specific IP address Name linking Hard link What we have described so far as a path name: a name that is resolved by following a specific path in a naming graph from one node to another. Soft link: Allow a node N to contain a name of another node • First resolve N’s name (leading to N) • Read the content of N, yielding name • Name resolution continues with name Observations • The name resolution process determines that we read the content of a node, in particular, the name in the other node that we need to go to. • One way or the other, we know where and how to start name resolution given name Name linking The concept of a symbolic link explained in a naming graph Observation Node n5 has only one name Mounting Issue Name resolution can also be used to merge different name spaces transparently through mounting: associating a node identifier of another name space with a node in a current name space. Terminology • Foreign name space: the name space that needs to be accessed • Mount point: the node in the current name space containing the node identifier of the foreign name space • Mounting point: the node in the foreign name space where to continue name resolution Mounting across a network 1. The name of an access protocol. 2. The name of the server. 3. The name of the mounting point in the foreign name space. Mounting in distributed systems Mounting remote name spaces through a specific access protocol Name-space implementation Basic issue Distribute the name resolution process as well as name space management across multiple machines, by distributing nodes of the naming graph. Distinguish three levels • Global level: Consists of the high-level directory nodes. Main aspect is that these directory nodes have to be jointly managed by different administrations • Administrational level: Contains mid-level directory nodes that can be grouped in such a way that each group can be assigned to a separate administration. • Managerial level: Consists of low-level directory nodes within a single administration. Main issue is effectively mapping directory nodes to local name servers. Name-space implementation An example partitioning of the DNS name space, including network files Name-space implementation A comparison between name servers for implementing nodes in a name space Item Global Administrational Managerial 1 Worldwide Organization Department 2 Few Many Vast numbers 3 Seconds Milliseconds Immediate 4 Lazy Immediate Immediate 5 Many None or few None 6 Yes Yes Sometimes 1: Geographical scale 2: # Nodes 3: Responsiveness 4: Update propagation 5: # Replicas 6: Client-side caching? Iterative name resolution Principle 1. resolve(dir, [name1,..., nameK ]) sent to Server0 responsible for dir 2. Server0 resolves resolve(dir, name1) → dir1, returning the identification (address) of Server1, which stores dir1. 3. Client sends resolve(dir1, [name2 , ..., nameK ]) to Server1, etc. Scalability issues Size scalability We need to ensure that servers can handle a large number of requests per time unit ⇒ high-level servers are in big trouble. Solution Assume (at least at global and administrational level) that content of nodes hardly ever changes. We can then apply extensive replication by mapping nodes to multiple servers, and start name resolution at the nearest server. Observation An important attribute of many nodes is the address where the represented entity can be contacted. Replicating nodes makes large-scale traditional name servers unsuitable for locating mobile entities. Scalability issues We need to ensure that the name resolution process scales across large geographical distances Problem By mapping nodes to servers that can be located anywhere, we introduce an implicit location dependency. DNS (Domain Name Service) Essence • Hierarchically organized name space with each node having exactly one incoming edge ⇒ edge label = node label. • domain: a subtree • domain name: a path name to a domain’s root node. Information in a node Type Refers to Description SOA A MX SRV NS CNAME PTR HINFO TXT Zone Host Domain Domain Zone Node Host Host Any kind Holds info on the represented zone IP addr. of host this node represents Mail server to handle mail for this node Server handling a specific service Name server for the represented zone Symbolic link Canonical name of a host Info on this host Any info considered useful Modern DNS (Domain Name Service) The traditional organization of the implementation of DNS The modern organization of DNS Secure DNS Basic approach Resource records of the same type are grouped into a signed set, per zone. Examples: • A set with all the IPv4 addresses of a zone • A set with all the IPv6 addresses of a zone • A set with the name servers of a zone The public key associated with the secret key used for signing a set of resource records is added to a zone, called a zone-signing key. Trusting the signatures • All zone-signing keys are grouped again into a separate set, which is signed using another secret key. The public key of the latter is the key-signing key. • The hash of the key-signing key is stored at, and signed by, the parent zone Secure DNS Building a trust chain • Consider a single set of resource records RR, hashed with HZk and signed with SKZk • SZKk has associated public key ZSKk • (Set of) ZSKk is hashed with HKk and signed with SKKk • SKKk has associated public key KSKk A client can verify signature SKZ2(HZ2(RR)) by checking ? ZSK2(SKZ2(HZ2(RR))) = HZ 2(RR) Mounting nested directories LDAP (Lightweight Directory Access Protocol ) Essence • Directory Information Base: collection of all directory entries in an LDAP service. • Each record is uniquely named as a sequence of naming attributes (called Relative Distinguished Name), so that it can be looked up. • Directory Information Tree: the naming graph of an LDAP directory service; each node represents a directory entry. Part of a directory information tree LDAP (Lightweight Directory Access Protocol ) Two directory entries having HostName as RDN Attribute Value Attribute Value Locality Amsterdam Locality Amsterdam Organization VU University Organization VU University OrganizationalUnit Computer Science OrganizationalUnit Computer Science CommonName Main server CommonName Main server HostName star HostName zephyr HostAddress 192.31.231.42 HostAddress 137 .37 .20.10 Result of search("(C=NL)(O=VU University)(OU=*)(CN=Main s e r v e r ) " ) Drawbacks of distributed index Quite a few • A query involving k attributes requires contacting k servers • Imagine looking up “lastName = Smith ∧firstName = Pheriby ”: the client may need to process many files as there are so many people named “Smith.” • No (easy) support for range queries, such as “price = [1000 − 2500].” Alternative: map all attributes to 1 dimension and then index Space-filling curves: principle 1. Map the N-dimensional space covered by the N attributes {a1,..., aN } into a single dimension 2. Hashing values in order to distribute the 1-dimensional space among index servers. Hilbert space-filling curve of (a) order 1, and (b) order 4 (a) (b) Thank You ‫ر‬ ‫الجامعة السعودية االلكتونية‬ ‫ر‬ ‫االلكتونية‬ ‫الجامعة السعودية‬ ‫‪26/12/2021‬‬ College of Computing and Informatics CS476 Parallel and Distributed Computing 2 CS476 Parallel and Distributed Computing Module 5 Analytical Modeling of Parallel Systems 3 Contents 1. Effect of Granularity on Performance 2. Scalability of Parallel Systems 3. Minimum Execution Time and Minimum Cost-Optimal Execution Time 4. Asymptotic Analysis of Parallel Programs 5. Other Scalability Metrics 4 Weekly Learning Outcomes 1. Learn the scalability of Parallel Systems . 2. Discuss about Minimum Execution Time and Minimum CostOptimal Execution Time. 5 Required Reading Chapter 5 Analytical Modeling of Parallel Systems: (Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.) Recommended Reading Granularity in Parallel Computing https://www.youtube.com/watch?v=AlzOErpaXE8 6 Effect of Granularity on Performance ❖Often, using fewer processors improves performance of parallel systems. ❖Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system. ❖A naive way of scaling down is to think of each processor in the original case as a virtual processor and to assign virtual processors equally to scaled down processors. ❖Since the number of processing elements decreases by a factor of n / p, the computation at each processing element increases by a factor of n / p. ❖The communication cost should not increase by this factor since some of the virtual processors assigned to a physical processors might talk to each other. This is the basic reason for the improvement from building granularity. Building Granularity: Example • Consider the problem of adding n numbers on p processing elements such that p < n and both n and p are powers of 2. • Use the parallel algorithm for n processors, except, in this case, we think of them as virtual processors. • Each of the p processors is now assigned n / p virtual processors. • The first log p of the log n steps of the original algorithm are simulated in (n / p) log p steps on p processing elements. • Subsequent log n - log p steps do not require any communication. Building Granularity: Example (continued) • The overall parallel execution time of this parallel system is Θ ( (n / p) log p). • The cost is Θ (n log p), which is asymptotically higher than the Θ (n) cost of adding n numbers sequentially. Therefore, the parallel system is not cost-optimal. Building Granularity: Example (continued) Can we build granularity in the example in a cost-optimal fashion? • Each processing element locally adds its n / p numbers in time Θ (n / p). • The p partial sums on p processing elements can be added in time Θ(n /p). A cost-optimal way of computing the sum of 16 numbers using four processing elements. Building Granularity: Example (continued) • The parallel runtime of this algorithm is (3) • The cost is • This is cost-optimal, so long as ! Scalability of Parallel Systems How do we extrapolate performance from small problems and small systems to larger problems on larger configurations? Consider three parallel algorithms for computing an n-point Fast Fourier Transform (FFT) on 64 processing elements. A comparison of the speedups obtained by the binary-exchange, 2-D transpose and 3-D transpose algorithms on 64 processing elements with tc = 2, tw = 4, ts = 25, and th = 2. Clearly, it is difficult to infer scaling characteristics from observations on small datasets on small machines. Scaling Characteristics of Parallel Programs • The efficiency of a parallel program can be written as: or (4) • The total overhead function To is an increasing function of p . Scaling Characteristics of Parallel Programs ❖For a given problem size (i.e., the value of TS remains constant), as we increase the number of processing elements, To increases. ❖The overall efficiency of the parallel program goes down. This is the case for all parallel programs. Scaling Characteristics of Parallel Programs: Example • Consider the problem of adding numbers on processing elements. • We have seen that: = (5) = (6) = (7) Scaling Characteristics of Parallel Programs: Example (continued) Plotting the speedup for various input sizes gives us: Speedup versus the number of processing elements for adding a list of numbers. Speedup tends to saturate and efficiency drops as a consequence of Amdahl's law Scaling Characteristics of Parallel Programs ❖Total overhead function To is a function of both problem size Ts and the number of processing elements p. ❖ In many cases, To grows sublinearly with respect to Ts. ❖In such cases, the efficiency increases if the problem size is increased keeping the number of processing elements constant. ❖For such systems, we can simultaneously increase the problem size and number of processors to keep efficiency constant. ❖We call such systems scalable parallel systems. Scaling Characteristics of Parallel Programs ❖Recall that cost-optimal parallel systems have an efficiency of Θ(1). ❖Scalability and cost-optimality are therefore related. ❖ A scalable parallel system can always be made cost-optimal if the number of processing elements and the size of the computation are chosen appropriately. Isoefficiency Metric of Scalability ❖For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down for all systems. ❖For some systems, the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant. Isoefficiency Metric of Scalability Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems. Isoefficiency Metric of Scalability ❖What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed? ❖This rate determines the scalability of the system. The slower this rate, the better. ❖Before we formalize this rate, we define the problem size W as the asymptotic number of operations associated with the best serial algorithm to solve the problem. Isoefficiency Metric of Scalability • We can write parallel runtime as: (8) • The resulting expression for speedup is (9) • Finally, we write the expression for efficiency as Isoefficiency Metric of Scalability • For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To / W is maintained at a constant value. • For a desired value E of efficiency, (11) • If K = E / (1 – E) is a constant depending on the efficiency to be maintained, since To is a function of W and p, we have (12) Isoefficiency Metric of Scalability ❖The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant. ❖This function is called the isoefficiency function. ❖This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements Isoefficiency Metric: Example ❖ The overhead function for the problem of adding n numbers on p processing elements is approximately 2p log p . ❖ Substituting To by 2p log p , we get = (13) ❖ Thus, the asymptotic isoefficiency function for this parallel system is . ❖ If the number of processing elements is increased from p to p’, the problem size (in this case, n ) must be increased by a factor of (p’ log p’) / (p log p) to get the same efficiency as on p processing elements. Isoefficiency Metric: Example Consider a more complex example where • Using only the first term of To in Equation 12, we get = (14) • Using only the second term, Equation 12 yields the following relation between W and p: (15) • The larger of these two asymptotic rates determines the isoefficiency. This is given by Θ(p3) Cost-Optimality and the Isoefficiency Function • A parallel system is cost-optimal if and only if (16) • From this, we have: (17) (18) • If we have an isoefficiency function f(p), then it follows that the relation W = Ω(f(p)) must be satisfied to ensure the cost-optimality of a parallel system as it is scaled up. Lower Bound on the Isoefficiency Function • For a problem consisting of W units of work, no more than W processing elements can be used cost-optimally. • The problem size must increase at least as fast as Θ(p) to maintain fixed efficiency; hence, Ω(p) is the asymptotic lower bound on the isoefficiency function. Degree of Concurrency and the Isoefficiency Function ❖The maximum number of tasks that can be executed simultaneously at any time in a parallel algorithm is called its degree of concurrency. ❖If C(W) is the degree of concurrency of a parallel algorithm, then for a problem of size W, no more than C(W) processing elements can be employed effectively. Degree of Concurrency and the Isoefficiency Function: Example Consider solving a system of equations in variables by using Gaussian elimination (W = Θ(n3)) ❖The n variables must be eliminated one after the other, and eliminating each variable requires Θ(n2) computations. ❖At most Θ(n2) processing elements can be kept busy at any time. ❖Since W = Θ(n3) for this problem, the degree of concurrency C(W) is Θ(W2/3) . ❖Given p processing elements, the problem size should be at least Ω(p3/2) to use them all. Minimum Execution Time and Minimum CostOptimal Execution Time Often, we are interested in the minimum time to solution. • We can determine the minimum parallel runtime TPmin for a given W by differentiating the expression for TP w.r.t. p and equating it to zero. =0 (19) • If p0 is the value of p as determined by this equation, TP(p0) is the minimum parallel time. Minimum Execution Time: Example Consider the minimum execution time for adding n numbers. (20) = Setting the derivative w.r.t. p to zero, we have p = n/ 2 . The corresponding runtime is = (21) (One may verify that this is indeed a min by verifying that the second derivative is positive). Note that at this point, the formulation is not costoptimal. Minimum Cost-Optimal Parallel Time ❖Let TPcost_opt be the minimum cost-optimal parallel time. ❖If the isoefficiency function of a parallel system is Θ(f(p)) , then a problem of size W can be solved cost-optimally if and only if ❖ W= Ω(f(p)) . ❖In other words, for cost optimality, p = O(f--1(W)) . ❖For cost-optimal systems, TP = Θ(W/p) , therefore, = (22) Asymptotic Analysis of Parallel Programs Consider the problem of sorting a list of n numbers. The fastest serial programs for this problem run in time Θ(n log n). Consider four parallel algorithms, A1, A2, A3, and A4 as follows: • Comparison of four different algorithms for sorting a given list of numbers. The table shows number of processing elements, parallel runtime, speedup, efficiency and the pTP product. • Asymptotic Analysis of Parallel Programs ❖If the metric is speed, algorithm A1 is the best, followed by A3, A4, and A2 (in order of increasing TP). ❖In terms of efficiency, A2 and A4 are the best, followed by A3 and A1. ❖In terms of cost, algorithms A2 and A4 are cost optimal, A1 and A3 are not. ❖It is important to identify the objectives of analysis and to use appropriate metrics! Other Scalability Metrics ❖A number of other metrics have been proposed, dictated by specific needs of applications. ❖For real-time applications, the objective is to scale up a system to accomplish a task in a specified time bound. ❖In memory constrained environments, metrics operate at the limit of memory and estimate performance under this problem growth rate. Other Scalability Metrics: Scaled Speedup ❖Speedup obtained when the problem size is increased linearly with the number of processing elements. ❖If scaled speedup is close to linear, the system is considered scalable. ❖If the isoefficiency is near linear, scaled speedup curve is close to linear as well. ❖If the aggregate memory grows linearly in p, scaled speedup increases problem size to fill memory. ❖Alternately, the size of the problem is increased subject to an upperbound on parallel execution time. Scaled Speedup: Example ❖The serial runtime of multiplying a matrix of dimension n x n with a vector is tcn2 . ❖For a given parallel algorithm, (24) ❖Total memory requirement of this algorithm is Θ(n2) . Scaled Speedup: Example (continued) Consider the case of memory-constrained scaling. • We have m= Θ(n2) = Θ(p). • Memory constrained scaled speedup is given by or • This is not a particularly scalable system Scaled Speedup: Example (continued) Consider the case of time-constrained scaling. ❖We have TP = O(n2) . ❖Since this is constrained to be constant, n2= O(p) . ❖Note that in this case, time-constrained speedup is identical to memory constrained speedup. ❖This is not surprising, since the memory and time complexity of the operation are identical. Scaled Speedup: Example • The serial runtime of multiplying two matrices of dimension n x n is tcn3. • The parallel runtime of a given algorithm is: (25) • The speedup S is given by: Scaled Speedup: Example (continued) Consider memory-constrained scaled speedup. • We have memory complexity m= Θ(n2) = Θ(p), or n2=c xp. • At this growth rate, scaled speedup S’ is given by: Note that this is scalable. Scaled Speedup: Example (continued) Consider time-constrained scaled speedup. ❖We have TP = O(1) = O(n3 / p) , or n3=c x p . ❖Time-constrained speedup S’’ is given by: Memory constrained scaling yields better performance. Serial Fraction f • If the serial runtime of a computation can be divided into a totally parallel and a totally serial component, we have: • From this, we have, (26) Serial Fraction f • The serial fraction f of a parallel program is defined as: • Therefore, we have: Serial Fraction ❖Since S = W / TP , we have ❖From this, we have: (27) ❖If f increases with the number of processors, this is an indicator of rising overhead, and thus an indicator of poor scalability. Thank You ‫ر‬ ‫الجامعة السعودية االلكتونية‬ ‫ر‬ ‫االلكتونية‬ ‫الجامعة السعودية‬ ‫‪26/12/2021‬‬ ‫‪1‬‬ College of Computing and Informatics CS476 Parallel and distributed systems 2 CS476 Parallel and distributed systems Module 6 Architectures Distributed Systems 3 Contents 1. Architectural styles 2. Middleware and distributed systems 3. Layered-system architectures 4. Symmetrically distributed system architectures 5. Hybrid system architectures 4 Weekly Learning Outcomes 1. Learn Concept of Architectural styles. 2. Learn Middleware and distributed systems. 3. Learn about Layered-system architectures. 5 Required Reading Chapter 2 Architectures: Distributed Systems, 4th Edition, Version 4.01 Author(s) Maarten van Steen, Andrew S. Tanenbaum Publisher: CreateSpace; 3.01 edition (January 2023) ISBN: 978-90-815406-3-6, (Printed version) Recommended Reading https://www.tutorialspoint.com/software_architecture_design/distributed_ architecture.htm Architectures of distributed systems https://www.youtube.com/watch?v=kZvOKio0Q-A 6 Architectural styles Basic idea A style is formulated in terms of • (replaceable) components with well-defined interfaces • the way that components are connected to each other • the data exchanged between components • how these components and connectors are jointly configured into a system. Connector A mechanism that mediates communication, coordination, or cooperation among components. Example: facilities for (remote) procedure call, messaging, or streaming. Layered architecture Different layered organizations (a) (b) (c) Example: communication protocols Protocol, service, interface Two-party communication Server from socket 1 2 3 4 import * s = socket(AF_INET, SOCK_STREAM) (conn, addr) = s . a c c e p t ( ) # returns new socket and addr. c l i e n t while True: # forever d a t a = conn.recv(1024) # r eceive data from c l i e n t i f not d a t a : break # s t o p i f c l i e n t stopped msg = data.decode()+" * " # process t h e incoming data i n t o a response conn.send(msg.encode()) # r etur n t h e response conn.close() # c l o s e t h e connection 5 6 7 8 9 10 Client 1 from socket 2 3 4 5 6 7 8 9 import * s = socket(AF_INET, SOCK_STREAM) s.connect((HOST, PORT)) # connect t o server (block u n t i l accepted) msg = "Hello World" # compose a message s.send(msg.encode()) # send t h e message d ata = s.recv(1024) # receive t h e response print(data.decode()) # p r i n t t h e r e s u l t s.clo se() # close t h e connection Application Layering Traditional three-layered view •Application-interface layer contains units for interfacing to users or external applications •Processing layer contains the functions of an application, i.e., without specific data •Data layer contains the data that a client wants to manipulate through the application components Observation This layering is found in many distributed information systems, using traditional database technology and accompanying applications. 11 Application Layering Example: a simple search engine Object-based style Essence Components are objects, connected to each other through procedure calls. Objects may be placed on different machines; calls can thus execute across a network. Encapsulation Objects are said to encapsulate data and offer methods on that data without revealing the internal implementation . 13 RESTful architectures Essence View a distributed system as a collection of resources, individually managed by components. Resources may be added, removed, retrieved, and modified by (remote) applications. 1.Resources are identified through a single naming scheme 2.All services offer the same interface 3.Messages sent to or from a service are fully self-described 4.After executing an operation at a service, that component forgets everything about the caller Basic operations Operation PUT GET DELETE POST Description Create a new resource Retrieve the state of a resource in some representation Delete a resource Modify a resource by transferring a new state 14 Example: Amazon’s Simple Storage Service Essence Objects (i.e., files) are placed into buckets (i.e., directories). Buckets cannot be placed into buckets. Operations on ObjectName in bucket BucketName require the following identifier: http://BucketName.s3.amazonaws.com/ObjectName Typical operations All operations are carried out by sending HTTP requests: •Create a bucket/object: PUT, along with the URI •Listing objects: GET on a bucket name •Reading an object: GET on a full URI 15 On interfaces Issue Many people like RESTful approaches because the interface to a service is so simple. The catch is that much needs to be done in the parameter space. Amazon S3 SOAP interface 16 Coordination Temporal and referential coupling Referentially coupled Referentially decoupled Temporally coupled Direct Temporally coupled Mailbox Event-based Shared data space Event-based and Shared data space 17 Example: Linda tuple space Three simple operations • i n( t ) : remove a tuple matching template t • rd(t): obtain copy of a tuple matching template t • out(t): add tuple t to the tuple space More details • Calling o u t ( t ) twice in a row, leads to storing two copies of tuple t ⇒ a tuple space is modeled as a multiset. • Both i n and rd are blocking operations: the caller will be blocked until a matching tuple is found, or has become available. Example: Linda tuple space 1 2 3 4 Bob: 5 6 7 8 9 10 11 1 Alice: 2 3 4 5 6 7 1 2 3 Chuck: 4 5 6 7 8 9 10 11 12 import lin d a linda.connect() blog = linda.TupleSpace() linda.universe._out(("MicroBlog",blog)) blog = linda.universe._rd(("MicroBlog",linda.TupleSpace))[1] blog._out(( "bob","distsys","I am studying chap 2 " ) ) blog._out(("bob","distsys","The l i n d a example’s p r e t t y simple")) blog._out(("bob","gtcn","Cool book!")) import lin d a linda.connect() blog = linda.universe._rd(("MicroBlog",linda.TupleSpace))[1] blog._out(("alice","gtcn","This graph theory s t u f f i s n o t easy")) b lo g . _ o u t( ( " alice" , "d is ts ys " , " I l i k e systems more t h an graphs")) import lin d a linda.connect() blog = linda.universe._rd(("MicroBlog",linda.TupleSpace))[1] t 1 = blog._rd(("bob","distsys",str)) t 2 = blog._rd(( "alice","gtcn",str)) t 3 = blog._rd(("bob","gtcn",str)) print t 1 print t 2 print t 3 Publish and subscribe Issue: how to match events? • Assume events are described by (attribute,value) pairs • topic-based subscription: specify a “attribute = value” series • content-based subscription: specify a “attribute ∈ range” series Observation Content-based subscriptions may easily have serious scalability problems (why?) Middleware: the OS of distributed systems What does it contain? Commonly used components and functions that need not be implemented by applications separately. Organizing wrappers Two solutions: 1-on-1 or through a broker Complexity with N applications • 1-on-1: requires N × ( N − 1) = O(N2) wrappers • broker: requires 2N = O ( N ) wrappers Intercept the usual flow of control Centralized system architectures Basic Client–Server Model Characteristics: • There are processes offering services (servers) • There are processes that use services (clients) • Clients and servers can be on different machines • Clients follow request/reply model regarding using services Multi-tiered centralized system architectures Some traditional organizations • Single-tiered: dumb terminal/mainframe configuration • Two-tiered: client/single server configuration • Three-tiered: each layer on separate machine Traditional two-tiered configurations (a) (b) (c) (d) (e) Being client and server at the same time Three-tiered architecture Example: The Network File System Foundations Each NFS server provides a standardized view of its local file system: each server supports the same model, regardless the implementation of the file system. The NFS remote access model Remote access Upload/download Note FTP is a typical upload/download model. The same can be said for systems like Dropbox. 27 NFS architecture Example: Simple Web servers Back in the old days... ...life was simple: • A website consisted as a collection of HTML files • HTML files could be referred to each other by a hyperlink • A Web server essentially needed only a hyperlink to fetch a file • A browser took care of properly rendering the content of a file Alternative organizations Vertical distribution Comes from dividing distributed applications into three logical layers, and running the components from each layer on a different server (machine). Horizontal distribution A client or server may be physically split up into logically equivalent parts, but each part is operating on its own share of the complete data set. Peer-to-peer architectures Processes are all equal: the functions that need to be carried out are represented by every process ⇒ each process will act as a client and a server at the same time (i.e., acting as a servant). 30 Structured P2P Essence Make use of a semantic-free index: each data item is uniquely associated with a key, in turn used as an index. Common practice: use a hash function key(data item) = hash(data item’s value). P2P system now responsible for storing (key,value) pairs. Simple example: hypercube Looking up d with key k ∈ {0, 1, 2,..., 24 − 1} means routing request to node with identifier k . 31 Unstructured P2P Essence Each node maintains an ad hoc list of neighbors. The resulting overlay resembles a random graph: an edge ⟨u, v ⟩ exists only with a certain probability P[⟨u, v ⟩]. Searching •Flooding: issuing node u passes request for d to all neighbors. Request is ignored when receiving node had seen it before. Otherwise, v searches locally for d (recursively). May be limited by a Time-To-Live: a maximum number of hops. •Random walk: issuing node u passes request for d to randomly chosen neighbor, v . If v does not have d , it forwards request to one of its randomly chosen neighbors, and so on. 32 Example: Chord Principle •Nodes are logically organized in a ring. Each node has an m-bit identifier. •Each data item is hashed to an m-bit key. •Data item with key k is stored at node with smallest identifier id ≥ k , called the successor of key k . •The ring is extended with various shortcut links to other nodes. 33 Example: Chord lookup(3)@9 : 28 → 1 → 4 Super-peer networks Essence It is sometimes sensible to break the symmetry in pure peer-to-peer networks: •When searching in unstructured P2P systems, having index servers improves performance •Deciding where to store data can often be done more efficiently through brokers. 35 Collaboration: The BitTorrent case Principle: search for a file F •Lookup file at a global directory ⇒ returns a torrent file •Torrent file contains reference to tracker: a server keeping an accurate account of active nodes that have (chunks of) F . •P can join swarm, get a chunk for free, and then trade a copy of that chunk for another one with a peer Q also in the swarm. 36 Cloud computing Cloud computing Make a distinction between four layers •Hardware: Processors, routers, power and cooling systems. Customers normally never get to see these. •Infrastructure: Deploys virtualization techniques. Evolves around allocating and managing virtual storage devices and virtual servers. •Platform: Provides higher-level abstractions for storage and such. Example: Amazon S3 storage system offers an API for (locally created) files to be organized and stored in so-called buckets. •Application: Actual applications, such as office suites (text processors, spreadsheet applications, presentation applications). Comparable to the suite of apps shipped with OSes. 38 Edge-server architecture Essence Systems deployed on the Internet where servers are placed at the edge of the network: the boundary between enterprise networks and the actual Internet. 39 Reasons for having an edge infrastructure Commonly (and often misconceived) arguments •Latency and bandwidth: Especially important for certain real-time applications, such as augmented/virtual reality applications. Many people underestimate the latency and bandwidth to the cloud. •Reliability: The connection to the cloud is often assumed to be unreliable, which is often a false assumption. There may be critical situations in which extremely high connectivity guarantees are needed. •Security and privacy: The implicit assumption is often that when assets are nearby, they can be made better protected. Practice shows that this assumption is generally false. However, securely handling data operations in the cloud may be trickier than within your own organization. 40 Blockchains Principle working of a blockchain system Observations • Blocks are organized into an unforgeable append-only chain • Each block in the blockchain is immutable ⇒ massive replication • The real snag lies in who is allowed to append a block to a chain Appending a block: distributed consensus Centralized solution Observation A single entity decides on which validator can go ahead and append a block. Does not fit the design goals of blockchains. Appending a block: distributed consensus Distributed solution (permissioned) Observation • A selected, relatively small group of servers jointly reach consensus on which validator can go ahead. • None of these servers needs to be trusted, as long as roughly two-thirds behave according to their specifications. • In practice, only a few tens of servers can be accommodated. Appending a block: distributed consensus Decentralized solution (permisionless) Observation • Participants collectively engage in a leader election. Only the elected leader is allowed to append a block of validated transactions. • Large-scale, decentralized leader election that is fair, robust, secure, and so on, is far from trivial. Thank You 45 ‫ر‬ ‫الجامعة السعودية االلكتونية‬ ‫ر‬ ‫االلكتونية‬ ‫الجامعة السعودية‬ ‫‪26/12/2021‬‬ College of Computing and Informatics CS476 Parallel and Distributed Computing 2 CS476 Parallel and Distributed Computing Module 7 Performance 3 Contents 1. Sources of Overhead in Parallel Programs 2. Performance Metrics for Parallel Systems 4 Weekly Learning Outcomes 1. Learn the concept of sources of overhead in parallel programs and performance metrics . 5 Required Reading Chapter 5 Analytical Modeling of Parallel Systems (Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.) Recommended Reading Chapter 5 Analytical Modeling of Parallel Systems : https://www.cs.purdue.edu/homes/ayg/book/Slides/chap5_slides.pdf Granularity in Parallel Computing: https://www.youtube.com/watch?v=AlzOErpaXE8 6 Analytical Modeling - Basics • A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size). • The asymptotic runtime of a sequential program is identical on any serial platform. • The parallel runtime of a program depends on the input size, the number of processors, and the communication parameters of the machine. • An algorithm must therefore be analyzed in the context of the underlying platform. • A parallel system is a combination of a parallel algorithm and an underlying platform. Analytical Modeling - Basics • A number of performance measures are intuitive. • Wall clock time - the time from the start of the first processor to the stopping time of the last processor in a parallel ensemble. But how does this scale when the number of processors is changed of the program is ported to another machine altogether? • How much faster is the parallel version? This begs the obvious followup question whats the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program look • Raw FLOP count - What good are FLOP counts when they dont solve a problem? Sources of Overhead in Parallel Programs • If I use two processors, shouldnt my program run twice as fast? • No - a number of overheads, including wasted computation, communication, idling, and contention cause degradation in performance. The execution profile of a hypothetical parallel program executing on eight processing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling. Sources of Overheads in Parallel Programs • Interprocess interactions: Processors working on any non-trivial parallel problem will need to talk to each other. • Idling: Processes may idle because of load imbalance, synchronization, or serial components. • Excess Computation: This is computation not performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication. Performance Metrics for Parallel Systems: Execution Time • Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. • The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. • We denote the serial runtime by and the parallel runtime by TP . Performance Metrics for Parallel Systems: Total Parallel Overhead • Let Tall be the total time collectively spent by all the processing elements. • TS is the serial time. • Observe that Tall - TS is then the total time spend by all processors combined in nonuseful work. This is called the total overhead. • The total time collectively spent by all the processing elements Tall = p TP (p is the number of processors). • The overhead function (To) is therefore given by To = p TP - TS (1) Performance Metrics for Parallel Systems: Speedup • What is the benefit from parallelism? • Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements. Performance Metrics: Example • Consider the problem of adding n numbers by using n processing elements. • If n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors. Performance Metrics: Example Computing the globalsum of 16 partial sums using 16 processing elements . Σji denotes the sum of numbers with consecutive labels from i to j. Performance Metrics: Example (continued) • If an addition takes constant time, say, tc and communication of a single word takes time ts + tw, we have the parallel time TP = Θ (log n) • We know that TS = Θ (n) • Speedup S is given by S = Θ (n / log n) Performance Metrics: Speedup • For a given problem, there might be many serial algorithms available. These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees. • For the purpose of computing speedup, we always consider the best sequential program as the baseline. Performance Metrics: Speedup Example • Consider the problem of parallel bubble sort. • The serial time for bubblesort is 150 seconds. • The parallel time for odd-even sort (efficient parallelization of bubble sort) is 40 seconds. • The speedup would appear to be 150/40 = 3.75. • But is this really a fair assessment of the system? • What if serial quicksort only took 30 seconds? In this case, the speedup is 30/40 = 0.75. This is a more realistic assessment of the system. Performance Metrics: Speedup Bounds • Speedup can be as low as 0 (the parallel program never terminates). • Speedup, in theory, should be upper bounded by p - after all, w...

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Computer Science Question ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now