1st assignment is a discussion topic should answer the following question: List some of the challenges of cloud computing. How can one overcome those challenges? It should be 150-200 words
2nd assignment is a discussion topic too should answer the following question: Virtualization of the processor combined with the virtual memory management poses multiple challenges. Analyze the interaction of interrupt handling and paging and explain what you found to the class. It should be 150-200 words
3rd assignment is a case study: Search the Web for reports of cloud system failures. Write a 3 to 4 page paper where you discuss the causes of each incident.
Chapter 4 – Parallel & Distributed
Systems
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Cloud computing and distributed systems.
Parallel processing and distributed computing.
Parallel computer architecture.
SIMD architectures
Graphics processing units.
Speed-up and Amdahl’s Law.
Multicore processor speed up.
Distributed systems.
Modularity.
Layering.
Virtualization; layering and virtualization.
Per-to-peer systems.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
2
1. Cloud computing and distributed systems
◼
◼
◼
◼
◼
Cloud computing is intimately tied to parallel and distributed processing.
Parallel and distributed computing required major advances in several
areas including, algorithms, programming languages and environments,
performance monitoring, computer architecture, interconnection
networks, and, last but not least, solid state technologies.
The interconnection fabric was critical for the performance of parallel and
distributed systems.
Many cloud applications use a number of instances running concurrently.
Transaction processing systems including web-based services represent
a large class of applications hosted by computing clouds. Such
applications run multiple instances of the service and require reliable and
an in-order delivery of messages.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
3
The path to cloud computing
◼
Cloud computing is based on ideas and the experience accumulated in
many years of research in parallel and distributed systems.
Cloud applications are based on the client-server paradigm with a
relatively simple software, a thin-client, running on the user’s machine,
while the computations are carried out on the cloud.
Concurrency is important; many cloud applications are data-intensive
and use a number of instances which run concurrently.
Checkpoint-restart procedures are used as many cloud computations
run for extended periods of time on multiple servers. Checkpoints are
taken periodically in anticipation of the need to restart a process when
one or more systems fail.
Communication is at the heart of cloud computing. Communication
protocols which support coordination of distributed processes travel
through noisy and unreliable communication channels which may lose
messages or deliver duplicate, distorted, or out of order messages.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
4
2. Parallel processing and distributed computing
◼
◼
◼
◼
◼
Parallel processing and distributed computing allow us to solve large
problems by splitting them into smaller ones and solving them
concurrently.
Parallel processing refers to concurrent execution on a system with a
large number of processors.
Distributed computing means concurrent execution on multiple
systems, often located at different sites.
Distributed computing could only be efficient for coarse-grained
parallel applications when concurrent activities seldom communicate
with one another.
Metrics such as execution time, speedup, and processor utilization
characterize how efficiently a parallel or distributed system can
process a particular application
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
5
Data parallelism versus task parallelism
◼
Data parallelism → input data of an application is distributed to
multiple processors/cores running concurrently.
SIMD – Same Program Multiple Data; example, converting a large number of
images in from one format to another – given 109 images batches of 106 images
can be processed concurrently by 103 processors for a speedup of 1,000.
Embarrassingly parallel applications
Map Reduce
◼
Task parallelism → tasks are distributed to multiple processors;
example – data from different sensors providing images, sounds,
data can be processed concurrently by different programs each one
tasks to identify specific anomalies.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
6
Coarse-grained and fined-grained parallelism
◼
◼
Coarse grained parallelism →large blocks of code are executed
before concurrent threads communicate
Fine-grained parallelism → short bursts of computations alternate
with relatively long periods when a thread waits for messages from
other threads
Thread i
Thread j
Thread k
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
7
Control flow and data-flow architecture
◼
Control flow architecture
Pioneered by John von Neumann.
Dominant processor architecture.
The implementation of processor control flow is straightforward, a
program counter determines the next instruction to be loaded into the
instruction register and then executed.
The execution is strictly sequential, until a branch is encountered.
◼
Data flow architecture – operations are carried out at the time when
their input becomes available.
Widely used by network routers, digital signal processors, and other
special-purpose systems.
The lack of locality, the inefficient use of cache, and ineffective
pipelining are most likely some of the reasons why data flow generalpurpose processors are not as popular as control flow processor.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
8
Control flow and data-flow architecture (Cont’d)
◼
◼
◼
Data flow is emulated by von Neumann processors for dynamic
instruction scheduling.
Reservation stations to hold instructions waiting for their input to
become available and the register renaming for out-of-order
instruction execution.
Some of the systems discussed in the class apply the data flow
model for task scheduling on large clusters.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
9
if then else
Data flow
I1
control
data2
data1
data3
control
C3
C
I11
I2
data21
I3
control
control
data5
data7
I2
data6
I4
data4
data20
C6
C
6
C2
data8
data11
C4
data10
data15
C7
C5
data19
while loop
C8
data9
I1
C9
C11
data22
data13
control
control
C10
data12
data14
data18
I2
data16
data17
C12
C13
control
data23
I3
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
10
3. Parallel computer architecture
◼
◼
◼
◼
Bit level parallelism. The number of bits processed per clock cycle,
often called a word size, has increased gradually from 4-bit, to 8-bit,
16-bit, 32-bit, and to 64-bit. This has reduced the number of
instructions required to process larger size operands and allowed a
significant performance improvement. During this evolutionary
process the number of address bits have also increased allowing
instructions to reference a larger address space.
Instruction-level parallelism. Today’s computers use multi-stage
processing pipelines to speed up execution.
Data parallelism or loop parallelism. The program loops can be
processed in parallel.
Task parallelism. The problem can be decomposed into tasks that can
be carried out concurrently. For example, SPMD. Note that data
dependencies cause different flows of control in individual tasks.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
11
Classification of computer architectures
◼
Michael Flynn’s classification of computer architectures is based on
the number of concurrent control/instruction and data streams:
SISD (Single Instruction Single Data) – scalar architecture with one
processor/core.
SIMD (Single Instruction, Multiple Data) – supports vector processing.
When a SIMD instruction is issued, the operations on individual vector
components are carried out concurrently.
MIMD (Multiple Instructions, Multiple Data) – a system with several
processors and/or cores that function asynchronously and
independently; at any time, different processors/cores may be
executing different instructions on different data. We distinguish several
types of systems:
◼ Uniform Memory Access (UMA).
◼ Cache Only Memory Access (COMA).
◼ Non-Uniform Memory Access (NUMA).
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
12
Pipelining
◼
◼
◼
◼
◼
Pipelining – splitting an instruction into a sequence of steps that
can be executed concurrently by different circuitry on the chip.
A basic pipeline of a RISC (Reduced Instruction Set Computing)
architecture consists of five stages.
The number of pipeline stages in different RISC processors varies.
For example, ARM7 and earlier implementations of ARM processors
have a three stage pipeline, fetch, decode, and execute. Higher
performance designs, such as the ARM9, have deeper pipelines:
Cortex-A8 pipeline has thirteen stages.
A superscalar processor executes more than one instruction per
clock cycle.
A Complex Instruction Set Computer (CISC) architecture could have
a much large number of pipelines stages, e.g., an Intel Pentium 4
processor has a 35-stage pipeline.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
13
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
14
Hazards – side effects of pipelining
◼
◼
Instances when unchecked pipelining would produce incorrect results.
Data hazards:
Read after Write (RAW) → occurs when an instruction operates with data in
register that is being modified by a previous instruction.
Write after Read (WAR) → occurs when an instruction modifies data in a
register being used by a previous instruction.
Write after Write (WAW) → occurs when two instructions in a sequence
attempt to modify the data in the same register and the sequential execution
order is violated.
◼
◼
Structural hazards – the circuits implementing different hardware
functions are needed by two or more instructions at the same time. E.g.,
a single memory unit is accessed during the instruction fetch stage
where the instruction is retrieved from memory, and during the memory
stage where data is written and/or read from memory.
Control hazards – due to conditional branches. The processor will not
know the outcome of the branch when it needs to insert a new instruction
into the pipeline, normally during the fetch stage.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
15
Pipeline requirements, stalls, and scheduling
◼
The architecture should:
Preserve exception behavior, any change in instruction order must not
change the order in which exceptions are raised, to ensure program
correctness.
Preserve instruction flow, the flow of data between instructions that
produce results and consume them.
◼
◼
Pipeline stall – delay in the execution of an instruction in an
instruction pipeline in order to resolve a hazard. Such stalls could
drastically affect the performance.
Pipeline scheduling – separates dependent instruction from the
source instruction by the pipeline latency of the source instruction.
Its effect is to reduce the number of stalls.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
16
Dynamic instruction scheduling
◼
◼
◼
◼
Reduces the number of pipeline stalls, but adds to circuit
complexity.
Register renaming is sometimes supported by reservation stations.
A reservation station fetches and buffers an operand as soon as it
becomes available. A pending instruction designates the
reservation station it will send its output to.
A reservation station stores the following information:
the instruction;
buffered operand values (when available); and
the id of the reservation station number providing the operand values.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
17
4. SIMD architectures
◼
Flavors of SIMD architectures
Vector architectures.
SIMD extensions for mobile systems and multimedia applications.
◼
Graphics Processing Units (GPUs).
Advantages:
Exploit a significant level of data-parallelism.
Enterprise applications in
data mining and multimedia applications, applications in computational
science and engineering using linear algebra benefit the most.
Allow mobile device to exploit parallelism for media-oriented image and
sound processing using SIMD extensions of ISA
Are more energy efficient than MIMD architecture. Only one instruction
is fetched for multiple data operations, rather than fetching one
instruction per operation.
Higher potential speedup than MIMD architectures. SIMD potential
speedup could be twice as large as that of MIMD.
Allows developers to continue thinking sequentially.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
18
Vector architectures
◼
◼
◼
◼
◼
◼
◼
Vector registers holding as many as 64 or 128 vector elements.
Vector functional units carry out arithmetic and logic operations using
data from vector registers.
Vector load-store units are pipelined, hide memory latency, and
leverage memory bandwidth. The memory system spreads access to
multiple memory banks which can be addressed independently.
Vector length registers support handling of vectors whose length is not
a multiple of the length of the physical vector registers.
Vector mask registers disable/select vector elements and are used by
conditional statements.
Gather operations take an index vector and fetch vector elements at
the addresses given by adding a base address to the offsets given by
the index vector. A dense vector is loaded in a vector register.
Scatter operations are the reverse of gather.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
19
SIMD extensions for multimedia application
◼
◼
Augment an existing instruction set of a scalar processor with a set
of vector instructions.
Advantages over vector architecture:
Low cost to add circuitry to an existing ALU.
Little extra state is added thus, the extensions have little impact on
context-switching.
Need little extra memory bandwidth.
Do not pose additional complications to the virtual memory
management for cross-page access and page-fault handling.
◼
Multimedia applications often run on mobile devices and operate on
narrower data types than the native word size. E.g., graphics
applications use 3 x 8 bits for colors and one 8-bit for transparency;
audio applications use 8, 16, or 24-bit samples.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
20
Extensions to Intel Architecture
◼
◼
◼
◼
MMX introduced in 1996 Intel introduced MMX; supports eight 8-bit,
or four 16-bit integer operations.
SSE (1999 – 2004). The SSEs operate on eight 8-bit integers, four
32-bit or two 64-bit integer or floating-point operations.
AVX (Advanced Vector Extensions) introduced in 2010 operates on
four 64-bit either integer or floating-point operations.
AVX family of Intel processors: Sandy Bridge, Ivy Bridge, Haswell,
Broadwell, Skylake, and its follower, the Babylake.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
21
Roofline performance model
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
22
5. Graphics processing units (GPUs)
◼
◼
◼
Real-time graphics with vectors of two, three, or four dimensions led to
the development of GPUs. Also used in embedded systems, mobile
phones, personal computers, workstations, and game consoles.
GPU processing is based on a heterogeneous execution model with a
CPU acting as the host connected with a GPU, called the device.
Steps of a typical execution:
CPU copies the input data from the main memory to the GPU memory.
CPU instructs the GPU to start processing using the executable in the GPU
memory.
GPU uses multiple cores to execute the parallel code.
When done the GPU copies the result back to the main memory.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
23
GPUs organization and threads
◼
◼
◼
◼
◼
◼
A GPU has multiple multithreaded SIMD processors. The currentgeneration of GPUs, e.g., Fermi of NVIDIA, have 7 to 15
multithreaded SIMD processors.
A multithreaded SIMD processor has several wide & shallow SIMD
lanes.
For example, an NVIDIA GPU has 32,768 registers divided among
the 16 physical SIMD lanes; each lane has 2,048 registers.
Single-Instruction-Multiple-Thread (SIMT) is the GPU programming
model. All forms of GPU parallelism are unified as CUDA threads in
the SIMT model.
CUDA, a C-like programming language developed by NVIDIA.
A thread is associated with each data element.
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
24
Grid, blocks, and threads
◼
Example
Grid with 8192 components
of vector A.
16 blocks with 512 vector components each.
A bloc has 6 threads.
A thread operates on 32 components of vector A$.
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
25
A[0]= B[0] x C[0]
SIMD
Thread 0
A[1]= B[1] x C[1]
A[31]= B[31] x C[31]
A[32]= B[32] x C[32]
Thread
Block 0
SIMD
Thread 1
A[33]= B[33] x C[33]
A[63]= B[63] x C[63]
SIMD
Thread 15
A[480]= B[480] x C[480]
A[481]= B[481] x C[481]
A[511]= B[511] x C[511]
Grid
…………………………………………………………………………………………………………………………………………………………………………………………………..
A[7680]= B[7680] x C[7680]
SIMD
Thread 0
A[7681]= B[7680] x C[7681]
A[7711]= B[7711] x C[7711]
A[7712]= B[7712] x C[7712]
Thread
Block 15
SIMD
Thread 1
A[7713]= B[7713] x C[7713]
A[7743]= B[7743] x C[7743]
SIMD
Thread 15
A[8160] = B[8160] x C[8160]
A[8161] = B[8161] x C[8161]
A[8191] = B[8191] x C[8191]
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
26
GPU scheduling
◼
◼
◼
Multi-level scheduling
Thread block scheduler → assigns thread blocks to multithreaded
SIMD processors.
Thread scheduler → running on each multithreaded SIMD processor
assigns threads to the SIMD lanes of multithreaded processors.
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
27
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
28
6. Speed-up and Amdahl’s Law
◼
Parallel hardware and software systems allow us to:
Solve problems demanding resources not available on a single system.
Reduce the time required to obtain a solution.
◼
The speed-up S measures the effectiveness of parallelization:
S(N) = T(1) / T(N)
T(1) → the execution time of the sequential computation.
T(N) → the execution time when N parallel computations are executed.
◼
◼
Amdahl’s Law: if α is the fraction of running time a sequential program
spends on non-parallelizable segments of the computation then
S = 1/ α
Gustafson’s Law: the scaled speed-up with N parallel processes
S(N) = N – α( N-1)
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
29
7. Multicore Processor Speedup
◼
◼
We now live in the age of multicore processors brought about by the
limitations imposed on solid state devices by the laws of physics.
There are alternative designs of multicore processors:
The cores can be identical or different from one another, there could be
a few powerful cores or a larger number of less powerful cores.
More cores will lead to high speedup of highly parallel applications, a
powerful core will favor highly sequential applications
◼
◼
◼
◼
The design space of multicore processors should be driven by costperformance considerations.
The cost of a multicore processor depends on the number of cores
and the complexity, ergo, the power of individual cores.
Cost-effective design→ the speedup achieved exceeds the cost up.
Cost up → multicore processor cost divided by the single-core
processor cost.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
30
Quantifying multicore design alternatives
◼
◼
◼
Basic Core Equivalent (BCE) → quantifies resources of individual
cores.
A symmetric core processor may have n BCEs with r resources
each. Alternatively, n x r resources may be distributed unevenly in
an asymmetric core processor.
The speedup of asymmetric multicore processors is always larger
and, could be significantly larger than the speedup of symmetric
core processors.
For example, the largest speedup when f = 0.975 and n = 1024 is
achieved for a configuration with one 345-BCE core and 679 1-BCE
cores.
◼
Increasing the power of individual cores is beneficial even for
symmetric core processors.
For example, when f=0.975 and n=256 the maximum speedup occurs
for seven 1-BCE cores.
f is the fraction of an application that is parallelizable.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
31
16-BCE chip. Symmetric core processor with two different
configurations: (Left) sixteen 1-BCE cores; (Right) one 16-BCE core.
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
32
16-BCE chip. (Left) Symmetric core processor with four 4-BCE cores;
(Right) Asymmetric core processor with one 4-BCE core and twelve 1BCE cores.
◼
m Memory Access (NUMA).
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
33
8. Distributed systems
◼
◼
◼
Collection of autonomous computers, connected through a network
operating under the control and distribution software.
Middleware → software enabling individual systems to coordinate
their activities and to share system resources.
Main characteristics of distributed systems:
The users perceive the system as a single, integrated computing facility.
The components are autonomous.
Scheduling and other resource management and security policies are
implemented by each system.
There are multiple points of control and multiple points of failure.
The resources may not be accessible at all times.
Can be scaled by adding additional resources.
Can be designed to maintain availability even at low levels of
hardware/software/network reliability.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
34
Distributed systems – desirable properties
◼
◼
◼
◼
◼
◼
◼
◼
Access transparency – local and remote information objects are
accessed using identical operations.
Location transparency – information objects are accessed without
knowledge of their location.
Concurrency transparency – several processes run concurrently using
shared information objects without interference among them.
Replication transparency – multiple instances of information objects
increase reliability without the knowledge of users or applications.
Failure transparency – the concealment of faults.
Migration transparency – the information objects in the system are moved
without affecting the operation performed on them.
Performance transparency – the system can be reconfigured based on
the load and quality of service requirements.
Scaling transparency – the system and the applications can scale without
a change in the system structure and without affecting the applications.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
35
9. Modularity
◼
◼
◼
◼
Modularity, layering, and hierarchy are means to cope with the
complexity of a distributed application software.
Software modularity, the separation of a function into independent,
interchangeable modules requires well-defined interfaces specifying
the elements provided and supplied to a module.
Requirement for modularity → clearly define the interfaces between
modules and enable the modules to work together.
The steps involved in the transfer of the flow of control between the
caller and the callee:
The caller saves its state including the registers, the arguments, and the
return address on the stack
The callee loads the arguments from the stack, carries out the calculations
and then transfers control back to the caller.
The caller adjusts the stack, restores its registers, and continues its
processing.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
36
Modular software design principles
◼
◼
◼
◼
◼
◼
Information hiding → the user of a module does not need to know
anything about the internal mechanism of the module to make effective
use of it.
Invariant behavior → the functional behavior of a module must be
independent of the site or context from which it is invoked.
Data generality→ the interface to a module must be capable of passing
any data object an application may require.
Secure arguments → the interface to a module must not allow sideeffects on arguments supplied to the interface.
Recursive construction → a program constructed from modules must
be usable as a component in building larger programs/modules
System resource management → resource management for program
modules must be performed by the computer system and not by
individual program modules.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
37
Soft modularity
◼
Soft modularity → divide a program into modules which call each other
and communicate using shared-memory or follow procedure call
convention.
Hides module implementation details.
Once the interfaces of the modules are defined, the modules can be
developed independently.
A module can be replaced with a more elaborate, or with a more efficient
one, as long as its interfaces with the other modules are not changed.
The modules can be written using different programming languages and
can be tested independently.
◼
Challenges:
Increases the difficulty of debugging; for example, a call to a module with
an infinite loop will never return.
There could be naming conflicts and wrong context specifications.
The caller and the callee are in the same address space and may misuse
the stack, e.g., the callee may use registers that the caller has not saved
on the stack, and so on.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
38
Enforced modularity; the client-server paradigm
◼
◼
Modules are forced to interact only by sending and receiving
messages.
More robust design,
Clients and servers are independent modules and may fail separately.
Does not allow errors to propagate.
◼
◼
◼
Servers are stateless, they do not have to maintain state
information. A server may fail and then come back up without the
clients being affected, or even noticing the failure of the server.
Enforced modularity makes an attack less likely because it is difficult
for an intruder to guess the format of the messages or the sequence
numbers of segments, when messages are transported by TCP.
Often based on RPCs.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
39
Remote procedure calls (RPCs)
◼
Introduced in early 1970s by Bruce Nelson and used for the first
time at PARC.
Reduce fate sharing between caller and the callee.
RPCs take longer than local calls due to communication delays.
◼
RPC semantics
At least once → a message is resent several times and an answer is
expected. The server may end up executing a request more than once,
but an answer may never be received. Suitable for operation free of
side-effects
At most once → a message is acted upon at most once. The sender
sets up a timeout for receiving the response. When the timeout expires
an error code is delivered to the caller. Requires the sender to keep a
history of the time-stamps of all messages as messages may arrive
out-of-order. Suitable for operations which have side effects
Exactly once → implements at most once semantics and requests an
acknowledgment from server.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
40
Client-server communication for World Wide Web.
Three-way handshake involves
the first three messages
exchanged between the client
browser and the server.
Once the TCP connection is
established the HTTP server
takes its time to construct the
page to respond to the first
request; to satisfy the second
request the HTTP server must
retrieve an image from the disk.
Response time components:
RTT (Round-trip time).
2. Server residence time.
3. Data transmission time.
1.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter
4.
41
10. Layering and hierarchy
◼
◼
◼
◼
◼
Layering demands modularity → each layer fulfills a well-defined
function.
Communication patterns are more restrictive, a layer is expected
to communicate only with the adjacent layers. This restriction
reduces the system complexity and makes it easier to understand
its behavior.
Strictly enforced layering can prevent optimizations. For example,
cross-layer communication in networking was proposed to allow
wireless applications to take advantage of information available at
the Media Access Control (MAC) sub-layer of the data link layer.
There are systems where it is difficult to envision a layered
organization because of the complexity of the interaction between
the individual modules.
Could a layered cloud architecture be designed that has practical
implications for the future development of computing clouds?
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
42
Communication protocol layering
◼
Internet protocol stack:
Physical layer → accommodate divers physical communication
channels carrying electromagnetic, optical, or acoustic signals .
Data link layerHow → address the problem to transport bits, not signals
between two systems directly connected to one another by a
communication channel.
Network layer → packets carying bits have to traverse a chain of
intermediate nodes from a source to the destination; the concern is how
to forward the packets from one intermediate node to the next.
Transport layer → the source and the recipient of packets are outside
the network this layer guarantees delivery from source to destination.
Application layer → data sent and received by the hosts at the network
periphery has a meaning only in the context of an application.
Dan C. Marinescu
Cloud Computing Second Edition Chapter 4.
43
11. Virtualization; layering and virtualization
◼
◼
Virtualization abstracts the underlying physical resources of a
system and simplifies its use, isolates users from one another, and
supports replication which increases system elasticity and reliability.
Virtualization simulates the interface to a physical object:
Multiplexing → create multiple virtual objects from one instance of a
physical object. E.g., a processor is multiplexed among a number of
processes or threads.
Aggregation → create one virtual object from multiple physical objects.
E.g., a number of physical disks are aggregated into a RAID disk.
Emulation → construct a virtual object from a different type of a physical
object. E.g., a physical disk emulates Random Access Memory.
Multiplexing and emulation → E.g., virtual memory with paging
multiplexes real memory and disk and a virtual address emulates a real
address; the TCP protocol emulates a reliable bit pipe and multiplexes a
physical communication channel and a processor.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
44
Virtualization and cloud computing
◼
Virtualization is a critical aspect of cloud computing, equally
important for providers and consumers of cloud services for several
reasons:
System security → it allows isolation of services running on the same
hardware.
Performance isolation → allows developers to optimize applications and
cloud service providers to exploit multi-tenancy.
Performance and reliability → it allows applications to migrate from one
platform to another.
Facilitates development and management of services offered by a
provider.
◼
◼
A hypervisor runs on the physical hardware and exports hardwarelevel abstractions to one or more guest operating systems.
A guest OS interacts with the virtual hardware in the same manner it
would interact with the physical hardware, but under the watchful
eye of the hypervisor which traps all privileged operations and
mediates the interactions of the guest OS with the hardware.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
45
12. Peer-to-peer systems (P2P)
◼
P2P represents a significant departure from the client-server model
and have several desirable properties:
Require a minimally dedicated infrastructure, as resources are contributed
by the participating systems.
Highly decentralized.
Scalable, individual nodes are not required to be aware of global state.
Are resilient to faults and attacks, as few of their elements are critical for
the delivery of service and the abundance of resources can support a high
degree of replication.
Individual nodes do not require excessive network bandwidth as servers
used by client-server model do.
The systems are shielded from censorship due to the dynamic and often
unstructured system architecture.
◼
Undesirable properties:
Decentralization raises the question if P2P systems can be managed
effectively and provide the security required by various applications.
Shielding from censorship makes them a fertile ground for illegal activities.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
46
Resource sharing in P2P systems
◼
This distributed computing model promotes low-cost access to
storage and CPU cycles provided by participant systems.
Resources are located in different administrative domains.
P2P systems are self-organizing and decentralized, while the servers in
a cloud are in a single administrative domain and have a central
management.
◼
◼
Napster, a music-sharing system, developed in late 1990s gave
participants access to storage distributed over the network.
The first volunteer-based scientific computing, SETI@home, used
free cycles of participating systems to carry out compute-intensive
tasks.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
47
Organization of P2P systems
◼
Regardless of the architecture, P2P systems are built around an
overlay network, a virtual network superimposed over the real network.
Each node maintains a table of overlay links connecting it with other
nodes of this virtual network, each node is identified by its IP addresses.
Two types of overlay networks, unstructured and structured, are used.
Random walks starting from a few bootstrap nodes are usually used by
systems desiring to join an unstructured overlay.
◼
◼
◼
Each node of a structured overlay has a unique key which determines
its position in the structure; the keys are selected to guarantee a
uniform distribution in a very large name space.
Structured overlay networks use key-based routing (KBR); given a
starting node v0 and a key k, the function KBR(v0,k) returns the path in
the graph from v0 to the vertex with key k.
Epidemic algorithms are often used by unstructured overlays to
disseminate network topology.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
48
Examples of P2P systems
◼
◼
◼
◼
◼
Skype, a voice over IP telephony service allows close to 700 million
registered users from many countries around the globe to
communicate using a proprietary voice-over-IP protocol.
Data streaming applications such as Cool Streaming
BBC’s online video service,
Content distribution networks such as CoDeeN.
Volunteer computing applications based on the BOINC (Berkeley
Open Infrastructure for Networking Computing) platform.
Dan C. Marinescu
Cloud Computing Second Edition – Chapter 4.
49
Chapter 5. Cloud Access and Cloud
Interconnection Networks
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Clouds and networks.
Packet-switched networks.
Internet.
Relations between Internet networks.
Transformation of the Internet.
Web access and TCP congestion
Named data networks
Interconnection networks for computer clouds.
Clos networks, Myrinet, InfiniBand, fat trees.
Storage area networks.
Data center networks.
Network management algorithms.
Content delivery networks.
Overlay and scale-free networks.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
2
1. Clouds and networks
◼
Unquestionably, communication is at the heart of cloud computing.
Interconnectivity supported by a continually evolving Internet made cloud
computing feasible.
A cloud is built around a high-performance interconnect, the servers of a
cloud infrastructure communicate through high-bandwidth and low-latency
networks.
◼
◼
Cloud workloads fall into four broad categories based on their
dominant resource needs: CPU-intensive, memory-intensive, I/Ointensive, and storage-intensive. While the first two benefit from, but
not do not require, high-performing networking, the last two do.
Networking performance directly impacts the performance of I/O- and
storage-intensive workloads
The designers of a cloud computing infrastructure are acutely aware
that the communication bandwidth goes down and the communication
latency increases the farther from the CPU data travels.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
3
2. Packet-switched networks
◼
◼
◼
A packet-switched network transports data units called packets
through a maze of switches where packets are queued and routed
towards their destination.
A datagram is a transfer unit in a packet-switched network. In
addition to its payload a datagram has of a header containing control
information necessary for its transport through the network.
A packet-switched network consists of:
Network core made up from routers and control systems interconnected
by very high bandwidth communication channels.
Network edge where the end-user systems/hosts reside.
◼
◼
A switch is a device connecting together other devices. A switch
manage the flow of data across a network by transmitting a received
network packet to the devices the packet is intended for.
A networked device connected to a switch can be identified by its
network address, allowing the switch to regulate the flow of traffic.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
4
Basic concepts
◼
◼
◼
◼
◼
Packet → consists of a header which contains control information
necessary for its transport through the network and a payload or
data.
Packets are subject to a variable delay, errors, and loss.
A network architecture describes the protocol stack.
Protocol → a discipline for communication, it specifies the actions
taken by the sender and the receiver of a data unit.
Host → a system located at the network edge capable to initiate and
to receive communication, e.g., computer, mobile device, sensor.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter
5.
5
3. The Internet
◼
◼
Collection of separate and distinct networks.
All networks operate under a common framework consisting of:
globally unique IP addressing.
IP routing.
global Border Gateway Routing (BGP) protocols.
◼
◼
IP only provides best effort delivery – any router on the path from
the source to the destination may drop a packet if it is overloaded.
The Internet uses two transport protocols
UDP (User Datagram Protocol) – a connectionless datagram protocol.
The UDP transport protocol assumes that error checking and error
correction are either not necessary or performed by the application.
Datagrams may arrive out of order, duplicated, or may not arrive at all.
TCP (Transport Control Protocol) – a connection-oriented protocol.
TCP provides reliable, ordered delivery of a stream of bytes from an
application on one system to its peer on the destination system.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
6
The hourglass Internet architecture
Teleconferencing
Videoconferencing
RealAudio
Telnet
Application Layer
WWW
Email
FTP
TCP
UDP
Transport Layer
IP
Network Layer
ATM
Physical and Data Link Layers
Dan C. Marinescu
Dial-up
Modems
LANs
Wireless
Direct Cable
Broadcast
Frame
Sateliite
Relay
Cloud Computing. Second Edition – Chapter 5.
7
The Internet protocol stack
Host
Host
Application Layer
(Message )
Application Layer
(Message)
Network
Transport Layer
(Segment)
Transport Layer
(Segment)
Router
Router
Network Layer
(Packet)
Network Layer
Network Layer
Network Layer
(Packet)
Data Link Layer
(Frame)
Data Link
Layer
Data Link
Layer
Data Link Layer
(Frame)
Physical
Layer
Physical
Layer
Physical
Layer
Physical
Layer
Streams of bits encoded as electrical, optical, or electromagnetic signals
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
8
IPv4 versus IPv6
◼
Why is migration to IPv6 necessary?
IPv4 has an addressing capability of 232, or approximately
4.3 billion
addresses, a number that proved to be insufficient.
IPv6 has an addressing capability of 2128, or 3.4×1038 addresses.
◼
Major differences between IPv4 and IPv6:
IPv6 supports new multicast solutions and but not traditional IP broadcast.
IPv6 hosts can configure themselves automatically when connected to a
routed IPv6 network using the Internet Control Message Protocol version 6.
Mandatory support for network security. Internet Network Security(IPsec) is
an integral part of the base protocol suite in IPv6.
◼
Migration from IPv4 to IPv6 is a very challenging and costly proposition.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
9
IP and MAC addresses, ports and sockets
◼
◼
◼
◼
◼
IP address → logical address assigned dynamically by a DHCP
server. A host may have multiple IP addresses as it may be
connected to more than one network.
MAC address → unique physical address of each network interface.
Network interface → hardware connecting a host with a network.
Port → software abstraction for message delivery to an application.
Sockets → software abstraction allowing an application to send and
receive messages at a given port; implemented as two queues, one
for incoming and the other for outgoing messages.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
10
Sockets and ports
Router
Network
interface
Port
Process
Host
Network
Network
IP address: NetworkId+HostId
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
11
4. The relations between Internet networks
◼
Three type of relations:
Peering – two networks exchange traffic between each other’s
customers freely.
Transit – a network pays to another one to access the Internet.
Customer – a network is paid to allow Internet access.
◼
The networks are commonly classified as:
Tier 1 – can reach every other network on the Internet without
purchasing IP transit or paying settlements.
Tier 2 – an Internet service provider who engages in the practice of
peering with other networks, but who still purchases IP transit to
reach some portion of the Internet; the common providers on the
Internet.
Tier 3 – purchases transit rights from other networks (typically Tier 2
networks) to reach the Internet.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
12
Internet – Tier 1 networks
Tier 2 network
Tier 2 network
POP 1
IXP
Tier 3 networks
Internet users
The relation of Internet networks based on the transit and paying
settlements. There are three classes of networks, Tier 1, 2, and 3; an IXP is
a physical infrastructure allowing ISPs to exchange Internet traffic.
Dan C. Marinescu
Cloud Computing. Second Edition Chapter 5.
13
5. The transformation of the Internet
◼
◼
◼
◼
Web applications, cloud computing, and content-delivery networks
are reshaping the definition of a network.
Data streaming consumes an increasingly larger fraction of the
available bandwidth as high definition TV sets become less
expensive and content providers, such as Netflix and Hulu, offer
customers services that require a significant increase of the
network bandwidth.
The “last mile” – the link connecting the home to the Internet
Service Provider (ISP) network is the bottleneck.
Google has initiated the Google Fiber Project which aims to
provide 1Gb/s access speed to individual households through
FTTH.
Dan C. Marinescu
Cloud Computing. Second Edition Chapter 5.
14
National
Backbone
Operators
Sprint, MCI, UUnet,Psnet
NAP
NAP
NAP
Regional
Access
Providers
Local
Access
Providers
ISP 1
ISP 2
ISP n
ISP 3
Customer IP
Networks
(a) Textbook Internet prior to 2007; the global core consists of Tier 1 networks
Global
Internet
Core
Global Transit/
National Backbones
IXP
RegionalTier 2
providers
“Hyper Giants”
Large Content, Consume, Hosting CDN
IXP
ISP 1
IXP
ISP 2
Customer IP
Networks
(b) The 2009 Internet reflects the effect of comoditization of IP hosting and of
content-delivery networks (CDNs)
The transformation of the Internet. The traffic carried by Tier 3 networks
increased from 5.8% in 2007 to 9.4% in 2009; Goggle applications accounted
for 5.2% of the traffic in 2009.
Dan C. Marinescu
Cloud Computing. Second Edition Chapter 5.
15
The average download speed for broadband access advertised by several countries
Dan C. Marinescu
Cloud Computing. Second Edition Chapter 5.
16
6. Web access and TCP congestion
◼
◼
◼
HTTP – the application protocol for Web access uses the TCP
transport protocol.
TCP supports mechanisms to avoid congestion and limit the amount
of data transported over the Internet.
Web access requires the transfer of large amounts of data as we
can see in measurements reported by Google
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
17
TCP congestion control
◼
◼
Algorithms to control congestion include Tahoe, an algorithm based on:
(1) slow start, (2) congestion avoidance, and (3) fast retransmit.
Slow start means that:
(a) the sender starts with a window of two times MSS (Maximum Segment
Size).
(b) for every packet acknowledged, the congestion window increases by
1 MSS so that the congestion window effectively doubles for every RTT
(Round Trip Time).
◼
To overcame the limitations of the slow start application, strategies
have been developed to reduce the time to download data over the
Internet. For example,
Firefox 3 and Google Chrome open up to six TCP connections per domain.
Internet Explorer 8 opens 180 connections.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter
5.
18
TCP congestion control
◼
◼
◼
◼
The strategies used by the browsers to avoid the congestion
control mechanisms circumvent the mechanisms for congestion
control and incur a considerable overhead.
The TCP latency is dominated by the number of RTTs during the
slow start phase.
Given that the average page size is 384 KB, a single TCP
connection requires multiple RTTs to download a single page.
It is argued that a better solution is to increase the initial congestion
window of TCP. The effects of this solution:
It ensures fairness between short-lived transactions which are a
majority of Internet transfers and the long-lived transactions which
transfer very large amounts of data, e.g., audio and video streaming.
It allows faster recovery after losses through Fast Retransmission.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
19
7. Named data networks (NDN)
◼
Internet is a network of communication networks:
Communication units, the packets, only name the communication end
points.
Data in a communication networks is tied to a particular host and this
makes data replication and migration difficult.
◼
◼
Now Internet is mostly used as a distribution network by applications
in: digital media, electronic commerce, Big Data analytics, and so on.
NDN communication is content-centric:
Named objects are requested by an agent and, once a site holding the
object is located, the network transports it to the site that requested the
object.
The end user in a distribution network is oblivious to the location of the
data and it is only interested in the content.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
20
Teleconferencing
Videoconferencing
Application
Layer
Transport Layer
Network Layer
RealAudio
Telnet
WWW
Email
FTP
TCP
UDP
IP
ATM
Physical and Data
Link Layers
Dial- up
Modems
LANs
Wireless
Direct Cable
Broadcast
Frame
Sateliite
Relay
(Left) Internet hourglass architecture – regardless of application transport
protocol, and physical network, all packets are routed from the source to the
destination using the IP protocol and the IP address of the destination.
(Right) NDN hourglass architecture – parallels the one of the Internet, it
separates lower layers of the protocol stack from the upper ones thus, naming
of the data can evolve independently from networking.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
21
NDN versus Internet
◼
◼
◼
NDN networking service semantics → fetch a data chunk identified
by name.
Internet semantics → a packet to a given network address through
an end-to-end channel identified by the source and the destination
IP addresses
NDN packet delivery is driven by the data consumers.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
22
Interest Packet
Data Packet
Name
Name
Selectors
(order preference, publisher filter,
exclude filter, etc.)
Meta information
(context type, freshness period, etc.)
Nonce
Content
Guiders
(scope, interest lifetime, etc.)
Signature
(signature type, key locator, etc.)
NDN Communication is initiated by an agent generating an Interest
packet containing the name of the data.
Once the Interest packet reaches a network host which has a copy of
the data item, a Data packet containing the name, the contents of the
data, and the signature is generated.
The signature consists of the producer’s key.
The Data packet follows the route traced by the Interest packet and it is
delivered to the data consumer agent..
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
23
NDN routing data structures
◼
◼
◼
Content Store → local cache for Data packets previously crossing the router.
When an Interest packet arrives, a search of the content store determines if a
matching data exists and if so the data is forwarded on the same router
interface the Interest packet was received. If not, the router uses a data
structure, the Forwarding Information Base, to forward the packet.
Forwarding Information Base → entries in the data structure are populated by
a name-prefix based procedure. Forwarding Strategy retrieves longest prefix
matched entry from forwarding information base for an Interest packet.
Pending Interest Table (PIT) → stores Interest packets the router has
forwarded, but have not been satisfied yet. A PIT entry records the data name,
together with its incoming and outgoing router interface(s). When a Data
packet arrives, the router finds the matching PIT entry and forwards the data
to all downstream interfaces listed in that PIT entry, then removes the PIT
entry, and caches the Data packet in Content Store.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
24
8. Interconnection networks for computer clouds
◼
◼
◼
While processor and memory technology have followed Moore’s
Law, interconnection networks have evolved at a slower pace.
From 1997 to 2010 the speed of the ubiquitous Ethernet network
has increased from 1 to 100 Gbps. This increase is slightly slower
than the Moore’s Law for traffic which predicted, 1 Tbps Ethernet
by 2013.
Interconnection networks are a major factor in determining the
overall performance and cost of the system.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
25
Basic concepts
◼
◼
A network consists of nodes and links or communication channels.
An interconnection network can be:
Non-blocking if it is possible to connect any permutation of sources and
destinations at any time.
Blocking if this requirement is not satisfied.
◼
Switches and communication channels are the elements of the
interconnection fabric.
Switches→ receive data packets, look inside each packet to identify the
destination IP addresses, then use the routing tables to forward it to the
next hop towards its final destination.
An n-way switch→ has n ports that can be connected to n communication
links.
◼
◼
◼
The degree of a node is the number of links the node is connected to.
Nodes → could be processors, memory units, or servers.
Network interface of a node → hardware connecting it to the network.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
26
Interconnection networks
◼
Interconnection networks are distinguished by:
Topology –
is determined by the way nodes are interconnected
Routing – routing decides how a message gets from source to destination.
Flow control – negotiates how the buffer space is allocated.
◼
The topology of an interconnection network determines:
Network diameter – the average distance between all pairs of nodes
Bisection width – the minimum number of links cut to partition the network
into two halves. When a network is partitioned into two networks of the
same size the bisection bandwidth measures the communication bandwidth
between the two.
◼
There are two basic types of network topologies:
Static networks
where there are direct connections between servers; for
example: (a) Bus; (b) Hypercube; (c) 2D-mesh: (d) 2D-torus.
Switched networks where switches are used to interconnect the servers.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
27
Static networks: (a) Bus; (b) Hypercube; © 2D-mesh; and (d) Torus.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
28
Switched networks.
(Left) An 8 x 8 crossbar switch. 16 nodes are interconnected by 49
switches represented by the dark circles.
(Right) An 8 x 8 Omega swch. 16 nodes are interconnected by 12
switches represented by white rectangles.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
29
Cloud interconnection networks
◼
◼
◼
While processor and memory technology have followed Moore’s
law, the interconnection networks have evolved at a slower pace
and have become a major factor in determining the overall
performance and cost of the system.
The networking infrastructure is organized hierarchically: servers
are packed into racks and interconnected by a top of the rack router;
the rack routers are connected to cluster routers which in turn are
interconnected by a local communication fabric.
The networking infrastructure of a cloud must satisfy several
requirements:
Scalability.
Low cost.
Low-latency.
High bandwidth.
Provide location transparent communication between servers.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
30
Location transparent communication
◼
◼
◼
◼
Every server should be able to communicate with every other server with
similar speed and latency.
Applications need not be location aware.
It also reduces the complexity of the system management.
In a hierarchical organization true location transparency is not feasible and
cost considerations ultimately decide the actual organization and
performance of the communication fabric.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
31
Store and forward and cut-through networks
◼
◼
Store and-forward networks → an entire packet is buffered and its
checksum is verified in each node along the path from the source to
the destination.
Cut-through (wormhole) networks → packet is forwarded to its next
hop as soon as the header is received and decoded. This decreases
the latency, but a packet can still experience blocking if the outgoing
channel expected to carry it to the next node is in use. In this case
the packet has to wait until the channel becomes free.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
32
Routers and switches
◼
◼
◼
The cost of routers and the number of cables interconnecting the
routers are major components of the cost of interconnection network.
Better performance and lower costs can only be achieved with
innovative router architecture → wire density has scaled up at a slower
rate than processor speed and wire delay has remained constant .
Router – switch interconnecting several networks.
low-radix routers – have a small number of ports; divide the bandwidth into
a smaller number of wide ports.
high-radix routers – have a large number of ports; divide the bandwidth into
larger number of narrow ports
◼
◼
The number of intermediate routers in high-radix networks is reduced;
lower latency and reduced power consumption.
The pin bandwidth of the chips used for switching has increased by
approximately an order of magnitude every 5 years during the past two
decades.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
33
Network characterization
◼
◼
◼
◼
The diameter of a network is the average distance between all
pairs of nodes; if a network is fully-connected its diameter is equal
one.
When a network is partitioned into two networks of the same size,
the bisection bandwidth measures the communication bandwidth
between the two.
The cost.
The power consumption.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
34
A side-by-side comparison of performance and cost figures of several
interconnection network topologies for 64 nodes.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
35
9. Clos networks, InfiniBand, Myrinet, fat trees
◼
Clos → Multistage nonblocking network with an odd number of stages.
Consists of two butterfly networks. The last stage of the input is fused with
the first stage of the output.
All packets overshoot their destination and then hop back to it; most of the
time, the overshoot is not necessary and increases the latency, a packet
takes twice as many hops as it really needs.
◼
Folded Clos topology → the input and the output networks share switch
modules. Such networks are called fat trees.
Myrinet, InfiniBand, and Quadrics implement a fat-tree topology.
◼
Butterfly network → the name comes from the pattern of inverted
triangles created by the interconnections, which look like butterfly wings.
◼
Transfers the data using the most efficient route, but it is blocking, it
cannot handle a conflict between two packets attempting to reach the
same port at the same time.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
36
Output network
Input network
(a)
◼
◼
(b)
a) A 5-stage Clos network with radix-2 routers and unidirectional channels;
the network is equivalent to two back-to-back butterfly networks.
(b) The corresponding folded-Clos network with bidirectional channels; the
input and the output networks share switch modules.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
37
in0
in1
out0
out1
in0
in1
S’0
out0
out1
in2
in3
out2
out3
in2
in3
S’1
out2
out3
in4
in5
out4
out5
in4
in5
S’2
out4
out5
in6
in6
in7
out6
out7
S’3
out6
out7
in8
in9
out8
out9
in8
in9
S’4
out8
out9
in10
in11
out10
out11
in10
in11
S’5
out10
out11
in12
in13
out12
out13
in12
in13
S’6
out12
out13
in14
in15
out14
out15
in14
in15
S’7
out14
out15
S0
S1
S2
(a)
S3
in7
(b)
(a) A 2-ary 4-fly butterfly with unidirectional links.
(b) The corresponding 2-ary 4-flat flattened butterfly is obtained by combining the
four switches S0, S1, S2, and S3, in the first row of the traditional butterfly into a
single switch S0‘, and by adding additional connections between switches
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
38
Fat trees
◼
◼
◼
◼
◼
Optimal interconnects for large-scale clusters and for WSCs.
Servers are placed at the leafs.
Switches populate the root and the internal nodes of the tree.
Have additional links to increase the bandwidth near the root of the
tree.
A fat-tree network can be built with cheap commodity parts as all
switching elements of a fat-tree are identical.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
39
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
40
A 192 node fat-tree interconnection network with two 96-way and twelve
24-way switches in a computer cloud. The two 96-way switches at the root
are connected via 48 links. Each 24-way switch has 6 x 8 uplink
connections to the root and 6 x 16 down connections to 16 servers.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
41
A192 node fat-tree interconnect with two 96-way and sixteen 24-way switches.
Each 24-way switch has 2 x 6 uplink connections to the root and 12 down
connections to 12 servers.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
42
InfiniBand
◼
Interconnection network used by supercomputers and computer clouds.
Has a switched fabric topology designed to be scalable.
Supports several signaling rates.
The energy consumption depends on the throughput.
Links can be bonded together for additional throughput.
◼
Offers point-to-point bidirectional serial links intended for the connection
of processors with high-speed peripherals, such as disks, as well as
multicast operations. Advantages.
high throughput, low latency.
supports quality of service guarantees and failover – the capability to switch
to a redundant or standby system
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
43
InfiniBand (Cont’d)
◼
InfiniBand supports:
Quality of service guarantees.
Failover – the capability to switch to a redundant or standby system.
◼
The data rates.
single data rate (SDR) – 2.5 Gbps in each direction per connection.
double data rate (DDR) – 5 Gbps.
quad data rate (QDR) – 10 Gbps.
fourteen data rate (FDR) – 14.0625 Gbps.
enhanced data rated (EDR) – 25.78125 Gbps.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
44
The evolution of the speed of several high speed interconnects. The data rates supported by InfiniBand:
SDR, DDR, QDR, FDR, and EDR
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
45
Myrinet
◼
Myrinet → interconnect for massively parallel systems developed at
Caltech. Features:
Robustness ensured by communication channels with flow control,
packet framing, and error control.
Self-initializing, low-latency, cut-through switches.
Host interfaces that can map the network, select routes, and translate
from network addresses to routes, as well as handle packet traffic.
Streamlined host software that allows direct communication between
user processes and the network.
The network is scalable, its aggregate capacity grows with the number
of nodes
◼
◼
Supports high data rates.
A Myrinet link consists of a full-duplex pair of 640 Mbps channels
and has regular topology with elementary routing circuits in each
node.. Simple algorithmic routing avoid deadlocks and allow
multiple packets to flow concurrently through the network.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
46
10. Storage area networks
◼
◼
◼
◼
Specialized, high-speed network for data block transfers between
computer systems and storage elements.
Consists of a communication infrastructure and a management
layer.
The Fiber Channel (FC) → dominant architecture of SANs.
FC → a layered protocol.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
47
Clients
Local Area Network
Servers
SAN
Storage
Data
Data
Data
Data
A storage area network interconnects servers to servers, servers
to storage devices, and storage devices to storage devices.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
48
FC protocol layers
◼
Three lower-layer protocols: FC-0, the physical interface; FC-1, the
transmission protocol responsible for encoding/decoding; and FC-2,
the signaling protocol responsible for framing and flow control.
FC-0 uses laser diodes as the optical source and manages the
point-to-point fiber connections.
FC-1 controls the serial transmission and integrates data with clock
information.
FC-2 handles the topologies, the communication models, the classes of
service, sequence and exchange identifiers, and segmentation and
reassembly.
◼
Two upper-layer protocols:
FC-3 is common services layer.
FC-4 is the protocol mapping layer.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
49
FC-4
SCSI
IP
ATM
FC-3
Common Services
FC-2
Signaling Protocol
FC-1
Transmission Code
FC-0
Physical Interface
FC (Fiber Channel) protocol layers
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
50
FC classes of service
◼
◼
◼
◼
◼
◼
◼
Class 1 → rarely used blocking connection-oriented service.
Class 2 → acknowledgments ensure that the frames are received;
allows the fabric to multiplex several messages on a frame-by-frame
basis; does not guarantee in-order delivery.
Class 3 → datagram connection; no acknowledgments.
Class 4 → connection-oriented service for multimedia applications;
virtual circuits (VCs) established between ports, in-order delivery,
acknowledgment of delivered frames; the fabric is responsible for
multiplexing frames of different VCs. Guaranteed QoS, bandwidth
and latency.
Class 5 → isochronous service for applications requiring immediate
delivery, without buffering.
Class 6 → supports dedicated connections for a reliable multicast.
Class 7 → similar to Class 2, used for the control and management
of the fabric; connectionless service with notification of non-delivery.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
51
Word 0
Word 1
Word 2
Word 3-6
4 bytes
3 bytes
3 bytes
18 bytes
SOF
Destination Source
Control
(Start Of
port
port
information
Frame)
address
address
(0-2112 bytes)
Payload
CRC
EOF
(End Of
Frame)
The format of a FC frame
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
52
FC networks
◼
◼
◼
◼
◼
An FC device has a unique id called the WWN (World Wide Name),
a 64 bit address, the equivalent of the MAC address.
Each port in the switched fabric has its own unique 24-bit address
consisting of: the domain (bits 23 – 16), the area (bits 15 – 08), and
the port physical address (bits 07-00).
A switch assigns dynamically and maintains the port addresses.
When a device with a WWN logs into the switch on a port, the
switch assigns the port address to that device and maintains the
correlation between that port address and the WWN address of the
device using a Name Server.
The Name Server is a component of the fabric operating system,
running on the switch.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
53
11. Data center networks (DCNs)
◼
Challenges :
Aggregate cluster bandwidth scales poorly with
cluster size.
Bandwidth needed by many cloud applications comes at a high price.
The cost increases dramatically with cluster size.
Communication bandwidth may become oversubscribed by a significant
factor depending on the communication patterns.
◼
DCN architectural styles:
Three-tier → multiple-rooted tree topology with three layers, core,
aggregate, and access. The architecture is not suitable for computer
clouds, it is not particularly scalable, the bisection bandwidth is not
optimal; switches at the core layer are expensive and power-hungry.
Servers are directly connected to switches at the leafs of the tree at the
access layer.
Fat-tree → optimal for computer clouds, the bandwidth is not severely
affected for messages crossing multiple switches and interconnection
network can be built with commodity, rather than enterprise switches.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
54
Comparison of performance and cost of hierarchical and fat-tree DCNs
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
55
Fat-tree DCNs
◼
Design principles:
The network should scale to a very large number of nodes.
The fat-tree should have multiple core switches.
The network should support multi-path routing. The equal-cost multi-
path (ECMP) routing algorithm which performs static load splitting
among flows should be used.
◼
A WSC interconnect consists of k pods;
Each pod has two layers and k/2 switches at each layer.
Each switch at the lower layer is connected directly to k/2 servers. The
other k/2 ports are connected to k/2 of the k ports in the aggregation
layer.
The total number of switches is k(k+1) and the total number of servers
connected to the system is k2. There are (k/2)2 paths connecting every
pair of servers.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
56
87.4.1.1
87.4.1.2
87.4.2.1
87.4.2.2
Core
87.0.2.1
87.2.2.1
Agg
87.0.1.1
87.2.0.1
Edge
87.0.1.2
Pod 0
87.2.0.2
Pod 1
87.2.0.3
Pod 2
Pod 1
Fat-tree interconnection network for k=4. The core, the aggregation, and the edge
layers are populated with 4-port switches. Each core switch is connected with one of
the switches at the aggregation layer of each pod.
The network has four pods, there are four switches at each pod, two at aggregation
layer and two at the edge. Four servers are connected to each pod.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
57
87.4.1.1
87.4.1.2
87.4.2.1
87.4.2.2
Core
87.0.2.1
87.2.2.1
Agg
87.0.1.1
87.2.0.1
Edge
87.0.1.2
Pod 0
87.2.0.2
Pod 1
87.2.0.3
Pod 2
Pod 1
IP addresses of switches have the form 87.pod.switch.1; switches are numbered left to
right, and bottom to top. Core switches have IP addresses of the form 87.k.j.i where
k=4, j and i denote the coordinates of the switch in the (k/2)2 core switch grid starting
from top-left. The four switches of pod 2 have IP addresses 87.2.0.1, 87.2.1.1, 87.2.2.1,
and 87.2.3.1.
Servers have IP addresses of the form 87.pod.switch.serverID; serverID is the server
position in the subnet of the edge router starting from left to right; e.g., IP addresses of
the two servers connected to the switch with IP address 87.2.0.1 are 87.2.0.2 and
87.2.0.3.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
58
IP addresses of switches have the form 87.pod.switch.1; switches are numbered left
to right, and bottom to top.
Core switches have IP addresses of the form 87.k.j.i where j and i denote the
coordinates of the switch in the (k/2)2 core switch grid starting from top-left, e.g., the
four switches of pod 2 have IP addresses 87.2.0.1, 87.2.1.1, 87.2.2.1, and 87.2.3.1.
Servers have IP addresses of the form 87.pod.switch.serverID where serverID is the
server position in the subnet of the edge router starting from left to right; e.g., IP
addresses of the two servers connected to the switch with IP address 87.2.0.1 are
87.2.0.2 and 87.2.0.3.
Prefix
Output port
87.2.0.0/24
0
87.2.1.0/24
1
0.0.0.0/0
Suffix
0.0.02/8
0.0.0.3/8
Output port
2
3
TCAM
RAM
87.2.0.X
87.2.1.X
X.X.X.2
X.X.X.3
Address Next hop Output port
00
87.2.0.1
0
01
87.2.1.1
1
10
87.4.1.1
2
11
87.4.1.2
3
Encoder
(Left) The two level routing tables for switch 87.2.2.1. Two incoming packets for IP
addresses 87.2.1.2 and 87.3.0.3 are forwarded on ports 1 and 3, respectively.
(Right) The RAM implementation of a two-level TCAM routing table
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
59
12. Network resource management algorithms
◼
◼
Aim → guarantee the communication bandwidth required by an
application as specified by an SLA.
Scope → manage
Communication links of limited bandwidth.
Switches of limited capacity.
◼
◼
◼
How → apply strategies to support the data streaming QoS requirements.
A switch handles multiple flows, pairs of source-destination end-points of
the traffic.
A scheduling algorithm has to manage several quantities:
Bandwidth → the amount of data each flow is allowed to transport.
Timing → when the packets of individual flows are transmitted.
Buffer space allocated to each flow.
◼
FCFS algorithms → simple management of: bandwidth, timing, and
buffer space. Do not guarantee fairness, greedy flow sources can
transmit at a higher rate and benefit from a larger share of bandwidth.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
60
Fair-queuing (FQ); Stochastic Fair Queuing (SFQ)
◼
◼
◼
◼
◼
◼
FQ → ensures that a high-data-rate flow cannot use more than its
fair share of the link capacity.
FQ’s objective → max-min fairness, first it maximizes the minimum
data rate of any data flows, then it maximizes the second minimum
data rate.
FQ throughput is low but starvation of expensive flows is avoided.
Packets are first classified into flows by the system and then
assigned to a queue dedicated to the flow. Packet queues are
serviced one packet at a time in round-robin (RR) order.
SFQ → simpler and less accurate implementation of the FQ
algorithms and requires less calculations.
SFQ → ensures that each flow has the opportunity to transmit an
equal amount of data and takes into account data packet sizes
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
61
Flow 1
Flow 2
Flow 3
Port
Flow 4
Flow 5
Flow 6
Flow 7
Flow 8
Fair Queuing – packets are first classified into flows and then assigned
to a queue dedicated to the flow; queues are serviced one packet at a
time in roundrobin order and empty queues are skipped
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
62
Class-Based Queuing (CBQ)
◼
◼
◼
Objective → support flexible link sharing for applications which require
bandwidth guarantees such as VoIP, video-streaming, and audiostreaming.
Ensures balance between short-lived network flows, e.g., web searches,
and long-lived ones, e.g., video-streaming or file transfers.
Uses several functional units:
Classifier – uses the information in the packet header to assign arriving
packets to classes.
Estimator of the short-term bandwidth for the class.
Selector/scheduler, identifies the highest priority class to send next and, if
multiple classes have the same priority, to schedule them on a round-robin
base.
Delayer – computes the next time when a class that has exceeded its link
allocation is allowed to send.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
63
Link
25%
75%
A
B
RT
Web
Intr
Video
Audio
Priority:1
Alloc: 3%
Priority:2
Alloc: 20%
Priority:3
Alloc: 2%
Priority:1
Alloc: 60%
Priority:2
Alloc: 10%
FTP
Priority:3
Alloc: 5%
CBQ link sharing for two groups A, of short-lived traffic, and B, of long-lived
traffic, allocated 25% and 75% of the link capacity, respectively. There are six
classes of traffic with priorities 1, 2, and 3.
Priority 1 → RT (real-time) and the video streaming have are allocated 3% and
60% of link capacity.
Priority 12 → Web transactions and audio streaming are allocated 20% and
10% of link capacity.
Priority 3 → Intr (interactive applications) and FTP (file transfer protocols) are
allocated 2% and 5% of link capacity.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
64
CBQ classes
◼
◼
CBQ classes are organized in a tree-like hierarchy.
A class is:
Overlimit if over a certain recent period it has used more than its
bandwidth allocation (in bytes per second).
Underlimit – if it has used less than its bandwidth allocation.
Atlimit – if it has used exactly its bandwidth allocation.
◼
A leaf class is
Satisfied if it is underlimit and has a persistent backlog;
Unsatisfied otherwise.
◼
A non-leaf class is unsatisfied if it is underlimit and has some
descendent class with a persistent backlog. A class should be
regulated if it is overlimit and if some other class is unsatisfied and
this regulation should continue until the class is no longer overlimit
or until there are no unsatisfied classes.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
65
A
1
2
A
B
3
1
2
1
B
1
2
(a)
2
(b)
There are two groups A and B and three types of traffic, e.g., web,
real-time, and interactive, denoted as 1, 2, and 3.
(a) Group A and class A.3 traffic are underlimit and unsatisfied;
classes A.1, A.2 and B.1 are overlimit, unsatisfied and with
persistent backlog and have to be regulated; type A.3 is underlimit
and unsatisfied; group B is overlimit.
(b) Group A is underlimit and unsatisfied; Group B is overlimit and
needs to be regulated; class A.1 traffic is underlimit; class A.2 is
overlimit and with persistent backlog; class B.1 traffic is overlimit
and with persistent backlog and needs to be regulated.
◼
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
66
Hierarchical Token Buckets (HTB)
2 Gbps
Link
800/1200 Mbps
1200/1600 Mbps
A
B
Web
Intr
ftp
Audio
400/800
Mbps
400/800
Mbps
400/1200
Mbps
800/1600
Mbps
Linux kernel implements HTB, a link sharing algorithm inspired by CBQ. In CBQ
every class has an assured rate (AR); in addition to the AR every class in HTB has
also a ceil rate (CR).
Main advantage of HTB over CBQ, allows borrowing; if class C needs a rate above
its AR it tries to borrow from its parent; then the parent examines its children and,
if there are classes running at a rate lower that their AR, the parent can borrow
from them and reallocate it to C.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
67
13 Content delivery networks (CDNs)
◼
◼
◼
CDNs are designed to support scalability, to increase reliability and
performance, and to provide better security. In 2013, Internet video
is expected to generate over 18 exabytes of data per month.
A CDN receives the content from an origin server, then replicates it
to its edge cache servers; the content is delivered to an end-user
from the “closest” edge server.
A CDN can deliver static content and/or live or on-demand
streaming media.
Static content – media that can be maintained using traditional caching
technologies as changes are infrequent. Examples: HTML pages,
images, documents, software patches, audio and video files.
Live media – live events when the content is delivered in real time from
the encoder to the media server.
◼
Protocols used by CDNs: Network Element Control Protocol (NECP),
Web Cache Coordination Protocol (WCCP), SOCKS, Cache Array Routing
Protocol (CARP), Internet Cache Protocol (ICP), Hypertext Caching
Protocol (HTCP), and Cache Digest.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
68
CDN design and performance
◼
Design and policy decisions for a CDNs.
The placement of the edge servers.
The content selection and delivery.
The content management.
The request routing policies.
◼
Critical metrics for CDN performance
Cache hit ratio – the ratio of the number of cached objects versus total
number of objects requested.
Reserved bandwidth for the origin server.
Latency – based on the perceived response time by the end users.
Edge server utilization.
Reliability – based on packet-loss measurements.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
69
14 Overlay and scale-free networks
◼
An overlay network, or a virtual network, is a network built on top of
a physical network.
The nodes of an overlay network are connected by virtual links which
could traverse multiple physical links.
Overlay networks are widely used in many distributed systems such as
peer-to-peer systems, content-delivery systems, and client-server
systems; in all these cases, the distributed systems communicate
through the Internet.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
70
Scale-free networks
◼
◼
◼
The degree distribution of scale-free networks follows a power law.
Many physical and social systems are interconnected by a scalefree network. Empirical data available for power grids, the web, the
citation of scientific papers, or social networks confirm this trend.
The majority of the vertices of a scale-free network:
Are directly connected with the vertices with the highest degree.
Have a low degree and only a few vertices are connected to a large
number of edges.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
71
A scale-free network is non-homogeneous; the majority of vertices have a low
degree and only a few vertices are connected to a large number of edges; the
majority of the vertices are directly connected with the highest degree ones.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
72
Epidemic algorithms
◼
Epidemic algorithm mimic the transmission of infectious diseases and
are often used in distributed systems to accomplish tasks such as:
disseminate information, e.g., topology information.
compute aggregates, e.g., arrange the nodes in a gossip overlay into a list
sorted by some attributes in logarithmic time.
manage data replication in a distributed system.
◼
◼
Game of life is a popular epidemic algorithm invented by John
Conway.
Several classes of epidemic algorithms exist. The concepts used to
classify these algorithms
Susceptible (S),
Infective (I),
Recovered (R)
refer to the state of the population subject to infectious disease and,
by extension, to the recipient of information in a distributed system.
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
73
SI, SIR, and SIS epidemic algorithms
◼
◼
Susceptible-Infective (SI) algorithms ➔ apply when the entire
population is susceptible to be infected; once an individual becomes
infected it remains in that state until the entire population is infected.
Susceptible-Infectious-Recover (SIR) ➔ based on the model
developed by Kermack and McKendrik which assumes
S → I → R;
that the size of the population is fixed (S(t) + I(t) + R(t) =N.
the following transition from one state to another
◼
Susceptible-Infective-Susceptible (SIS) algorithms ➔ are particular
cases of SIR models when individuals recover from the disease
without immunity. If p=R(r)/I(r), then the number of newly infected
grows until (1-p)/2 are infected and then decreases exponentially
to (1-p).
Dan C. Marinescu
Cloud Computing. Second Edition – Chapter 5.
74