Assembly Language Worksheet

CSC2025Homework #10
Spring 2023
Once you have thoroughly read Sections 4.5-4.10: Please complete the following problems.
Your solutions should be typed. Show all work for calculations and justification, as appropriate.
Answers without appropriate justification, computation, and/or explanation will NOT receive
credit. Upload your completed work (answers to questions and seconds.asm) to the Dropbox under
Assignments in D2L.
1. (6 Pts) For each code sequence below, state whether it must stall, can avoid stalls using only
forwarding, or can execute without stalling or forwarding.
Sequence #1
lw $t0, 0($t0)
add $t1, $t0, $t0
Sequence #2
add $t1, $t0, $t0
addi $t2, $t0, #5
addi $t4, $t1, #5
Sequence #3
addi $t1, $t0, #1
addi $t2, $t0, #2
addi $t3, $t0, #2
addi $t3, $t0, #4
addi $t5, $t0, #5
2. (8 Pts) A group of students were debating the efficiency of a five-stage pipeline when one student
pointed out that not all instructions are active in every stage of the pipeline. After deciding to ignore
the effects of hazards, they made the following four statements. Which ones are correct? Explain
your decisions.
a. Allowing jumps, branches, and ALU instructions to take fewer stages than the five required
by the load instruction will increase pipeline performance under all circumstances.
b. Trying to allow some instructions to take fewer cycles does not help because the throughput is
determined by the clock cycle; the number of pipe stages per instructions affects latency not
throughput.
c. You cannot make ALU instructions take fewer cycles because of the write-back of the result,
but branches and jumps can take fewer cycles, so there is some opportunity for improvement.
d. Instead of trying to make instructions take fewer cycles, we should explore making the
pipeline longer, so that instructions take more cycles, but the cycles are shorter. This could
improve performance.
CSC2025
Homework #10
Spring 2023
3. (6 Pts) Consider three branch prediction scheme: predict not taken, predict taken, and dynamic
prediction. Assume that they all have zero penalty when they predict correctly and two cycles when
they are wrong. Assume that the average predict accuracy of the dynamic predictor is 90%. Which
predictor is the best choice for each of the following branches? Justify your answers.
a. A branch that is take with 5% frequency.
b. A branch that is taken with 95% frequency.
c. A branch that is taken with 70% frequency.
4. (4 Pts) Which exception should be recognized first in this sequence? Why?
add $1, $2, $1
#arithmetic overflow
XXX $1, $2, $1
#undefined instruction
sub $1, $2, $1
#hardware error
5. (9 Pts) State whether each of the following techniques or components are associated primarily with a
software- or hardware-based approach to exploiting Instruction-Level Parallelism. In some cases, the
answer might be both.
a. Branch prediction
b. Multiple Issue
c. VLIW
d. Superscalar
e. Dynamic Scheduling
f.
Out-of-order execution.
g. Speculation
h. Reorder Buffer
i.
Register Renaming
6. (3 Pts) In what ways does a VLIW design differ from a superpipelined design?
7. (4 Pts) Explain are two types of Instruction-Level Parallelism and their differences.
CSC2025
Homework #10
8. (3 Pts Each) These problems correspond to the exercises at the end of Chapter 4.
o
o
o
o
o
4.8 (Don’t forget to read the initial information)
 4.8.1
 4.8.2
 4.8.3
 4.8.4
 4.8.5
4.9 (Don’t forget to read the initial information)
 4.9.1
 4.9.2
 4.9.3
 4.9.4
 4.9.5
 4.9.6
4.11 (Don’t forget to read the initial information)
 4.11.1
 4.11.2
4.12 (Don’t forget to read the initial information)
 4.12.1
 4.12.2
 4.12.3
 4.12.4
 4.12.5
4.14 (Don’t forget to read the initial information)
 4.14.1
 4.14.4
 4.14.5
Spring 2023
Question 1
Sequence #1:
lw $t0, 0($t0) loads a word from memory into $t0. This instruction has a data hazard because it
depends on the value in $t0 that is being updated by the subsequent instruction.
add $t1, $t0, $t0 adds the contents of $t0 to itself and stores the result in $t1. This instruction also
has a data hazard because it depends on the value in $t0 that is being updated by the previous
instruction.
Therefore, we need to stall the pipeline for one cycle between the two instructions to avoid a data
hazard.
Answer: Must stall.
Sequence #2:
add $t1, $t0, $t0 adds the contents of $t0 to itself and stores the result in $t1. There are no data
hazards in this instruction.
addi $t2, $t0, #5 adds an immediate value to $t0 and stores the result in $t2. There are no data
hazards in this instruction.
addi $t4, $t1, #5 adds an immediate value to $t1 and stores the result in $t4. This instruction has
a data hazard because it depends on the value in $t1 that is being updated by the previous
instruction.
Therefore, we need to use forwarding to avoid a data hazard. Specifically, we can forward the
value of $t1 from the add $t1, $t0, $t0 instruction to the addi $t4, $t1, #5 instruction.
Answer: You can avoid stalls using only forwarding.
Sequence #3:
addi $t1, $t0, #1 adds an immediate value to $t0 and stores the result in $t1. There are no data
hazards in this instruction.
addi $t2, $t0, #2 adds an immediate value to $t0 and stores the result in $t2. There are no data
hazards in this instruction.
addi $t3, $t0, #2 adds an immediate value to $t0 and stores the result in $t3. There are no data
hazards in this instruction.
addi $t3, $t0, #4 adds an immediate value to $t0 and stores the result in $t3. This instruction has
a data hazard because it depends on the value in $t0 that is being updated by the previous
instruction.
addi $t5, $t0, #5 adds an immediate value to $t0 and stores the result in $t5. There are no data
hazards in this instruction.
Therefore, we need to stall the pipeline for one cycle between the addi $t3, $t0, #2 and addi $t3,
$t0, #4 instructions to avoid a data hazard.
Answer: Must stall.
Question 2
a. Allowing jumps, branches, and ALU instructions to take fewer stages than the five required by
the load instruction will increase pipeline performance under all circumstances.
This statement is not correct. While it is true that some instructions can take fewer stages than
others, allowing jumps, branches, and ALU instructions to take fewer stages than loads may not
necessarily improve pipeline performance under all circumstances. For example, if the program
being executed has a lot of memory access and few ALU operations, reducing the number of
stages for ALU instructions may not significantly benefit overall pipeline performance.
Answer: Incorrect.
b. Trying to allow some instructions to take fewer cycles does not help because the throughput is
determined by the clock cycle; the number of pipe stages per instruction affects latency, not
throughput.
This statement is not entirely correct. While it is true that the number of pipeline stages per
instruction affects latency, it can also have an impact on throughput. The throughput of a
pipeline is the number of instructions that can be completed in a given amount of time.
Shortening the number of stages for certain types of instructions can improve the overall
throughput of the pipeline, as it can allow more instructions to be processed within a single clock
cycle.
Answer: Incorrect.
c. You cannot make ALU instructions take fewer cycles because of the write-back of the result,
but branches and jumps can take fewer cycles, so there is some opportunity for improvement.
This statement is partially correct. ALU instructions typically require more stages in a pipeline
than jumps and branches because of the write-back stage that is needed to store the result.
However, it is possible to reduce the number of steps required for ALU instructions by
implementing techniques such as bypassing and forwarding to avoid data hazards. Similarly,
while jumps and branches can take fewer cycles in a pipeline, dependencies still need to be
handled, and branch mispredictions that can cause pipeline stalls.
Answer: Partially correct.
d. Instead of trying to make instructions take fewer cycles, we should explore making the
pipeline longer so that instructions take more cycles but are shorter. This could improve
performance.
This statement is not correct. Making the pipeline longer will increase the number of stages that
each instruction goes through, increasing the pipeline’s overall latency. While it may be possible
to reduce the clock cycle time to offset this increased latency, doing so will also increase the
power consumption of the pipeline. It is generally better to reduce the number of stages required
for each instruction rather than make the pipeline longer to improve pipeline performance.
Answer: Incorrect.
In summary, statement (c) is partially correct, while the other statements are incorrect.
Question 3
a. For a branch that is taken with only 5% frequency, the best predictor to use is predicted, not
taken. This is because the branch is rarely taken, and therefore it is more likely that the branch
will not be taken in the future. Using prediction not taken will result in the fewest number of
mispredictions, which means that the penalty of 2 cycles will be incurred less frequently.
b. For a branch that is taken with 95% frequency, the best predictor to use is predicted taken.
This is because the branch is taken most of the time, and therefore it is more likely that the
branch will be taken in the future. Using predict taken will result in the fewest number of
mispredictions, which means that the penalty of 2 cycles will be incurred less frequently.
c. For a branch that is taken with 70% frequency, the best predictor to use is dynamic prediction.
It is because the branch is taken and not taken with roughly equal frequency. Using predict taken
or not taken will result in a relatively high number of mispredictions, meaning that the penalty of
2 cycles will be incurred frequently. On the other hand, the dynamic prediction will have an
average prediction accuracy of 90%, which means that the penalty of 2 cycles will be incurred
less frequently than with predict taken or not taken.
In summary, the best predictor to use depends on the frequency of the branch being taken. If the
branch is rarely taken, predict not taken is the best choice. If the branch is frequently taken,
predict taken is the best choice. Dynamic prediction is the best choice if the frequency of the
branch being taken and not taken is roughly equal.
Question 4
In this sequence of instructions, the first exception that should be recognized is the arithmetic
overflow exception generated by the “add” instruction.
Arithmetic overflow is a common exception when the result of an arithmetic operation exceeds
the maximum value that can be represented in the destination register. In this case, the result of
the “add” instruction may exceed the maximum value that can be stored in register $1, which
would trigger an arithmetic overflow exception.
On the other hand, the “undefined instruction” and “hardware error” exceptions are less common
and less likely to occur. Therefore, the processor should prioritize handling the arithmetic
overflow exception first, as it is the most likely exception to occur and may cause the program to
fail if not handled properly.
It’s important to note that the order in which exceptions are recognized can affect the program’s
behavior and the exception-handling process’s outcome. In general, exceptions should be
recognized and handled in the order of their priority, with the most critical and common
exceptions dealt with first.
Question 5
a. Branch prediction: Both software and hardware approaches can be used to implement branch
prediction. Software-based branch prediction is usually implemented in the compiler or operating
system, while hardware-based branch prediction is implemented in the processor itself.
b. Multiple Issue: Hardware-based approach.
c. VLIW: Hardware-based approach.
d. Superscalar: Hardware-based approach.
e. Dynamic Scheduling: Hardware-based approach.
f. Out-of-order execution: Hardware-based approach.
g. Speculation: Both software and hardware approaches can be used to implement speculation.
Software-based speculation is usually implemented in the compiler, while hardware-based
speculation is implemented in the processor itself.
h. Reorder Buffer: Hardware-based approach.
i. Register Renaming: Hardware-based approach.
Question 6
A Very Long Instruction Word (VLIW) design and a superpipelined design are both techniques
for improving processor performance by exploiting Instruction-Level Parallelism (ILP), but they
differ in several ways:
Instruction format: In a VLIW design, a single instruction word contains multiple independent
instructions that are executed in parallel. In contrast, a superpipelined design executes a single
instruction in multiple overlapping stages.
Compiler dependence: VLIW designs are heavily dependent on the compiler to identify and
schedule independent instructions for parallel execution. In contrast, superpipelined designs are
less dependent on the compiler and can often execute instructions in parallel without explicit
compiler support.
Hardware complexity: VLIW designs tend to have simpler hardware because they rely on the
compiler to schedule instructions for parallel execution. In contrast, superpipelined designs
require more complex hardware to execute instructions in overlapping stages.
Latency tolerance: VLIW designs are often more tolerant of instruction latency because they can
execute multiple instructions in parallel, even if some instructions take longer to complete. In
contrast, superpipelined designs are less tolerant of instruction latency because a single slow
instruction can delay the entire pipeline.
Performance predictability: VLIW designs are often more predictable because the compiler can
schedule instructions for parallel execution based on static analysis of the code. In contrast,
superpipelined designs can be more unpredictable because the overlapping execution of
instructions can depend on dynamic factors such as data dependencies and cache behavior.
Question 7
The two types of Instruction-Level Parallelism (ILP) are:
Data-level parallelism (DLP)
Task-level parallelism (TLP)
Data-level parallelism (DLP) exploits parallelism within a sequence of instructions by
performing multiple operations on multiple data items simultaneously. DLP is typically achieved
using hardware techniques such as pipelining, superscalar execution, and out-of-order execution.
For example, a superscalar processor can execute multiple instructions simultaneously by using
multiple functional units to execute different instructions in parallel.
Task-level parallelism (TLP) exploits parallelism between independent sequences of instructions
by executing multiple tasks or threads in parallel. TLP is typically achieved using software
techniques such as multithreading and multiprocessing. For example, a multithreaded program
can run multiple threads simultaneously by allocating each thread to a separate core or processor.
The critical difference between DLP and TLP is the level at which they exploit parallelism. DLP
exploits parallelism within a sequence of instructions, whereas TLP exploits parallelism between
independent sequences of instructions. DLP is more hardware-oriented and relies on techniques
such as pipelining, superscalar execution, and out-of-order execution. TLP is more softwareoriented and relies on techniques such as multithreading and multiprocessing. Both DLP and
TLP are essential for achieving high performance in modern computer systems, and they are
often used in combination to exploit as much parallelism as possible.
Question 8: see images, some parts from handwritten and others from the book

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Assembly Language Worksheet ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now