MASSACHUSETTS INSTITUTE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
6.191 Computation Structures

Updated Fall 2022

| 1 | $/ 20$ |
| :--- | :--- |
| 2 | $/ 17$ |
| 3 | $/ 16$ |
|  |  |
|  |  |
|  |  |

## Quiz \#2

| Name |  | Athena login name | Score |
| :--- | :--- | :--- | :--- |
|  |  |  |  |
| Recitation section |  |  |  |
| $\square$ WF 10, 34-301 (Grace) | $\square$ WF 2, 13-5101 (Frances) | $\square$ WF 12, 36-155 (Shiqi) |  |
| $\square$ WF 11, 34-301 (Grace) | $\square$ WF 3, 13-5101 (Frances) | $\square$ WF 1, 36-155 (Shiqi) |  |
| $\square$ WF 12, 35-310 (Alexandra) | $\square$ WF 10, 13-4101 (Georgia) | $\square$ WF 1,34-303 (Amelia) |  |
| $\square$ WF 1, 35-308 (Alexandra) | $\square$ WF 11, 13-4101 (Georgia) | $\square$ WF 2, 34-303 (Amelia) |  |
|  |  | $\square$ opt-out |  |

Please enter your name, Athena login name, and recitation section above. Enter your answers in the spaces provided below. Show your work for potential partial credit. You can use the extra white space and the backs of the pages for scratch work.

## Problem 1. Sequential Circuits in Minispec (20 points)

You are frustrated with the 77 Mass. Ave crosswalk and decide to design a better traffic signal in Minispec. To start, you want to make sure the traffic light will function well in the daytime when there's lots of traffic. After carefully analyzing traffic patterns, you define the following specification:

- The traffic light should be red for 4 cycles, then green for 10 cycles, then yellow for 1 cycle, and repeat this pattern indefinitely.
- The light starts red (and should stay red for 4 cycles before turning green).
- Pedestrians can only cross when the light is red.
(A) (6 points) Fill in the Minispec module on the next page to track the traffic light state as a sequential circuit.
- The pedestriansCanCross method should return True if and only if the light is in a state where pedestrians are allowed to cross.
- The currentLight method should return the current state of the traffic light.
- We have provided a counter register - use this to count down to the next state transition.

```
typedef enum { Green, Yellow, Red } LightState;
module TrafficLight;
    Reg#(LightState) light(___Red___);
    Reg#(Bit#(____4____)) counter(_____3____);
    method Bool pedestriansCanCross = _light == Red__;
    method LightState currentLight = __light
```

$\qquad$

``` ;
    rule tick;
        if (light == Green) begin
            if (counter == 0) begin
                    light <= __Yellow____;
            end else begin
                counter <= __counter - 1__;
                    end
            end else if (light == Yellow) begin
                    light <=
```

$\qquad$

``` Red
``` \(\qquad\)
```

                    counter <=
    ```
\(\qquad\)
``` 3
``` \(\qquad\)
```

        end else if (light == Red) begin
            if (counter == 0) begin
            light <= ___Green____;
            counter <= ___ 9_____;
            end else begin
            counter <= __counter - 1___;
            end
        end
    endrule
    ```
endmodule
(B) (6 points) To ensure that your module behaves as expected, fill in the timing chart below with the register values and outputs for the first 6 cycles.
\begin{tabular}{|l|l|l|l|l|l|l|}
\hline Cycle & \(\mathbf{0}\) & \(\mathbf{1}\) & \(\mathbf{2}\) & \(\mathbf{3}\) & \(\mathbf{4}\) & \(\mathbf{5}\) \\
\hline counter & 3 & 2 & 1 & 0 & 9 & 8 \\
\hline light & Red & Red & Red & Red & Green & Green \\
\hline currentLight & Red & Red & Red & Red & Green & Green \\
\hline pedestriansCanCross & True & True & True & True & False & False \\
\hline
\end{tabular}
(C) (8 points) Now you want to add a new feature to your traffic light. During the daytime, you want it to work as in part (A). But during the nighttime, the traffic light should work differently:
- By default, the light should be green.
- When a pedestrian requests to cross the street and the light is green, it should remain green for \(\mathbf{3}\) more cycles, turn yellow for 1 cycle, then red for 4 cycles. Then it should go back to being green indefinitely.
- If a pedestrian requests to cross the street when the light is yellow or red, this request should be ignored and have no effect.
- If a pedestrian requests to cross the street while the light is green, and a pedestrian requests to cross the street in a following cycle when the light is still green, this request should also have no effect.

You also want to add a feature for emergency pedestrian requests. In an emergency, if a pedestrian requests to cross, the light should immediately turn yellow on the next cycle. The pedestrian request is provided as a Maybe\#(PedestrianRequest) type - on each cycle it will either be:
- Invalid (no pedestrian request)
- Standard (a standard pedestrian request was made)
- Emergency (an emergency pedestrian request was made)

Note that your implementation should still work when the input transitions from daytime to nighttime, even though in daytime the Green light is 10 cycles and in nighttime it is only 3 cycles following a pedestrian request. Thus, if it is nighttime and our counter variable is too large (because we were counting down from a larger value during the daytime), we should "clamp" it to be no larger than it can be in nighttime. We have provided a currentCounter variable to use for this purpose - i.e. it will be clamped to the maximum value the counter can be during nighttime.

Fill in the Minispec module below to add this functionality. We have provided two inputs one for whether it is currently nighttime or daytime, and one for whether a pedestrian has requested to cross the street in this cycle.
```

typedef enum { Green, Yellow, Red } LightState;
typedef enum { Daytime, Nighttime } TimeOfDay;
typedef enum { Standard, Emergency } PedestrianRequest;
module TrafficLight;
Reg\#(LightState) light(_<answer from Part A>_);
Reg\#(Bit\#(_<answer from Part A>_)) counter(_<answer from part A>_);
input TimeOfDay timeOfDay default = Nighttime;
input Maybe\#(PedestrianRequest) pedestrianRequest default = Invalid;
method Bool pedestriansCanCross = <answer from part A>;
method LightState currentLight = <answer from part A>;

```

The code continues on the next page.
```

    rule tick;
        if (timeOfDay == Daytime) begin
                        <Your answer from Part A>
            end else begin
                if (light == Green) begin
                    Bit#(<answer from part A>) currentCounter;
                    // Clamp currentCounter to the maximum value counter
                    // can be for a Green light at night
                    currentCounter = counter > ___3_ ? ___3_ : counter;
                    if (currentCounter == 0) begin
                    light <= ___Yellow___;
                    // Check if received pedestrian request this cycle
                    end else if (_____isValid(pedestrianRequest)
    ```
\(\qquad\)
```

                            begin
                        // Handle emergency request
                if (fromMaybe(?, pedestrianRequest) ==
                    Emergency) begin
    ```
\(\qquad\)
``` end else begin
                                    counter <= ___currentCounter - 1____;
                    end
                    end else if (currentCounter < _______) begin
                counter <= ___currentCounter - 1_____;
                    end else begin
                    counter <= ___currentCounter_(or 3)____;
                    end
                end else if (light == Yellow) begin
                    <Your answer from Part A>
                end else if (light == Red) begin
                    <Your answer from Part A>
                end
    end
    endrule
endmodule
```


## Problem 2. Arithmetic Pipelines (17 points)

You are given a module, named "F." This module has two inputs, $\mathbf{X}$ and $\mathbf{Y}$, and two outputs, $\mathbf{A}$ and $\mathbf{B}$. You are told that the circuit functions, but its throughput is too low. You decide to take a look and try to pipeline the circuit.


For each of the questions below, please create a valid K-stage pipeline of the given circuit. Each component in the circuit is annotated with its propagation delay in nanoseconds. Show your pipelining contours and place large black circles ( $\bullet$ ) on the signal arrows to indicate the placement of pipeline registers. Give the latency and throughput of each design, assuming ideal registers ( $\operatorname{tpd}=0, \mathrm{t}_{\text {SETup }}=0$ ). Remember that our convention is to place a pipeline register on each output.
(A) (1 point) What is the propagation delay of the whole circuit shown below as-is without pipelining?
$t_{\text {PD }}(n s):$ $\qquad$ 26
(B) (4 points) Show the maximum-throughput 3-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit? Pay close attention to the direction of each arrow. In case you need them, extra copies of the circuit are available on the last page of the exam.


Latency (ns): $\qquad$ 33 $\qquad$
Throughput ( $\mathrm{ns}^{-1}$ ): $\qquad$ 1/11 $\qquad$
(C) (4 points) Show the maximum-throughput pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit? In case you need them, extra copies of the circuit are available on the last page of the exam.


Latency (ns): $\qquad$ 32 $\qquad$
Throughput ( $\mathrm{ns}^{-1}$ ): $\qquad$ 1/8 $\qquad$
(D) Now, you are given two new modules: module " $G$ " takes in input $\mathbf{U}$ and produces output $\mathbf{V}$, and module "H" takes in input $\mathbf{S}$ and produces output $\mathbf{T}$. You are given module " G " implemented with a 2 -stage pipeline, with registers denoted by the large black circles ( $(\bullet)$, and module " H " implemented with a single stage pipeline as shown below.


Implementation of Module G


Implementation of Module H
(i) (2 points) Given the implementation of the modules above, what are the throughputs of the modules?
$\qquad$ 1/15
$\qquad$ 1/8 $\qquad$
You want to connect these two modules with the module "F", from the previous parts of this problem, so the output A of Module " F " is connected to the input $\mathbf{U}$ and output $\mathbf{B}$ is connected to the input $\mathbf{S}$ as shown below.

(ii) (2 points) Do any changes need to be made to modules G or H to ensure that the combined circuit above behaves as a proper pipeline? If so, draw any updated modules below. If no changes are required, say "No changes required".

Add a pipeline stage to H so that both G and H have 2 pipeline stages. (Preferred) OR
Remove the first pipeline stage on $G$ (between 6 and 3/7)
(iii) (4 points) When connecting module F to module G and module H , you have two options for module F: your 3-stage pipeline, or your maximum-throughput pipeline. If you want to maximize throughput, while minimizing latency and the number of registers used, which implementation of module F would you use? What would the latency and throughput of the combined device be? Assume that any required changes to modules G or H have been made.
Module F implementation (circle one): 3-stage pipeline Maximum-throughput pipeline
5 Stages at $15 \mathrm{~ns} /$ stage (if gave $1^{\text {st }}$ answer to dii) or 4 Stages at $26 \mathrm{~ns} /$ stage (if gave $2^{\text {nd }}$ answer to dii) Latency (ns): __75 or 104

Throughput ( $\mathrm{ns}^{-1}$ ): $\qquad$ $1 / 15$ or $1 / 26$

## Problem 3. A RISCier processor (16 points)

Consider the single-cycle processor implementation we saw in lecture:


The timing characteristics of all components are listed below:

| Component | Propagation delay (t $\mathbf{P D}$ ) |
| ---: | :--- |
| Register | 1ns |
| Decoder | 2 ns |
| RegFile read | 3 ns |
| MUX | 1ns |
| ALU | 4 ns |
| Adder | 3 ns |
| Memory read | 4 ns |
| (instruction or data) |  |


| Setup/hold times for clocked inputs |
| :---: |
| (registers and writes to |
| RegFile and data memory) |


| Setup time ( $\mathbf{t}_{\text {setup }}$ | 2 ns |
| ---: | :--- |
| Hold time ( $\left.\mathbf{t}_{\text {HoLd }}\right)$ | 0 ns |

Assume that any components not listed have a delay of 0 ns .
(A) (3 points) What is the minimum clock cycle time of this processor? (For partial credit, draw the critical path in the diagram above.)
Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> DMem (read)
$->$ WDSEL mux $->\mathrm{RF}$ (write) $\Rightarrow \mathrm{t}_{\text {cLK }}>=1+4+2+3+1+4+4+1+2$ (RF setup)
$\qquad$ ns

Ben Bitdiddle is unhappy with the performance of this processor. After a Ouija session with legendary CPU designer Seymour Cray, Ben implements this alternative datapath (the control logic stays the same), where the data memory's Adr input comes from a different place:

(B) (3 points) What is the minimum clock cycle time of Ben's new processor? (For partial credit, draw the critical path in the diagram above.)
Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> WDSEL mux $->\mathrm{RF}($ write $)=>\mathrm{t}_{\text {CLK }}>=1+4+2+3+1+4+1+2$ (RF setup)

Minimum tcle: $\qquad$ 18 $\qquad$ ns
(C) (2 points) Ben's processor executes some instructions incorrectly according to the RISC-V ISA. Give an example of one such instruction, and write the equivalent RISC-V instruction that is actually executed by the processor.

Example of incorrect instruction: $\qquad$ Iw a0, 8(a1)
(any lw or sw instruction with a non-zero offset)
RISC-V instruction that produces the same behavior as executing the above incorrect instruction: $\qquad$ Iw a0, 0(a1)
(the same lw or sw instruction with the offset set to 0 )
(D) (5 points) The program below takes 90 instructions to execute in the original processor. However, it produces incorrect results in Ben's new processor. Modify the program so that it runs correctly on the new processor. For full credit, the number of executed instructions should not increase compared to the original code. How many instructions does your assembly code execute?

```
C code
int x[16];
for (int i = 0; i < 15; i++)
    x[i+1] = x[i] + x[i+1];
```

Assembly code

```
# Initial values:
# a0: address of x[0]
# a7: address of x[15]
loop: lw a1, 0(a0)
    lw a2, 4(a0)
    add a3, a2, a1
    sw a3, 4(a0)
    addi a0, a0, 4
    blt a0, a7, loop
```

```
Modified assembly that executes
correctly on new processor:
# Initial values:
# a0: address of x[0]
# a7: address of x[15]
    lw a1, 0(a0)
    addi a0, a0, 4
loop: lw a2, 0(a0)
    add a1, a1, a2
    sw a1, 0(a0)
    addi a0, a0, 4
    ble a0, a7, loop
```

Number of executed instructions of new program: $\qquad$ $15 * 5+2=77$ $\qquad$ (other solutions are possible; any solution with fewer instructions than the original one earns full credit, and any correct solution earns at least 2 points of partial credit)
(E) (2 points) What is the execution time of the above program in the original and new processors? (Use the appropriate variant of the program for each processor.)

Execution time on original processor: $\qquad$ $77 * 22=1694$ $\qquad$ ns

Execution time on Ben's new processor: ___ $77 * 18=1386$ $\qquad$ ns

## Problem 4. Caches ( 20 points)

Cache Ketchum wants to design a cache to help keep track of his Pokedex entries. He's enlisted your help as a talented 6.191 student! :)
(A) (3 points) Ketchum wants to build a direct-mapped cache with a block size of eight words. He also wants the cache to hold a total of $\mathbf{2}^{\boldsymbol{9}}=\mathbf{5 1 2}$ data words. Which address bits should be used for the block offset, cache index, and tag? Assume that data words and addresses are 32 bits wide.

Address bits used for block offset: A[ _ 4 _ : __ 2 _ ] Address bits used for cache index: $\mathrm{A}[\ldots 10 \ldots$ : _ 5 _ ]

Address bits used for tag: A[ __31__: _11
$\qquad$
$\qquad$
(B) (2 points) Ketchum ponders over the design and decides that he wants to double the number of cache lines in his direct-mapped cache. However, he wants to keep the total number of words in the cache the same. How will the number of bits used to represent the block offset change as a result?

Change in \# bits to represent block offset (select one of the choices below):
UNCHANGED $\ldots+1 \ldots-1.2 \mathrm{x} \ldots 0.5 \mathrm{x} \ldots$ CAN'T TELL

Ketchum decides he doesn't want a direct-mapped cache at all! He wants a two-way setassociative cache.

The remainder of the problem will be considering this 2-way set-associative cache with a capacity of 32 words. Below is a snapshot of this cache during the execution of some unknown code. V is the valid bit and D is the dirty bit of each set. Assume an LRU replacement policy and that Way 0 is currently holds the LRU cache line for all sets.

| Way 0 |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| V | D | Tag | Word 0 | Word 1 | Word 2 | Word 3 |
| 1 | 0 | 0x28 | 0xA65 | 0x521 | 0xA2C | 0x947 |
| 1 | 1 | 0x1D | 0xB54 | 0xE95 | 0x9AA | 0xC7A |
| 1 | 0 | 0x4D | 0xE71 | 0x2FE | 0xC58 | 0x4C4 |
| 1 | 0 | 0x085 | 0xB6B | 0xD55 | 0x27D | 0xE1E |

Way 1

| V | D | Tag | Word 0 | Word 1 | Word 2 | Word 3 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 1 | 0x093 | 0x2EA | 0x4CE | 0x42D | 0x462 |
| 1 | 1 | 0x093 | 0x3C2 | 0x152 | 0xB9C | 0xC23 |
| 1 | 0 | 0xAF | 0xC05 | 0xE81 | 0xCEA | 0x60B |
| 1 | 0 | 0xA5 | 0x57B | 0xC5F | 0xA1F | 0xAF5 |

(C) (5 points) Identify whether each of the following memory accesses is a hit or a miss. Consider each memory access independently. If it is a hit, specify what value is returned; if it is a miss, write $\mathrm{N} / \mathrm{A}$. In addition, if it is a miss, determine if any values need to be written back to main memory, and if so, to which location(s) in main memory? List all updated main memory word addresses. If no writes to main memory are needed, write N/A.

Load from address 0x2974
$0 \times 2974$ = 0010_1001_0111_0100
tag $=0 \times 15$, index $=3$, block offset $=1$
Circle One: Hit/ Miss
Returned value if hit or N/A if miss: __C5F $\qquad$
All updated main memory word addresses or N/A: $\qquad$ N/A $\qquad$

## Load from address 0x11D8

0x11D8 = 0001_0001_1101_1000
$\boldsymbol{t a g}=0 \times 47$, index $=1$, block offset $=2$
miss $\rightarrow$ replaces way 0 line 1 which is dirty, must first write cache line back to memory if tag $=0 \times 1 \mathrm{D}$ and index $=1$, then memory addresses are 111_0101_XX00 $=0 \times 750$, $0 x 754,0 \times 758,0 x 75 C$.

Circle One: Hit
Returned value if hit or N/A if miss: $\qquad$ N/A $\qquad$
All updated main memory word addresses or N/A: $\qquad$ 0x750, 0x754, 0x758, 0x75C

After testing, Ketchum decides to use the cache with the following RISC-V assembly program that increments every element in an array and stores the changed elements in another array.

```
// Assume the following registers are initialized:
// x1 = 0xC0 (base address of input array)
// x2 = 0x80 (base address of output array)
// x3 = 4 (number of elements in input and output arrays)
    . = 0x100 // The following code starts at address 0x100
    slli x6, x3, 2
    add x6, x1, x6 // address of end of input array
loop:
lw x4, 0(x1) // get array element
addi x4, x4, 1 // increment element
sw x4, 0(x2) // store element into output array
addi x1, x1, 4 // compute next address for input array
addi x2, x2, 4 // compute next address for output array
blt x1, x6, loop // continue looping
```

Answer the following questions about the behavior of the cache during execution of the above code. Note the cache has 2 ways and uses an LRU replacement policy. Assume that the cache is initially empty.
(D) (1 point) How many instruction fetches and data accesses occur per iteration of the loop?

Number of instruction fetches per loop iteration: $\qquad$ 6 $\qquad$
Number of data accesses per loop iteration: $\qquad$ 2 $\qquad$
(E) (1 point) How many instruction fetches and data accesses occur during execution of the entire program, including the instructions outside of the loop and the four iterations of the loop?

Total number of instruction fetches: $\qquad$ 26 $\qquad$
Total number of data accesses: $\qquad$ 8 $\qquad$

To help you think through the behavior of the cache on this program, we provide you with a diagram of the empty cache. You may use if you find it helpful, but you do not need to fill it out. Extra copies are available at the end of the exam.

Way 0

| $\mathbf{V}$ | $\mathbf{D}$ | Tag | Word 0 | Word 1 | Word 2 | Word 3 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 0 | $0 \times 4$ | slli | add | lw | addi |
| 1 | 0 | $0 \times 4$ | sw | addi | addi | blt |
|  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |

Way 1

| $\mathbf{V}$ | $\mathbf{D}$ | Tag | Word 0 | Word 1 | Word 2 | Word 3 |
| :--- | :--- | :---: | :---: | :---: | :---: | :---: |
| 1 | 1 | $0 \times 3$, | $\mathrm{M}[0 \mathrm{xC0}]$, | $\mathrm{M}[0 \mathrm{xC4}]$, | $\mathrm{M}[0 \mathrm{xC} 8]$, | $\mathrm{M}[0 \mathrm{xCC}]$, |
|  |  | $0 \times 2$ | $\mathrm{M}[0 \mathrm{x} 80]$ |  |  |  |
| $\mathrm{M}[0 \mathrm{x} 84]$ |  |  |  |  |  |  |
| $\mathrm{M}[0 \mathrm{x} 88]$ |  |  |  |  |  |  |
| $\mathrm{M}[0 \mathrm{x} 8 \mathrm{C}]$ |  |  |  |  |  |  |
|  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |

Instructions miss: $1 / 4$ for fetching first word in block, so total of 2
Data miss: 0 xC 0 and 0 x 80 both map to index 0 so they keep swapping each other out so you miss every time, so total of 8
So 10 misses out of 34 or 24 hits out of 34 .
(F) (4 points) How many instruction fetch misses and data access misses occur during execution of the entire program?

Number of instruction fetch misses: $\qquad$ 2 $\qquad$
Number of data access misses: $\qquad$ 8 $\qquad$
(G) (1 point) What is the hit ratio for the execution of this program? You may leave your answer as a fraction.

Hit ratio: $\qquad$ $24 / 34=12 / 17$ $\qquad$
(H) (3 points) Ketchum wants to get the best performance out of his cache. He is considering modifying his current cache to double the number of cache lines while leaving all other parameters of the cache the same (2-way set associative and a block size of 4), thus doubling the total capacity of the cache. However, this new cache is a lot more expensive! Ketchum wants to choose the cheapest cache that maximizes the hit ratio. Which one should he choose? Explain your answer.

## Circle One: Current Cache

## Why?

Currently, the memory accesses overwrite each other every time since the indices overlap. If we have three bits to represent the index instead of two, the indices will no longer conflict, reducing the misses to $2 / 8$.

## Problem 5. Pipelined Processors (18 points)

Consider the loop below, which sums the values in a linked list. We run this code on a standard 5stage RISC-V processor with full bypassing. Assume that branches are always predicted not taken and that branch decisions are made in the EXE stage. Assume that the loop repeats many times and it's currently in the middle of its execution.

In case you need them, extra pipeline diagrams are available at the end of the quiz.

```
loop: lw a1, 0(a0)
    lw a0, 4(a0)
    add a2, a2, a1
    bnez a0, loop
    mv a0, a2
    addi sp, sp, 8
    ret
```

(A) (7 points) Fill in the pipeline diagram for cycles $100-109$, assuming that at cycle 100 the 1 w a1, 0(a0) instruction is fetched. Draw arrows indicating each use of bypassing. Ignore any cells shaded in gray.

|  | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | lw | lw | add | bnez | bnez | mv | addi | lw | lw | add |
| DEC |  | lw | lw | add | add | bnez | mv | NOP | lw | lw |
| EXE |  |  | $1 w$ | $1 w$ | NOP | add | bnez | NOP | NOP | lw |
| MEM |  |  |  | $1 w$ | $1 w$ | NOP | add | bnez | NOP | NOP |
| WB |  |  |  |  |  | $1 w$ | $1 w$ | NOP | add | bnez |

How many cycles does each iteration of the loop take? For each loop iteration, how many cycles are wasted due to stalls? How many are wasted due to annulments?

Number of cycles per loop iteration: $\qquad$ 7 $\qquad$

Number of cycles per loop iteration wasted due to stalls: $\qquad$ 1 $\qquad$
Number of cycles per loop iteration wasted due to annulments: $\qquad$ 2

Ben Bitdiddle is looking to improve the performance of his processor. He observes that it is possible to save some cycles with extra bypass paths. Despite its name, full bypassing is not using as many bypasses as possible: it is bypassing to DEC from all later stages, but it is only bypassing to DEC. Sometimes, instructions do not need their inputs in the EXE stage, but at a later stage, and stalling on DEC wastes cycles.

To solve this, Ben proposes a new bypassing scheme called extra bypassing. In extra bypassing:

- There is a bypass path from each stage to DEC and all other stages between DEC and the source of the bypass. For example, the 5 -stage pipeline has paths EXE->DEC, MEM->DEC, WB->DEC, MEM->EXE, WB->EXE, and WB->MEM. (Note full bypassing has only the first 3 of these paths.)
- As usual, each bypass path bypasses to the end of the stage (e.g., MEM->DEC bypasses to the end of DEC, right before the DEC-EXE pipeline register and after register reads, and MEM->EXE bypasses to the end of EXE, right before the EXE-MEM pipeline register and after the ALU).
- If an instruction depends on a value produced by an earlier instruction, and the value is still not available (on either the register file or a bypass path), the instruction does not stall until the stage before where the value is needed. For example, if a value is needed in EXE and has not been produced yet, the instruction will stall in DEC (as with full bypassing). But if the value is needed in MEM, the instruction will proceed from DEC to EXE, and stall in EXE until the value becomes available through a bypass path.
- Instructions capture missing inputs from bypass paths on the first opportunity they get, even if they are stalled for other reasons. (This ensures that, if an instruction advances past DEC, it always gets to bypass the correct input value before it needs it.)

The diagram below shows the bypass paths in full vs. extra bypassing.

(B) (4 points) Give an example 2-instruction sequence that incurs two cycles worth of stalls with full bypassing, but only one cycle with extra bypassing. Specify which bypass paths are exercised in each case.

## Example 2-instruction sequence:

lw a1, 0(a0)
sw a1, 0(a2)

Bypass paths exercised with full bypassing: _WB->DEC $\qquad$
Bypass paths exercised with extra bypassing: __WB->EXE $\qquad$

To better leverage extra bypassing, Ben implements a new 5 -stage pipeline, shown below, where the ALU is placed in the fourth stage:

- In this pipeline, the first two stages, IF and DEC, work as in the standard 5-stage pipeline; the third stage, AC, performs address calculation for loads and stores; the fourth stage, EXM, performs ALU operations and data accesses; and the fifth stage, WB, works as in the standard pipeline.
- Branches are predicted not-taken and resolved in the EXM stage.
- The pipeline has extra bypassing.

Consider again the code from part (A):


$$
\begin{aligned}
\text { loop: } & l w a 1, ~ 0(a \theta) \\
& l w a \theta, 4(a \theta) \\
& \text { add a2, a2, a1 } \\
& \text { bnez a0, loop } \\
& \text { mv a0, a2 } \\
& \text { addi } s p, s p, 8 \\
& \text { ret }
\end{aligned}
$$

(C) (7 points) Fill in the pipeline diagram for cycles 100-109, assuming that at cycle 100 the 1 w a1, $0(a 0)$ instruction is fetched. Draw arrows indicating each use of bypassing. Ignore any cells shaded in gray. Recall that the Execute module is in the EXM pipeline stage.

|  | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | lw | lw | add | bnez | mv | addi | ret | lw | lw | add |
| DEC |  | lw | lw | add | bnez | $m v$ | addi | NOP | lw | lw |
| EXE |  |  | lw | lw | add | bnez | mv | NOP | NOP | lw |
| MEM |  |  |  | $1 w$ | $1 w$ | add | bnez | NOP | NOP | NOP |
| WB |  |  |  |  | lw | $1 w$ | add | bnez | NOP | NOP |

How many cycles does each iteration of the loop take? For each loop iteration, how many cycles are wasted due to stalls? How many are wasted due to annulments?
$\qquad$
Number of cycles per loop iteration wasted due to stalls: _ 0 $\qquad$
Number of cycles per loop iteration wasted due to annulments: $\qquad$ 3 $\qquad$

## Problem 6. Processor Pipeline Performance (16 points)

You are designing a 5- stage pipelined (IF, DEC, EXE, MEM, WB) RISC-V processor with the same functionality in each stage that we have seen in lecture:

- IF: Initiate instruction fetch
- DEC: Decode instruction and gather source operands (stall if not available)
- EXE: Perform ALU operations and resolve branches
- MEM: Initiate data memory accesses
- WB: Write results back to register file

The processor has the following features:

- The instruction memory responds to every request in one cycle.
- The data memory responds to cache hits in one cycle. The cache miss penalty of the data memory is $\mathbf{1}$ additional cycle.
- You have a cache with 4-word cache lines for the data memory.
- The processor predicts that all branches are TAKEN and can start fetching the instruction at the target address on the cycle immediately following the branch.

This processor will spend most of its time executing the following loop:

```
loop:
0x100 lw x1, 0(x2)
0x104 lw x3, 0(x1) # address depends on value loaded above
0x108 add x4, x3, x1
0x10c sw x4, 0(x2)
0x110 addi x2, x2, 4
0x114 addi x6, x6, -1 # assume x6 initially is 128
0x118 bne x6, x0, loop
```

Assume that:

- $\quad x 2$ has an initial value of $0 x 1000$, so on the first iteration, the lw $x 1,0(x 2)$ instruction accesses a memory address with cache block offset 0 .
- The memory location accessed by the $1 w x 3,0(x 1)$ instruction does not overlap with the locations accessed by $1 w \times 1,0(x 2)$ and $s w x 4,0(x 2)$, and each access maps to a different cache line.

The code is repeated here for your reference:
loop:
$0 x 100$ lw x1, $0(x 2)$
$0 x 104$ lw x3, $0(x 1)$ \# address depends on value loaded above
$0 \times 108$ add $x 4, x 3, x 1$
0x10c sw x4, 0(x2)
$0 \times 110$ addi $x 2, \times 2,4$
$0 x 114$ addi x6, x6, -1 \# assume x6 initially is 128
$0 \times 118$ bne x6, x0, loop
(A) (8 points) We are concerned with average performance and the loop runs for many iterations.

Fill out the pipeline diagram below for the $5^{\text {th }}$ iteration of the loop. Note that the processor predicts the branch to be taken. You may leave boxes blank to indicate NOP operations.
Draw arrows to indicate where a bypass path was used. Then specify the number of cycles for the $5^{\text {th }}$ iteration of the loop.

| Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | lw | lw | add | add | add | add | sw | sw | sw | sw | addi | addi | bne | lw | lw | add |
| DEC |  | lw | lw | lw | lw | lw | add | add | add | add | sw | addi | addi | bne | lw | lw |
| EXE |  |  | lw |  |  |  | lw |  |  |  | add | sw | addi | addj | bne | lw |
| MEM |  |  |  | lw |  |  |  | lw |  |  |  | add | sw | addi | addi | bne |
| WB |  |  |  |  | lw | lw |  |  | lw | lw |  |  | add | sw | addi | addi |

Number of cycles for $5^{\text {th }}$ iteration of the loop: $\qquad$ 13 $\qquad$
(B) (3 points) How many cycles does the $6^{\text {th }}$ iteration of the loop take? The processor again predicts the branch to be taken.

Number of cycles for $6^{\text {th }}$ iteration of the loop: $\qquad$ 12 $\qquad$
(C) (3 points) On average, how many cycles does a loop iteration take?
$(3 * 12+13) / 4=49 / 4=12.25$
Cycles per iteration: $\qquad$ 12.25 $\qquad$
(D) (2 points) We now slightly modify the loop to use a multiply instruction. All other instructions are the same as before. The execute stage of the multiplication takes 4 cycles.
loop:
0x100 lw x1, 0(x2)
0x104 lw x3, 0(x1) \# address depends on value loaded above
$0 x 108$ mul $x 4, ~ x 3, ~ x 1$
0x10c sw x4, 0(x2)
$0 \times 110$ addi $\times 2, \times 2,4$
0x114 addi x6, x6, -1 \# assume x6 initially is 128
$0 x 118$ bne x6, x0, loop

On average, how many additional cycles does an iteration of this new loop take versus your answer in part C ?

Additional cycles per iteration: $\qquad$ 3 $\qquad$

END OF QUIZ 2!

