Please enter your name, Athena login name, and recitation section above. Enter your answers in the spaces provided below. Show your work for potential partial credit. You can use the extra white space and the back of each page for scratch work.
Problem 1. Sequential Circuits in Minispec (16 points)

You are given the following Minispec sequential module:

```minispec
typedef struct {Bit#(16) a; Bit#(16) b;} Args;

module Comp;
    Reg#(Bit#(16)) x(1);
    Reg#(Bit#(16)) y(0);

    input Maybe#(Args) in;

    rule step;
        if (isValid(in)) begin
            let args = fromMaybe(?, in);
            x <= args.a;
            y <= args.b;
        end else if (x != 0) begin
            if (x >= y) begin
                x <= x - y;
            end else begin
                x <= y;
                y <= x;
            end
        end
    endrule

    method Maybe#(Bit#(16)) result = (x == 0)? Valid(y) : Invalid;
endmodule
```

(A) (6 points) Fill in the table below with the value of each variable at each cycle.

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<td>in</td>
<td>Invalid</td>
<td>Valid(A</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
</tr>
<tr>
<td></td>
<td>rs{a:6 ,b:4})</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>x</td>
<td>1</td>
<td>1</td>
<td>6</td>
<td>2</td>
<td>4</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>y</td>
<td>0</td>
<td>0</td>
<td>4</td>
<td>4</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>result</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Invalid</td>
<td>Valid(2)</td>
<td>Valid(2)</td>
</tr>
</tbody>
</table>

6.191 Spring 2024 - 2 of 20 - Quiz #2 Solutions
Now that you have warmed up, you are asked to design a parameterized matrix multiplication module for matrix product AxB where A and B are square matrices, such that the number of columns and rows in each matrix is $n$. Matrix multiplication is defined as demonstrated in the following diagram:

$$c_{i,j} = \sum_{k=0}^{n-1} a_{i,k} \times b_{k,j}$$

Your module can accept 2 inputs at a time, row and column. row and column are Maybe Vectors of $n$ 32-bit elements. When both row and column inputs are valid, then your module should compute the dot product of these two vectors and assign the result to the correct location of temp_matrix (which is indexed by row_counter and col_counter) in the code below. After your module computes all the matrix product elements, your module should output a valid result, the matrix product C. Assume that all addition (+) and multiplication (*) operations in your matrix multiply use 32-bit inputs and produce 32-bit outputs.

It is important to note that you cannot assume that both row and column inputs will be valid every cycle.

Your module will receive the pairs of inputs in the following order and you can assume when a valid input pair is received, it follows the following pattern:

<table>
<thead>
<tr>
<th>Input Pair</th>
<th>Row</th>
<th>Column</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>n-1</td>
<td>0</td>
<td>n-1</td>
</tr>
<tr>
<td>n</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>n+1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
B) (8 points) Complete the Minispec skeleton below to implement matrix multiplication for $n$ by $n$ matrix inputs.

```
module MatrixMultiply#(Integer n);

    input Maybe#(Vector#(n, Bit#(32))) row;
    input Maybe#(Vector#(n, Bit#(32))) column;

    Reg#(Bit#(log2(n))) row_counter(0);
    Reg#(Bit#(log2(n))) col_counter(0);
    Reg#(Bit#(1)) done(0);

    Vector#(Vector#(n, Reg#(Bit#(32))) temp_matrix = unpack(0);

    rule tick;
        if (row_counter == (n-1) && col_counter == (n-1) && isValid(row) && isValid(column)) done <= 1;
        else done <= 0;

        if( isValid(row) && isValid(column) ) begin
            Vector#(n, Bit#(32)) row_in = ___ fromMaybe(? , row) _____;
            Vector#(n, Bit#(32)) col_in = ___ fromMaybe(? , column) ____;
            Bit#(32) total_sum = 0;
            for(Integer k = 0; k < n; k = k + 1) begin
                ___ total_sum = total_sum + row_in[k]*col_in[k] ____;
            end
            temp_matrix[row_counter][col_counter] <= ___ total_sum ____;
        end

        if (col_counter == n - 1) begin
            col_counter <= 0 _______
            if (row_counter == n - 1) begin
                row_counter <= 0 _______
            end
            end __ row_counter <= row_counter + 1 ____;
        end else __ col_counter <= col_counter + 1 ____;
    end
endrule
```
(C) (2 points) Complete the `result` method which returns a valid matrix `C` once it has been fully computed and returns `Invalid` otherwise.

```verilog
method Maybe#(Vector#(n, Vector#(n, Bit#(32)))) result;
  Vector#(n, Vector#(n, Bit#(32))) ret = unpack(0);
  for (Integer i = 0; i < n; i = i + 1) begin
    for (Integer j = 0; j < n; j = j + 1) begin
      ret[i][j] = __temp_matrix[i][j]__;
    end
  end
  if (__done == 1__) return __Valid(ret)__;
  else return __Invalid____;
endmethod
endmodule
```
Problem 2. Arithmetic Pipelines (16 points)

Octavian the octopus accidentally fried his sister's Kelp-o-Meter, but luckily, he was able to buy a working replacement. However, the new device's throughput is much lower than the original's, so Octavian asks you to help him pipeline it. A Kelp-o-Meter (KELP for short) has two inputs, X and Y, and two outputs, K and L.

For each of the questions below, please create a valid k-stage pipeline of the given circuit. Each component in the circuit is annotated with its propagation delay in nanoseconds. Show your pipelining contours and place large black circles (●) on the signal arrows to indicate the placement of pipeline registers. Give the latency and throughput of each design, assuming ideal registers (t_{PD}=0, t_{SETUP}=0). Remember that our convention is to place a pipeline register on each output.

(A) (1 point) Based on the circuit shown in part (B), what is the propagation delay of the whole KELP circuit as-is, without pipelining?

\[ t_{PD} \text{ (ns)}: 29 \]

(B) (3 points) Show the maximum-throughput 2-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit? Pay close attention to the direction of each arrow. Show your pipeline contours and each pipeline register. In case you need them, extra copies of the circuit are available at the end of the quiz.

\[ \text{Latency (ns)}: 30 \]

\[ \text{Throughput (ns}^{-1}) \text{)}: 1/15 \]
(C) (4 points) Show the **maximum-throughput 3-stage pipeline** using a minimal number of registers. What are the latency and throughput of the resulting circuit? Show your pipeline contours and each pipeline register. In case you need them, extra copies of the circuit are available at the end of the quiz.

Latency (ns): ___33_____
Throughput (ns⁻¹): ___1/11_____

(D) (4 points) Show the **maximum-throughput pipeline** using a minimal number of registers. What are the latency and throughput of the resulting circuit? Show your pipeline contours and each pipeline register. In case you need them, extra copies of the circuit are available at the end of the quiz.

Latency (ns): ___32_____
Throughput (ns⁻¹): ___1/8_____

6.191 Spring 2024 - 7 of 20 - Quiz #2 Solutions
(E) Octavian came up with a brilliant birthday present for his sister: an add-on module that connects to the outputs of her new Kelp-o-Meter. He is considering three different models, CRAB, STAR, and FISH, which have the same functionality, but differ in throughput and number of pipeline stages (given in the table below).

<table>
<thead>
<tr>
<th>Add-on Module</th>
<th>Throughput (ns⁻¹)</th>
<th>Pipeline Stages</th>
</tr>
</thead>
<tbody>
<tr>
<td>CRAB</td>
<td>1/6</td>
<td>4</td>
</tr>
<tr>
<td>STAR</td>
<td>1/8</td>
<td>2</td>
</tr>
<tr>
<td>FISH</td>
<td>1/12</td>
<td>1</td>
</tr>
</tbody>
</table>

(i) (2 points) Octavian thinks maximizing throughput is most important. Which versions of the pipelined KELP module and add-on module should Octavian choose, and what are the resulting latency and throughput? If two combinations have identical throughput, choose the one with better latency.

Module KELP (circle one):

2-stage pipeline 3-stage pipeline Maximum-throughput pipeline

Add-on Module (circle one):

CRAB (T = 1/6) STAR (T = 1/8) FISH (T = 1/12)

Latency (ns): 48
Throughput (ns⁻¹): 1/8

(ii) (2 points) After thinking for a bit, Octavian starts worrying about the latency of the combined modules. To minimize latency, which versions of the pipelined KELP module and add-on module should Octavian choose, and what are the resulting latency and throughput? If two combinations have identical latency, choose the one with better throughput.

Module KELP (circle one):

2-stage pipeline 3-stage pipeline Maximum-throughput pipeline

Add-on Module (circle one):

CRAB (T = 1/6) STAR (T = 1/8) FISH (T = 1/12)

Latency (ns): 45
Throughput (ns⁻¹): 1/15
Problem 3. Processor Implementation (14 points)

Reggie Ster has written a program in RISC-V assembly that repeatedly calculates an address and jumps to that address. Reggie’s program has this code pattern repeated many times:

\[
\text{slli } x3, x3, 2 \\
\text{add } x2, x4, x3 \\
\text{jalr } x1, 0(x2)
\]

Reggie notices that the offset for the jalr instruction is always 0 for his program. To make his code more efficient, Reggie decides to combine these 3 instructions into a single calculate and jump instruction that can be executed in one cycle. This is the new instruction Reggie wants to add to the RISC-V ISA:

**calcj** rd, rs1, rs2

The **calcj** instruction computes an address based on the values in registers rs1 and rs2, then jumps to that address. The instruction also stores pc+4 to the rd register. The behavior of this new instruction is summarized in the code below:

\[
\text{reg}[rd] \Leftarrow pc + 4 \\
\text{JT} = \text{reg}[rs1] + (\text{reg}[rs2] \ll 2) \\
\text{pc} \Leftarrow \{\text{JT}[31:1], 1'b0\}
\]

Reggie has chosen to encode the **calcj** instruction in the following way:

<table>
<thead>
<tr>
<th>31...25</th>
<th>24...20</th>
<th>19...15</th>
<th>14...12</th>
<th>11...7</th>
<th>6...0</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000001</td>
<td>rs2</td>
<td>rs1</td>
<td>101</td>
<td>rd</td>
<td>0111011</td>
</tr>
</tbody>
</table>

(A) (1 point) Encode the following instruction as a 32-bit binary word. Provide your encoding in hexadecimal notation.

\[
\text{calcj } x1, x4, x3
\]

\[
0000|001 0|0011 |0010|0 101 |0000|1 011|1011 \\
0x023250BB
\]

Encoding in hexadecimal 0x: **023250BB**
Reggie introduces a new \( \ll 2 \) module (highlighted below) to the processor diagram from lecture. This module left-shifts the second register value by two, with the shifted output labeled as \( rVal2shift \).

Please also note that the **logic for computing branches and jumps** is provided in bold in the diagram below. Help Reggie modify the processor implementation below to support the new calculate and jump instruction.

(B) (2 points) For each of the following signals, determine whether the mux being controlled by that signal needs an extra input to accommodate the new instruction. If so, indicate the name of the signal that needs to be added as an input to the mux. If not, indicate which existing value of the mux control signal is required to make the instruction work properly.

**BSEL:** Needs new input (circle one)? **YES**  **NO**  
New input/Existing control signal: __rVal2shift__

**WDSEL:** Needs new input (circle one)? **YES**  **NO**  
New input/Existing control signal: __PC+4 (0)___
(C) (3 points) To support the `calcj` instruction, Reggie decides he wants to add an input 4 to the PCSEL mux and connect `aluResult` to this new input signal (i.e., Reggie replaces the question mark, at input 4 of the PCSEL mux, with `aluResult`). The decoder will set PCSEL = 4 for the `calcj` instruction.

Is there a problem with Reggie’s approach? **If yes,** explain the issue and suggest what Reggie should do instead to update the PC register correctly. **If no,** explain why Reggie’s approach works.

Is Reggie’s approach problematic (circle one)? [YES] [NO]

**Explanation:**
If the result of the addition is odd, then `aluResult` will give a PC that is not half-word aligned. Instead of creating a new input and connecting `aluResult`, Reggie should use the existing input 2 `{JT[31:1], 1'b0}` (which is essentially `aluResult` with the last bit cleared)

(D) (5 points) Decide for each of the following control signals what their values should be when executing the `calcj` instruction. If the value of the signal doesn’t matter, then put N/A. The possible values for each signal are provided below.

**AluFunc:** Add, Sub, And, Or, Xor, Slt, Sltu, Sll, Srl, Sra

**BrFunc:** Eq, Neq, Lt, Ltu, Ge, Geu

**MemFunc:** Lw, Lh, Lhu, Lb, Lbu, Sw, Sh, Sb

**MemEnable:** True, False

**WERF:** True, False

<table>
<thead>
<tr>
<th><strong>AluFunc:</strong></th>
<th>Add</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>BrFunc:</strong></td>
<td>N/A</td>
</tr>
<tr>
<td><strong>MemFunc:</strong></td>
<td>N/A</td>
</tr>
<tr>
<td><strong>MemEnable:</strong></td>
<td>False</td>
</tr>
<tr>
<td><strong>WERF:</strong></td>
<td>True</td>
</tr>
</tbody>
</table>
(E) (3 points) Reggie modifies his program slightly so that the repeated code block in his program becomes:

\[
\begin{align*}
\text{sl} \text{li } x3, x3, 2 \\
\text{add } x2, x4, x3 \\
\text{jalr } x1, 4(x2)
\end{align*}
\]

Reggie notices that the offset for the \textit{jalr} instruction is now \textbf{always 4} for his program. He wants to make another \textit{calculate and jump four} instruction, which would have the following code implementation:

\[
\begin{align*}
\text{reg}[rd] & \leftarrow pc + 4 \\
\text{JT} & = (\text{reg}[rs1] + (\text{reg}[rs2]<<(2)) + 4 \\
\text{pc} & \leftarrow \{\text{JT}[31:1], 1'b0\}
\end{align*}
\]

Could we modify \textbf{only the control signals} of the processor above (with the additional shifter module) to support this instruction? \textbf{Explain your answer}.

Could we implement a calculate and jump four instruction (circle one)? \hspace{1cm} \textbf{YES} \hspace{1cm} \textbf{NO}

\textbf{Explanation:}

The processor with an additional shifter module allows us to support instructions that require a 2-bit left shift and one additional ALU operation. The calculate and jump four instruction would require a 2-bit left shift and 2 additions. This is not possible to implement in our processor without an additional adder module.
Problem 4. Caches (18 points)

Assume that addresses and data words are 32 bits. Consider a 4-way set associative cache with 64 sets and a block size of 4.

(A) (2 points) Which address bits should the cache use for the cache index, tag field, and block offset. Write [ X : X ] if no bits are used.

- Address bits for byte offset: $A[1:0]$
- Address bits for tag field: $A[31:10]$
- Address bits for block offset: $A[3:2]$

Now consider a 2-way set associative cache with 4 sets and a block size of 4. You will use this architecture for parts (B) – (E).

(B) (3 points) How will the following cache parameters change in this new cache relative to the cache in part (A)? Please circle the best answer. If ‘Other’, write the change.

- # of cache index bits: UNCHANGED … +1 … -1 … +2 … -2 … CAN’T TELL…Other: ___-4___
- # of tag field bits: UNCHANGED … +1 … -1 … +2 … -2 … CAN’T TELL…Other ___4___
- # of block offset bits: UNCHANGED … +1 … -1 … +2 … -2 … CAN’T TELL…Other: __________

(C) (8 points) Now analyze the performance of this cache (2-way set associative, 4 sets, block size of 4 words) using the following assembly program, which iterates through array A (base address 0x40) and stores the result of $4*A[i]$ into array B (base address 0x80).

```
// x1 = 0 (loop index i)
// x2 = 4 (number of elements in array A)
. = 0x0 // The following code starts at address 0x0
loop:
    slli x3, x1, 2 // convert to byte offset
    lw x4, 0x40(x3) // load value from A[i]
    slli x4, x4, 2 // x4 = 4 * A[i]
    sw x4, 0x80(x3) // store 4 * A[i] into B[i]
    addi x1, x1, 1 // increment index
    blt x1, x2, loop // loop 4 times
    unimp // halt, all done
```

6.191 Spring 2024 - 13 of 20 - Quiz #2 Solutions
Assume the cache is empty before execution of this code (i.e., all valid bits are 0). Assume that the cache uses a least-recently used (LRU) replacement policy, and that all cache lines in Way 0 are currently the least-recently used. Fill in, or update, all the known values of the LRU bit, the dirty bit (D), the valid bit (V), the tag, and the data words after one loop iteration (after executing the blt instruction for the first time). For word fields, fill them in with the opcode if they are instructions (e.g., blt) or fill them in with the array element if they are data (e.g., \( A[0] \)).

You may assume that if \( V = 0 \) then \( D = 0 \).

```c
// x1 = 0 (loop index i)
// x2 = 4 (number of elements in array A)
.x = 0x0    // The following code starts at address 0x0
loop:
    slli x3, x1, 2   // convert to byte offset
    lw x4, 0x40(x3)  // load value from A[i]
    slli x4, x4, 2   // x4 = 4 * A[i]
    sw x4, 0x80(x3)  // store 4 * A[i] into B[i]
    addi x1, x1, 1   // increment index
    blt x1, x2, loop // loop 4 times

    unimp    // halt, all done
```

### Way 0 (After one loop iteration)

<table>
<thead>
<tr>
<th>Index</th>
<th>V</th>
<th>D</th>
<th>Tag</th>
<th>Word 3</th>
<th>Word 2</th>
<th>Word 1</th>
<th>Word 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0x0</td>
<td>sw</td>
<td>slli</td>
<td>lw</td>
<td>slli</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0x0</td>
<td>unimp</td>
<td>blt</td>
<td>addi</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Way 1 (After one loop iteration)

<table>
<thead>
<tr>
<th>Index</th>
<th>V</th>
<th>D</th>
<th>Tag</th>
<th>Word 3</th>
<th>Word 2</th>
<th>Word 1</th>
<th>Word 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
(D) (4 points) Fill out the table below with the number of instruction hits, instruction misses, data hits, and data misses for each of the four iterations of the loop.

<table>
<thead>
<tr>
<th></th>
<th>Instructions</th>
<th></th>
<th>Data</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Hits</td>
<td>Misses</td>
<td>Hits</td>
<td>Misses</td>
</tr>
<tr>
<td>First loop iteration</td>
<td>4</td>
<td>2</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>Second loop iteration</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>Third loop iteration</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>Fourth loop iteration</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
</tbody>
</table>

(E) (1 point) What is the hit ratio for the execution of the four loop iterations (Note: do not include execution of the unimp instruction)? You may leave your answer as a fraction.

Hit Ratio: \( \frac{22}{32} \)
Problem 5. Pipelined Processors (18 points)

Ben Bitdiddle writes the following loop in RISC-V assembly to multiply elements of an array by 3. The array is of size \( n \) and is stored in memory beginning at address \( 0x500 \).

```assembly
// a0 = n = 100
// a1 = 0 = loop index i
loop:
    slli a2, a1, 2      // multiply index by 4
    lw a3, 0x500(a2)
    slli a4, a3, 1
    add a4, a4, a3
    sw a4, 0x500(a2)    // a[i] = 3 * a[i]
    addi a1, a1, 1      // increment loop index i
    blt a1, a0, loop
    ori a0, a2, 2       // some instructions following the loop
    xori a2, a0, 4
    and a3, a2, a1
    and a3, a2, a1
```

Ben runs this on a 4-stage pipeline (IF, DEC, EXE/MEM, WB). In this pipeline:
- The EXE and MEM stages have been merged into one pipeline stage.
- The result of a \( \text{lw} \) operation is available at the beginning of the WB stage.
- Branches are predicted not taken.
- Branches and jumps are resolved in the EXE/MEM stage.
- Full bypassing is implemented.

(A) (7 points) Fill in the pipeline diagram below for cycles 100-112, assuming that at cycle 100 the \( \text{slli} \ a2, \ a1, \ 2 \) instruction is fetched. Assume the loop runs for many iterations. **Draw arrows indicating each use of bypassing.** Ignore cells shaded in gray.

<table>
<thead>
<tr>
<th></th>
<th>100</th>
<th>101</th>
<th>102</th>
<th>103</th>
<th>104</th>
<th>105</th>
<th>106</th>
<th>107</th>
<th>108</th>
<th>109</th>
<th>110</th>
<th>111</th>
<th>112</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>slli</td>
<td>lw</td>
<td>slli</td>
<td>add</td>
<td>add</td>
<td>sw</td>
<td>addi</td>
<td>blt</td>
<td>ori</td>
<td>xori</td>
<td>slli</td>
<td>lw</td>
<td>slli</td>
</tr>
<tr>
<td>DEC</td>
<td></td>
<td>slli</td>
<td>lw</td>
<td>slli</td>
<td>slli</td>
<td>add</td>
<td>sw</td>
<td>addi</td>
<td>blt</td>
<td>ori</td>
<td>NOP</td>
<td>slli</td>
<td>lw</td>
</tr>
<tr>
<td>EXE/MEM</td>
<td>slli</td>
<td>lw</td>
<td>NOP</td>
<td>slli</td>
<td>add</td>
<td>sw</td>
<td>addi</td>
<td>blt</td>
<td>NOP</td>
<td>NOP</td>
<td>slli</td>
<td>lw</td>
<td></td>
</tr>
<tr>
<td>WB</td>
<td></td>
<td>slli</td>
<td>lw</td>
<td>NOP</td>
<td>slli</td>
<td>add</td>
<td>sw</td>
<td>addi</td>
<td>blt</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td></td>
</tr>
</tbody>
</table>

(B) (2 points) How many cycles does each iteration of the loop take? For each loop iteration, how many cycles are wasted due to stalls? How many are wasted due to annulments?

- **Number of cycles per loop iteration:** \( 10 \)  
- **Number of cycles per loop iteration wasted due to stalls:** \( 1 \)  
- **Number of cycles per loop iteration wasted due to annulments:** \( 2 \)
(C) (4 points) Suppose Ben now modifies his processor so that branches are always predicted to be taken. Assume that everything else about the processor remains unchanged. Fill in the pipeline diagrams below assuming that at cycle 200 the `addi a1, a1, 1` is fetched, and the branch is taken. **Draw arrows indicating each use of bypassing.** Ignore cells shaded in gray.

<table>
<thead>
<tr>
<th>cycles</th>
<th>200</th>
<th>201</th>
<th>202</th>
<th>203</th>
<th>204</th>
<th>205</th>
<th>206</th>
<th>207</th>
<th>208</th>
<th>209</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>addi</td>
<td>blt</td>
<td>slli</td>
<td>lw</td>
<td>slli</td>
<td>add</td>
<td>add</td>
<td>sw</td>
<td>addi</td>
<td>blt</td>
</tr>
<tr>
<td>DEC</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EXE/MEM</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

(D) (2 points) How many cycles does the execution of the loop take when branches are predicted to be taken? For each loop iteration, how many cycles are wasted due to stalls? How many are wasted due to annulments?

- Number of cycles per loop iteration: 8
- Number of cycles per loop iteration wasted due to stalls: 1
- Number of cycles per loop iteration wasted due to annulments: 0

(E) (3 points) Ben decides to continue using the version of his processor that predicts that branches are always taken. However, for cost saving reasons, he needs to remove one of the bypass paths from his processor. Should he choose to remove the EXE/MEM to DEC bypass or the WB to DEC bypass? With the selected bypass path removed, how many cycles does each iteration of the loop take? For each loop iteration, how many cycles are wasted due to stalls? How many are wasted due to annulments?

**Bypass path to remove (circle one):** EXE/MEM to DEC

- Num cycles per loop iteration after removing bypass path: 10
- Num cycles per loop iteration wasted due to stalls after removing bypass path: 3
- Num cycles per loop iteration wasted due to annulments after removing bypass path: 0
Problem 6. Pipelined Processor Performance (18 points)

The following loop sums up the elements of an array:

```
loop:   lw a1, 0(a2)
        add a0, a1, a0
        addi a2, a2, 4
        blt a2, a3, loop
        xor a1, a4, a5 // some code after the loop
        sub sp, sp, a6
        ret
```

(A) (3 points) Assume a standard 5-stage RISC-V pipelined processor with full bypassing. In steady state, how many cycles does each iteration of the loop take? Note that all branches are predicted not taken.

```
Instructions per loop iteration: ______4______
Cycles per loop iteration lost to stalls: ______2______
Cycles per loop iteration lost to annulments: ______2______
Cycles per loop iteration: ______8______
```

(B) (2 points) Reorder the instructions in the loop to improve performance. How many cycles per iteration does your code achieve?

```
loop:  ___ lw a1, 0(a2)______________
       ___ addi a2, a2, 4 ____________
       ___ add a0, a1, a0 ____________
        blt a2, a3, loop
```

(There are also 6-cycle solutions if you reorder across loop iterations, but the code for those is more complex.)

```
Cycles per loop iteration with reordered code: ______7______
```
Ben Bitdiddle notices that it’s common for code to have load instructions for values that are used by a single ALU instruction, like the `lw` and `add` instruction pair in the previous loop. He proposes to change the RISC-V ISA to support ALU instructions where the first operand comes from memory instead of a register. These instructions have the form:

\[
\text{op rd, (rs1), rs2: } \text{Reg[rd]} \leftarrow \text{Mem[Reg[rs1]] op Reg[rs2]}
\]

With this ISA change, the previous loop can be rewritten as follows, saving one instruction:

```
loop:   add a0, (a2), a0
        addi a2, a2, 4
        blt a2, a3, loop
        xor a1, a4, a5
        sub sp, sp, a6
        ret
```

To support this new instruction type, Ben implements the 5-stage pipeline shown here. The key difference with the standard 5-stage pipeline is that the MEM stage comes before the EXE stage, not after. This allows these new ALU instructions to load one of their operands from memory.

As in the classic 5-stage pipeline, branches and jumps are resolved in the EXE stage (but note that this is now the fourth stage in the pipeline, not the third one).

(C) (6 points) Assume that this pipeline implements the same bypass paths as the classic 5-stage pipeline: values can be bypassed from the MEM, EXE, and WB stages to the DEC stage. Analyze the performance of the loop above. Fill out the pipeline diagram below for the first 10 cycles and calculate the number of cycles this processor takes to execute one iteration of the above loop. Fill in any stalled/annulled stages with NOPs and clearly show all uses of the bypass paths using arrows. Ignore cells shaded in gray.

<table>
<thead>
<tr>
<th>Cycle</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>add</td>
<td>addi</td>
<td>blt</td>
<td>xor</td>
<td>xor</td>
<td>sub</td>
<td>ret</td>
<td>add</td>
<td>addi</td>
<td>blt</td>
</tr>
<tr>
<td>DEC</td>
<td>add</td>
<td>addi</td>
<td>blt</td>
<td>blt</td>
<td>blt</td>
<td>xor</td>
<td>sub</td>
<td>NOP</td>
<td>add</td>
<td>addi</td>
</tr>
<tr>
<td>MEM</td>
<td>add</td>
<td>addi</td>
<td>NOP</td>
<td>blt</td>
<td>xor</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td>add</td>
<td></td>
</tr>
<tr>
<td>EXE</td>
<td>add</td>
<td>addi</td>
<td>NOP</td>
<td>blt</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td></td>
<td></td>
</tr>
<tr>
<td>WB</td>
<td>add</td>
<td>addi</td>
<td>NOP</td>
<td>blt</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Cycles per loop iteration: ______ 7 ______
(D) (4 points) Let’s consider adding even more bypass paths: assume that values can be bypassed from the MEM, EXE, and WB stages to the DEC stage (like before), and from the EXE stage to the MEM stage. Bypassing from EXE to MEM lets us have back-to-back dependences among ALU instructions without stalls, like in the classic pipeline, even though this pipeline has an extra stage (MEM) between DEC and EXE. The bypass paths to MEM work as follows: if the instruction in DEC reads a value that will be produced by an ALU instruction that is currently in MEM, the instruction in DEC does not stall and instead relies on the EXE→MEM bypass path to provide this value on the next cycle.

Analyze the performance of the loop above. Fill out the pipeline diagram below for the first 10 cycles and calculate the number of cycles this processor takes to execute one iteration of the loop. Fill in any stalled/annulled stages with NOPs and clearly show all uses of the bypass paths using arrows. Ignore cells shaded in gray.

<table>
<thead>
<tr>
<th>Cycle</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>add</td>
<td>addi</td>
<td>blt</td>
<td>xor</td>
</tr>
<tr>
<td>DEC</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>add</td>
<td>addi</td>
<td>blt</td>
<td>xor</td>
</tr>
<tr>
<td>MEM</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>add</td>
<td>addi</td>
<td>blt</td>
<td>xor</td>
</tr>
<tr>
<td>EXE</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>add</td>
<td>addi</td>
<td>blt</td>
<td>NOP</td>
</tr>
<tr>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>add</td>
<td>addi</td>
<td>blt</td>
<td>NOP</td>
</tr>
</tbody>
</table>

Cycles per loop iteration: ______ 6 ______

(E) (3 points) In this pipeline, branches and jumps are resolved in a later stage than memory accesses. Is it safe to use speculation to resolve control hazards? If so, explain why. Otherwise, give a code sequence that produces incorrect behavior on mis-speculation and explain why the code sequence misbehaves. Include any assumptions you are making in your explanation.

This pipeline still produces correct execution, but by a hair. The key constraint is that instructions cannot modify architectural state until they are non-speculative. Because memory is placed before branch resolution, the worst-case sequence is a branch or jump followed by a store instruction. In this case, the store instruction can be annulled while it is in MEM and before it writes to the data cache. If EXE happened one cycle later, this would not be possible.

Also accepting solutions that say it doesn’t work if you assume that the sw can alter state in the MEM stage before being annulled.