6.191 Worksheet Questions
L13 – Caches

Keep the most often-used data in a small, fast SRAM (often local to CPU chip). The reason this strategy works: LOCALITY.

- Temporal locality: If a location has been accessed recently, it is likely to be accessed (reused) soon
- Spatial locality: If a location has been accessed recently, it is likely that nearby locations will be accessed soon

AMAT(Average Memory Access Time) = HitTime + MissRatio * MissPenalty

Direct-Mapped Caches

- Each word in memory maps into a single cache line
- Access (for cache with \(2^w\) lines):
  - Index into cache with \(W\) address bits (the index bits)
  - Read out valid bit, tag, and data
  - If valid bit == 1 and tag matches upper address bits, HIT

Example 8-line direct-mapped cache:
Fully-Associative Cache

Opposite extreme: Any address can be in any location

- No cache index!
- Flexible (no conflict misses)
- Expensive: Must compare tags of all entries in parallel to find matching one

32-bit BYTE address

N-way Set-Associative Cache

- Use multiple direct-mapped caches in parallel to reduce conflict misses
- Nomenclature:
  - # Rows = # Sets
  - # Columns = # Ways
  - Set size = #ways = “set associativity” (e.g., 4-way → 4 lines/set)
- Each address maps to only one set, but can be in any way within the set
- Tags from all ways are checked in parallel

- Fully-associative cache: Extreme case with a single set and as many ways as cache lines
**Cache summary**

How to break up address:
- Bottom 2 bits of memory are the byte offset.
- Number of block offset bits = \( \log_2 \) (words per block)
- Number of index bits = \( \log_2 \) (number of sets (rows))

Cache structure:
- In a direct mapped cache, number of sets = number of rows = number of cache lines
- In a set-associative cache, number of sets = number of rows = number of cache lines / # of ways (e.g., 2-way set associative cache has 2 cache lines in a set. A cache line = a block so it may contain multiple words).
- In a fully-associative cache, number of sets = 1 so there are no index bits.
Problem 1 (parts A and B are from worksheet 12)

The RISC-V Engineering Team is working on the design of a cache. They’ve decided that the cache will have a total of $2^{10} = 1024$ data words, but are still thinking about the other aspects of the cache architecture.

First assume the team chooses to build a direct-mapped cache with a block size of 4 words.

(A) Please answer the following questions:

Number of lines in the cache: 256

In a direct-mapped cache, each line has one block. If each block has 4 words and we want a total of 1024 words, we must have $(1024 \text{ words}) / (4 \text{ words/block}) / (1 \text{ block/line}) = 256$ cache lines.

Number of bits in the tag field for each cache entry: 20

With 256 cache lines, we need 8 bits to index into them (index bits). This gives us one block. Within each block, we have 4 words. Each word is 4 bytes. Thus, within a block, we need 2 bits to index into the block and retrieve the word (block offset) and then 2 bits to index into the word to retrieve an individual byte (byte offset). Since memory addresses are 32 bits wide, we have $32 - 8 - 2 - 2 = 20$ bits remaining for the tag.

(B) This cache takes 2 clock cycles to determine if a memory access is a hit or a miss and, if it’s a hit, return data to the processor. If the access is a miss, the cache takes 20 additional clock cycles to fill the cache line and return the requested word to the processor. If the hit rate is 90%, what is the processor’s average memory access time in clock cycles?

Average memory access time assuming 90% hit rate (clock cycles): 4

$2 + (1-90\%) \times 20 = 4$

Now assume the team chooses to build a 2-way set-associative write-back cache with a block size of 4 words. The total number of data words in the entire cache is still 1024. The cache uses a LRU replacement strategy.

(C) Please answer the following questions:

1024/4 words/2 way = 128 lines

Address bits used as block offset: $A[3:2]$

Address bits used as cache line index: $A[10:4]$

Address bits used for tag comparison: $A[31:11]$

(D) To implement the LRU replacement strategy this cache requires some additional state for each set. How many state bits are required for each set?

1 bit to select which of the two ways was most recently used for each set

Number of state bits needed for each set for LRU: 1
To test this set-associative cache, the team runs the benchmark code shown on the right. The code sums the elements of a 16-element array. The first instruction of the code is at location 0x0 and the first element of the array is at location 0x10000. Assume that the cache is empty when execution starts and remember the cache has a block size of 4 words.

(E) How many instruction misses will occur when running the benchmark?

Number of instruction misses when running the benchmark: 3

Note: this explanation (which covers both 2E and 2F) is similar to that for problem 1E.

As we see in part (C), for each memory access, we look at bits [10:4] of the address to determine which index in the cache that particular memory address maps to. Remember that both instructions and the array are stored in memory. Thus, each instruction fetch is a memory access and is also cached along with all “normal” memory accesses in the form of lw instructions.

In addition, note that each block in this cache contains four (4) words. Therefore, when we fetch a new line into the cache, we will be adding four instructions/array values into our cache. Because our instructions are mostly executed linearly and because our array is accessed in a strictly linear fashion, this pattern allows us to take huge advantage of spatial locality to reduce the number of instruction/data fetches that we must execute.

We start from the top with the mv instruction at address 0x0. Bits [10:4] = 0b0. Therefore, we check index 0 of the cache for the mv instruction. However, since our cache starts empty, there is nothing there (the valid bit == 0). This is a cache miss, so we load a cache line into that slot in way 1 of our cache. Since each line contains four words, we load in the instructions at addresses [0x0, 0x4, 0x8, 0xC] – i.e. the next three instructions in addition to the current instruction. Thus, the next three instructions are all cache hits. The fifth instruction, lw, is at address 0x10. Bits[10:4] = 0b1, so this instruction is cached at index 1. The pattern should be evident: every fourth instruction is cached at the next cache index.

We fetch and cache the lw instruction next at index 0b0100. The lw instruction accesses address 0x10000 (the first element of the array). Bits [10:4] of 0x10000 are 0b0. This is the first data read miss. Note that we have already loaded in something in our cache at that index – the first set of four instructions. However, since our cache has two (2) ways, we can use the other way to load in another line of 4 words of array values. Thus, we only have a data fetch miss every fourth iteration of the loop L.

Since there are 16 array values and nine (9) instructions, we have that there are 16 // 4 = 4 data misses and 9 // 4 = 3 instruction misses.
Here is the final cache picture:

<table>
<thead>
<tr>
<th>Index</th>
<th>Way 0</th>
<th>Way 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000 0</td>
<td>add</td>
<td>Mem[0x100 0C]</td>
</tr>
<tr>
<td>0000 01</td>
<td>lui</td>
<td>Mem[0x100 08]</td>
</tr>
<tr>
<td>0000 10</td>
<td>mv</td>
<td>Mem[0x100 00]</td>
</tr>
<tr>
<td>0000 11</td>
<td>Mem[0x100 00]</td>
<td>Mem[0x100 00]</td>
</tr>
</tbody>
</table>

(F) How many data misses (i.e., misses caused by the memory access from the LD instruction) will occur when running the benchmark?

Number of data misses when running the benchmark: 4

(G) What’s the exact hit rate when the complete benchmark is executed?

Benchmark hit rate: 109/116

# instruction fetches = 3 instructions before + 6 instructions/iteration of loop*16 iterations + 1 instruction after = 100
# data fetches = 16
Misses = 7
Hits = 116 - 7 = 109
Problem 2 ★

Bob has a direct mapped cache containing 32 total words with 8 words per line. Assume all words in data memory start off as 0x000000000. Assume there is a separate data and instruction cache. Bob executes the following code to test his cache.

```
. = 0x0
li x1, 0xABCD1234
li x4, 0x77337733
sw x1, 0x500(x0)
li x2, 0x502(x0)
lb x5, 0x660(x0)
sh x1, 0x660(x0)
sb x4, 0x395(x0)
sb x1, 0x518(x0)
li x3, 0x61916191
sb x3, 0x503(x0)
```

(A) What are the values of the following registers after execution of the code?

- **x2**: ________0xFFFFABCD_________
- **x5**: ________0x00000000_________

x2 contains the value of loading the half word at address 0x502. This is word address 0x500 with a byte offset of 0x2, so the upper half-word of 0x500. The address 0x500 currently contains 0xABCD1234 (written during the previous instruction), so the half word in question is 0xABCD. Sign extending this, we get 0xFFFFABCD.

x5 contains the value of loading the byte at address 0x660. This is word address 0x660 with a byte offset of 0x0. This address has not yet been written to, so it is all zeros.

(B) What are the values at the following locations in memory after execution of the code?

- **0x500**: ________0x91CD1234_________
- **0x660**: ________0x00001234_________
- **0x394**: ________0x00003300_________
- **0x518**: ________0x34000000_________

The word at 0x500 is written to twice; once with a sw to address 0x500 and once with a sb to address 0x503. The address 0x503 is word address 0x500 plus a byte offset of 0x3. First, the value 0xABCD1234 is stored at this address. Then, the byte in x3 is stored at a byte offset of 3 into 0x500. This is the upper byte of this word. x3 contains 0x61916191, and the least significant 8 bits of this are 0x91, so these get stored into the upper byte of 0x500. This results in a value of 0x91CD1234.
The word at 0x660 is written to once, with a sh to address 0x660. This is word address 0x660 with byte offset 0x0, so the lower half-word is affected. The register x1 contains 0xABC1234, and the least-significant 16 bits are 0x1234. So, 0x1234 is stored in the lower half word of 0x660, resulting in 0x0001234.

The word 0x394 is written to once, with a sb to 0x395. This is word address 0x394 with byte offset 0x1, so the byte at offset 1 is affected. The register x4 contains 0x77337733, and the least-significant 8 bits are 0x33. So, 0x33 is stored into the byte at offset 1, resulting in 0x00033300.

The word 0x518 is written to once, with a sb to 0x51B. This is word address 0x518 with byte offset 0x3, so the byte at offset 3 (most significant byte) is affected. The register x1 contains 0xABC1234, and the least-significant 8 bits are 0x34. So 0x34 is stored into the byte at offset 3, resulting in 0x34000000.

(C) How many hits and misses did the data cache receive?

Hits: ________ 3 _________
Misses: ________ 4 _________

With 32 total words and 8 words per line, the cache has 2 index bits and 3 offset bits.

The cache begins empty.

The first access to 0x500 is a miss, since nothing is in the cache. Then all 8 words from 0x500 – 0x51C are placed in the cache. They are placed at index 0.

The second access to 0x502 is a hit, since the word 0x500 is in the cache and this is a half-word access to part of that word.

The access to 0x660 is a miss. The line containing 0x660 is brought into the cache at index 3.

The next access to 0x660 is a hit, since it is now in the cache.

The access to 0x395 is a miss. It is part of the word 0x394. It maps to index 0, so must kick out the line containing address 0x500-0x51C.

The access to 0x51B is a miss. It is in the word 0x518. The line containing 0x500-0x51C is brought back into the cache again.

The access to 0x503 is a hit. It is in the word 0x500, which is now in the cache.
Problem 3 ★

Bob is tired of supporting half-word and byte instructions in his cache. He complains that users can achieve the same thing with RISC-V instructions! Translate the following RISC-V half-word or byte load/store instructions to sequences of instructions that do not use half-word or byte load/store instructions.

(A) \texttt{lh x2, offset(x1)}

\begin{verbatim}
addi x4, x1, offset  // Calculate address
andi x5, x4, 2      // Get half-word offset
andi x3, x4, ~2     // Calculate base word address
lw x2, 0x0(x3)      // Load entire word into x2
beqz x5, else       // Check if equal to 0
srai x2, 16         // If offset \neq 0, shift right by 16
j endif
else:
  slli x2, 16        // If offset = 0, shift left by 16
                     // (to get rid of upper bits)
  srai x2, 16        // Then shift back right by 16
                     // (sign extends appropriately)
endif:
\end{verbatim}

(B) \texttt{sb x2, offset(x1)}

\begin{verbatim}
addi x4, x1, offset  // Calculate address
andi x6, x4, 3      // Get byte offset
andi x7, x4, ~3     // Calculate base word address
lw x3, 0x0(x7)      // Load current word
slli x6, x6, 3      // Multiply byte offset by 8
li x5, 0xFF         // Load byte of all 1s
and x2, x2, x5      // And x2 with byte mask
sll x2, x2, x6      // Shift x2 to byte offset position
sll x5, x5, x6      // Shift mask to byte offset position
xor x5, x5, 0       // Reverse byte mask
and x3, x3, x5      // And x3 with reversed byte mask
add x2, x2, x3      // Add x2 (just byte we want) and
                     // x3 (minus byte we want)
sw x2, 0x0(x7)      // Store new x2 into address
\end{verbatim}
(C) For a sequence of instructions that do not contain any half-word or byte-level stores or loads, is Bob’s scheme faster or slower?

(Slower) (No Change) (Faster)

It takes some extra circuitry to compute half-word and byte-level stores and loads. If the cache determines the clock in our system, and this computation is on the critical path, it would make our clock period longer, and thus Bob’s scheme is faster. If this is not on the critical path, then there would be no change.

(D) For a sequence of instructions with many half-word and byte-level stores and loads, is Bob’s scheme faster or slower?

(Slower) (No Change) (Faster)

Each half-word or byte-level store or load gets translated into several instructions. This is especially true if the byte and half-word offsets are not known in advance (as is the case in (A) and (B)). Even if each cycle is slightly shorter by omitting the circuitry to read and write at the half word and byte level, so many more cycles are consumed by the extra instructions that the instruction sequence will run slower.
Problem 4

Assume, the program shown on the right is being run on a RISC-V processor with a cache with the following parameters:

- 2-way set-associative
- block size of 2, i.e., 2 data words are stored in each cache line
- total number of data words in the cache is 32
- LRU replacement strategy

(A) The cache will divide the 32-bit address supplied by the processor into four fields: 2 bits of byte offset, B bits of block offset, L bits of cache line index, and T bits of tag field. Based on the cache parameters given above, what are the appropriate values for B, L, and T?

2 words per line; \( \log_2(2) = 1 \)

value for B: 1  // allocate space to hold array

8 lines per ways

value for L: 3  // instruction address = base + instruction_index * 4 = 0x240 + 3*4 = 0x24C

value for T: 32-6=26

(B) If the SLLI instruction is resident in a cache line, what will be its cache line index? the value of the tag field for the cache?

Instruction address = base + instruction_index * 4 = 0x240 + 3*4 = 0x24C
0x24C=0b00000000000000000000001001_001_100

Cache line index for SLLI when resident in cache: 1

Tag field for SLLI when resident in cache: 0x9

(C) Given that the code begins at address 0x240 and the array begins at address 0x420, and that there are 16 elements in the array as shown in the code above, list all the values \( j \) (\( 0 \leq j < 16 \)) where the location holding the value \( A[j] \) will map to the same cache line index as the SLLI instruction in the program.

The array addresses that overlap are:
0x448=0b000000000000000000000010001_001_000
and
0x44C=0b000000000000000000000010001_001_100

These correspond to the 10th and 11th elements of the array.

List all \( j \) where \( A[j] \) have the same cache line index as SLLI: 10, 11

(D) If the outer loop is run many times, give the steady-state hit ratio for the cache, i.e., assume that the number of compulsory misses as the cache is first filled are insignificant compared to the number of hits and misses during execution.
Since this is a 2-way cache, the instructions and data will never conflict (they will be stored in two different ways even if they have the same line index). Thus, at steady-state, there is a 100% hit ratio.

Steady-state hit ratio (%): 100%
**Problem 5 ★**

Consider a 2-way set-associative cache where each way has 4 cache lines with a **block size of 2 words**. Each cache line includes a valid bit (V) and a dirty bit (D), which is used to implement a write-back strategy. The replacement policy is least-recently-used (LRU). The cache is used for both instruction fetch and data (LD,ST) accesses. Please use this cache when answering questions (A) through (D).

(A) Using this cache, a particular benchmark program experiences an average memory access time (AMAT) of 1.3 cycles. The access time on a cache hit is 1 cycle; the miss penalty (i.e., additional access time) is 10 cycles. What is the hit ratio when running the benchmark program? You can express your answer as a formula if you wish:

**Hit ratio for benchmark program:** 0.97

\[
\text{AMAT} = \text{hit\_time} + (1 - \text{hit\_ratio}) \times \text{miss\_penalty}
\]

1.3 = 1 + (1-HR)*10

(B) The circuitry for this cache uses various address bits as the block offset, cache line index and tag field. Please indicate which address bits A[31:0] are used for each purpose by placing a “B” in each address bit used for the block offset, “L” in each address bit used for the cache line index, and “T” in each address bit used for the tag field.

**Fill in each box with “B”, “L”, or “T”**

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>29</th>
<th>28</th>
<th>27</th>
<th>26</th>
<th>25</th>
<th>24</th>
<th>23</th>
<th>22</th>
<th>21</th>
<th>20</th>
<th>19</th>
<th>18</th>
<th>17</th>
<th>16</th>
<th>15</th>
<th>14</th>
<th>13</th>
<th>12</th>
<th>11</th>
<th>10</th>
<th>9</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>5</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>T</td>
<td>L</td>
<td>L</td>
<td>B</td>
<td>0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Block Size = 2Words = 8Byte: Block Offset 1bit, Byte Offset 2bit (2bits are 2'b00)
4 lines/each way = 4 sets cache: Index 2bits
Remaining bits are all tag bits: 27 bits

(C) This cache needs room to store new data and based on the LRU replacement policy has chosen the cache line whose information is shown to the right for replacement. Since the current contents of that line are marked as dirty (D = 1), the cache must write some information back to main memory. What is the address of each memory location to be written? Please give each address in hex.

**Addresses of each location to be written (in hex): 0x2478, 0x247C**

Block offset: 1-bit value: 0b1 or 0b0; Byte offset: 2-bit value 0b00
Index: 2-bit 3 = 0b11
Tag: 27-bit 0x123 = 0b000 0000 0000 0000 0001 0010 0011
Address = {27'b tag, 2'b index, 1'b block offset, 2'b byte offset}
1) Block Offset (0b0): 0b000...00_0010_0100_0111_1000 = 0x2478
2) Block Offset (0b1): 0b000...00_0010_0100_0111_1100 = 0x247C

---

Way: 0
Cache line index: 3
Valid bit (V): 1
Dirty bit (D): 1
Tag field: 0x123
This cache is used to run the following benchmark program. The code starts at memory address 0; the array referenced by the code has its first element at memory address 0x200. First determine the number of memory accesses (both instruction and data) made during each iteration through the loop. Then estimate the steady-state average hit ratio for the program, i.e., the average hit ratio after many iterations through the loop.

```
. = 0
  mv x3, x0        // byte index into array
  mv x1, x0        // initialize checksum accumulator
loop:
  lw x2, 0x200(x3) // load next element of array
  slli x1, x1, 1   // shift checksum
  addi x1, x1, 1   // increment checksum
  add x1, x1, x2   // include data value in checksum
  addi x3, x3, 4   // byte index of next array element
  slti x2, x3, 1000 // process 250 entries
  bnez x2, loop    // halt
  unimp
```

In steady state:
- Each loop fetches 7 instructions and 1 data (lw).
- Loop starts at instruction address 0x8=0b1000
- If you calculate the index bits for every two instructions (since each cache line contains two words) in the loop, you will see that none of them overlap. In addition, they can all be inserted into Way 0 of the cache. Thus, there are no instruction cache misses!

The other way of our cache can be used to load data from our array starting at address 0x200. Every time we encounter a new element of the array not in the cache, our cache load request will actually fetch two words (i.e. to contiguous elements in our array); thus, in the next iteration of our loop, we will not need to fetch again, as the next element is already in our cache.

Thus, the data fetch instruction only (lw) misses every other iteration of the while loop. Since there are 8 total memory accesses per while loop iteration, we see that one out of every 2*8=16 memory accesses is a miss.

Steady-state hit ratio = (8-0.5)/8 = 15/16

**Number of memory accesses made during each iteration of the loop:** 8

**Estimated steady-state average hit ratio:** 15/16
Problem 6

Consider the diagram to the right for a 2-way set associative cache to be used with our RISC-V processor. Each cache line holds a single 32-bit word of data along with its associated tag and valid bit (0 when the cache line is invalid, 1 when the cache line is valid).

(A) The RISC-V produces 32-bit byte addresses, A[31:0]. To ensure the best cache performance, which address bits should be used for the cache index? For the tag field?

address bits used for cache index: A[4:2]
address bits used for tag field: A[31:5]

As always, 2 bits for byte offset (both 0)
1 word per block implies block offset = 0 bits (no need to select)
8 lines/way implies index = 3 bits
Remaining 27 bits are tag bits.

(B) Suppose the processor does a read of location 0x5678. Identify which cache location(s) would be checked to see if that location is in the cache. For each location specify the cache section (A or B) and row number (0 through 7). E.g., 3A for row 3, section A. If there is a cache hit on this access what would be the contents of the tag data for the cache line that holds the data for this location?

cache location(s) checked on access to 0x5678: 6A and 6B

cache tag data on hit for location 0x5678 (hex): 0x2B3

0x5678 in binary: 0b 0101 0110 0111 1000
Line: 3'b110 = 6
Tag: 0b0010_1011_0011 = 0x2B3

Remember, all ways are checked in parallel when searching a set-associative cache.

(C) Assume that checking the cache on each read takes 1 cycle and that refilling the cache on a miss takes an additional 8 cycles. If we wanted the average access time over many reads to be 1.1 cycles, what is the minimum hit ratio the cache must achieve during that period of time? You needn’t simplify your answer.

1.1 = 1 + (1-HR)*8

minimum hit ratio for 1.1 cycle average access time: 79/80
(D) Estimate the approximate cache hit ratio for the following program. Assume the cache is empty before execution begins (all the valid bits are 0) and that an LRU replacement strategy is used. Remember the cache is used for both instruction and data (LD) accesses.

```assembly
addi x4, x0, 0x100
mv x1, x0
lui x2, 1       // x2 = 0x1000
loop: lw x3, 0(x4)
      addi x4, x4, 4
      add x1, x1, x3
      addi x2, x2, -1
      bnez x2, loop
      sw x1, 0x100(x0)
      unimp       // halt

source:
   . = 0
   . = . + 0x4000 // Set source to 0x100, reserve 0x1000 words
```

**approximate hit ratio: 5/6**

The first instruction in the loop (lw) is at 0xC=0b1000. Extracting the index bits from this address, we see that the instructions within the loop body (lw - bnez) occupy lines 3-7 of one of the ways in the cache.

Thus, one way is always free for data. Each iteration causes a data fetch that is a cache miss (since there is only a single word per cache line -> no spatial locality). Thus, each iteration there are 5 instruction fetches (hits) and 1 data fetch (always a miss).

(E) After the program of part (D) has finished execution what information is stored in row 4 of the cache? Give the addresses for the two locations that are cached (one in each of the sections) or briefly explain why that information can’t be determined.

**Addresses whose data is cached in “Row 4”: 0x10 and 0x40F0**

(@ 0x10) addi x4,x4,4: Mapped to Row 4

Data access ends at 0x4100 - 4 (Starts from 0x100 and does 0x1000 times: next element at +4)

= 0x40FC @ Row 7

... 0x40F0 @ Row 4
    0x40F4 @ Row 5
    0x40F8 @ Row 6
    0x40FC @ Row 7
    ...

...
**Problem 7 ★**

A standard unpipelined RISC-V is connected to a 2-way set-associative cache containing 8 sets, with a block size of 4 32-bit words. The cache uses a LRU replacement strategy. At a particular point during execution, a snapshot is taken of the cache contents, which are shown below. All values are in hex; assume that any hex digits not shown are 0.

### Way #1

<table>
<thead>
<tr>
<th>Line#</th>
<th>V</th>
<th>D</th>
<th>TAG</th>
<th>A5</th>
<th>A4</th>
<th>A3</th>
<th>A2</th>
<th>B1</th>
<th>B0</th>
</tr>
</thead>
<tbody>
<tr>
<td>7</td>
<td>1</td>
<td>1</td>
<td>f29</td>
<td>a6218800</td>
<td>77f70002</td>
<td>c01f0014</td>
<td>8359c800</td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>0</td>
<td>027</td>
<td>c01f0037</td>
<td>73ffe0fa7</td>
<td>77ee0002</td>
<td>d33ac000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>0</td>
<td>0</td>
<td>000</td>
<td>00000000</td>
<td>00000000</td>
<td>00000000</td>
<td>00000000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>0</td>
<td>33a</td>
<td>8358c000</td>
<td>73fa0002</td>
<td>c17f00f0</td>
<td>73ffe0f0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>0</td>
<td>803</td>
<td>64f06068</td>
<td>60ff0600</td>
<td>c01f0043</td>
<td>80200000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>0</td>
<td>002</td>
<td>90072800</td>
<td>73ffe0b8</td>
<td>80428000</td>
<td>d0052ca0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>5c8</td>
<td>80400000</td>
<td>8358c000</td>
<td>73f00003</td>
<td>5ab6000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>a31</td>
<td>c01f0045</td>
<td>d3c7e0f0</td>
<td>7f1f0091a</td>
<td>77e00002</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Way #2

<table>
<thead>
<tr>
<th>Line#</th>
<th>V</th>
<th>D</th>
<th>TAG</th>
<th>B3</th>
<th>B2</th>
<th>B1</th>
<th>B0</th>
</tr>
</thead>
<tbody>
<tr>
<td>7</td>
<td>1</td>
<td>0</td>
<td>77e</td>
<td>d01e001e</td>
<td>61040008</td>
<td>c01f0044</td>
<td>80a32800</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>0</td>
<td>1f0</td>
<td>c3f0394</td>
<td>77e00002</td>
<td>80400000</td>
<td>c01f0046</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>0</td>
<td>d1c</td>
<td>f79c0001</td>
<td>e809ff</td>
<td>55555555</td>
<td>55555555</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>0</td>
<td>00a</td>
<td>d37afeff</td>
<td>d0000000</td>
<td>73ffe93</td>
<td>80600000</td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>0</td>
<td>01b</td>
<td>c01f0035</td>
<td>d01e04b0</td>
<td>80a32800</td>
<td>73ffe0ab</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>0</td>
<td>02b</td>
<td>c6510001</td>
<td>77f00002</td>
<td>90082800</td>
<td>c01f0041</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>f99</td>
<td>77ee0002</td>
<td>f9b001f</td>
<td>64000000</td>
<td>fc000000</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>7af</td>
<td>a9ab6000</td>
<td>c01f003b</td>
<td>77e00002</td>
<td>fb5cc011</td>
</tr>
</tbody>
</table>

(A) The cache uses bits from the 32-bit byte address produced by the processor to select the appropriate set (L), as input to the tag comparisons (T) and to select the appropriate word from the data block (B). For correct and optimal performance what are the appropriate portions of the address to use for L, T and B? Express your answer in the form “A[N:M]” for N and M in the range 0 to 31, or write “CAN’T TELL”.

4 bytes/word -> 2-bit byte offset (as always)
4 word (16 Byte) cache block -> 2-bit block offset
8 rows per way (8-set cache) -> 3-bit index
Address -> \{25'b tag, 3'b index, 2'b block offset, 2'b byte offset\}

**Address bits to use for L:** A[6:4]
**Address bits to use for T:** A[31:7]
**Address bits to use for B:** A[3:2]

(B) For the following addresses, if the contents of the specified location appear in the cache, give the location’s 32-bit contents in hex (determined by using the appropriate value from the cache). If the contents of the specified location are NOT in the cache, write “MISS”.

0xA1100: Tag 0x1422; Line 0; Word 0 (B0 or A0)

Contents of location 0xA1100 (in hex) or “MISS”: **MISS**

0x548: Tag 0x00A; Line 4; Word 2 (B2 or A2)

Contents of location 0x548 (in hex) or “MISS”: **0xDC000000**
(C) Ignoring the current contents of the cache, is it possible for the contents of locations 0x0 and 0x1000 to both be present in the cache simultaneously?

2-ways, so yes, even though their index bits match

Locations 0x0 and 0x1000 present simultaneously (circle one):  **YES**  ...  **NO**

(D) Give a one-sentence explanation of how the D bit got set to 1 for Line #7 of Way #1. At what point should the D bit be reset to 0?

**One sentence explanation**

ST updated a value of a word in that line in cache, but hasn’t yet been written back to memory (Dirty). It will be reset when that line is evicted from the cache and written to memory.

(E) The following code snippet sums the elements of the 32-element integer array X. Assume this code is executing on a RISC-V processor with a cache architecture as described above and that, initially, the cache is empty, i.e., all the V bits have been set to 0. Compute the hit ratio as this program runs until it executes the `unimp` instruction, a total of \(2 + (6 \times 32) + 1 = 195\) instruction fetches and 32 data accesses.

```
Hit ratio: 216/227
```

```c

mv  x4, x0       // loop counter
mv  x1, x0       // accumulated sum

loop:
    slli  x2, x4, 2 // convert loop counter to byte offset
    lw   x3, 0x100(x2) // load next value from array
    add  x1, x1, x3 // add value to sum
    addi x4, x4, 1 // increment loop counter
    slti x2, x4, 32 // finished with all 32 elements?
    bnez x2, loop  // nope, keep going

unimp       // all done, sum in x1

. = 0x100
X: .word 1     // the 32-element integer array X
   .word 2
   ...
   .word 32
```

Cache stores 4 words per line: First four instructions (mv, mv, slli, lw) map to cache index “0” and the other 4 instructions (add, addi, slti, bnez) map to cache index “1”. Unimp is mapped to index “3”

For 32 words in the array, 4 are grouped into a cache block and stored in another way of the cache (total 8 lines)

3 instruction misses and 8 data misses (Compulsory – initially cache empty)
No more misses after that.
Total instruction fetches: \(2 + 6 \times (32) + 1 = 195\), total data fetches: 32

\((195+32-3-8)/(195+32)\)
Problem 8 ★

After his geek hit single *I Hit the Line*, renegade singer Johnny Cache has decided he’d better actually learn how a cache works. He bought three RISC-V processors, identical except for their cache architectures:

- *Proc1* has a 64-line direct-mapped cache
- *Proc2* has a 2-way set associative cache, LRU, with a total of 64 lines
- *Proc3* has a 4-way set associative cache, LRU, with a total of 64 lines

Note that each cache has the same total capacity: 64 lines, each holding a single 32-bit word of data or instruction. All three machines use the same cache for data and instructions fetched from main memory.

Johnny has written a simple test program:

```asm
// Try a little cache benchmark
// Assume x7 = 0x2000 (data region A)
// Assume x8 = 0x3000 (data region B)
// Assume x9 = 16 (size of data regions in BYTES!)

.data
   = 0x1000  // start program here
P:    addi x6, x0, 1000  // outer loop count
Q:    mv x3, x9          // Loop index i (array offset)
R:    addi x3, x3, -4    // i = i-1
     addi x9, x3, x7     // x9 = address of A[i]
     addi x10, x3, x8    // x10 = address of B[i]
     lw x1, 0(x9)        // read A[i]
     lw x2, 0(x10)       // read B[i]
     bnez x3, R
     addi x6, x6, x9     // repeat many times
     bnez x6, Q
     bnez x6, Q
     unimp              // halt
```

Johnny runs his program on each processor, and finds that one processor model outperforms the other two.

(A) Which processor model gets the highest hit ratio on the above benchmark?

Circle one:  Proc1  Proc2  Proc3

(B) Johnny changes the value of `B` in his program to 0x2000 (same as `A`), and finds a substantial improvement in the hit rate attained by one of the processor models (approaching 100%).

Which model shows this marked improvement?

Circle one:  Proc1  Proc2  Proc3

(C) Finally, Johnny moves the code region to 0x0 and the two data regions `A` and `B` each to 0x0, and sets `x9` to 64. What is the TOTAL number of cache misses that will occur executing this version of the program on each of the processor models?

One region (code data) needs to fetch 64/4 = 16 elements once in all caches (“compulsory misses”)

TOTAL cache misses running on Proc1: 16; Proc2: 16; Proc3: 16
Problem 9 ★

(A) We would like to design a cache with an AMAT (average memory access time) of 1.5 cycles. Accessing our cache takes 1 cycle, and on a miss, it takes an additional 10 cycles to retrieve the data from main memory and update the cache. What does our hit ratio need to be in order to achieve the target AMAT?

\[
\text{AMAT} = \text{HitTime} + \text{MissPenalty} \times \text{MissRatio}
\]

\[
1.5 \text{ cycles} = 1 \text{ cycle} + (10 \text{ cycles}) \times (1 - \text{HitRatio})
\]

\[
0.05 = 1 - \text{HitRatio}
\]

\[
\text{Hit ratio} = \frac{0.95}{1}
\]

We choose to implement a 2-way set-associative cache with a block size of 4 (i.e. 4 words per line). The number of sets in the cache is 4. Assume that addresses and data words are 32 bits.

(B) To ensure the best cache performance, which address bits should be used for the block offset, the cache index, and the tag field?

- Address bits used for byte offset: \( A[1 : 0] \)
- Address bits used for block offset: \( A[3 : 2] \)
- Address bits used for cache index: \( A[5 : 4] \)
- Address bits used for tag field: \( A[31 : 6] \)

(C) Assuming the cache uses a writeback policy, what is the total number of bits per cache line? Please show your work for partial credit.

\[
\text{Bits per line} = 1 \text{ Valid bit} + 1 \text{ Dirty bit} + 26 \text{ Tag bits} + 4 \times 32 \text{ Data bits}
\]

\[
= 28 + 128 \text{ bits}
\]

\[
= 156 \text{ bits}
\]

\[
\text{156 bits}
\]
We want to analyze the performance of this cache on the following assembly program, which
iterates through a 1000-word array A and sets each element to \(A[i] = -A[i]\). The base
address of array A is 0x3000.

. = 0x100 // The following code starts at address 0x100

// Assume the following registers are initialized:
// x1=0 (loop index)
// x2=1000 (number of array elements)
// x3=0x3000 (base address of array A)

loop:
  slli x5, x1, 2   // x5 = byte offset of the ith element
  add x6, x5, x3   // x6 = address of A[i]
  lw x7, 0(x6)    // x7 = A[i]
  sub x7, x0, x7   // x7 = -A[i]
  sw x7, 0(x6)    // store A[i] = -A[i]
  addi x1, x1, 1   // increment i
  blt x1, x2, loop // continue looping

(D) Below is the cache state the first time the program is about to enter the loop at loop.
Assume that the cache uses a least-recently used (LRU) replacement policy, and that all
cache lines in Way 1 are currently the least-recently used. Mark up the cache below to
indicate the state of the cache immediately after one loop iteration (after executing the
blt instruction for the first time). You do not need to specify the value of data words, but
do specify the values of D (dirty bit), V (valid bit), and Tag.

<table>
<thead>
<tr>
<th>Way 0</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th>Word 3</th>
<th>Word 2</th>
<th>Word 1</th>
<th>Word 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>D</td>
<td>V</td>
<td>Tag</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0 1</td>
<td>1</td>
<td>0x0 0xC0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0 1</td>
<td>1</td>
<td>0x0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0 1</td>
<td>1</td>
<td>0x0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Way 1</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th>Word 3</th>
<th>Word 2</th>
<th>Word 1</th>
<th>Word 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>D</td>
<td>V</td>
<td>Tag</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>- 0</td>
<td>0 1</td>
<td>- 0x4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>- 0</td>
<td>0 1</td>
<td>- 0x4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
(F) How many instruction fetches and data accesses occur per iteration of the loop?

Number of instruction fetches: ____7_____
Number of data accesses: ____2_____

(G) After the program has been running for many loop iterations, what is the steady-state hit ratio for instruction fetches and data accesses?

All instructions fetches fit in Way 1. They are never replaced by data since they are always accessed more recently than the previous set of data. So in steady state, there is an instruction hit rate of 100%.

The data array is large enough that in steady state every time you access a new block of data you get 1 miss. However, when you get the miss, you bring in 4 consecutive data elements, each of which is accessed twice through a lw followed by a sw. So for every 8 data accesses, you get 1 miss, or a hit rate of 7/8.

Steady-state hit ratio for instruction fetches: ____1_____
Steady-state hit ratio for data accesses: ____7/8_____


Problem 10

Anne and Ben just learned about caches in 6.004 and decided to design their own.

(A) They would like to design a cache with an AMAT (average memory access time) of 3 cycles. Accessing the cache should take 1 cycle, and on a miss, it should take an additional 16 cycles to retrieve the data from main memory, update the cache, and return the requested word to the processor. What should their hit ratio be in order to achieve the target AMAT?

Hit ratio: 7/8

Ben suggests implementing a 2-way set-associative cache with a block size of 4 (i.e. 4 words per line). The number of sets in the cache is 8. Assume that addresses and data words are 32 bits wide.

(B) To ensure the best cache performance, which address bits should be used for the block offset, the cache index, and the tag field?

Address bits used for byte offset: A[1 : 0]
Address bits used for block offset: A[3 : 2]
Address bits used for cache index: A[6 : 4]
Address bits used for tag field: A[31 : 7]

(C) Anne agrees with the appropriateness of a 2-way set-associative implementation, but she suspects that a larger block size might result in a higher hit rate. Suppose the block size of the cache is doubled to 8. If the total number of data words in the cache remains unchanged, how would the number of cache lines change?

Change in # of cache lines (select one of the choices below):

UNCHANGED  ...  +1  ...  -1  ...  2x  ...  0.5x  ...  CAN’T TELL
Ultimately, they choose to implement Ben’s 2-way, 4-block cache. Below is a snapshot of the cache during the execution of some unknown code. The column labeled \(\text{Word } x\) corresponds to the \(x^{th}\) word of the block. The \(V\) bit specifies whether or not the line is valid, and the \(D\) bit specifies whether or not the line is dirty.

<table>
<thead>
<tr>
<th>V</th>
<th>D</th>
<th>Tag</th>
<th>Word 3</th>
<th>Word 2</th>
<th>Word 1</th>
<th>Word 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>0x32</td>
<td>0x3A</td>
<td>0x2A</td>
<td>0x1A</td>
<td>0x0A</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0x32</td>
<td>0x7B</td>
<td>0x6B</td>
<td>0x5B</td>
<td>0x4B</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>0x32</td>
<td>0x0C</td>
<td>0x1C</td>
<td>0x2C</td>
<td>0x3C</td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>0x32</td>
<td>0x4D</td>
<td>0x5D</td>
<td>0x6D</td>
<td>0x7D</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>0x50</td>
<td>0x03</td>
<td>0x13</td>
<td>0x23</td>
<td>0x33</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>0x50</td>
<td>0x14</td>
<td>0x24</td>
<td>0x34</td>
<td>0x44</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>0x43</td>
<td>0x85</td>
<td>0x75</td>
<td>0x65</td>
<td>0x55</td>
</tr>
<tr>
<td>7</td>
<td>1</td>
<td>0x66</td>
<td>0x96</td>
<td>0x86</td>
<td>0x76</td>
<td>0x66</td>
</tr>
</tbody>
</table>

\(\text{V} \quad \text{D} \quad \text{Tag} \quad \text{Word 3} \quad \text{Word 2} \quad \text{Word 1} \quad \text{Word 0}\)

\(1 \quad 0 \quad 0x33 \quad 0x83 \quad 0x82 \quad 0x81 \quad 0x80\)
\(1 \quad 1 \quad 0x33 \quad 0xB7 \quad 0xB6 \quad 0xB5 \quad 0xB4\)
\(1 \quad 0 \quad 0x95 \quad 0xC0 \quad 0xC1 \quad 0xC2 \quad 0xC3\)
\(1 \quad 0 \quad 0x95 \quad 0xD6 \quad 0xD5 \quad 0xD4 \quad 0xD3\)
\(1 \quad 0 \quad 0x22 \quad 0x86 \quad 0x87 \quad 0x88 \quad 0x89\)
\(1 \quad 0 \quad 0xA0 \quad 0x95 \quad 0x94 \quad 0x93 \quad 0x92\)
\(1 \quad 1 \quad 0x37 \quad 0xF8 \quad 0xF7 \quad 0xF6 \quad 0xF5\)
\(1 \quad 1 \quad 0x18 \quad 0xA9 \quad 0x8A \quad 0xA7\)

(D) Would a load request to address 0x193C result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, write N/A.

Hit / Miss : \_
\_

Returned value if hit or N/A if miss: \_
\_

0x4D
Anne and Ben want to analyze the performance of this cache on the following assembly program, which calculates the first 256 terms in the Fibonacci sequence and stores them in an array \( A \). The base address of array \( A \) is \( 0x3000 \).

```assembly
// Assume the following registers are initialized:
// x1 = 0 (initial loop index)
// x2 = 256 - 2 = 254 (number of Fibonacci elements to calculate)
// x3 = 0x3000 (base address of array A)

. = 0x100 // The following code starts at address 0x100

fibonacci:
    li x4, 1         // x4 = 1 (second element in sequence)
    sw x0, 0(x3)     // \( A[0] = 0 \)
    sw x4, 4(x3)     // \( A[1] = 1 \)

loop:
    slli x4, x1, 2   // x4 = byte offset of the ith element
    add x5, x4, x3   // x5 = address of \( A[i] \)
    lw x6, 0(x5)     // x6 = \( A[i] \)
    lw x7, 4(x5)     // x7 = \( A[i+1] \)
    add x6, x6, x7   // x6 = \( A[i] + A[i+1] \)
    sw x6, 8(x5)     // \( A[i+2] = x6 \)
    addi x1, x1, 1   // increment i
    blt x1, x2, loop // continue looping
```

Answer the following questions about the behavior of the cache during execution of the above code. Assume that the cache uses a least recently used (LRU) replacement policy, that the cache is initially empty, and that all cache lines in Way 0 are currently the least-recently used.

(E) How many instruction fetches and data accesses occur per iteration of the loop?

- Number of instruction fetches: ___8___
- Number of data accesses: ___3___

(F) After the program has been running for many loop iterations, what is the steady-state hit ratio for instruction fetches and data accesses? Hint: Note that in steady state each array element is accessed in multiple loop iterations.

- Steady-state hit ratio for instruction fetches: ___1___
- Steady-state hit ratio for data accesses: ___11/12___
Problem 11 (From Past Quizzes)

Dilvina and Saniel are analyzing 4.006 grade statistics and are performing some hefty calculations, so they suspect that a cache could improve their system's performance.

(A) They are considering using a 2-way set-associative cache with a block size of 4 (i.e. 4 words per line). The cache can store a total of 64 words. Assume that addresses and data words are 32 bits wide. To properly make use of locality, which address bits should be used for the block offset, the cache index, and the tag field?

Address bits used for byte offset: A[1 : 0]

Address bits used for tag field: A[31 : 7]

Address bits used for block offset: A[3 : 2]

Address bits used for cache index: A[6 : 4]

Block size of 4 → 2 bits for block offset
64 words / 4 words per block = 16 blocks
16 blocks / 2 ways = 8 sets → 3 bits for cache index
32 – 3 – 2 – 2 = 25 bits of tag

(B) If Dilvina and Saniel instead used a direct-mapped cache with the same total capacity (64 words) and same block size (4 words), how would the following parameters in their system change?

Change in # of cache lines (select one of the choices below):

UNCHANGED ... +1 ... -1 ... 2x ... 0.5x ... CAN'T TELL
Accepted Unchanged or 2x as solutions because 2-way cache also has ½ the number of sets as the direct mapped cache, but the number of cache lines is actually the same.

Change in # of bits in tag field (select one of the choices below):

UNCHANGED ... +1 ... -1 ... 2x ... 0.5x ... CAN'T TELL

Block size of 4 → 2 bits for block offset
64 words / 4 words per block = 16 blocks → 4 bits for cache line index
32 – 4 – 2 – 2 = 24 bits of tag
Ultimately, they decided that the 2-way set associative cache would probably have better performance for their application, so the remainder of the problem will be considering a 2-way set associative cache. Below is a snapshot of this cache during the execution of some unknown code.

V is the valid bit and D is the dirty bit of each set.

<table>
<thead>
<tr>
<th>Way 0</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>V</td>
<td>D</td>
<td>Tag</td>
<td>3</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0x32</td>
<td>0x3A</td>
<td>0x2A</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0x32</td>
<td>0x7B</td>
<td>0x6B</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>0</td>
<td>0x32</td>
<td>0x0C</td>
<td>0x1C</td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>0</td>
<td>0x32</td>
<td>0x4D</td>
<td>0x5D</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>1</td>
<td>0x50</td>
<td>0x03</td>
<td>0x13</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>1</td>
<td>0x50</td>
<td>0x14</td>
<td>0x24</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>0</td>
<td>0x43</td>
<td>0x85</td>
<td>0x75</td>
</tr>
<tr>
<td>7</td>
<td>1</td>
<td>0</td>
<td>0x66</td>
<td>0x96</td>
<td>0x86</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Way 1</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>V</td>
<td>D</td>
<td>Tag</td>
<td>3</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0x33</td>
<td>0x83</td>
<td>0x82</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0x33</td>
<td>0xB7</td>
<td>0xB6</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>0</td>
<td>0x95</td>
<td>0xC0</td>
<td>0xC1</td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>0</td>
<td>0x95</td>
<td>0xD6</td>
<td>0xD5</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>1</td>
<td>0x22</td>
<td>0x86</td>
<td>0x87</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>1</td>
<td>0xA0</td>
<td>0x95</td>
<td>0x94</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>0</td>
<td>0x37</td>
<td>0xF8</td>
<td>0xF7</td>
</tr>
<tr>
<td>7</td>
<td>1</td>
<td>0</td>
<td>0x18</td>
<td>0xAA</td>
<td>0xA9</td>
</tr>
</tbody>
</table>

(C) Would the following memory accesses result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, explain why in a few words or by showing your work.

32-Bit Byte Address: 0x4AB4

Line index: ___3____
Tag: 0x___95____
Block offset: ___1___
Returned value if hit / Explanation if miss: ___________0xD4_________

32-Bit Byte Address: 0x21E0

Line index: ___6___
Tag: 0x___43___
Block offset: ___0___
Returned value if hit / Explanation if miss: __miss, valid bit is 0_______
Dilvina and Saniel want to analyze the performance of this same cache on the following assembly program, which averages the quiz 2 scores for the 225 students in 4.006. The students' scores are stored in an array A whose base address is at 0x3000.

```assembly
// Assume the following registers are initialized:
// x1 = 0 (loop index)
// x2 = 225 (number of 4.006 students)
// x3 = 0x3000 (base address of array A)
// x6 = 0 (used for summing)

. = 0x0 // The following code starts at address 0x0
sum_loop:
    slli x4, x1, 2  // x4 = 4*i
    add x5, x4, x3  // x5 = address of A[i]
    lw x5, 0(x5)    // x5 = A[i]
    add x6, x6, x5  // x6 = sum(A[i:0])
    addi x1, x1, 1  // increment i
    blt x1, x2, sum_loop // continue looping
divide_by_n:
    // divide by 225 here
```

Answer the following questions about the behavior of the cache during execution of the above code. Assume that the cache uses a least recently used (LRU) replacement policy, that the cache is initially empty, and that all cache lines in Way 0 are currently the least-recently used.

(D) How many instruction fetches and data accesses occur per iteration of the loop?

   **Number of instruction fetches per loop iteration:** __6__

   **Number of data accesses per loop iteration:** __1__

(E) What is the hit ratio for all memory accesses (both instruction fetches and data accesses) the first time through the loop?

   **First loop iteration hit ratio:** __4/7__

(F) After the program has been running for many loop iterations, what is the steady-state hit ratio for all memory accesses (both instruction fetches and data accesses)?

   **Steady-state hit ratio:** __27/28__

(G) Dilvina and Saniel want to use the cache above in their memory hierarchy between the CPU and main memory. The cache takes 5 cycles to determine if a memory access is a miss or a hit, and the main memory in their system takes an additional 140 cycles for each memory access that reaches it. What is the steady-state AMAT of their memory system during the test that they ran above?
AMAT = hit\_time + (1-\text{HR})(\text{miss penalty})
= 5 + (1/28)(140) =

AMAT: \underline{10}