cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:32 AM Color profile: Disabled

Composite Default screen

۲

P&H = Patterson & Hennessy; to indicate stuff from/based on material from 2-4th editions of the famous text, used by author in undergrad course on the subject taught 2000-2011.

generate, which may not reference real/physical memory locations at all (actually don't).

Memory bits are moved in generic blocks to hide details of how processors and instrs access memory (and other ISA and implementation stuff).

Processor word, typically 2's complement operand, in the original MIPS 32 bits stored in GPR that also hold addresses.

Ouiz How many bytes of memory in example?

© 2024 Dr. Muhammad Al-Hashimi

### **A Basic Cache**

Memory block

#### addresses that programs Processor may use any memory address but is only physically connected to locations in cache

#### Architecture independent unit

#### A running example

Solution States Sta

Source Section Sec

Same as processor & instr words (32 bits)



© 2024 Dr. Muhammad Al-Hashimi

0

### **Direct Mapped Block Placement**

Scache [block] line





© 2024 Dr. Muhammad Al-Hashimi

### Direct Mapped Cache Operation

Compulsory miss





#### 0

The initial 3 misses must occur (**compulsory**), whereas the last one (marked) is only due to <u>competition</u> on the same cache block.

#### Exercise

Construct the block address by inspecting the cache contents in figures.

| 0 | 1 | 2 | 3  | 4 | 5 | 6 | 7 |
|---|---|---|----|---|---|---|---|
|   |   |   | 10 |   |   |   |   |
| 0 | 0 | 0 | 1  | 0 | 0 | 0 | 0 |

© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:34 AM Color profile: Disabled Composite Default screen

### A Basic Cache Review: Main Concept

Why wrong question to ask if block is in cache <u>or</u> main memory? (Answer on this slide).

Size of index bits depends on the number of cache lines (blocks), e.g., 6 bits index 64 lines.

#### Quiz

Associate cache ops with memory access instructions in MIPS. (**Hint**: i.e., loads and stores; tricky).

#### Block mapping: a foundation, relates seats/spots in the cache

Block location (find in cache)
 Block (re)placement

| Block Addr |      |             |  |  |  |  |  |  |  |
|------------|------|-------------|--|--|--|--|--|--|--|
|            | tag  | cache index |  |  |  |  |  |  |  |
|            | bits | bits        |  |  |  |  |  |  |  |

Clearly, Iw: mem reads, and sw: mem writes. Address requests in prev slide could have come from any mix of Iw/sw. However cache ops are more primitive. Both reads in cache, and on miss both in cache, and on miss both (re)place blocks.

If in cache then must also be in memory, by definition of a hierarchy (if not in main memory it wouldn't be cached in the first place). The question is can it be referenced quickly?

KAU • CS-7046

⇔ Miss/hit ratio

→ Miss penalty (later)

© 2024 Dr. Muhammad Al-Hashimi

### A Basic Cache Design Review

-> Spatial locality

### Direct mapped, 1-word block

Simple (block addr same as word)

block transfer ving only one vord (instruction **Fast** (find quickly + relatively min miss penalty)

Solution Temporal locality only (no nearby addrs)

Simplified writes (create versions of a block)

Miss ratio? Clearly a desirable performance metric

KAU • CS-7047

A minimal block transfer time involving only one processor word (instruction or operand perhaps) that may be accessed repeatedly in a short time.

Hit and miss similar, involving a direct write to cache with the same overhead.

#### 0

The 2nd miss (16) could perhaps have been avoided if some neighboring words to 19 were included in the block (i.e., some **spatial locality**).

© 2024 Dr. Muhammad Al-Hashimi



### **Improving Performance**



© 2024 Dr. Muhammad Al-Hashimi

### Improving Performance Multi-word Mapping





A design parameter may affect more than one performance metric in different ways.

Quiz

Specify consequences of using chip space for one big instr + data cache?

© 2024 Dr. Muhammad Al-Hashimi

→ Block size

### Cache Performance 101 Understanding

#### Conflict miss

Quiz

Identify compulsory misses for the request scenario in the multi-word cache. Compare to the 1word block cache.

#### Quiz

Which performance metric is affected in each case? State how.

|   | <br>17<br>17             | <br> |  |
|---|--------------------------|------|--|
|   | <br>25<br>25             | <br> |  |
| ٨ | <br><del>-17</del><br>17 | <br> |  |

Most words unused before a block is replaced, underuse more likely as blocks become wider (ironically, less use of better spatial locality) + rise in less productive bus traffic.

© 2024 Dr. Muhammad Al-Hashimi

#### ➡ Lower miss rate

initially less misses

Better use of spatial locality

#### 🖙 Side-effects 🖗

More <u>competition</u> between blocks over limited cache spots

More time to fetch a missed block from memory
A tradeoffs are

tradeoffs are fundamental to cache design

### A Basic Cache Design Review 2

#### 0

Less tags to save vs 1-word blocks (why? *Ans. next slide*). What about total tag bits?

Write misses may incur additional overheads (later).

Misses <u>can</u> be reduced but those due to conflict may become more prominent.

#### Direct mapped, multi-word block

- Solution Temporal and spatial locality
- Still fast finds (but careful with miss penalty)
- Setter miss rate (with conflict miss issues)
- Size prevents some misses due to capacity

#### **Pursuing lower miss ratios**

increased penalty

#### 8-

More conflicts lead to larger blocks replaced too soon at a <u>higher cost</u> before reasonably utilizing their locality, a byproduct of (direct) mapping of blocks.

© 2024 Dr. Muhammad Al-Hashimi

Pressures to keep misses in check
 Rethink block mapping

cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:38 AM Color profile: Disabled Composite Default screen

Seat assignment in a bus or a plane is a good example of flexibilities in block mapping strategies.

Tough luck if more than one of the passengers assigned an overbooked seat showed up (overbooking a 2-seat row could accommodate up to two at a time).

Г Overbooking all seats as a set (the trip) causes issues free seating only if there is no more room.

۱...۲

Г

© 2024 Dr. Muhammad Al-Hashimi

assigned row of seats

уомелек' інскедзе. cache. Bits in a tag, than 32 for the 1-word 8 рүоскя-газя кагиек word blocks resulted in -+ 'әլдшрхә 8иіппп one needs a tag. In the fewer tass since each fewer blocks, therefore, ot shast leads to

KAU • CS-70413



associate with



assigned

seat

**Improving Performance Mapping Strategies** 

### **Associative Cache**

#### Multi-block mapping

set-of-one? Blocks individually indexed



#### ⇒ Direct-mapped Map to <u>one</u> block in cache

Map a mem block to <u>set</u> of cache blocks, not just one (fig: 2-way); blocks *fade*, mapping sees sets.

#### n-way set associative



Divide cache into sets of n>1 blocks
 Map to set of blocks in cache



## Fully associative Map to all cache blocks (place anywhere)

© 2024 Dr. Muhammad Al-Hashimi

### Associative Cache Block Organization

Same 16 blocks re-organized to different degrees of associativity.

#### A 16-block Cache

Note the **mappable unit** (numbered) and resulting cache size <u>from mapping</u> <u>formula viewpoint</u>.



2-way set associative





© 2024 Dr. Muhammad Al-Hashimi

#### cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:39 AM Color profile: Disabled Composite Default screen

#### Any block in a set is usable (replacement is more involved in associative cache than in a direct-mapped, why?).

#### Quiz

Place the following memory requests after address 8: 20, 0, 28. Discuss replacement logic, if needed (look it up).

#### Exercise

Assuming 4-word blocks, use the <u>formula</u> to determine the **cache set** for **byte-address** 102 for 1 to 16-way associativity.

Verify the **index field** indicating set number (block# for 1-way) in the subfigure.

© 2024 Dr. Muhammad Al-Hashimi

#### Associative Cache Block Placement

#### Cache set/line



### **Associative Cache Block Location**



© 2024 Dr. Muhammad Al-Hashimi

to be checked.

Exercise

۲

#### Exercise

Use formulas to verify: block numbers (formula Slide 9), and the cache index in each case.

Essentially, two fresh requests competing for the same mappable unit (block/set) must generate the same compulsory misses in all caches (marked **>**).

Words from up to 2 competing memory blocks may be accommodated by both associative caches.

Another word from a 3rd competing block, after a compulsory miss on all (including replacement on 1/2-way), will hit on a subsequent reference only in the fully associative.

Exercise

Suggest a word from a 3rd competing block. Write the miss ratio for each case. *Hint: write words in mem blocks 0-9.* 

© 2024 Dr. Muhammad Al-Hashimi





#### Associative Cache Performance



#### Miss rate improvement

P&H: FastMATH-like 64 KB (16-w block) Benchmark: SPEC2000, data cache D-mapped: 10.3% 2-way set-assoc: 8.6% (16.5% better)

#### ➡ Hit time: the access latency



In contrast, direct mapped locates blocks via a formula hence will not involve a similar effect.

© 2024 Dr. Muhammad Al-Hashimi



KAU • CS-704 19

cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:42 AM Color profile: Disabled Composite Default screen

### **Cache Writes**

Writes cause copies in cache and memory of the same blocks to differ (become *inconsistent*).

#### Quiz

Why is there no difference between write hits and misses in a 1-word block cache?

#### Write hit in general Block in cache, write word to cache

### ⇔ Write miss (multi-word)

Fetch correct block then write (later)



© 2024 Dr. Muhammad Al-Hashimi

# Block versions: stale info A block in cache will have other copies elsewhere in a hierarchy

KAU • CS-704 20

### Cache Writes Block Consistency

Simpler, may be a practical choice when chip resources are limited.

#### Write-through policy

Write both cache and main memory (carbon-copy) on every write

#### Write-back policy

Write cache; update memory on block replacement

### stale Memory

© 2024 Dr. Muhammad Al-Hashimi

#### Cache coherence

Same blocks in processor-level (private) caches

### Cache Writes Multi-word Miss

A request to write word 27 (from mem block 6) generates a miss since mapped cache spot is occupied by block 4, hence a miss.

Can't just write the word and update tag (see bottom fig), must first bring the right block (one to which new word belongs).

Therefore, 1) fetch new block, 2) update main mem after new word is written <u>if</u> <u>write-through</u>.



© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:43 AM Color profile: Disabled Composite Default screen

### Write Buffers

Write-through a fast buffer instead of directly to slow memory, i.e., complete a <u>consistent</u> write quickly + free processor early.

Deeper write buffers (4 in fig) allow the CPU to work with little or no stalls despite higher write misses (up to 4 updates in-progress may be accommodated).

Generally, a **buffer** (in the context) is any intermediate storage used to isolate access (e.g., fast from slow or read from write or user from system).

**Quiz** What if writes occur in large bursts?

© 2024 Dr. Muhammad Al-Hashimi



cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:43 AM Color profile: Disabled Composite Default screen

#### Cache Writes Performance

A write buffer reduces CPU stalls due to memory update (fixes writethrough somewhat).

#### ⇔ Write-through via write buffer Essentially hides mem updates from CPU

#### ⇒ Write buffer works if

Solution Writes occur at less rate than mem update completion

Writes don't occur in <u>bursts</u> (too many in short time)

Write bursts overwhelm buffers

© 2024 Dr. Muhammad Al-Hashimi

KAU • CS-704 24

#### Quiz

Identify the organization and the amount of data bytes the cache can hold, assuming 32bit address and data words. Where would byte-address 108 go? Answers last slide.

To calculate storage for 256 bytes of data (figure), determine #blocks (16, regardless of org), block size (128 bits), and tag size (26 bits, based on org), resulting in 2480 bits or 310 bytes.

#### **Exercise**

Compare storage cost for 1 to 16-way. Repeat for 16kB data. Hint: block address length does not change (why?).

**Exercise** Write a formula for storage bits if #blocks is  $2^n$ , #bytes/block is 2<sup>m</sup>.

© 2024 Dr. Muhammad Al-Hashimi



## **Observations**

32-bit byte-addr 108 ···· 0 0110 1100 26 -

Extra tag and valid bits

Tag depends on block organization

Tag size increases with associativity

May need all bits in block addr to tag (when?)

KAU • CS-704 25

### **Storage Cost**



### Improving Performance Design Review 3

A larger or highly associative cache will eliminate more avoidable misses but may become less responsive than design requirements.

#### Exercise

Is it a good idea to restrict a cache to reads to keep design simple? Research credible data about writes to support or refute bypassing cache for memory writes.

© 2024 Dr. Muhammad Al-Hashimi

#### Associative (multi-block), multi-word

Better use of temporal & spatial locality
 More storage & hit cost (careful with associativity)
 Best miss rate (within capacity and responsiveness tolerances)

#### **Optimizing performance**

Other cache design factors?
 Component or system concerns?



© 2024 Dr. Muhammad Al-Hashimi

KAU • CS-704 27

### Cache Performance 101 A Simplified Model

Focus on cycles added by memory (hits excluded since hidden in CPU cycles).

Assume mem stall cycles are due to cache miss only (isolate cache effects).

A write miss is complex since writes must manage changes in copies of the same block.

Assume write-through policy with <u>deep</u> write buffers (make reads and writes suffer <u>similar</u> <u>miss penalties</u>, mostly block fetch time).

#### 0

2nd assumption makes the cost of write miss (in cycles) similar to read so that they may be <u>counted together</u> (combined) to simplify.

© 2024 Dr. Muhammad Al-Hashimi



cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:44 AM Color profile: Disabled Composite Default screen

### Cache Performance 101 Improvement

Dividing the formula by an IC gives a memory-based CPI to add to a processor-focused CPI based on a perfect cache (that always hits) to obtain a slightly more realistic CPI/IPC.

memory stall cycles = memory requests × miss rate × miss penalty (cycles)

change characteristics

re-arrange subsystem

The number of memory requests generated by the program is not controlled by the memory system.

Writes increase the need to access memory while a larger block increases the cost of each access.

© 2024 Dr. Muhammad Al-Hashimi

**Suggest 2 strategies** 

Seduce miss rate (e.g., associativity)

Reduce miss penalty

KAU • CS-704 29

### **Improving Performance Reducing Miss Penalty**

#### 

Check 2 places to reduce the miss penalty: a) where a miss is satisfied, and b) the interfaces used to satisfy misses.

A miss from main memory, costly.

Miss addresses are typically passed to a dedicated cache controller that talks to a DRAM controller (some request setup time may also apply).

Two main interfaces are involved: a) the DRAM chips interface and b) the cache-tomemory subsystem.

© 2024 Dr. Muhammad Al-Hashimi



### A miss from main memory, typically DRAM, is relatively Main parts of miss penalty

Request (send addr) time DRAM first access delay (latency) Block transfer time

KALL • CS-70430

### Reducing Miss Penalty Multilevel Cache

To reduce miss penalty change: a) *where a miss is satisfied, ...* 

An L2 cache reduces expensive misses from DRAM (splits

original miss rate) + frees L1

to <u>focus</u> on keeping up with the processor (reducing access

Calculate the cost in cycles

when a miss can not be satisfied from L2 (along red path).

latency).

Exercise

#### **Design flexibility** Optimized cache performance

Intel Core i5-661 (ignore L2) L1 L1 L2 L3 DRAM 4c 10c 39c 76.4 ns cheaper miss 3.33 GHz **4c** (Ex: 4c) L2 (Ex: 76.4 ns = ? c)Main Memory (DRAM)

© 2024 Dr. Muhammad Al-Hashimi

KAU • CS-704 31

#### cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:47 AM Color profile: Disabled Composite Default screen

## Reducing Miss Penalty Cache-Memory Interfaces

Physical cell array, usually in a chip; cells

accessed via a common

row-column scheme.

### To reduce miss penalty change: ... b) *interfaces* used to satisfy misses.

Physical characteristics and organization of internal circuitry/devices determine the latency.

Bit width and clock speed contribute to **bandwidth**.

A fast subsystem can transfer a full block at once (perhaps at the same clock as the processor) or **interleave** addressing to overlap access penalty.

A fast **DRAM interface** may use more data channels or a higher clock to deliver bits off the chip package or increase data rate.

© 2024 Dr. Muhammad Al-Hashimi

**Cost mostly** 

## DRAM <u>latency</u>: cell access initiation

Block transfer: interface <u>bandwidth</u>



cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:47 AM Color profile: Disabled Composite Default screen

### Reducing Miss Penalty Architectural Concerns

Main memory in CPU-like general-purpose processors and graphics memory in SIMD-like GPU.

Details of internal org and access increasingly transparent to system, DRAM interface growing in complexity.

Exercise Outline the conflicting demands that low response time and high bandwidth place on communication channels. Give examples for solutions and tradeoffs.

© 2024 Dr. Muhammad Al-Hashimi

#### **DRAM: modern primary storage**

## Shifting complexity Mem system vs DRAM chips package

DRAM: respond quickly, transfer in bulk
 Memory system: fast transport lanes

#### A balancing act Response time vs throughput

cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:48 AM Color profile: Disabled

Color promo. ....

#### Exercise

Lookup terms (note architectural/system implications of move to **DDR5 SDRAM**).

Interleaving may be implemented directly by a memory system or transparently by the DRAM package via internal separately accessible memory banks (see SDRAM).

#### Exercise

Identify earliest Intel or AMD mainstream  $\mu$ -architectures to get a DRAM controller on die.

Speed contiguous address blocks (DDR transfer data on rising and falling edges of a clock; multiple channels interleave access to data in or across modules).

Burst data size/channel doubled in DDR5 SDRAM from 8 to 16 bytes (16 × 2 channels/DIMM × 2 transfers/clock = 64 bytes, typical cache block size, from 1 DIMM in 1 op).

© 2024 Dr. Muhammad Al-Hashimi

## Reducing Miss Penalty **Speeding Memory/DRAM**

Interleaved memory (beat latency)
 Multiple (cell-array) banks
 Tightly-integrated controller

## Multi-channel memory (boost bandwidth) Burst mode https://en.wikipedia.org/wiki/DRAM DRAM okin mediane biotering live

Double data rate

DRAM chip packages historically offered a variety of ways to interface the memory cells, now mainly **SDRAM** (synchronous, i.e., transfer over a common clock) in DIMM (dual in-line memory module) chip package.



#### Reducing Miss Penalty High Bandwidth Cache

Planar designs, which lay caches over the same 2D plane as processor(s), incur increasing cache latency as they scale up in size and complexity due to interconnection distances.

Cache embedded in vertically stacked dies allow closer hence faster connections.

Figures depict single and multiprocessor (with shared cache) scenarios.



**Quiz** Which aspect of misses (rate

or penalty) should benefit?

Higher density more cache.. less chip space
 Less latency more (instr/data) processing bandwidth
 Better thermals more power efficiency

© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_cache.cdr Thursday, April 18, 2024 11:47:49 AM Color profile: Disabled Composite Default screen

#### Exercise

Review textbook carefully to fill this table (or as you go through reading material).

#### Notes

|  | • |  |  | • |  |  | • |  |  |  |  |  | • |  | • | • |  | • |  |  |
|--|---|--|--|---|--|--|---|--|--|--|--|--|---|--|---|---|--|---|--|--|
|  |   |  |  |   |  |  |   |  |  |  |  |  | • |  |   |   |  |   |  |  |
|  |   |  |  |   |  |  |   |  |  |  |  |  | • |  |   |   |  |   |  |  |
|  |   |  |  |   |  |  |   |  |  |  |  |  |   |  |   |   |  |   |  |  |
|  |   |  |  |   |  |  |   |  |  |  |  |  |   |  |   |   |  |   |  |  |
|  |   |  |  |   |  |  |   |  |  |  |  |  |   |  |   |   |  |   |  |  |

Storage Cost Answers

• 16-block (tags mark blocks) • 4-way set associative (groups of 4 blocks assigned a common index number)

- Data size: 256 bytes = 16 blocks × 4 words/block × 4 bytes/word • 108 may be placed anywhere in
- set 2 (indicated by 2-bit cache index field in address)
- Block addr length depends only on #bytes in block (16), hence always 28 bits (max tag size) for 32-bit address
- Remember to include the valid bit in storage calculations

© 2024 Dr. Muhammad Al-Hashimi

| Perf Metric<br>actor | Miss/hit<br>rate            | Hit<br>time | Miss<br>penalty | Cache<br>bandwidth |
|----------------------|-----------------------------|-------------|-----------------|--------------------|
| ativity              | Reduce block<br>competition | How?        | /               |                    |

| Design Factor                                                            | Tale                                                       | ume  | penalty | Danuwium |  |  |  |  |  |  |
|--------------------------------------------------------------------------|------------------------------------------------------------|------|---------|----------|--|--|--|--|--|--|
| Associativity                                                            | Reduce block<br>competition<br>Flexible block<br>placement | How? |         |          |  |  |  |  |  |  |
| Block size                                                               |                                                            |      |         |          |  |  |  |  |  |  |
| Mem bandwidth                                                            |                                                            |      |         |          |  |  |  |  |  |  |
| Cache size                                                               |                                                            |      |         |          |  |  |  |  |  |  |
| Multilevel                                                               |                                                            |      |         |          |  |  |  |  |  |  |
| Split/combine                                                            |                                                            |      |         |          |  |  |  |  |  |  |
| Performance: Affect positively (increase) ( Affect negatively (decrease) |                                                            |      |         |          |  |  |  |  |  |  |

Performance: S Affect positively (increase)

KAU • CS-70436

### Cache Design