# Memory Systems Address Space



Programmers visualize memory logically as numbered boxes (each stores a byte of info).

# Where info comes from depends on an underlying hierarchy of physical memories

© 2022 Dr. Muhammad Al-Hashimi

KAU • CS-704 1

# Address Space Example

→ Object code

# **MIPS 32 program memory**

Total of 0x80000000 bytes (2 GB) is allocated for a MIPS program (rest of the 32-bit space reserved for the OS).

A convention (standard way) of organizing how the 2 GB logical addr space is used makes programs easier to read and debug.

- Reserved = OS exceptions code
- Text = program instructions
- Static data = persistent and compiler data (const, literals, global and static vars)
- Dynamic data = runtime heap (malloc/objects)
- Stack = procedure arguments, return values, and local vars



### Program/object code addr space (bytes): 0 – 2147483647 (0x0000 0000 – 7fff ffff)

© 2022 Dr. Muhammad Al-Hashimi

cs704fig\_perf.cdr Tuesday, November 1, 2022 11:06:07 AM Color profile: Disabled Composite Default screen

# **Virtual Memory**

**Virtual** = not really, but in effect (practically).

**Real memory** (popularly known as RAM), also called primary storage or **main memory**, refers to physical memory locations used for <u>active</u> code, currently implemented in DRAM technology.

VM features close cooperation between hardware (usually integrated with the processor) and software (usually part of the OS) invisibly to user programs.

VM systems provide 2 important services: **protection** (private address space) and **relocation** (independent mapping to physical memory).

8

Virtual address space corresponds to memory *apparently* available to programs from a processor viewpoint (like the logical memory available to programmers).

# Virtual vs. real memory Role: transparently

Section Exceed real memory size limitation

Manage shared memory efficiently

# Virtual addresses



© 2022 Dr. Muhammad Al-Hashimi

cs704fig\_perf.cdr Tuesday, November 1, 2022 11:06:07 AM Color profile: Disabled Composite Default screen

Each program can access a tual address space.

The addresses, always allocated in overflow locations kept in magnetic/flash storage (a swap space), may also map to real memory.

Different era different terms, same things essentially.

VM is an older technology where physical memory acts as a cache for memory kept in slow storage.

Design is dominated by the huge difference in access time/latency between DRAM and magnetic storage.

VM historically was about more memory, now mainly to abstract program address space and support robust, secure, and costeffective multitasking.

© 2022 Dr. Muhammad Al-Hashimi

# 

CPU generates virtual addresses

# ⇔ VM terminology vs cache

WM block = page

Solve States States

Mapping = address translation

KALL . CS-7044

# **Virtual Memory** Concepts

cs704fig\_perf.cdr Tuesday, November 1, 2022 11:06:07 AM Color profile: Disabled Composite Default screen

# Virtual Memory Scheme



© 2022 Dr. Muhammad Al-Hashimi

#### cs704fig\_perf.cdr Tuesday, November 1, 2022 11:06:13 AM Color profile: Disabled Composite Default screen

# Virtual Memory Address Translation

### Page table



A **page table** in physical memory stores the translations.

Frequently used ones are cached in a **translation look-aside buffer** (**TLB**) inside the processor.

Two scenarios depict a read best case, in left (solid line) when real addresses index cache as pictured in the top figure.

Right (dashed line): cache index may come from a virtual address (virtually addressed cache) so that cache can be accessed while waiting for a physical address.

#### Quiz

What is the advantage of virtually indexing the cache? (Technically, virtually addressed if virtually tagged also). **Hint**: check critical path to hit.

© 2022 Dr. Muhammad Al-Hashimi

# Memory Systems Info Hierarchy

Segmented VM



\* Per GB (AVE). Consumer/retail newegg.com@2021/12/14 (best selling) • DRAM DDR3-1333 4/8G DIMM • SSD/NAND 256-1TB • HDD/internal 10-20 TB

© 2022 Dr. Muhammad Al-Hashimi

#### Exercise

Compare with preceding gen i5-750 to find the compromise.

Released January 2010, dual core with 2 levels of private cache per core run at 3.33 GHz; <u>off-core</u> die shared L3 cache + other parts of the CPU package, <u>including</u> <u>DRAM controller</u>, at 2.4 GHz.

#### Quiz

What is the processor clock cycle time in ns? Determine the DRAM latency in processor cycles.

Clearly, L3 cache is designed to minimize expensive miss penalty from DRAM.

Separate die memory controller results in higher access cost than expected from the SDRAM (48.75–52.5 ns\*). \* JEDEC, DDR3 SDRAM Specifications, 2010.

© 2022 Dr. Muhammad Al-Hashimi

# Memory Systems Real World Example

Architectural compromise

# Intel Core i5-661, Clarkdale

Clock 3.33/2.4 GHz, associative cache, 64-byte block, DDR3-1066 SDRAM

L1 split 32 KB/4-way, 32 KB/8-way (data)

L2 unified 256 KB 8-way

∞ L3 4 MB 16-way @ 2.4GHz

Latencies:

| L1 | L2  | L3  | DRAM    |
|----|-----|-----|---------|
| 4c | 10c | 39c | 76.4 ns |

Compare DRAM latency 1999–2015: 46–60 ns (*tRAS+tRP*), Kevin K. Chang et al. (pp. 323–336, SIGMETRICS '16).

# **Program Performance**



- Memory models hide a physical hierarchy
- Machine code interacts with the hierarchy
- Program performance will vary accordingly

Some performance problems can only be resolved by examining the hierarchy and adjusting code or the algorithm

© 2022 Dr. Muhammad Al-Hashimi

# Program Performance Example





Similarly, a program may interact negatively with the VM system by using too many pages at the same time, or generating too many TLB misses.

KAU • CS-704 10

# Program Performance 3 Miss Types

Each of these misses can insert between tens to millions of cycles in latency to individual instructions.

Execution times of different programs increase depending on how much and which type of misses they generate.

In the *Core i5-661*, the slower offcore memory controller adds stall cycles to memoryintensive programs every time the page table is accessed or a cache miss is satisfied from DRAM.

# Cache miss

Request not in close reach of CPU

# 

A fast translation not available

# ⇔ Page table miss (page fault) Address not in physical memory

# Program Performance Constituents

Different parts of a computing system interact to affect program performance.

Disk encryption programs run faster on the *i5-661* than previous generation processors due to 6 new instructions which support AES enc/decryption.

A carefully designed ISA can lead to cost-effective pipelined microarchitecture that economically delivers power-efficient high instruction execution rates.

© 2022 Dr. Muhammad Al-Hashimi

## **☑** Memory hierarchy

### Impact of ISA

Instruction selection

Instruction design

### **Microarchitecture**

- Datapaths/caches + control streams
- Processor and instruction-level parallelism

### Program

- Algorithm selection
- Programming environment (lang/compiler)

□ Impact of I/O

# Program Performance Improvement

KALL • CS-70413

# Algorithmic advantage Maximize computation efficiency

In addition to time, power savings may be achieved.

Maximize paralellizable tasks.

Sometimes it pays to restrict problem instances (focus on those of interest).

Less/cheap operations (reduce work)
 Overlap operations (parallelize)
 Limit/break scope of work (localize)
 Clever shortcuts (heuristics)

# Language/Compiler advantage Maximize code efficiency

- Session Search Sear
- Setter use of machine features

© 2022 Dr. Muhammad Al-Hashimi

# **Program Performance** Improvement

ently, they do affect each other.

3

For example: to accommodate or utilize a specific memory arrangement, gaining the architectural advantage may require reworking the algorithm to alter memory locality.

**To Think About** Which performance constituent would be easier to control?

© 2022 Dr. Muhammad Al-Hashimi

# Although the advantages can be exploited independ. Maximize machine utilization

Better ILP Add parallel processing resources More addr block usage before replacement (focus expensive misses) Reduce reliance on bandwidth limited interconnect channels

KALL • CS-70414