Berger, D., Goodman, J.R., Sohi, G.S. “Memory Systems”
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
88
Memory Systems
88.1 Introduction
88.2 Memory Hierarchies
88.3 Cache Memories
88.4 Parallel and Interleaved Memories
88.5 Virtual Memory
88.6 Research Issues
88.1 Introduction
A memory system serves as a repository of information (data) in a computer system. The processor [also called
the central processing unit (CPU)] accesses (reads or loads) data from the memory system, performs compu-
tations on them, and stores (writes) them back to memory. The memory system is a collection of storage
locations. Each storage location, or memory word, has a numerical address. A collection of storage locations
forms an address space. Figure 88.1 shows the essentials of how a processor is connected to a memory system
via address, data, and control lines.
When a processor attempts to load the contents of a memory location, the request is very urgent. In virtually
all computers, the work soon comes to a halt (in other words, the processor stalls) if the memory request does
not return quickly. Modern computers are generally able to continue briefly by overlapping memory requests,
but even the most sophisticated computers will frequently exhaust their ability to process data and stall
momentarily in the face of long memory delays. Thus, a key performance parameter in the design of any
computer, fast or slow, is the effective speed of its memory.
Ideally, the memory system must be both infinitely large so that it can contain an arbitrarily large amount
of information and infinitely fast so that it does not limit the processing unit. Practically, however, this is not
possible. There are three properties of memory that are inherently in conflict: speed, capacity, and cost. In
general, technology tradeoffs can be employed to optimize any two of the three factors at the expense of the
third. Thus it is possible to have memories that are (1) large and cheap, but not fast; (2) cheap and fast, but
small; or (3) large and fast, but expensive. The last of the three is further limited by physical constraints. A
large-capacity memory that is very fast is also physically large, and speed-of-light delays place a limit on the
speed of such a memory system.
The latency (L) of the memory is the delay from when the processor first requests a word from memory
until that word arrives and is available for use by the processor. The latency of a memory system is one attribute
of performance. The other is bandwidth (BW), which is the rate at which information can be transferred from
the memory system. The bandwidth and the latency are related. If R is the number of requests that the memory
can service simultaneously, then:
(88.1)BW
R
L
=
Doug Burger
University of Wisconsin-Madison
James R. Goodman
University of Wisconsin-Madison
Gurindar S. Sohi
University of Wisconsin-Madison
? 2000 by CRC Press LLC
From Eq. (88.1) we see that a decrease in the latency will result in an increase in bandwidth, and vice versa, if
R is unchanged. We can also see that the bandwidth can be increased by increasing R, if L does not increase
proportionately. For example, we can build a memory system that takes 20 ns to service the access of a single
32-bit word. Its latency is 20 ns per 32-bit word, and its bandwidth is
or 200 Mbytes/s. If the memory system is modified to accept a new (still 20 ns) request for a 32-bit word every
5 ns by overlapping results, then its bandwidth is
or 800 Mbytes/s. This memory system must be able to handle four requests at a given time.
Building an ideal memory system (infinite capacity, zero latency and infinite bandwidth, with affordable
cost) is not feasible. The challenge is, given a set of cost and technology constraints, to engineer a memory
system whose abilities match the abilities that the processor demands of it. That is, engineering a memory
system that performs as close to an ideal memory system (for the given processing unit) as is possible. For a
processor that stalls when it makes a memory request (some current microprocessors are in this category), it
is important to engineer a memory system with the lowest possible latency. For those processors that can handle
multiple outstanding memory requests (vector processors and high-end CPUs), it is important not only to
reduce latency, but also to increase bandwidth (over what is possible by latency reduction alone) by designing
a memory system that is capable of servicing multiple requests simultaneously.
Memory hierarchies provide decreased average latency and reduced bandwidth requirements, whereas
parallel or interleaved memories provide higher bandwidth.
88.2 Memory Hierarchies
Technology does not permit memories that are cheap, large, and fast. By recognizing the nonrandom nature
of memory requests, and emphasizing the average rather than worst case latency, it is possible to implement a
hierarchical memory system that performs well. A small amount of very fast memory, placed in front of a large,
slow memory, can be designed to satisfy most requests at the speed of the small memory. This, in fact, is the
primary motivation for the use of registers in the CPU: in this case, the programmer or compiler makes sure
that the most commonly accessed variables are allocated to registers.
A variety of techniques, employing either hardware, software, or a combination of the two, can be employed
to assure that most memory references are satisfied by the faster memory. The foremost of these techniques is
the exploitation of the locality of reference principle. This principle captures the fact that some memory locations
are referenced much more frequently than others. Spatial locality is the property that an access to a given
memory location greatly increases the probability that neighboring locations will soon be accessed. This is
largely, but not exclusively, a result of the tendency to access memory locations sequentially. Temporal locality
is the property that an access to a given memory location greatly increases the probability that the same location
FIGURE 88.1 The memory interface.
32
20 10
9
′
–
sec
bits
32
510
9
′
–
sec
bits
? 2000 by CRC Press LLC
will be accessed again soon. This is largely, but not exclusively, a result of the frequency of looping behavior of
programs. Particularly for temporal locality, a good predictor of the future is the past: the longer a variable has
gone unreferenced, the less likely it is to be accessed soon.
Figure 88.2 depicts a common construction of a memory hierarchy. At the top of the hierarchy are the CPU
registers, which are small and extremely fast. The next level down in the hierarchy is a special, high-speed semi-
conductor memory known as a cache memory. The cache can actually be divided into multiple distinct levels;
most current systems have between one and three levels of cache. Some of the levels of cache may be on the
CPU chip itself, they may be on the same module as the CPU, or they all may be entirely distinct. Below the
cache is the conventional memory, referred to as main memory, or backing storage. Like a cache, main memory
is semiconductor memory, but it is slower, cheaper, and denser than a cache. Below the main memory is the
virtual memory, which is generally stored on magnetic or optical disk. Accessing the virtual memory can be
tens of thousands times slower than accessing the main memory because it involves moving mechanical parts.
As requests go deeper into the memory hierarchy, they encounter levels that are larger (in terms of capacity)
and slower than the higher levels (moving left to right in Fig. 88.2). In addition to size and speed, the bandwidth
in-between adjacent levels in the memory hierarchy is smaller for the lower levels. The bandwidth in-between
the registers and top cache level, for example, is higher than that between cache and main memory or main
memory and virtual memory. Since each level presumably intercepts a fraction of the requests, the bandwidth
to the level below need not be as great as that to the intercepting level.
A useful performance parameter is the effective latency. If the needed word is found in a level of the hierarchy,
it is a hit; if a request must be sent to the next lower level, the request is said to miss. If the latency L
HIT
is known
in the case of a hit and the latency in the case of a miss is L
MISS
, the effective latency for that level in the hierarchy
can be determined from the hit ratio (H), the fraction of memory accesses that are hits:
L
average
= L
HIT
· H + L
MISS
· (1 – H) (88.2)
The portion of memory accesses that miss is called the miss ratio (M = 1 – H). The hit ratio is strongly
influenced by the program being executed, but is largely independent of the ratio of cache size to memory size.
It is not uncommon for a cache with a capacity a few thousand bytes to exhibit a hit ratio greater than 90%.
88.3Cache Memories
The basic unit of construction of a semiconductor memory system is a module or bank. A memory bank,
constructed from several memory chips, can service a single request at a time. The time that a bank is busy
servicing a request is called the bank busy time. The bank busy time limits the bandwidth of a memory bank.
FIGURE 88.2 A memory hierarchy.
? 2000 by CRC Press LLC
Both caches and main memories are constructed in this fashion, although caches have significantly shorter
bank busy times than do main memory banks.
The hardware can dynamically allocate parts of the cache memory for addresses deemed most likely to be
accessed soon. The cache contains only redundant copies of the address space. The cache memory is associative,
or content-addressable. In an associative memory, the address of a memory location is stored, along with its
content. Rather than reading data directly from a memory location, the cache is given an address and responds
by providing data which may or may not be the data requested. When a cache miss occurs, the memory access
is then performed with respect to the backing storage, and the cache is updated to include the new data.
The cache is intended to hold the most active portions of the memory, and the hardware dynamically selects
portions of main memory to store in the cache. When the cache is full, bringing in new data must be matched
by deleting old data. Thus a strategy for cache management is necessary. Cache management strategies exploit
the principle of locality. Spatial locality is exploited by the choice of what is brought into the cache. Temporal
locality is exploited in the choice of which block is removed. When a cache miss occurs, hardware copies a
large, contiguous block of memory into the cache, which includes the word requested. This fixed-size region
of memory, known as a cache line or block, may be as small as a single word, or up to several hundred bytes.
A block is a set of contiguous memory locations, the number of which is usually a power of two. A block is
said to be aligned if the lowest address in the block is exactly divisible by the block size. That is to say, for a
block of size B beginning at location A, the block is aligned if
A modulo B = 0 (88.3)
Conventional caches require that all blocks be aligned.
When a block is brought into the cache, it is likely that another block must be evicted. The selection of the
evicted block is based on some attempt to capture temporal locality. Since prescience is so difficult to achieve,
other methods are generally used to predict future memory accesses. A least-recently-used (LRU) policy is often
the basis for the choice. Other replacement policies are sometimes used, particularly because true LRU replace-
ment requires extensive logic and bookkeeping.
The cache often comprises two conventional memories: the data memory and the tag memory, shown in
Fig. 88.3. The address of each cache line contained in the data memory is stored in the tag memory, as well as
other information (state information), particularly the fact that a valid cache line is present. The state also
keeps track of which cache lines the processor has modified. Each line contained in the data memory is allocated
a corresponding entry in the tag memory to indicate the full address of the cache line.
FIGURE 88.3 Components of a cache memory. (Source: Adapted from M. D. Hill, “A case for direct-mapped caches,” IEEE
Computer, 21(12), 27, 1988.)
? 2000 by CRC Press LLC
The requirement that the cache memory be associative (content-addressable) complicates the design.
Addressing data by content is inherently more complicated than by its address. All the tags must be compared
concurrently, of course, because the whole point of the cache is to achieve low latency. The cache can be made
simpler, however, by introducing a mapping of memory locations to cache cells. This mapping limits the number
of possible cells in which a particular line may reside. The extreme case is known as direct mapping, in which
each memory location is mapped to a single location in the cache. Direct mapping makes many aspects of the
design simpler, since there is no choice of where the line might reside, and no choice as to which line must be
replaced. However, direct mapping can result in poor utilization of the cache when two memory locations are
alternately accessed and must share a single cache cell.
A hashing algorithm is used to determine the cache address from the memory address. The conventional
mapping algorithm consists of a function of the form
(88.4)
where A
cache
is the address within the cache for main memory location A
memory
, cache size is the capacity of the
cache in addressable units (usually bytes), and cache line size is the size of the cache line in addressable units.
Since the hashing function is simple bit selection, the tag memory need only contain the part of the address
not implied by the hashing function. That is,
A
tag
= A
memory
div size of cache (88.5)
where A
tag
is stored in the tag memory and div is the integer divide operation. In testing for a match, the
complete address of a line stored in the cache can be inferred from the tag and its storage location within the
cache.
A two-way set-associative cache maps each memory location into either of two locations in the cache and
can be constructed essentially as two identical direct-mapped caches. However, both caches must be searched
at each memory access and the appropriate data selected and multiplexed on a tag match (hit). On a miss, a
choice must be made between the two possible cache lines as to which is to be replaced. A single LRU bit can
be saved for each such pair of lines to remember which line has been accessed more recently. This bit must be
toggled to the current state each time either of the cache lines is accessed.
In the same way, an M-way associative cache maps each memory location into any of M memory locations
in the cache and can be constructed from M identical direct-mapped caches. The problem of maintaining the
LRU ordering of M cache lines quickly becomes hard, however, since there are M! possible orderings, and
therefore it takes at least
é log
2
(M !) ù (88.6)
bits to store the ordering. In practice, this requirement limits true LRU replacement to 3- or 4-way set
associativity.
Figure 88.4 shows how a cache is organized into sets, blocks, and words. The cache shown is a 2-Kbyte,
4-way set-associative cache, with 16 sets. Each set consists of four blocks. The cache block size in this example
is 32 bytes, so each block contains eight 4-byte words. Also depicted at the bottom of Fig. 88.4 is a 4-way
interleaved main memory system (see the next section for details). Each successive word in the cache block
maps into a different main memory bank. Because of the cache’s mapping restrictions, each cache block obtained
from main memory will be loaded into its corresponding set, but may appear anywhere within that set.
Write operations require special handling in the cache. If the main memory copy is updated with each write
operation—a technique known as write-through or store-through—the writes may force operations to stall while
the write operations are completing. This can happen after a series of write operations even if the processor is
allowed to proceed before the write to the memory has completed. If the main memory copy is not updated
A
A cache size
cache line size
cache
memory
=
mod _
__
? 2000 by CRC Press LLC
with each write operation—a technique known as write-back or copy-back or deferred writes—the main memory
locations become stale, that is, memory no longer contains the correct values and must not be relied upon to
provide data. This is generally permissible, but care must be exercised to make sure that it is always updated
before the line is purged from the cache and that the cache is never bypassed. Such a bypass could occur with
DMA (direct memory access), in which the I/O system writes directly into main memory without the involvement
of the processor.
Even for a system that implements write-through, care must be exercised if memory requests bypass the
cache. While the main memory is never stale, a write that bypasses the cache, such as from I/O, could have the
effect of making the cached copy stale. A later access by the CPU could then provide an incorrect value. This
can only be avoided by making sure that cached entries are invalidated even if the cache is bypassed. The
problem is relatively easy to solve for a single processor with I/O, but becomes very difficult to solve for multiple
processors, particularly so if multiple caches are involved as well. This is known in general as the cache coherence
or consistency problem.
The cache exploits spatial locality by loading an entire cache line after a miss. This tends to result in bursty
traffic to the main memory, since most accesses are filtered out by the cache. After a miss, however, the memory
system must provide an entire line at once. Cache memory nicely complements an interleaved, high-bandwidth
main memory (described in the next section), since a cache line can be interleaved across many banks in a
regular manner, thus avoiding memory conflicts, and thus can be loaded rapidly into the cache. The example
main memory shown in Fig. 88.3 can provide an entire cache line with two parallel memory accesses.
Conventional caches traditionally could not accept requests while they were servicing a miss request. In other
words, they locked up or blocked when servicing a miss. The growing penalty for cache misses has made it
necessary for high-end commodity memory systems to continue to accept (and service) requests from the
processor while a miss is being serviced. Some systems are able to service multiple miss requests simultaneously.
To allow this mode of operation, the cache design is lockup-free or non-blocking [Kroft, 1981]. Lockup-free
caches have one structure for each simultaneous outstanding miss that they can service. This structure holds
the information necessary to correctly return the loaded data to the processor, even if the misses come back in
a different order than that in which they were sent.
Two factors drive the existence of multiple levels of cache memory in the memory hierarchy: access times
and a limited number of transistors on the CPU chip. Larger banks with greater capacity are slower than smaller
banks. If the time needed to access the cache limits the clock frequency of the CPU, then the first-level cache
size may need to be constrained. Much of the benefit of a large cache may be obtained by placing a small first-
level cache above a larger second-level cache; the first is accessed quickly and the second holds more data close
to the processor. Since many modern CPUs have caches on the CPU chip itself, the size of the cache is limited
by the CPU silicon real-estate. Some CPU designers have assumed that system designers will add large off-chip
caches to the one or two levels of caches on the processor chip. The complexity of this part of the memory
hierarchy may continue to grow as main memory access penalties continue to increase.
FIGURE 88.4 Organization of a cache.
? 2000 by CRC Press LLC
Caches that appear on the CPU chip are manufactured by the CPU vendor. Off-chip caches, however, are a
commodity part sold in large volume. An incomplete list of major cache manufacturers is Hitachi, IBM Micro,
Micron, Motorola, NEC, Samsung, SGS-Thomson, Sony, and Toshiba. Although most personal computers and
all major workstations now contain caches, very high-end machines (such as multi-million dollar supercom-
puters) do not usually have caches. These ultra-expensive computers can afford to implement their main
memory in a comparatively fast semiconductor technology such as static RAM (SRAM), and can afford so
many banks that cacheless bandwidth out of the main memory system is sufficient. Massively parallel processors
(MPPs), however, are often constructed out of workstation-like nodes to reduce cost. MPPs therefore contain
cache hierarchies similar to those found in the workstations on which the nodes of the MPPs are based.
Cache sizes have been steadily increasing on personal computers and workstations. Intel Pentium-based
personal computers come with 8 Kbyte each of instruction and data caches. Two of the Pentium chip sets,
manufactured by Intel and OPTi, allow level-two caches ranging from 256 to 512 Kbyte and 64 Kbyte to 2 Mbyte,
respectively. The newer Pentium Pro systems also have 8 Kbyte, first-level instruction and data caches, but they
also have either a 256 Kbyte or a 512 Kbyte second-level cache on the same module as the processor chip.
Higher-end workstations—such as DEC Alpha 21164-based systems—are configured with substantially more
cache. The 21164 also has 8 Kbyte, first-level instruction and data caches. Its second-level cache is entirely on-
chip, and is 96 Kbyte. The third-level cache is off-chip, and can have a size ranging from 1 to 64 Mbyte.
For all desktop machines, cache sizes are likely to continue to grow—although the rate of growth compared
to processor speed increases and main memory size increases is unclear.
88.4 Parallel and Interleaved Memories
Main memories are comprised of a series of semiconductor memory chips. A number of these chips, like caches,
form a bank. Multiple memory banks can be connected together to form an interleaved (or parallel) memory
system. Since each bank can service a request, an interleaved memory system with K banks can service K requests
simultaneously, increasing the peak bandwidth of the memory system to K times the bandwidth of a single
bank. In most interleaved memory systems, the number of banks is a power of two, that is, K = 2
k
. An n-bit
memory word address is broken into two parts: a k-bit bank number and an m-bit address of a word within a
bank. Though the k bits used to select a bank number could be any k bits of the n-bit word address, typical
interleaved memory systems use the low-order k address bits to select the bank number; the higher order m =
n – k bits of the word address are used to access a word in the selected bank. The reason for using the low-
order k bits will be discussed shortly. An interleaved memory system which uses the low-order k bits to select
the bank is referred to as a low-order or a standard interleaved memory.
There are two ways of connecting multiple memory banks: simple interleaving and complex interleaving.
Sometimes simple interleaving is also referred to as interleaving, and complex interleaving as banking.
Figure 88.5 shows the structure of a simple interleaved memory system. m address bits are simultaneously
supplied to every memory bank. All banks are also connected to the same read/write control line (not shown
in Fig. 88.5). For a read operation, the banks start the read operation and deposit the data in their latches. Data
can then be read from the latches, one by one, by setting the switch appropriately. Meanwhile, the banks could
be accessed again, to carry out another read or write operation. For a write operation, the latches are loaded,
one by one. When all the latches have been written, their contents can be written into the memory banks by
supplying m bits of address (they will be written into the same word in each of the different banks). In a simple
interleaved memory, all banks are cycled at the same time; each bank starts and completes its individual
operations at the same time as every other bank; a new memory cycle can start (for all banks) once the previous
cycle is complete. Timing details of the accesses can be found in The Architecture of Pipelined Computers, [Kogge,
1981].
One use of a simple interleaved memory system is to back up a cache memory. To do so, the memory must
be able to read blocks of contiguous words (a cache block) and supply them to the cache. If the low-order k
bits of the address are used to select the bank number, then consecutive words of the block reside in different
banks, and they can all be read in parallel, and supplied to the cache one by one. If some other address bits
are used for bank selection, then multiple words from the block might fall in the same memory bank, requiring
multiple accesses to the same bank to fetch the block.
? 2000 by CRC Press LLC
Figure 88.6 shows the structure of a complex interleaved memory system. In such a system, each bank is set
up to operate on its own, independent of the operation of the other banks. For example, bank 1 could carry
out a read operation on a particular memory address, and bank 2 carries out a write operation on a completely
unrelated memory address. (Contrast this with the operation in a simple interleaved memory where all banks
are carrying out the same operation, read or write, and the locations accessed within each bank represent a
contiguous block of memory.) Complex interleaving is accomplished by providing an address latch and a
read/write command line for each bank. The memory controller handles the overall operation of the interleaved
memory. The processing unit submits the memory request to the memory controller, which determines which
bank needs to be accessed. The controller then determines if the bank is busy (by monitoring a busy line for
FIGURE 88.5A simple interleaved memory system. (Source: Adapted from P. M. Kogge, The Architecture of Pipelined
Computers, 1st ed., New York: McGraw-Hill, 1981, p. 41.)
FIGURE 88.6 A complex interleaved memory system. (Source: Adapted from P. M. Kogge, The Architecture of Pipelined
Computers, 1st ed., New York: McGraw-Hill, 1981, p. 42.)
? 2000 by CRC Press LLC
each bank). The controller holds the request if the bank is busy, submitting it later when the bank is available
to accept the request. When the bank responds to a read request, the switch is set by the controller to accept
the request from the bank and forward it to the processing unit. Details of the timing of accesses can be found
in The Architecture of Pipelined Computers [Kogge, 1981].
A typical use of a complex interleaved memory system is in a vector processor. In a vector processor, the
processing units operate on a vector, for example a portion of a row or a column of a matrix. If consecutive
elements of a vector are present in different memory banks, then the memory system can sustain a bandwidth
of one element per clock cycle. By arranging the data suitably in memory and using standard interleaving (for
example, storing the matrix in row-major order will place consecutive elements in consecutive memory banks),
the vector can be accessed at the rate of one element per clock cycle as long as the number of banks is greater
than the bank busy time.
Memory systems that are built for current machines vary widely, the price and purpose of the machine being
the main determinant of the memory system design. The actual memory chips, which are the components of
the memory systems, are generally commodity parts built by a number of manufacturers. The major commodity
DRAM manufacturers include (but certainly are not limited to) Hitachi, Fujitsu, LG Semicon, NEC, Oki,
Samsung, Texas Instruments, and Toshiba.
The low-end of the price/performance spectrum is the personal computer, presently typified by Intel Pentium
systems. Three of the manufacturers of Pentium-compatible chip sets (which include the memory controllers)
are Intel, OPTi, and VLSI Technologies. Their controllers provide for memory systems that are simply inter-
leaved, all with minimum bank depths of 256 Kbyte, and maximum system sizes of 192 Mbyte, 128 Mbyte,
and 1 Gbyte, respectively.
Both higher-end personal computers and workstations tend to have more main memory than the lower-end
systems, although they usually have similar upper limits. Two examples of such systems are workstations built
with the DEC Alpha 21164, and servers built with the Intel Pentium Pro. The Alpha systems, using the 21171
chip set, are limited to 128 Mbyte of main memory using 16 Mbit DRAMs, although they will be expandable
to 512 Mbyte when 64-Mbit DRAMs are available. Their memory systems are 8-way simply interleaved,
providing 128 bits per DRAM access. The Pentium Pro systems support slightly different features. The 82450KX
and 82450GX chip sets include memory controllers that allow reads to bypass writes (performing writes when
the memory banks are idle). These controllers can also buffer eight outstanding requests simultaneously. The
82450KX controller permits 1- or 2-way interleaving, and up to 256 Mbyte of memory when 16-Mbit DRAMs
are used. The 82450GX chip set is more aggressive, allowing up to four separate (complex-interleaved) memory
controllers, each of which can be up to 4-way interleaved and have up to 1 Gbyte of memory (again with
16 Mbit DRAMs).
Interleaved memory systems found in high-end vector supercomputers are slight variants on the basic complex
interleaved memory system of Fig. 88.6. Such memory systems may have hundreds of banks, with multiple
memory controllers that allow multiple independent memory requests to be made every clock cycle. Two
examples of modern vector supercomputers are the Cray T-90 series and the NEC SX series. The Cray T-90
models come with varying numbers of processors—up to 32 in the largest configuration. Each of these
processors is coupled with 256 Mbyte of memory, split into 16 banks of 16 Mbyte each. The T-90 has complex
interleaving among banks. the largest configuration (the T-932) has 32 processors, for a total of 512 banks and
8 Gbyte of main memory. The T-932 can provide a peak of 800 Gbyte/s bandwidth out of its memory system.
NEC’s SX-4 product line, their most recent vector supercomputer series, has numerous models. Their largest
single-node model (with one processor per node) contains 32 processors, with a maximum of 8 Gbyte of
memory, and a peak bandwidth of 512 Gbyte/s out of main memory. Although the sizes of the memory systems
are vastly different between workstations and vector machines, the techniques that both use to increase total
bandwidth and minimize bank conflicts are similar.
88.5 Virtual Memory
Cache memory contains portions of the main memory in dynamically allocated cache lines. Since the data
portion of the cache memory is itself a conventional memory, each line present in the cache has two addresses
associated with it: its main memory address and its cache address. Thus, the main memory address of a word
? 2000 by CRC Press LLC
can be divorced from a particular storage location and abstractly thought of as an element in the address space.
The use of a two-level hierarchy—consisting of main memory and a slower, larger disk storage device—evolved
by making a clear distinction between the address space and the locations in memory. An address generated
during the execution of a program is known as a virtual address, which must be translated to a physical address
before it can be accessed in main memory. The total address space is simply an abstraction.
A virtual memory address is mapped to a physical address, which indicates the location in main memory
where the data actually reside [Denning, 1970]. The mapping is maintained through a structure called the page
table, which is maintained in software by the operating system. Like the tag memory of a cache memory, the
page table is accessed through a virtual address to determine the physical (main memory) address of the entry.
Unlike the tag memory, however, the table is usually sorted by virtual addresses, making the translation process
a simple matter of an extra memory access to determine the real physical address of the desired item. A system
maintaining the page table in the way analogous to a cache tag memory is said to have inverted page tables. In
addition to the real address mapped to a virtual page, and an indication of whether the page is present at all,
a page table entry often contains other information. For example, the page table may contain the location on
the disk where each block of data is stored when not present in main memory.
The virtual memory can be thought of as a collection of blocks. These blocks are often aligned and of fixed
size, in which case they are known as pages. Pages are the unit of transfer between the disk and main memory,
and are generally larger than a cache line—usually thousands of bytes. A typical page size for machines in 1997
is 4 Kbyte. A page’s virtual address can be broken into two parts, a virtual page number and an offset. The
page number specifies the page to be accessed, and the page offset indicates the distance from the beginning
of the page to the indicated address.
A physical address can also be broken into two parts, a physical page number (also called a page frame
number) and an offset. This mapping is done at the level of pages, so the page table can be indexed by means
of the virtual page number. The page frame number is contained in the page table and is read out during the
translation, along with other information about the page. In most implementations the page offset is the same
for a virtual address and the physical address to which it is mapped.
The virtual memory hierarchy is different than the cache/main memory hierarchy in a number of respects,
resulting primarily from the fact that there is a much greater difference in latency between accesses to the disk
and the main memory. While a typical latency ratio for cache and main memory is one order of magnitude
(main memory has a latency ten times larger than the cache), the latency ratio between disk and main memory
is often four orders of magnitude or more. This large ratio exists because the disk is a mechanical device—with
a latency partially determined by velocity and inertia—whereas main memory is limited only by electronic and
energy constraints. Because of the much larger penalty for a page miss, many design decisions are affected by
the need to minimize the frequency of misses. When a miss does occur, the processor could be idle for a period
during which it could execute tens of thousands of instructions. Rather than stall during this time, as may
occur upon a cache miss, the processor invokes the operating system and may switch to a different task. Because
the operating system is being invoked anyway, it is convenient to rely on the operating system to set up and
maintain the page table, unlike cache memory, where it is done entirely in hardware. The fact that this accounting
occurs in the operating system enables the system to use virtual memory to enforce protection on the memory.
This ensures that no program can corrupt the data in memory that belong to any other program.
Hardware support provided for a virtual memory system generally includes the ability to translate the virtual
addresses provided by the processor into the physical addresses needed to access main memory. Thus, only on
a virtual address miss is the operating system invoked. An important aspect of a computer implementing virtual
memory, however, is the necessity of freezing the processor at the point where a miss occurs, servicing the page
table fault, and later returning to continue the execution as if no page fault had occurred. This requirement
means either that it must be possible to halt execution at any point—including possibly in the middle of a
complex instruction—or it must be possible to guarantee that all memory accesses will be to pages resident in
main memory.
As described above, virtual memory requires two memory accesses to fetch a single entry from memory, one
into the page table to map the virtual address into the physical address, and the second to fetch the actual data.
This process can be sped up in a variety of ways. First, a special-purpose cache memory to store the active
portion of the page table can be used to speed up the first access. This special-purpose cache is usually called
? 2000 by CRC Press LLC
a translation lookaside buffer (TLB). Second, if the system also employs a cache memory, it may be possible to
overlap the access of the cache memory with the access to the TLB, ideally allowing the requested item to be
accessed in a single cache access time. The two accesses can be fully overlapped if the virtual address supplies
sufficient information to fetch the data from the cache before the virtual-to-physical address translation has
been accomplished. This is true for an M-way set-associative cache of capacity C if the following relationship
holds:
(88.7)
For such a cache, the index into the cache can be determined strictly from the page offset. Since the virtual
page offset is identical to the physical page offset, no translation is necessary, and the cache can be accessed
concurrently with the TLB. The physical address must be obtained before the tag can be compared.
An alternative method applicable to a system containing both virtual memory and a cache is to store the
virtual address in the tag memory instead of the physical address. This technique introduces consistency
problems in virtual memory systems that permit more than a single address space or allow a single physical
page to be mapped to more than a single virtual page. This problem is known as the aliasing problem.
88.6 Research Issues
Research is occurring on all levels of the memory hierarchy. At the register level, researchers are exploring
techniques to provide more registers than are architecturally visible to the compiler. A large volume of work
exists (and is occurring) for cache optimizations and alternate cache organizations. For instance, modern
processors now commonly split the top level of the cache into separate physical caches, one for instructions
(code) and one for program data. Due to the increasing cost of cache misses (in terms of processor cycles),
some research trades-off increase the complexity of the cache for reducing the miss rate. Two examples of cache
research from opposite ends of the hardware/software spectrum are blocking [Lam, 1901] and skewed-associative
caches [Seznec, 1993]. Blocking is a software technique in which the programmer or compiler reorganizes
algorithms to work on subsets of data that are smaller than the cache, instead of streaming entire large data
structures repeatedly through the cache. This reorganization greatly improves temporal locality. The skewed-
associative cache is one example of a host of hardware techniques that map blocks into the cache differently,
with the goal of reducing misses from conflicts within a set. In skewed-associative caches, either one of two
FIGURE 88.7 Virtual-to-real address translation.
Page size
C
M
_ 3
? 2000 by CRC Press LLC
hashing functions may determine where a block should be placed in the cache, as opposed to just the one
hashing function (low-order index bits) that traditional caches use. An important cache-related research topic
is prefetching [Mowry, 1992], in which the processor issues requests for data well before the data are actually
needed. Speculative prefetching is also a current research topic. In speculative prefetching, prefetches are issued
based on guesses as to which data will be needed soon. Other cache-related research examines placing special
structures in parallel with the cache, trying to optimize for workloads that do not lend themselves well to
caches. Stream buffers [Jouppi, 1990] are one such example. A stream buffer automatically detects when a
linear access through a data structure is occurring. The stream buffer issues multiple sequential prefetches upon
detection of a linear array access.
Much of the ongoing research on main memory involves improving the bandwidth from the memory system
without greatly increasing the number of banks. Multiple banks are expensive, particularly with the large and
growing capacity of modern DRAM chips. Rambus [Rambus Inc., 1992] and Ramlink [IEEE Computer Society,
1993] are two such examples.
Research issues associated with improving the performance of the virtual memory system fall under the
domain of operating system research. One proposed strategy for reducing page faults allows each running
program to specify its own page replacement algorithm, enabling each program to optimize the choice of page
replacements based on its reference pattern [Engler et al., 1995]. Other recent research focuses on improving
the performance of the TLB. Two techniques for doing this are the use of a two-level TLB (the motivation is
similar to that for a two-level cache), and the use of superpages [Talluri, 1994]. With superpages, each TLB
entry may represent a mapping for more than one consecutive page, thus increasing the total address range
that a fixed number of TLB entries may cover.
Summary
A computer’s memory system is the repository for all the information that the CPU uses and produces. A
perfect memory system is one that can immediately supply any datum that the CPU requests. This ideal memory
is not implementable, however, as the three factors of memory capacity, speed, and cost are directly in opposition.
By staging smaller, faster memories in front of larger, slower, and cheaper memories, the performance of the
memory system may approach that of a perfect memory system—at a reasonable cost. The memory hierarchies
of modern general-purpose computers generally contain registers at the top, followed by one or more levels of
cache memory, main memory, and virtual memory on a magnetic or optical disk.
Performance of a memory system is measured in terms of latency and bandwidth. The latency of a memory
request is how long it takes the memory system to produce the result of the request. The bandwidth of a memory
system is the rate at which the memory system can accept requests and produce results. The memory hierarchy
improves average latency by quickly returning results that are found in the higher levels of the hierarchy. The
memory hierarchy generally reduces bandwidth requirements by intercepting a fraction of the memory requests
at higher levels of the hierarchy. Some machines—such as high-performance vector machines—may have fewer
levels in the hierarchy, increasing memory cost for better predictability and performance. Some of these
machines contain no caches at all, relying on large arrays of main memory banks to supply very high bandwidth,
with pipelined accesses to operands that mitigate the adverse performance impact of long latencies.
Cache memories are a general solution for improving the performance of a memory system. Although caches
are smaller than typical main memory sizes, they ideally contain the most frequently accessed portions of main
memory. By keeping the most heavily used data near the CPU, caches can service a large fraction of the requests
without accessing main memory (the fraction serviced is called the hit rate). Caches assume locality of reference
to work well transparently—they assume that accessed memory words will be accessed again quickly (temporal
locality), and that memory words adjacent to an accessed word will be accessed soon after the access in question
(spatial locality). When the CPU issues a request for a datum not in the cache (a cache miss), the cache loads
that datum and some number of adjacent data (a cache block) into itself from main memory.
To reduce cache misses, some caches are associative—a cache may place a given block in one of several places,
collectively called a set. This set is content-addressable; a block may or may not be accessed based on an address
tag, one of which is coupled with each block. When a new block is brought into a set and the set is full, the
cache’s replacement policy dictates which of the old blocks should be removed from the cache to make room
? 2000 by CRC Press LLC
for the new block. Most caches use an approximation of least-recently-used (LRU) replacement, in which the
block last accessed farthest in the past is the one that the cache replaces.
Main memory, or backing store, consists of banks of dense semiconductor memory. Since each memory chip
has a small off-chip bandwidth, rows of these chips are placed together to form a bank, and multiple banks are
used to increase the total bandwidth from main memory. When a bank is accessed, it remains busy for a period
of time, during which the processor may make no other accesses to that bank. By increasing the number of
interleaved (parallel) banks, the chance that the processor issues two conflicting requests to the same bank is
reduced.
Systems generally require a greater number of memory locations than are available in the main memory
(i.e., a larger address space). The entire address space that the CPU uses is kept on large magnetic or optical
disks; this is called the virtual address space, or virtual memory. The most frequently used sections of the virtual
memory are kept in main memory (physical memory), and are moved back and forth in units called pages.
The place at which a virtual address lies in main memory is called its physical address. Since a much larger
address space (virtual memory) is mapped onto a much smaller one (physical memory), the CPU must translate
the memory addresses issued by a program (virtual addresses) into their corresponding locations in physical
memory (physical addresses). This mapping is maintained in a memory structure called the page table. When
the CPU attempts to access a virtual address that does not have a corresponding entry in physical memory, a
page fault occurs. Since a page fault requires an access to a slow mechanical storage device (such as a disk), the
CPU usually switches to a different task while the needed page is read from the disk.
Every memory request issued by the CPU requires an address translation, which in turn requires an access
to the page table stored in memory. A translation lookaside buffer (TLB) is used to reduce the number of page
table lookups. The most frequent virtual-to-physical mappings are kept in the TLB, which is a small associative
memory tightly coupled with the CPU. If the needed mapping is found in the TLB, the translation is performed
quickly and no access to the page table needs to be made. Virtual memory allows systems to run larger or more
programs than are able to fit in main memory, enhancing the capabilities of the system.
Defining Terms
Bandwidth: The rate at which the memory system can service requests.
Cache memory: A small, fast, redundant memory used to store the most frequently accessed parts of the
main memory.
Interleaving: Technique for connecting multiple memory modules together in order to improve the band-
width of the memory system.
Latency: The time between the initiation of a memory request and its completion.
Memory hierarchy: Successive levels of different types of memory, which attempt to approximate a single
large, fast, and cheap memory structure.
Virtual memory: A memory space implemented by storing the most frequently accessed parts in main
memory and less frequently accessed parts on disk.
Related Topics
80.1 Integrated Circuits (RAM, ROM) ? 80.2 Basic Disk System Architectures
References
P. J. Denning, “Virtual memory,” Computing Surveys, vol. 2, no. 3, pp. 153–170, Sept. 1970.
D. R. Engler, M. F. Kaashoek, J. O’Toole, Jr., “Exokernel: An operating system architecture for application-level
resource management,” Proc. 15th Symposium on Operating Systems Principles, pp. 251–266, 1995.
J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, San Mateo, Calif.: Morgan
Kaufmann Publishers, 1990.
M. D. Hill, “A case for direct-mapped caches,” IEEE Computer, 21(12), 1988.
IEEE Computer Society, IEEE Standard for High-Bandwidth Memory Interface Based on SCI Signaling Technology
(RamLink), Draft 1.00 IEEE P1596.4-199X, 1993.
? 2000 by CRC Press LLC
N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and
prefetch buffers,” Proc. 17th Annual International Symposium on Computer Architecture, pp. 364–373, 1990.
P. M. Kogge, The Architecture of Pipelined Computers, New York: McGraw-Hill, 1981.
D. Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proc. 8th Annual International Sympo-
sium on Computer Architecture, pp. 81–87, 1981.
M. S. Lam, E. E. Rothberg, and M. E. Wolf, “The cache performance and optimizations of blocked algorithms,”
Proc. 4th Annual Symposium on Architectural Support for Programming Languages and Operating Systems,
pp. 63–74, 1991.
T. C. Mowry, M. S. Lam, and A. Gupta, “Design and evaluation of a compiler algorithm for prefetching,” Proc.
5th Annual Symposium on Architectural Support for Programming Languages and Operating Systems,
pp. 62–73, 1992.
Rambus, Inc., Rambus Architectural Overview, Mountain View, Calif.: Rambus, Inc., 1992.
A. Seznec, “A case for two-way skewed-associative caches,” Proc. 20th International Symposium on Computer
Architecture, pp. 169–178, 1993.
A. J. Smith, “Bibliography and readings on CPU cache memories and related topics,” ACM SIGARCH Computer
Architecture News, 14(1), 22–42, 1986.
A. J. Smith, “Second bibliography on cache memories,” in ACM SIGARCH Computer Architecture News, 19(4),
154–182, June 1991.
M. Talluri and M. D. Hill, “Surpassing the TLB performance of superpages with less operating system support,”
Proc. Sixth International Symposium on Architectural Support for Programming Languages and Operating
Systems, pp. 171–182, 1994.
Further Information
Some general information on the design of memory systems is available in High-Speed Memory Systems by
A. V. Pohm and O. P. Agrawal.
Computer Architecture: A Quantitative Approach by John Hennessy and David Patterson contains a detailed
discussion on the interaction between memory systems and computer architecture.
For information on memory system research, the recent proceedings of the International Symposium on
Computer Architecture contain annual research papers in computer architecture, many of which focus on the
memory system. To obtain copies, contact the IEEE Computer Society Press, 10662 Los Vaqueros Circle, P.O. Box
3014, Los Alamitos, CA 90720-1264.
? 2000 by CRC Press LLC