A processor that runs with a 100MHz clock has a cycle time of 10ns. If we assume that an addition can be executed in a single processor cycle, then an ADD instruction that takes one of its operands from main memory might spend 100ns waiting for that operand and only 10ns doing the addition. The overall time required to complete a program would then be determined almost entirely by the memory access time; increasing the processor speed would have very little effect.
Caches are introduced into a system to buffer the mismatch between main memory and processor speeds. A cache is a relatively small, fast memory placed between the processor and the main memory. The cache is designed so that its access time matches the processor cycle time. Thus, if the processor is running with a 100MHz clock the cache should be able to respond to a memory request in approximately 10ns. In the high-performance single-chip processors being built today, the cache memory is actually built on the processor chip and separated into distinct instruction and data caches. The typical size of these caches is 8kb, for a total of 16kb of cache on the processor chip. Many system designs also include an off-chip cache, which is called the second-level cache or the L2 cache:
The L2 cache can be anywhere from 128kb to 4Mb in size. The on-chip cache is called the first-level or primary cache. While the first-level cache must match the processor speed, the second-level cache can be somewhat slower (but not as slow as the main memory!)
When the processor makes a memory request, the request first passes to the primary cache. If the data item is found in this cache, we have a cache hit. If the data item is not found in the primary cache, we have a cache miss and the memory request is forwarded to the L2 cache. If the data item is found in this cache, we have an L2 cache hit and the data is passed back to the primary cache. If the data is not found in the L2 cache, the request is finally forwarded to the main memory. When the main memory responds to the memory request, the data item is passed back to the L2 cache and then the primary cache.
Caches work well because the memory request can usually be serviced by the primary cache. In fact, measurements show that 90% of the time the instruction cache will contain the requested instruction and 85% of the time the data cache will be able to respond to the data request. Thus, the L2 and main memory are rarely accessed.
The reason that so many memory requests can be handled by the primary cache has to do with two aspects of program behavior:
Now consider the data references that might be made in these program loops. There will probably be some references to local variables or loop control variables. As with the instructions of a loop, these variables are repeatedly accessed as the loop is being executed. The data items that vary with each iteration of a loop are often elements of arrays that are processed serially. While each element may not be referenced many times, a reference to one element guarantees that adjacent elements will be referenced.
It makes sense to bring a data item into the cache because temporal locality implies that it will be referenced again soon. On the first reference, a delay will be experienced because the data item must be gotten from the main memory. Once in the cache, subsequent references will result in a faster response. Because of spatial locality, when a data item must be brought into the cache, it makes sense to not only get that item from memory, but to also get data from nearby memory locations.
Spatial locality is embodied in a cache design by grouping sequential bytes of memory into a single cache line:
Information transfer between the cache and the memory is in terms of complete cache lines, rather than individual bytes. Thus if the program needs a particular byte, the entire cache line containing that byte is obtained from the memory. For example, suppose that the cache of Figure 2 was being used and the program fetches the word (two bytes) at location 0004736. If none of the cache lines contain the 16 bytes stored in addresses 0004730 through 000473F, then these 16 bytes are transferred from the memory to one of the cache lines. Because of the spatial locality of the program, we expect that other values in the cache line thus loaded will be referenced in the near future.
Figure 2 shows that the memory is considered to consist of 16-byte blocks, each beginning at an address divisible by 16. Any one of these blocks could be loaded into any one of the 512 lines of the cache. Thus, in order to handle memory requests correctly, the system needs to know which block is loaded into each cache line at each instant of time. Moreover, this information needs to be made available in hardware, because the cache is controlled entirely by hardware.
A 32-bit physical address has been assumed in Figure 2. All of the addresses of bytes in a 16-byte block of memory that begins at an address divisible by 16 have the same bit pattern in the most significant 28 bits, with each byte having a distinct bit pattern in the least significant 4 bit positions. Thus the most significant 28 bits of a memory address can be considered its ``block address''. One possible scheme for specifying the block loaded into each cache line is a fully-associative cache that provides one 28-bit tag register for each line. The block address of the data stored in a line is then put into the tag register associated with that line:
When a memory access is required, the hardware can compare the most significant 28 bits of the physical address against all 512 tag registers. These comparisons occur simultaneously, using 512 comparators. If there is a match from any of these comparators, we have a cache hit. The line that caused the hit is then read out of the cache and the least significant 4 bits of the physical address are used to select the proper byte.
Unfortunately, a fully-associative cache is difficult to build because many comparators are needed and the number of bits in the tag field is large. One approach to simplifying the problem is to limit the set of memory blocks that can be stored in each cache line. This could be done by considering that the memory is made up of 8kb blocks, each beginning at an address divisible by 8192. The first 16-byte block of any one of these 8kb blocks could be loaded into the first line of the cache, the second 16-byte block of any one could be loaded into the second line of the cache, and so on. In that case, we only need to store which 8kb block of the memory the bytes in a particular cache line correspond to; the number of the 16-byte block within the 8kb block is identical to the number of the cache line. A cache using this scheme for keeping track of the memory block corresponding to a cache line is called a direct-mapped cache.
With a direct-mapped cache, the physical address is divided into three fields -- a tag, an index, and a byte offset:
The byte offset is used to select a byte within a cache line. The index is used to select a cache line. Thus the contents of each 16-byte block of main memory can reside in only one line of the cache. When a memory access is required, the hardware selects the appropriate cache line on the basis of the index bits of the physical address, and compares the 19 tag bits to the tag associated with that cache line. If there is a match, we have a cache hit. The indexed cache line is then read out of the cache and the least significant 4 bits of the physical address are used to select the proper byte.
A direct-mapped cache is easy to build because only one comparator is needed and the number of bits in the tag is smaller than that required for the fully-associative cache. Because a particular 16-byte block of main memory can reside in only one cache line, however, there are instances when repeated misses may cause a program to run more slowly than expected. For example, suppose that a program were adding two arrays and storing the result into a third array. If the arrays being added were exactly 512x16 bytes long, then the memory locations for both of the operands and those for the result might have to reside in the same cache line! This situation would lead to a miss on every array access.
A compromise organization, called a set-associative cache, approaches the performance of a fully-associative cache but is much easier to build. The compromise consists of shifting the boundary between the tag and the cache line index in Figure 4 to the right, reducing the number of bits used for a cache line index, without changing the number of lines in the cache.
Suppose that we move the line to the right by one bit position, yielding an 8-bit cache index. The number of cache lines remains the same, so this 8-bit value is not long enough to provide a unique address for every cache line. We could argue, however, that the 8-bit value specifies two cache lines: one in the first set of 256 and the other in the second set. For example, the 8-bit value 00 specifies either the first cache line or the 256th; the 8-bit value FF specifies either the 255th cache line or the last.
With an 8-bit cache index, an access proceeds as follows: The hardware selects two cache lines on the basis of the index bits of the physical address, and simultaneously compares the 20 tag bits of the physical address to the tags associated with each of those cache lines. If there is a match, we have a cache hit, and one of the two lines has been selected. The selected cache line is then read out of the cache and the least significant 4 bits of the physical address are used to select the proper byte.
A cache with this behavior is called a two-way set-associative cache because two cache lines are selected by the cache line index. Moving the line in Figure 4 to the right by two bit positions yields a four-way set-associative cache because four cache lines are selected by the (7-bit) cache line index. If the line is moved far enough to the right to eliminate the cache line index completely, the result is the fully-associative cache shown in Figure 3.
Measurements show that the performance of a four-way set-associative cache closely approaches that of a fully-associative cache of the same size and shape. This is reasonable: In the array computation discussed above, a four-way set-associative cache would allow all three operands plus the instructions to lie in cache lines with the same index bits, and still avoid misses. In this case the simultaneous comparison of the tags requires only four comparators of 21 bits each -- a far cry from the 512 comparators of 28 bits each needed by a fully-associative cache of the same size and shape.
| Instructor | Revision 1.6 (2002/04/18 20:01:52) |