When the nano-dimensional semiconductor industry is offering a golden period for CPU hardware development, essential components like CPU cache are not lagging far behind to utilize the opportunity in all possible ways. Nano range transistors are leaving enough space for CPU cache to follow the same footsteps of advancement. Since information processing starts with accessing the data, the modernization of CPU cache plays an important role in the overall performance of a microprocessor. These days, it is sufficient to say that we all have heard a bit about this term CPU cache memory while buying a computer or discussing the CPU. The invention of CPU cache was a significant event in the history of computers. It was to solve a serious problem with CPU performance. Earlier processors were bound to access data from the system memory. However, the advancement in CPU technology was calling out for high-speed data access systems since the main memory was unable to cope with its performance. Here, CPU caches play their part of traversing the performance gap between these two components. Let us take you on a journey to describe how the story started. Stay tuned. We will start with the concept of caching at first.
What is Cache?
In computer literacy, the word ‘cache’ refers to a temporary data storage component prepared to serve future requests for that data. The cache is used for both hardware and software purposes. But why do we need them? They are faster to respond than the main storage section. Cache data is created mostly as a result of the previous computation or a copy of some essential data stored elsewhere.
What is CPU Cache or CPU Cache Memory?
Cache memory should not be confused with the general term cache. It is the hardware implementation of memory blocks to be used by the CPU to create cache data from the main memory. These blocks of data are born when various programs run and make a copy of their recent information so that next time the CPU can access them more efficiently, matching the CPU clock speed. The chip-based cache memory is also called CPU cache memory since they are mostly integrated with the CPU chip or directly connected with it through an interconnecting bus. Simply put, CPU cache is actually the most quickly accessible memory unit. It helps to improve the CPU access speed.
CPU Cache Vs Other Memory Units
As we already know, there are at least three types of physical memory sections in a computer memory hierarchy.
CPU Cache vs The Primary storage:
- The primary storage refers to the HDD or SSD, which is not only a memory unit but also a storage system. It stores a bulk amount of data including the OS and other programs. The storage technology is non-volatile, meaning it doesn't lose its content without a power supply, unlike the other memory units. It’s the largest, slowest, and most inexpensive memory section of a computer.
CPU Cache vs The System memory:
- The system memory or the main memory refers to the RAM or Random Access Memory. It runs on Dynamic RAM or DRAM technology and is implanted on the motherboard. This limited memory unit is volatile and less expensive than the CPU cache. This larger RAM unit provides half of the speed of a CPU cache memory, although it’s faster than the primary storage.
CPU Cache vs The Virtual Memory:
There is also a memory type named virtual memory, but this is not a memory block, but a technique to enhance the RAM capacity in such a way that the ability to run larger programs or multitasking capability of a computer increases. Yet, in case of accessibility speed, this memory technique is also slower than the CPU cache memory.
[caption id="attachment_13025" align="aligncenter" width="600"]
Computer memory hierarchy based on size and data access speed[/caption]
Lastly, we come to the CPU cache. It is also a RAM, but the hardware used here is Static RAM or SRAM. It is made of CMOS technology with six transistors for each cache block, whereas DRAM uses capacitors and transistors. Due to charge leakage, DRAM is required to be refreshed continuously to retain data for a longer period. Hence, it consumes more power while providing slower memory access. However, SRAM doesn’t require refreshing to store data for a long time. Although SRAM is more complex and expensive, it empowers the CPU cache memory to be able to respond to a CPU request in a few nanoseconds only.
What is the CPU Cache’s Significance in a Processor?
In the early days of computing, when CPUs were not really fast enough, the main memory could cope with its velocity. After 1980, with the advancement of processor technology, the velocity gap between CPU and RAM started to widen significantly.
[caption id="attachment_13020" align="aligncenter" width="600"]
CPU-DRAM speed gap[/caption]
As the microprocessor clock speed grew more, the RAM memory access time couldn’t reach a similar level. And hence the RAM became responsible to pull the CPU performance behind. In this scenario, the demand for a faster memory gave birth to the CPU cache system. These days, mainstream consumer CPU clocks run around 4GHz velocity, whereas most DDR4 memory units reach up to 1800MHz speed or so. Undoubtedly, the main memory is too slow to work directly with the processor. Here comes the CPU cache memory to play the most significant role of bridging the velocity gap between the RAM and the processor for the CPU to perform efficiently. CPU cache works as an intermedial buffer between these two. It stores small blocks of repeatedly used data or sometimes, only the memory addresses of them.
[caption id="attachment_13023" align="aligncenter" width="600"]
Comparing different memory performance[/caption]
The significance of the CPU cache memory is in its efficiency in data retrieval. Fast access to the instructions enhances the overall speed of the program. When time is the essence of this modern world, even a few milliseconds of more latency could potentially lead to gigantic expenses, based on a specific situation.
How Does CPU Cache Memory Work?
Let’s take an example of a program. It is designed with a set of instructions, to be controlled by the CPU. Now, the program is loaded in the primary storage. Therefore, when the program is launched, the instructions snake their way through the memory hierarchy towards the CPU. The instructions first get loaded into the RAM from the primary storage and then reach the CPU. Modern CPUs are now capable of bearing a large chunk of data per second. But to utilize all its power, the CPU requires access to a high-speed memory module. This is where the cache starts playing. The memory controller takes the required data from RAM to load into the cache. Based on the CPU design, this controller can sit inside the CPU itself or on the North Bridge chipset on the motherboard.
[caption id="attachment_13024" align="aligncenter" width="600"]
Data flow across the CPU, cache memory, and main memory[/caption]
Lastly, the data is distributed inside the cache depending on the cache memory hierarchy to carry it around back and forth inside the CPU. Now, CPU caches act like small memory pools to store the most probable information required by the CPU at the next moment. There are sophisticated algorithms to determine which programming code will be needed next and hence to be loaded in the first layer of CPU cache memory and then in the other layers as well. The very purpose of the cache memory is to feed the CPU immediately as it calls upon a request so that the CPU can work round the clock without much lagging.
Types of CPU Cache
Generally, CPU cache memory is divided into three main levels. The ranking is marked according to decreasing speed and closeness to the CPU and hence the increasing size of the layers as well:
Level 1 represents the primary cache, which is the fastest memory block present in a computer. This smallest section generally comes embedded in the microprocessor chip. According to access priority, the L1 cache retains the data most likely to be retrieved by the CPU up next to finish a certain task. Generally, these days, the L1 cache size ranges among 256-512KB for flagship processors, while offering 64KB for each CPU core. However, some power-efficient CPUs are currently implanting it close to 1MB or more. Server chipsets like Intel’s Xeon CPUs come equipped with 1-2MB of L1 cache. L1 cache is further split in two ways: L1 Instruction Cache and L1 Data Cache. The instruction layer retains the commands the CPU has to perform to run the operation, and the data cache keeps the data to be written back to the main memory. The instruction cache also holds pre-decode information and branching data. That said, the data cache often acts like an output-cache, whereas the instruction cache plays the role of an input-cache. This circular process comes handy when loops are involved in a specific program. In earlier processors, when L3 cache was not included, the system RAM used to communicate directly through the L2 cache: RAM → L2 Cache → L1 Instruction Cache → Fetch Unit → Decode Unit → Execution Unit → L1 Data Cache → RAM. An Intel Sky lake CU cache design showing cache memory arrangement
The secondary cache is larger but slower than the L1 cache. This one may come embedded on the CPU, or can be set on a separate chip or a coprocessor. In the latter case, an alternative high-velocity system bus connects the cache with the CPU so that it doesn't lose efficiency due to the traffic on the main memory bus. L2 cache ranges from 4-8MB on flagship processors (512KB per core), although you can expect even more of it in modern power computers. Generally, in modern multicore CPUs, we see each core having its own dedicated L1 and L2 caches embedded in them, whereas the L3 cache is shared across them all.
The third level is specialized in improving the performance of L1 and L2. Although L1 and L2 are significantly faster than L3, L3 can churn out double the speed of DRAM. The size of the L3 cache can vary from 10-64MB. Server chips can go up to 256MB of L3 cache. Recently, AMD’s Ryzen CPUs are implanting much larger sized caches compared to their Intel counterparts. Earlier, the trend was to create L1, L2, and L3 caches using a combination of processor and motherboard components. Currently, the tendency of consolidating all three levels of cache memories on-board with the CPU is gaining popularity. Contrary to a popular belief, implanting extra DRAM won't increase cache memory. Sometimes, two expressions ‘memory caching’ and ‘cache memory’ are often used synonymously, but they are not the same. Memory caching is a process used by DRAM or flash memory to buffer disk reads to enhance storage I/O performance. Whereas, cache memory is a physical memory unit to provide read buffering for the processor.
How L1 and L2 CPU Caches Work
The data-flow snakes through the RAM to the L3 cache then reaches the L2, and finally L1. When the processor searches for data to perform an operation, it first looks into the associated core’s L1 cache. If the data is found there, the situation is called a cache hit. Otherwise, the CPU goes scampering off to search for the data in L2, and then L3 caches. If it can’t find the data, it tries to retrieve it from the main memory. This condition is called a cache miss. In that case, the CPU has to request it to be written on to the cache from the RAM or storage. This process consumes time and negatively affects performance. The cache memory performance is frequently measured in terms of a quantity called Hit ratio. Hit ratio = no. of hits / (no. of hits + no. of misses) = no. of hits / total accesses. Generally, the cache hit rate improves with increased cache size. The effects are considerably visible in the latency-sensitive workloads as in gaming or so. Now, latency is defined by the time required to access data from a memory unit. Resultantly, the fastest L1 cache has the lowest latency, and then it increases according to the cache hierarchy. In the case of a cache miss, the latency increases a lot since the CPU has to retrieve information from the main memory. Let’s take an example. Suppose, a processor has to retrieve data from the L1 cache for 100 times in a row. The L1 cache has a 1ns(nanosecond) access latency and a 100 percent hit rate. Resultantly, the CPU will take 100ns to perform this operation. Now, imagine that the cache hit rate is 99 percent and the required data for the 100th access is actually sitting in L2, with a 10ns access latency. Therefore, the CPU will take 99ns to play the first 99 reads and then 10ns to perform only the 100th. Meaning, a 1 percent depletion in hit rate just slowed the CPU down by 10 percent. In real practices, an L1 cache shows a hit rate within 95 to 97 percent, but the performance impact between these two values can differ around 14 percent if the remaining data sits in the L2 cache. But if it’s a cache miss and the data is in main memory, with 80-120ns access latency, the CPU may take nearly double the time required before. As computers are evolving, latency is decreasing as well. Currently, low latency DDR4 RAM and high-speed SSDs are significantly reducing the overall latency in a computer system. Earlier cache designs were made in such a way that L2 and L3 caches used to keep outside the CPU chip. This process was causing more latency due to the distance of the cache units from the processor. However, the current fabrication processes make sure to insert billions of CPU transistors in a relatively smaller space and thus leaving more room for the cache unit. As a result, the closeness of the cache to the CPU is growing, as well as lowering down the latency. That’s why the market focus has now shifted from buying a computer with a large cache size to the one with a sufficient amount of integrated cache levels with the CPU chip.
CPU Cache Memory Mapping
This is the process that explains how the CPU cache communicates with the main memory. The cache memory is split into blocks. Again, each block can be divided into n 64-byte lines. The RAM is also divided into blocks that interact with cache lines or sets of lines. A block of memory can’t be placed randomly into the cache. The restriction is confined to one cache line or a set of cache lines. The cache placement policy is what determines where a specific memory block can be put into the cache. Although caching configurations are evolving continuously, traditionally there are three different policies available for placement of a memory block into the cache.
In this process, the cache is split into multiple sets having one cache line per set. According to the memory block address, each block is mapped to exactly one cache memory line. Here, the cache can be represented as an (n*1) column matrix. The set is recognized by the index bits of the memory block address, and a tag with all or part of the address of the data is stored in the tag field. If the allocated place is already occupied, then the new data rewrites that in the cache.
[caption id="attachment_13028" align="aligncenter" width="500"]
The performance of this simplest process is directly proportional to the hit ratio. It is power-efficient since there is no need to search through all the cache lines. And because of its simplicity, it doesn’t require expensive hardware manufacturing. Although, the hit rate here is low, because of the only one option for each block of data, and also the replacement of old data increases cache miss.
Associative or Fully-associative Mapping:
In this process, the cache is arranged into a single set with multiple lines. A memory block can inhabit any of the cache lines. The arrangement can be represented as a (1*m) row matrix.
[caption id="attachment_13029" align="aligncenter" width="513"]
This process offers better flexibility and a hit rate. Also, various replacement algorithms can be applied here if the cache miss occurs. But the process is slow since each search goes through the entire cache, and also power-hungry for the same reason. Moreover, it requires the most expensive associative-comparison hardware among all three policies.
This process can be seen as an improved version of direct mapping. More precisely, it’s a trade-off between the other two. A set-associative cache can be framed as an (n*m) matrix. The cache is split into n sets and every set contains m lines. A memory block first enters into a set and then occupies any cache line of that set.
[caption id="attachment_13030" align="aligncenter" width="578"]
A direct-mapped cache can be imagined as one-way set-associative and a fully associative cache with n cache lines can be seen as n-way set-associative. Most contemporary processors are inheriting either direct-mapped or a two-way or four-way set-associative configuration.
Data Writing Policies
Although there are various techniques available, there are two main writing policies under which cache memory is written.
- Data is written to both the cache and a backing store (another cache or main memory) at a time.
Write-back (or Write-behind):
- Initially, data is written only to the cache. Writing to the backing store occurs only when the present data is about to be replaced by another block of data.
The data writing policy has a direct impact on data consistency and access efficiency. If it’s a write-through process, more writing is required to be done, which causes latency upfront. And in the case of write-back policy, efficiency may be enhanced, but data may not be consistent between the cache blocks and the main memory.
This concept has also a great influence on a computer's overall performance. Locality determines various conditions to make a system more predictable. These situations are considered by the CPU cache memory before it creates a pattern for data retrieval that it can rely upon. Among the several types of them, two basic ones are described here:
- In this case, the same resources are used repeatedly in a short period of time.
- In this one, data or resources are accessed from the resources which are near each other.
The Future of the CPU Cache
The research on CPU cache memory is now advancing more than ever. As well as the experiments are focusing on cutting-edge CPU models, CPU cache designs are also trying to match the same footsteps of churning out the best performance out of smaller and cheaper structures. Manufacturers like Intel and AMD are not only competing on larger cache designs with higher L4 levels, we are also getting a small L0 level of cache in some modern processors as well. Although the latter is now only a few KB in size, they are made for CPUs to have more easy access to these tiny data pools with even lower latency than L1 cache. Undoubtedly, there is a lot going on to remove the bottlenecks on modern computers. One aspect of it is vowed to come out with the best solutions for latency reduction and another is dedicated to fit larger caches onboard with the CPU chips, or even experimenting on hybrid cache designs. Whatever it is, it looks like the future of the CPU cache market is going to offer us surprising performance out of mainstream computers. In today's discussion, we tried to provide you some basic information on the CPU cache memory work procedure. If you still have questions, please connect us anytime. We are here for you, always.