When the nano-dimensional semiconductor industry is offering a golden period for CPU hardware development, essential components like CPU cache are not lagging far behind to utilize the opportunity in all possible ways. Nano-range transistors are leaving enough space for the CPU cache to follow in the same footsteps of advancement. Since information processing starts with accessing the data, the modernization of the CPU cache plays an important role in the overall performance of a microprocessor. These days, it is sufficient to say that we all have heard a bit about the term CPU cache memory while buying a computer or discussing the CPU. The invention of the CPU cache was a significant event in the history of computers. It was to solve a serious problem with CPU performance. Earlier processors were bound to access data from the system memory. However, the advancement in CPU technology was calling out for high-speed data access systems since the main memory was unable to cope with its performance. Here, CPU caches play their part in traversing the performance gap between these two components. Let us take you on a journey to describe how the story started. Stay tuned. We will start with the concept of caching first.
In computer literacy, the word ‘cache’ refers to a temporary data storage component prepared to serve future requests for that data. The cache is used for both hardware and software purposes. But why do we need them? They are faster to respond than the main storage section. Cache data is created mostly as a result of the previous computation or a copy of some essential data stored elsewhere.
Cache memory should not be confused with the general term cache. It is the hardware implementation of memory blocks to be used by the CPU to create cache data from the main memory. These blocks of data are born when various programs run and make a copy of their recent information so that next time the CPU can access them more efficiently, matching the CPU clock speed. The chip-based cache memory is also called CPU cache memory since they are mostly integrated with the CPU chip or directly connected with it through an interconnecting bus. Simply put, the CPU cache is the most quickly accessible memory unit. It helps to improve the CPU access speed.
As we already know, there are at least three types of physical memory sections in a computer memory hierarchy.
Primary storage refers to the HDD or SSD, which is not only a memory unit but also a storage system. It stores a bulk amount of data including the OS and other programs. The storage technology is non-volatile, meaning it doesn't lose its content without a power supply, unlike the other memory units. It’s the largest, slowest, and most inexpensive memory section of a computer.
The system memory or the main memory refers to the RAM or Random Access Memory. It runs on Dynamic RAM or DRAM technology and is implanted on the motherboard. This limited memory unit is volatile and less expensive than the CPU cache. This larger RAM unit provides half of the speed of a CPU cache memory, although it’s faster than the primary storage.
There is also a memory type named virtual memory, but this is not a memory block, but a technique to enhance the RAM capacity in such a way that the ability to run larger programs or multitasking capability of a computer increases. Yet, in the case of accessibility speed, this memory technique is also slower than the CPU cache memory.
Computer memory hierarchy based on size and data access speed[/caption]
Lastly, we come to the CPU cache. It is also RAM, but the hardware used here is Static RAM or SRAM. It is made of CMOS technology with six transistors for each cache block, whereas DRAM uses capacitors and transistors. Due to charge leakage, DRAM is required to be refreshed continuously to retain data for a longer period. Hence, it consumes more power while providing slower memory access. However, SRAM doesn’t require refreshing to store data for a long time. Although SRAM is more complex and expensive, it empowers the CPU cache memory to be able to respond to a CPU request in a few nanoseconds only.
Read Also: How to Stop System Data Usage in Windows 10
In the early days of computing, when CPUs were not fast enough, the main memory could cope with its velocity. After 1980, with the advancement of processor cache technology, the velocity gap between CPU and RAM started to widen significantly.
CPU-DRAM speed gap[/caption]
As the microprocessor clock speed grew, the RAM access time couldn’t reach a similar level. And hence the RAM became responsible to pull the CPU performance behind. In this scenario, the demand for faster memory gave birth to the CPU cache system. These days, mainstream consumer CPU clocks run around 4GHz velocity, whereas most DDR4 memory units reach up to 1800MHz speed or so. Undoubtedly, the main memory is too slow to work directly with the processor. Here comes the CPU cache memory plays the most significant role in bridging the velocity gap between the RAM and the processor for the CPU to perform efficiently. CPU cache works as an intermedial buffer between these two. It stores small blocks of repeatedly used data or sometimes, only the memory addresses them.
Comparing different memory performance[/caption]
The significance of the CPU cache memory is in its efficiency in data retrieval. Fast access to the instructions enhances the overall speed of the program. When time is the essence of this modern world, even a few milliseconds of more latency could potentially lead to gigantic expenses, based on a specific situation.
Read Also: How To Troubleshoot and Fix Blue Screen?
Let’s take an example of a program. It is designed with a set of instructions, to be controlled by the CPU (how does cache work). Now, the program is loaded in the primary storage. Therefore, when the program is launched, the instructions snake their way through the memory hierarchy toward the CPU. The instructions first get loaded into the RAM from the primary storage and then reach the CPU. Modern CPUs are now capable of bearing a large chunk of data per second. But to utilize all its power, the CPU requires access to a high-speed memory module. This is where the cache starts playing. The memory controller takes the required data from RAM to load into the cache. Based on the CPU design, this controller can sit inside the CPU itself or on the North Bridge chipset on the motherboard.
Data flow across the CPU, cache memory, and main memory
Lastly, the data is distributed inside the cache depending on the cache memory hierarchy to carry it around back and forth inside the CPU. Now, CPU caches act like small memory pools to store the most probable information required by the CPU at the next moment. There are sophisticated algorithms to determine which programming code will be needed next and hence to be loaded in the first layer of CPU cache memory and then in the other layers as well. The very purpose of the cache memory is to feed the CPU immediately as it calls upon a request so that the CPU can work around the clock without much lagging.
Generally, CPU cache memory is divided into three main levels: CPU cache l1, l2 l3. The ranking is marked according to decreasing CPU cache speed and closeness to the CPU l1 l2 l3 cache size and hence the increasing size of the layers as well, here in this we also share what is the trade-off between size and speed between l1 and l2:
The l1 cache meaning, Level 1 represents the primary cache, which is the fastest memory block present in a computer. This smallest section in the li cache generally comes embedded in the microprocessor chip. how l1 cache works → According to access priority, the L1 cache retains the data most likely to be retrieved by the CPU up next to finish a certain task. Generally, these days, the L1 cache size (l1 cache memory size) ranges from 256-512KB for flagship processors, while offering 64KB for each CPU core. However, some power-efficient CPUs are currently implanting it close to 1MB or more. Server chipsets like Intel’s Xeon CPUs come equipped with 1-2MB of L1 cache. L1 cache is further split in two ways: L1 Instruction Cache and L1 Data Cache. The instruction layer retains the commands the CPU has to perform to run the operation, and the data cache keeps the data to be written back to the main memory. The instruction cache also holds pre-decode information and branching data. That said, the data cache often acts like an output cache, whereas the instruction cache plays the role of an input cache. This circular process comes in handy when loops are involved in a specific program. In earlier processors, when the L3 cache was not included, the system RAM was used to communicate directly through the L2 cache: RAM → L2 Cache → L1 Instruction Cache → Fetch Unit → Decode Unit → Execution Unit → L1 Data Cache → RAM. An Intel Sky lake CU cache design showing cache memory arrangement
What is the l2 cache → The secondary cache is larger but slower than the L1 cache. This one may come embedded in the CPU, or can be set on a separate chip or a coprocessor. In the latter case, an alternative high-velocity system bus connects the cache with the CPU so that it doesn't lose efficiency due to the traffic on the main memory bus. L2 cache ranges from 4-8MB on flagship processors (CPU l2 cache size 512KB per core), although you can expect even more of it in modern power computers. Generally, in modern multicore CPUs, we see each core having its dedicated L1 and L2 caches embedded in them, whereas the L3 cache is shared across them all.
The third level is specialized in improving the performance of L1 and L2. Although L1 and L2 are significantly faster than L3, L3 can churn out double the speed of DRAM. The size of the L3 cache can vary from 10-64MB. Server chips can go up to 256MB of L3 cache. Recently, AMD’s Ryzen CPUs are implanting many larger-sized caches compared to their Intel counterparts. Earlier, the trend was to create L1, L2, and L3 caches using a combination of processor and motherboard components. Currently, the tendency of consolidating all three levels of cache memories on board with the CPU is gaining popularity. Contrary to a popular belief, implanting extra DRAM won't increase cache memory. Sometimes, two expressions ‘memory caching’ and ‘cache memory’ are often used synonymously, but they are not the same. Memory caching is a process used by DRAM or flash memory to buffer disk reads to enhance storage I/O performance. Whereas, cache memory is a physical memory unit to provide read buffering for the processor.
The data-flow snakes through the RAM to the L3 cache then reaches the L2, and finally L1. When the processor searches for data to operate, it first looks into the associated core’s L1 cache. If the data is found there, the situation is called a cache hit. Otherwise, the CPU goes scampering off to search for the data in L2, and then L3 caches. If it can’t find the data, it tries to retrieve it from the main memory. This condition is called a cache miss. In that case, the CPU has to request it to be written onto the cache from the RAM or storage. This process consumes time and negatively affects performance. The cache memory performance is frequently measured in terms of a quantity called the Hit ratio. Hit ratio = no. of hits / (no. of hits + no. of misses) = no. of hits / total accesses. Generally, the cache hit rate improves with increased cache size. The effects are considerably visible in latency-sensitive workloads as in gaming or so. Now, latency is defined by the time required to access data from a memory unit. Resultantly, the fastest L1 cache has the lowest latency, and then it increases according to the cache hierarchy. In the case of a cache miss, the latency increases a lot since the CPU has to retrieve information from the main memory. Let’s take an example. Suppose, a processor has to retrieve data from the L1 cache 100 times in a row. The L1 cache has a 1ns(nanosecond) access latency and a 100 percent hit rate. Resultantly, the CPU will take 100ns to perform this operation. Now, imagine that the cache hit rate is 99 percent and the required data for the 100th access is sitting in L2, with a 10ns access latency. Therefore, the CPU will take 99ns to play the first 99 reads and then 10ns to perform only the 100th. This means that a 1 percent depletion in hit rate just slowed the CPU down by 10 percent. In real practice, an L1 cache shows a hit rate within 95 to 97 percent, but the performance impact between these two values can differ by around 14 percent if the remaining data sits in the L2 cache. But if it’s a cache miss and the data is in main memory, with 80-120ns access latency, the CPU may take nearly double the time required before. As computers are evolving, latency is decreasing as well. Currently, low-latency DDR4 RAM and high-speed SSDs are significantly reducing the overall latency in a computer system. Earlier cache designs were made in such a way that L2 and L3 caches used to be kept outside the CPU chip. This process was causing more latency due to the distance of the cache units from the processor. However, the current fabrication processes make sure to insert billions of CPU transistors in a relatively smaller space and thus leaving more room for the cache unit. As a result, the closeness of the cache to the CPU is growing, as well as lowering the latency. That’s why the market focus has now shifted from buying a computer with a large cache size to one with a sufficient amount of integrated cache levels with the CPU chip.
This is the process that explains how the CPU cache communicates with the main memory. The cache memory is split into blocks. Again, each block can be divided into n 64-byte lines. The RAM is also divided into blocks that interact with cache lines or sets of lines. A block of memory can’t be placed randomly into the cache. The restriction is confined to one cache line or a set of cache lines. The cache placement policy is what determines where a specific memory block can be put into the cache. Although caching configurations are evolving continuously, traditionally there are three different policies available for the placement of a memory block into the cache.
In this process, the cache is split into multiple sets having one cache line per set. According to the memory block address, each block is mapped to exactly one cache memory line. Here, the cache can be represented as an (n*1) column matrix. The set is recognized by the index bits of the memory block address, and a tag with all or part of the address of the data is stored in the tag field. If the allocated place is already occupied, then the new data rewrites that in the cache.
The performance of this simplest process is directly proportional to the hit ratio. It is power-efficient since there is no need to search through all the cache lines. And because of its simplicity, it doesn’t require expensive hardware manufacturing. Although, the hit rate here is low, because of only one option for each block of data, and also the replacement of old data increases cache miss.
In this process, the cache is arranged into a single set with multiple lines. A memory block can inhabit any of the cache lines. The arrangement can be represented as a (1*m) row matrix.
Fully Associative Cache[/caption]
This process offers better flexibility and a hit rate. Also, various replacement algorithms can be applied here if the cache miss occurs. But the process is slow since each search goes through the entire cache, and also power-hungry for the same reason. Moreover, it requires the most expensive associative-comparison hardware among all three policies.
This process can be seen as an improved version of direct mapping. More precisely, it’s a trade-off between the other two. A set-associative cache can be framed as an (n*m) matrix. The cache is split into n sets and every set contains m lines. A memory block first enters into a set and then occupies any cache line of that set.
[caption id="attachment_13030" align="aligncenter" width="578"]
A direct-mapped cache can be imagined as a one-way set-associative and a fully associative cache with n cache lines can be seen as an n-way set-associative. Most contemporary processors are inheriting either direct-mapped or a two-way or four-way set-associative configuration.
Although there are various techniques available, there are two main writing policies under which cache memory is written.
Data is written to both the cache and a backing store (another cache or main memory) at a time.
Initially, data is written only to the cache. Writing to the backing store occurs only when the present data is about to be replaced by another block of data.
The data writing policy has a direct impact on data consistency and access efficiency. If it’s a write-through process, more writing is required to be done, which causes latency upfront. And in the case of a write-back policy, efficiency may be enhanced, but data may not be consistent between the cache blocks and the main memory.
This concept also has a great influence on a computer's overall performance. Locality determines various conditions to make a system more predictable. These situations are considered by the CPU cache memory before it creates a pattern for data retrieval that it can rely upon. Among the several types of them, two basic ones are described here:
In this case, the same resources are used repeatedly in a short period.
In this one, data or resources are accessed from the resources which are near each other.
The research on CPU cache memory is now advancing more than ever. As well as the experiments focusing on cutting-edge CPU models, CPU cache designs are also trying to match the same footsteps of churning out the best performance out of smaller and cheaper structures. Manufacturers like Intel and AMD are not only competing on larger cache designs with higher L4 levels, but we are also getting a small L0 level of cache in some modern processors as well. Although the latter is now only a few KB in size, they are made for CPUs to have more easy access to these tiny data pools with even lower latency than the L1 cache. Undoubtedly, a lot is going on to remove the bottlenecks on modern computers. One aspect of it is vowed to come out with the best solutions for latency reduction and another is dedicated to fitting larger caches onboard with the CPU chips, or even experimenting with hybrid cache designs. Whatever it is, it looks like the future of the CPU cache market is going to offer us surprising performance out of mainstream computers. In today's discussion, we tried to provide you with some basic information on the CPU cache memory work procedure. If you still have questions, please connect us anytime. We are here for you, always.
Since retrieving data from the l1 cache and l2 cache is the first step in information processing, modernizing the CPU cache is crucial to a microprocessor's overall performance (the performance of cache memory is frequently measured in terms of a quantity called hit ratio, which is calculated as). An important development in the history of computers was the creation of the CPU cache. If you think about how much cache memory is good for a laptop or about l1 and l2 cache memory, this blog is all for you. The memory blocks utilized by the CPU to construct cached data from the main memory are implemented in hardware. CPU cache in comparison to system memory: Random Access Memory, sometimes known as RAM, is referred to as the system memory or main memory. CPU Cache vs. Virtual Memory: Virtual memory is another sort of memory, but it isn't a memory block; rather, it's a method for expanding RAM so that a computer can execute more complex programs or perform many tasks at once.
Here, the CPU cache memory performs the most important part in bridging the speed gap between the processor and RAM so that the CPU can operate effectively. The data cache stores the data that will be written back to the main memory, whereas the instruction layer stores the orders the CPU must execute to complete the process. Since the CPU must obtain data from the main memory in the event of a cache miss, the latency dramatically increases. As a result, the latency is decreasing and the cache's proximity to the CPU is increasing.