Today, AMD introduced the Zen 2 core in more detail, providing a reason for the company's performance on the Computex last week to increase the clock performance by 15% over the previous generation.
AMD's Zen 2 portfolio
AMD currently announced that it has Zen 2 core products including Ryzen's third-generation consumer CPU, the Ryzen 3000 series, and AMD's next-generation EPYC processor, Rome. So far, AMD has released details of six consumer Ryzen 3000 processors, including core count, frequency, memory support and power. onserverThe details of the processor, with the exception of some peaks, are expected to be announced in due course in the coming months.
Compared to the first generation of Zen, the design paradigm of Zen 2 has undergone significant changes. The new platform and core implementation is designed around a small 8-core chiplet from TSMC's 7nm process and measures approximately 74-80 square millimeters. There are two sets of quad-core "core complexes" (CCX) on these chipsets, which contain the four cores and a set of L3 caches - Zen 2's L3 cache is twice that of Zen 1.
Each complete CPU, regardless of how many chiplets it has, is paired with a central IO chip via an Infinity Fabric link. The IO chip acts as a central hub for all off-chip communications because it contains all of the processor's PCIe lanes, memory lanes, and Infinity Fabric links to other chiplets and other CPUs. The IO chip of the EPYC Rome processor is based on TSMC's 14nm process, while the consumer processor IO chip (smaller and less functional) is based on GlobalFoundries' 12nm process.
The consumer processor, called "Matisse" (or Ryzen 3rd Gen, Ryzen 3000 series), has up to two chiplets and 16 cores. AMD will launch six versions of Matisse on July 7th, ranging from 6 cores to 16 cores. The 6-core processor and the 8-core processor have one chiplet, and the 8-core or higher processor has two chiplets, but in all cases the IO chips are the same. This means that every Zen 2 based Ryzen 3000 processor can access 24 PCIe 4.0 channels and dual channel memory. According to today's announcement, the price of the Ryzen 5 3600 will be from $199 to more than $700 for 16 cores (we are waiting for the final confirmation of this price).
Built on Zen 2 chiplets, the EPYC Rome processor has up to eight processors, enabling a platform to support up to 64 cores. Like consumer processors, chiplets cannot communicate directly with one another - each chiplet can only be directly connected to a central IO chip. The IO chip contains 8 memory channel links and up to 128 PCIe 4.0 connection channels.
AMD's roadmap
Before discussing the new product line, it is worthwhile to review our current position in AMD's roadmap.
AMD's previous roadmap shows the transition from Zen to Zen 2, Zen 3, AMD explained that this structure has been around for many years, Zen was released in 2017, Zen 2 was released in 2019, and Zen 3 was released in 2021. The rhythm is not entirely a generation, as it depends on AMD's design and manufacturing capabilities, as well as agreements with foundry partners and current market forces.
AMD has said that Zen 2's plan has always been introduced on the 7nm process, and finally using TSMC's 7nm process (Global Foundries failed to prepare the 7nm process in time, and finally gave up the plan). The next generation of Zen 3 is expected to be in line with the newer 7nm process, and AMD has not yet commented on the potential "Zen 2+" design, although we don't expect to see it now.
In addition to Zen 3, AMD has stated that Zen 4 and Zen 5 are currently at different stages of their respective designs, but AMD does not commit to specific time frames or process node technologies. AMD has said in the past that these platforms and processor design paradigms are all formulated three to five years in advance, and companies must place big bets on each generation to ensure they remain competitive.
To gain a deeper understanding of Zen 4, at Computex, Forrest Norrod, senior vice president of AMD's embedded and semi-customized group, revealed to AnandTech the code name of AMD Zen 4 EPYC processor: Genoa.
Forrest explained that the code for Zen 5 follows a similar pattern, but he does not want to comment on the time frame of the Zen 4 product. Given that the design of Zen 3 is expected to be launched in mid-2020, if AMD follows this rhythm, then Zen 4 will be launched in late 2021 / early 2022. It's unclear how it will enter AMD's consumer roadmap program, which will depend on how AMD approaches its chip paradigm and future adjustments to packaging technology to achieve further performance improvements.
Zen 2 performance statement
At Computex, AMD announced that they have designed Zen 2, and when comparing Zen 2 and Zen+ at the same frequency, Zen 2 can provide 15% higher raw performance than the Zen+ platform. At the same time, AMD also claims that at the same power, Zen 2 can provide performance gains of more than 1.25 times, or only half of the power consumption under the same performance. Combined with this, AMD claims that its performance per watt is 75% higher than its predecessor and 45% higher than its competitors.
These figures are currently not verifiable because we do not have the relevant products at hand. When the ban is lifted on July 7, we will determine the benchmark results. AMD did spend a lot of time researching the new changes in the Zen 2 microarchitecture and platform-level changes to show how the product improved compared to the previous generation.
It should also be noted that during AMD's recent Technology Day, the company has repeatedly stated that they have no intention of repeatedly ripping with their major competitors on incremental updates in an attempt to defeat the other party, which may lead to technical stagnation. AMD executives say that no matter who the competitors are, AMD will do its best to challenge the performance limits of each generation. CEO Lisa Su and CTO Mark Papermaster both said they expect the timeline for the launch of the Zen 2 portfolio to cross the highly competitive Intel 10nm product line. Although this is not the case, AMD executives say they are still pushing their roadmap as planned.
When AMD demonstrated the performance of its upcoming Matisse processor, the benchmark was Cinebench. Cinebench is a floating-point benchmark, and the company has been doing very well in this area, and it tends to detect CPU FP performance and cache performance, although it usually does not involve many memory subsystems.
As early as January this year at CES 2019, AMD showed an unnamed 8-core Zen 2 processor, which is roughly comparable to Intel's high-end 8-core processor i9-9900K on the Cinebench R15. The same, but AMD's system-wide power consumption is about 1/3 or less of Intel's. At Computex in May, AMD released a number of details on 8-core and 12-core, as well as comparisons of these chips in single-threaded and multi-threaded Cinebench R20 results.
AMD says its new processors offer better single-threaded performance, better multi-threading performance, lower power consumption and lower price in terms of CPU benchmarking when comparing different core counts.
When it comes to games, AMD is quite optimistic in this regard. At 1080p, comparing the Ryzen 7 2700X to the Ryzen 7 3800X, AMD expects a 11% to 34% increase in frame rate per generation.
When comparing AMD and Intel processors, AMD insisted on 1080p testing of popular games, again comparing cores and similarly priced processors. In almost all comparisons, AMD's products are comparable to Intel's, AMD's higher, some lower, or evenly divided. The following is an example of a $250 product comparison:
At this point, game performance is designed to show improvements in frequency and IPC, rather than showing the benefits of PCIe 4.0. In terms of frequency, AMD said that although the 7nm chip size is reduced and the channel resistivity is higher, they can obtain higher frequencies from the TSMC 7nm process than GlobalFoundries' 14nm and 12nm.
AMD also commented on the new L3 cache design as it changed from 2MB/core to 4MB/core. According to AMD, the L3 cache has doubled, and 1080p performance has increased by 11% to 21% when playing with a separate GPU.
There are some new instructions in Zen 2 that can help verify these numbers.
Windows optimization
A major headache for non-Intel processors using Windows is optimization and scheduler scheduling in the operating system. We have seen in the past how unfriendly Windows is to non-Intel microarchitecture layouts, such as AMD's previous module design in Bulldozer, Hybrid's hybrid CPU strategy for Snapdraon, and the recent multi-chip arrangement on Threadripper. To introduce different memory latency domains into consumer computing.
Obviously, AMD andMicrosoftThere is a close relationship between the two companies when it comes to identifying unconventional core topologies for processors, and the two companies are committed to ensuring thread and memory allocation, with no program-driven direction, trying to make the most of the system. With the Windows update on May 10, some additional features are in place to take advantage of the upcoming Zen 2 microarchitecture and Ryzen 3000 chip layout.
There are two aspects to optimization, both of which are easy to explain.
Thread grouping
The first is thread allocation. When processors have different CPU core "groups", the way to allocate threads is different, all of which have their own advantages and disadvantages. The two extremes of thread allocation boil down to thread grouping and thread expansion.
Thread grouping is when new threads are generated, they are directly allocated to the kernel next to the kernel that already owns the thread. This allows threads to be tightly coupled for thread-to-thread communication, but it can create areas of high power density, especially when there are multiple cores on the processor but only a few are active.
Thread expansion means that the kernels are placed as far apart as possible from each other. This means that the second thread is generated as far as possible on different chiplets or different core complexes (CCX). This allows the CPU to maintain high performance through areas without high power density, typically providing optimal turbo performance on multiple threads.
The danger of thread expansion is when a program generates two threads that are ultimately located in different locations on the CPU. In Threadrapper, this might even mean that the second thread is in a portion of the CPU that has a long memory delay, resulting in a potential performance imbalance between the two threads, even if the cores where the threads are located are at a higher turbo frequency. .
Since modern software (especially video games) is generating multiple threads instead of relying on a single thread, and these threads need to communicate with each other, AMD is moving from hybrid thread extension technology to thread grouping technology. This means that a CCX will be filled by threads before accessing another CCX. AMD believes that while one chiplet has the potential for high power density while the other may be inactive, it is still worthwhile for overall performance.
For Matisse, this should provide a good improvement for a limited threading scenario. It will be interesting to see how much this impacts the upcoming EPYC Rome CPU or future Threadripper design. AMD's single benchmark in its explanation is the 1080p Low Rocket League, which reported a 15% increase in frame rate.
Clock boost
For those familiar with the Skylake microarchitecture, you may remember that Intel introduced a new feature called Speed Shift that allows the processor to more freely adjust between different P states and adjust from idle very quickly. To the load - the first version of Skylake went from 100 milliseconds to 40 milliseconds, then Kaby Lake dropped to 15 milliseconds. It does this by returning P state control from the operating system to the processor, which reacts to the instruction throughput and request. In Zen 2, AMD now implements the same functionality.
Compared to Intel, AMD has enough granularity in frequency adjustment, allowing for a difference of 25MHz instead of 100MHz, but when it comes to very burst-driven workloads, it can achieve faster ramp- To-load frequency hopping will help AMD, such as WebXPRT (Intel likes this demo). According to AMD, the way to implement this feature with Zen 2 will require a BIOS update and Windows May 10 update, but it will reduce Zen's frequency boost time from 30 milliseconds to Zen 2's 1 to 2 milliseconds. It's worth noting that this is much faster than the numbers given by Intel.
The technical name that AMD implements involves CPPC2, Collaborative Power Performance Control 2. AMD's metrics indicate that this will increase burst workload and application load. AMD said that with PCMark10's application launcher test, application startup time performance increased by 6%.
Enhanced Zen 2 security
Another aspect of Zen 2 is the way AMD uses to improve the security requirements of modern processors. As has been reported, a series of side channel attacks have not affected AMD processors recently, mainly because AMD manages its TLB buffers, which always require additional security checks before they become a problem. Still, for AMD's vulnerability, it has implemented a completely hardware-based security platform for these issues.
The change here comes from the Speculative Store Bypass, called Spectre v4, which now has additional hardware to work with the operating system or virtual memory manager (such as the hypervisor) for control. AMD expects these updates to not introduce any performance changes. New issues such as Forestadow and Zombieload will not affect AMD processors.
New instruction cache and memory bandwidth QoS control
As with most new x86 microarchitectures, there is momentum to improve performance with new instructions, but it also attempts to achieve parity between different vendors in terms of which instructions are supported. For Zen 2, although AMD does not cater to some of the more eccentric instruction sets like Intel, it adds new instructions in three different areas.
The first is CLWB, which has previously been seen on Intel processors as it relates to non-volatile memory. This command allows the program to push data back into non-volatile memory in case the system receives a shutdown command and causes data loss. Although AMD does not specify it, there are other instructions related to protecting data into non-volatile memory systems. This may indicate that AMD is looking to better support the hardware and structure of non-volatile memory in future designs, especially in EPYC processors.
The second cache instruction, WBNOINVD, is an AMD-only command, but it is based on other similar commands, such as WBINVD. This command is used to predict specific parts that may need to be cached in the future and clean them up to speed up future calculations. If the required cache line is not ready, the refresh command is processed before the desired operation, increasing the latency - running the cache line refresh ahead of time when the delay critical instruction is still passed along the pipeline, helping to speed up its final execution .
The last set of instructions that are archived under QoS is actually related to how the cache and memory priority are allocated.
When splitting the cloud CPU into different containers or containers for different customersvirtual machineAt (VM), performance levels are not always consistent, as performance may be limited based on what another virtual machine does on the system. This is the so-called "noisy neighbor" problem: if other people are taking up all core-to-memory bandwidth (ie L3 cache), then another VM on the system has difficulty accessing what it needs. Due to this noisy neighbor, the latency of other VMs in handling their workloads will be highly variable. Or, if one mission-critical VM is on the system and another VM is always requesting resources, the mission-critical VM might miss its target because it does not have access to all the resources it needs.
In addition to ensuring that a single user has full access to the hardware, it is difficult to handle noisy neighbors. Most cloud providers and operations won't even tell you if there are neighbors. In the case of real-time VM migration, these neighbors may change very frequently, so there is no guarantee of sustained performance at all times. This requires a dedicated set of QoS (Quality of Service) instructions.
As with Intel's implementation, when a series of virtual machines are assigned to a system on top of a hypervisor, the hypervisor can control how much memory bandwidth and cache each virtual machine has. If a mission-critical 8-core virtual machine needs access to 64MB of L3 and at least 30GB/s of memory bandwidth, the hypervisor can control the priority virtual machine to always have access to that number and move it from the pool of other virtual machines. Completely removed, or intelligently limit the requirements when a mission-critical virtual machine suddenly enters full access.
Intel only implemented this feature on its Xeon scalable processor, but AMD will do this for the consumer and enterprise users in the Zen 2 processor family.
The most immediate problem I have encountered with this feature is at the consumer level. Imagine if a video game needs access to all caches and all memory bandwidth, and some streaming software is not accessible - this can cause serious damage to the system. AMD explained that while technically, a single program can request a certain level of QoS, whether these requests are valid and appropriate will depend on the operating system or hypervisor. They see this feature more as an enterprise feature used by the hypervisor, rather than a bare metal installation on a consumer system.
CCX size
Moving the node down will bring many challenges both inside and outside the core. Even if power and frequency are not considered, it is an exercise to put the structure into the chip, then integrate the chip into the package, and provide power to the correct part of the chip through the correct connection. AMD gives us insight into how 7nm can change some of its designs and the packaging challenges involved.
A key indicator that AMD abandoned was related to the core complex (CCX): four cores, the associated core structure, and then the L2 and L3 caches. According to AMD, in the case of 12 nm and ZEN+ cores, the single core complex is 60 square millimeters, with the core accounting for 44 square millimeters and the 8 MB L3 accounting for 16 square millimeters. Adding two of the 60-mm square composites plus two memory controllers, PCIe channels, four IF links, and other IOs, the Zen+ Zeppelin die totals 213 square millimeters.
For Zen 2, a single chiplet is 74 square millimeters, of which 31.3 square millimeters is the core complex with 16 MB of L3. AMD did not split the 31.3 numbers into cores and L3s, but one can imagine that L3 might be close to 50% of this number. The reason chiplet is so small is that it does not require a memory controller. It has only one IF link and no IO because all platform requirements are on the IO chip. This allows AMD to make the chiplet very compact. However, if AMD intends to continue to increase the L3 cache, then the L3 cache may occupy most of the chip.
But overall, AMD has said that the size of CCX (core plus L3) has been reduced by 47%. This shows great scalability, especially when +15% of the original instruction throughput and increased frequency come into play. Performance per square millimeter will be a very exciting indicator.
Package
Since Matisse uses the AM4 slot, Rome uses the EPYC slot, and AMD says they must bet on the packaging technology to maintain compatibility. Some of these bets are always trade-offs for continued support, but AMD believes that extra effort is worth the effort.
One of the key issues related to packaging that AMD talks about is how each die is connected to the package. In order to implement a pin-grid array desktop processor, the chip must be fixed to the processor in a BGA manner. AMD said that due to the 7nm process, the bump pitch (the distance between the die and the solder balls on the package) was reduced from 150 microns at 12nm to 130 microns at 7nm. This doesn't sound like much, but AMD says that only two vendors in the world have enough technology to do this. The only alternative is to use a larger chip to support larger bump spacing, which ultimately leads to a lot of idle (or different design paradigms) in the chip.
One way to achieve tighter bump spacing is to adjust the way the bumps are processed on the underside of the chip. Typically, the solder bump on the package is a pellet or ball of lead-free solder that relies on the physical properties of surface tension and reflow to ensure consistency and regularity. However, in order to achieve tighter bump spacing, AMD must turn to a copper pillar solder bump topology.
To achieve this characteristic, copper is epitaxially deposited in the mask to form the "bracket" used to reflow the solder. Due to the diameter of the solder post, fewer masks are required, resulting in a smaller solder radius. Due to its two-chip design inside Matisse, AMD has encountered another problem: if the IO chip uses a standard solder bump mask and the chiplet uses a copper pillar, the integrated heat sink needs to have a certain degree of high uniformity. For smaller copper columns, this means managing the growth level of the copper column.
AMD explained that it is easier to manage this connection than to build a different height of the heat sink, because the stamping process of the heat sink does not produce such small tolerances. AMD expects that all 7nm designs will be implemented in copper columns in the future.
wiring
In addition to placing the die on an organic substrate, the substrate must also manage the connection between the die and the outside of the die. In order to handle the extra wiring, AMD must increase the substrate layer in the package to 12 layers (not revealing how many layers, maybe 14 layers, are needed in Rome). This is also a bit more complicated for single-core chiplets and dual-core chiplet processors, especially when testing the die before it is placed in the package.
From the figure we can clearly see the IF link from the two chiplets to the IO chip, the IO chip also handles the memory controller and the task of the power plane. There are no in-package links between chiplets: chiplets cannot communicate directly, and all communication between chiplets is handled by IO chips.
AMD said that with this layout, they must also pay attention to how the processor is placed in the system, as well as cooling and memory layout. In addition, when it comes to faster memory support or tighter tolerances for PCIe 4.0, all of this also needs to be considered in order to provide the best path for signal transmission without interference from other wiring.
AMD Zen 2 Microarchitecture Overview Quick Analysis
On AMD's Technology Day, the presence of colleague and chief architect Mike Clark, he experienced these changes. Mike is a very good engineer, and although it always makes me interesting, the engineers who talk about the latest products have been working in the company for one generation, two generations or three generations (for any company, not just AMD). Mike said that he spent some time thinking about the specific changes from Zen+ to Zen 2, and his mind has undergone several generations of product changes.
An interesting element of Zen 2 is around its intent. Originally Zen 2 was just a miniature version of Zen+, shrinking from 12nm to 7nm, similar to Intel's tick-tock model we saw at the beginning of this century. However, based on internal analysis and a 7nm timeframe, AMD decided to use ZEN 2 as a better performance platform, using 7nm in multiple ways, rather than redesigning the same layout on a new process node. As a result of the adjustment, AMD is pushing the IPC of Zen 2 by 15% over Zen+.
When it comes to the exact changes in the microarchitecture, what we basically see is still a layout plan similar to the Zen look. Zen 2 is a member of the Zen series and is not a complete redesign or a different paradigm in handling x86 - like other architectures with family updates, Zen 2 provides a more efficient core and a broader core that allows for more Good instruction throughput.
At a high level, the cores look very similar. Highlights of the Zen 2 design include different L2 branch predictors, called TAGE predictors, micro-op cache doubling, L3 cache doubling, integer resource addition, load/store resource addition, and single-operation AVX-256 (or AVX2) support. AMD said that based on its energy-aware frequency platform, the AVX2 has no frequency loss.
AMD also made adjustments to the caching system, the most notable of which was the L1 instruction cache, which was halved to 32 kb, but the correlation was doubled. This change was made for important reasons, which we will discuss on the next page. The L1 data cache and the L2 cache remain unchanged, but the transaction lookaside buffer (TLB) adds support. AMD also said it has added deeper virtualization support to security to help implement the pipeline's subsequent capabilities. As mentioned earlier in this article, there are security enhancements.
For fast analysis, it's easy to see that in many cases, doubling the micro-op cache will bring significant improvements to IPC, and combining it with increased load/storage resources will help pass more Instructions. Double the L3 cache to help with specific workloads, as well as support for AVX2 single operations, but the improved branch prediction program will also show raw performance improvements. All in all, from a paper analysis, AMD's 15% IPC improvement seems to be a very reasonable number.
In the next few pages, we'll delve into the changes in microarchitecture.
Extraction/pre-fetching
We start with the front end of the processor, the prefetcher.
The main improvement AMD promotes here is the use of the TAGE predictor, although it is only used for non-l1 extraction. This may not sound like: AMD still uses the Hash Perceptron prefetch engine to extract for L1, which will extract as much as possible, but the TAGE L2 branch predictor uses extra markers to achieve longer branch histories to get A better predictive path. This becomes more important for L2 prefetching and subsequent prefetching, which is preferred for short prefetching in power based L1.
At the front end, we have a larger BTB to help track instruction branches and cache requests. The size of the L1 BTB has doubled, from 256 entries to 512 entries, and L2 has almost doubled from 4K to 7K. The L0 BTB remains at 16 entries, but the indirect target array can have up to 1K entries. Overall, these changes in AMD have reduced the false prediction rate by 30%, which saves power.
Another major change is the L1 instruction cache. We noticed that it is even smaller for Zen 2: only 32KB instead of 64KB, but the correlation has doubled, from 4 to 8. Given the way the cache works, these two effects will not eventually cancel each other out, but the 32KB L1-I cache should be more energy efficient and have higher utilization. The L1-I cache is not reduced in isolation - one of the benefits of reducing the size of the I cache is to allow AMD to double the size of the micro-op cache. These two structures are adjacent to each other inside the core, so even at 7 nm, we have space-limited examples, resulting in trade-offs between the core internal structures. AMD said that this smaller L1 and larger micro-op cache configuration performed better in more test scenarios.
decoding
For the decoding phase, the main enhancement here is the micro-op cache. By doubling the 2K entry to the 4K entry, it will contain more decoding operations than before, which means it will undergo a lot of reuse. For ease of use, AMD has improved the scheduling speed from micro-op cache to buffer, up to 8 fused instructions. Assuming that AMD can often bypass its decoder, this should be a very efficient block.
The 4K entry is even more impressive when we compare it to our competitors. In Intel's Skylake series, the micro-op cache in these cores has only 1.5K entries. Intel has increased the size of Ice Lake by 50% to 2.25K. This core will enter the mobile platform later this year and may enter the server next year. In contrast, AMD's Zen 2 core will cover everything from consumer to enterprise. At the same time, we can also compare it to the micro-op cache of the Arm A77 CPU, which is a 1.5K entry. However, it is the first micro-op cache designed by Arm for the core.
The decoder in Zen 2 remains the same, we still have access to 4 complex decoders (IntelIt is a complex decoder + 4 simple decoders), the decoding instructions are buffered into the micro-op buffer and dispatched to the micro-op queue.
AMD also said it has improved its micro-op fusion algorithm, but did not elaborate on how this will affect performance. The current micro-op fusion conversion is pretty good, so it will be interesting to see what AMD has done here. Compared to ZEN and ZEN+, based on support for AVX2, this means that the decoder does not need to decompose the AVX2 instructions into two micro-ops: AVX2 is now a single micro-op through the pipeline.
In addition to the decoder, the micro-op queue and schedule can feed the scheduler with six micro-ops per cycle. However, this is a bit unbalanced because AMD has separate integer and floating-point schedulers: the integer scheduler can accept 6 micro-ops per cycle, while the floating-point scheduler can only accept 4 micro-ops. However, scheduling can send micro-ops to both at the same time.
floating point
The key highlight of floating point performance is full support for AVX2. AMD has increased the width of the execution unit from 128 bits to 256 bits, allowing single-cycle AVX2 calculations instead of dividing the calculation into two instructions and two cycles. This is enhanced by providing 256-bit load and storage, so the FMA unit can be fed continuously. AMD notes that due to its energy-aware scheduling, there is no predefined frequency drop when using the AVX2 instruction (but the frequency may be reduced depending on temperature and voltage requirements, but this is automatic no matter which instruction is used).
In a floating point unit, the queue accepts up to four micro-ops from the scheduling unit per cycle, and these micro-ops feed a physical register file containing 160 entries. This will move to 4 execution units, which can provide 256b of data to these units in the load and store mechanism.
In addition to doubling the size, FMA has made other adjustments. AMD said they have improved memory allocation, repeated physical calculations, and the raw performance of some audio processing technologies.
Another key update is to reduce the FP multiplication delay from 4 cycles to 3 cycles. This is a fairly significant improvement. AMD said the company kept a lot of details confidential because the company wanted to show it on August's Hot Chips. We will conduct a comprehensive analysis of the instructions on July 7.
Integer unit, load and store
The integer unit scheduler can accept up to six micro-ops per cycle, and these micro-ops will feed to a reordering buffer of 224 entries (previously 192). An integer unit is technically composed of seven execution ports, consisting of four ALUs (Arithmetic Logic Units) and three AGUs (Address Generation Units).
The scheduler consists of four 16-entry ALU queues and one 28-entry AGU queue, although the AGU unit can feed 3 micro-ops to the register file per cycle. The size of the AGU queue has increased based on AMD's simulation of instruction distribution in general software. These queues feed 180 entries of general purpose register files (originally 168), but also track specific ALU operations to prevent potential downtime.
The three AGUs are fed to the load/store unit, which can support two 256-bit reads and one 256-bit write per cycle. As you can see from the above figure, not all three AGUs are the same: AGU2 can only manage storage, while AGU0 and AGU1 can be loaded and stored simultaneously.
The storage queue has been increased from 44 entries to 48 entries, and the TLB of the data cache has also increased. However, the key metric here is load/store bandwidth, because the core now supports 32 bytes per clock instead of the original 16 bytes.
Cache and Infinity Fabric
The biggest change in the cache is the L1 instruction cache, which has been reduced from 64KB to 32KB, but the degree of integration has increased from 4 to 8. This change allows AMD to increase the size of the micro-op cache from 2K entries to 4K entries, which AMD believes can better balance the development of modern workloads.
The L1-D cache is still 8 channels of 32KB, while the L2 cache is still 8 channels of 512KB. The L3 cache is a non-inclusive cache (L2 is an inclusive cache), and its size has now doubled to 16MB/core complex (formerly 8MB). The way AMD manages L3 is to share a 16MB block per CCX, rather than allowing access to L3 from any core.
Due to the increase in the size of L3, the delay is slightly increased. L1 is still 4 cycles, L2 is still 12 cycles, but L3 has been increased from 35 cycles to 40 cycles (this is a feature of large caches, their latency will be slightly longer; this is an interesting trade-off). AMD has stated that it has increased the size of queues that handle L1 and L2 loss, although it has not been specified how big they are now.
Infinity Fabric
With the introduction of Zen 2, we also turned to the second generation of Infinity Fabric. One of the major updates to IF2 is support for PCIe 4.0, so the bus width is increased from 256 bits to 512 bits.
According to AMD, the overall efficiency of IF2 has increased by 27%, resulting in lower power consumption per bit. This will become very important as the number of IF links in the EPYC increases, as data is transferred from the chiplet to the IO chip.
A feature of IF2 is that the clock has been separated from the DRAM master clock. In Zen and Zen+, the IF frequency is coupled to the DRAM frequency, which leads to some interesting scenarios in which memory can run faster, but the limitations in IF mean that they are all limited by the clock lockstep feature. For Zen 2, AMD has introduced a ratio for IF2 that supports a 1:1 normal ratio or a 2:1 ratio, which can halve the IF2 clock.
This ratio should automatically work around DDR4-3600 or DDR4-3800, but it does mean that the IF2 clock is reduced by half, which has a shocking effect on bandwidth. It should be noted that even if the DRAM frequency is high, if the IF frequency is slow, the original performance gain obtained from the faster memory may be limited. AMD recommends maintaining a 1:1 ratio around DDR4-3600, but optimizing sub-timing at this speed.
Conclusion: Platform, SoC, Core
Building a core like Zen 2 requires more than just building cores. The interaction between the core, SoC design and the platform requires different internal teams to unite to create a level of synergy that is lacking in a single job. AMD's work on chiplet design and Zen 2 has shown great promise not only to take advantage of smaller process nodes, but also to open the way for the future of computing.
When entering a more advanced process node, the main advantage is lower power consumption. This can be done in several ways: Reduce the power of the operation with the same performance, or use more power budget to do more. Over time, we saw this in the core design: as more power budgets are turned on and different units in the kernel become more efficient, additional power is used more widely to drive the kernel, hopefully Can increase the original instruction rate. This is not an easy problem to solve because there are many trade-offs: One example of the Zen 2 core is that the reduction in L1 I cache has doubled AMD's micro-op cache, which AMD hopes will improve performance and power consumption. For these engineers, implementing a solution that is at least at a high level is like playing.
Still, Zen 2 looks a lot like Zen. It belongs to the same series, which means it looks very similar. Everything AMD does on this platform, enabling PCIe 4.0 and getting the server processor out of a NUMA-like environment will help AMD's long-term development. AMD's good prospects depend on how often the server components it can drive, but Zen 2+ Rome will focus on solving the many problems that Zen's customers have come up with.
In short, AMD has improved core performance by 15% based on Zen 2 and Zen+. With the core changes, it is certainly feasible at a high level. Performance-focused users will like the new 16-core Ryzen 9 3950X, and the processor looks efficient at 105W, so it's interesting to see what happens at low power. We also look forward to Rome launching very powerful products in the next few months, especially with features like double FP performance and QoS. The original multithreading performance of 64 cores will be an interesting disruptor in the market, especially If the price is valid. We will get the hardware soon, and show our findings on July 7th when the processor is released.
related articles:
AMD Ruilong 9 3950X GB4 running points exposure: 16 core multi-threaded lead 31% than Intel 18 core
AMD's 16-core processor beat Intel's 18-core processor in testing
User comments