Home > News content

Amd zen3 architecture in depth analysis! Uncover the secret of 39% performance surge

via:驱动之家     time:2020/11/5 23:14:52     readed:818

AMD's sharp dragon 5000 Series Based on zen3's new architecture has finally been lifted from the market,I don't know if the performance of the 9 5950x and 9 5900x of the sharp dragon is satisfactory to everyone?Did you chop your hands to buy while shouting yes?

Next, fasttech will present the Reebok 75800x, the Reebok 55600x appraisal, please look forward to.

This time, it can be said that the only weak single core / game performance of Ruilong is no longer a short board. It has realized the anti surpassing of Intel in one fell swoop. Moreover, it is also achieved under the premise that the manufacturing process remains completely unchanged at 7Nm. The newly designed zen3 architecture can be said to play an important role, which is also the largest reform since the birth of Zen.


Today, let's talk about the innovation of zen3 architecture.

Of course, processor architecture design is a very profound knowledge. We can't talk about it deeply and professionally. Let's talk about some superficial and easy to understand things to see how such an adverse performance leap came about.


First of all, everything must have a goal, especially when designing a processor architecture. Zen3 has three goals:

everythingImprove single threaded performanceIPC (instructions per clock) is a professional term. After all, previous generations have been pursuing multi-core technology. It's time to improve the performance of single core to a sufficient level. Otherwise, it's always limping and lacking long-term competitiveness.

The second is to maintain the 8-core CCD module,Unified core and cacheTo improve the communication efficiency and reduce the delay.

Third, it isContinue to improve energy efficiency ratioThe power consumption can not be out of control while the performance is improved.


Therefore, all modules of zen3 architecture have been renovated. Front end, prefetching, decoding, execution, integer, floating-point, load, storage, cache, etc. are completely new.

First of all, zen3 designedAn artistic branch predictorAfter that, there are two channels to queue the instructions and dispatch them. One is the 32KB L1 instruction cache and x86 decoder associated with 8 channels, and the other is the op cache of 4K instructions.

The limitation of the x86 decoder is that it can only process up to four instructions per clock cycle, but if it is a familiar instruction, it can be put into the operation cache,We can handle eight in each cycleThe combination of them greatly improves the efficiency of instruction distribution, which is a level higher than zen2 directly.

After instruction dispatch, it comes to the execution engine stage, which is divided into two parts: integer and floating-point,They can be assigned six instructions per clock cycle.

Among them,There are still four integer cells, but they are more dispersed, and a separate branch and data storage unit are addedTo improve the throughput, three addresses can be generated per clock cycle.

Floating point is divided into six pipelineTo further improve throughput and efficiency.

In terms of memory, three loads can be executed per clock cycle, or one load plus two storage can be executed, which can improve the throughput again, and can handle different workload more flexibly.


Simply speaking, zen3 may not feel anything, so compare zen2. If there are too many changes, it's better to pick up the core.

Front end, mainly includingDouble capacity L1 BTB, larger branch predictor bandwidth, faster prediction error recovery, faster operation cache pick-up, and finer operation cache pipeline switching, etc.

In terms of execution engine, it mainly includesIndependent branch and data storage unit, larger integer window, lower specific integer / floating point instruction latency, 6-width pick and distribute, wider floating-point dispatch, faster floating-point FMAC (multiplication accumulator), etc.

Load / storage, mainly includingHigher load bandwidth (2 to 3), higher storage bandwidth (1 to 2), more flexible load / storage instructions, and better memory dependency detection, etc.


The above are the changes in some key indicators of the core and cache of Zen, zen2 and zen3 architectures. At first glance, zen3 seems to be less dynamic than zen2, but on the one hand, these figures can not fully reflect the deeper changes; on the other hand, zen3 has more breakthroughs in key indicators, such asThe distribution width jumped from 10 / 11 to 16, and the execution efficiency could be improved more than a little bit.


Based on these improvements, the IPC of zen3 architecture has increased by as much as 19%, with the joint contribution of front-end, load / storage, execution engine, cache prefetching, micro operation cache, branch prediction, etc.

So you may be wondering, how did the figure of 19% come from?


It is also simple to say that the zen3 and zen2 architectures are fixed in the8 core, 4GHz frequencyThen compare the performance changes of different applications, and finally synthesize them.

Of course, different workloads have different increases,The biggest change is the previous weak online game of Reebok. Eating chicken, lol and csgo has increased by 35-39%. In addition, with the increase of frequency, we can finally see the earth shaking changes of the Ruilong 5000 in online games.

In fact, most of the games that have increased by more than 19% of the average level are games. Because of this, the sharp dragon 5000 has taken the last position of Intel in game performance, and is entitled to claim that it is the best game processor.

Some benchmark projects and some games that are difficult to optimize in depth are relatively small to improve, especially single thread performance, such as POV ray 9%, CPU-Z 12%, cinebench R20 13%, cinebench R15 18%But even so, we also see a very obvious improvement in actual performance, which is much more than the change of about 5% at most for a certain generation of core.

If you don't like the architecture mentioned above and want to know more about it, let's break it down into different modules and take a look at their changes.


In the front-end part, zen3 creates a faster branch predictor, which can process more instructions in each clock cycle, and switch between operation cache and instruction cache more quickly, and it is more flexible and efficient to cope with different workload.

Of course, branch prediction can't be 100% accurate. It's all probability, and sometimes it's wrong. At this time, the key is whether to recover quickly. Zen3 greatly reduces the delay at this time and can quickly return to the right track. The accuracy of branch prediction is also improved.


In the picking and decoding part, we can see more details of the branch predictor, especially how the accuracy improvement comes from, such as branch target buffer redesign, L1 B2B capacity doubling, L2 B2B reorganizing, indirect target array (ITA) increasing, pipeline shortening, error prediction delay reducing, etc.

At the same time, 32KB 8-way associated L1 instruction cache is optimized to improve prefetching capability and utilization.

Operation cache is more refined, queue picking efficiency is higher, and switching between operation cache and instruction cache pipeline is more flexible.


In the aspect of execution engine, floating-point and integer distribution width are increased, FMAC delay is reduced, and execution window is increased.



Integer scheduler nodes increased from 92 to 96(4) in integer execution

The distribution per clock cycle also increased from 7 to 10, including 4 ALUs (arithmetic logic units), 3 AGUS (address generation units), 1 branch unit, and 2 storage data units.

In addition, the number of X86 instructions stored in the recorder buffer (ROB) increased from 224 to 256.

There are still four integer units in zen3, but they share ALU and AGU schedulers, which are more balanced when dealing with different loads.


In terms of floating-point execution, increasing floating-point units to six pipelines means that six micro operation instructions can be allocated at one time. At the same time, the mul multiplication and add add integer units, which are responsible for storing and floating-point register files, are now changed to independent pipelines, so that real mul and add instructions can be handled better when necessary.

In addition, there are faster 4-cycle FMAC, separate f2i and storage unit, and larger scheduler.



In the aspect of load / storage, the number of storage queue nodes increased from 48 to 64. At the same time, the bandwidth between 32 KB L1 instruction cache and 3 loads per clock cycle, or 2 floating-point and 1 storage, was implemented. In addition, prefetching algorithm was improved to make better use of level 3 cache with doubled capacity.

And then we come back


This CDD core and cache layout is very familiar to you. Each CCD of zen2 and zen3 has 8 physical cores and 32MB Level 3 cache. However, the former is an isolated two parts, and each four cores share half of the 16MB Level 3 cache, while the latter is a complete part,All eight cores share all 32MB of L3 cache, which is equal to double the amount of L3 cache available to each core.

On zen2, if the instructions and data required by a certain core are in the other half of the third level cache which is not directly shared, it will take a circle, and the delay will naturally increase greatly. Now it can be directly put in place. When the data required by the first core is in the eighth core, it can also be quickly obtained in CCX.


Let's look at the cache details.For example, the 32KB L1 instruction cache supports 32bit pickup, the 32KB L1 data cache supports up to 3 loads and 2 stores, and the 512KB L2 cache is also faster.

After the capacity of the third level cache increases and the access is unified, the discarded data in the second level cache can be completely savedVictim cache, which is equivalent to a backup, because they have a high probability of being accessed again, so that no matter which core needs it again, it can be obtained directly from the cache from the swap.

In addition, 64 hits are allowed for each core from secondary cache to level 3 cache, and 192 hits from level 3 cache to memory are allowed.


The Ruilong 5000 Series continues the chiplet chip design, with one or two CCD dies and an IOD (responsible for the memory controller and I / O), butBecause there is only one CCX in each CCD instead of two independent CCX, the connection and communication between CCD, IOD and memory is more consistent and efficient.


When two CCDs are combined with an iod, the bandwidth is the same and the consistency system is the same.


At this time, it also reflects the benefits of chiplet small chip design again. It can easily achieve 16 cores, and upgrade to zen3 architecture without changing the layout and platform. Everything is carried out in the package.


In terms of security, zen3 has increased its focusControl flow enforcement technology (CET), Intel has previously supported. It introduces a shadow stack, which only contains the return address and is stored in the system memory. At the same time, it is protected by the memory management module of the processor. If any malicious code tampers with the stack by exploiting the vulnerability, it can be detected and prevented before causing damage.


In terms of instruction sets, zen3 addedMPK, which is the memory protection keyIn addition, vaes and vpclmulqd instructions support avx2.


Finally, we talk about energy efficiency. According to the official statement, compared with i9-10900k, the sharp dragon 9 5950x and the sharp dragon 9 5900x of zen3 architecture are 2.8 times and 2.4 times higher than that of i9-10900k, while the sharp dragon 9 3950x and the sharp dragon 9 39000xt of zen2 architecture are also improved by 12% and 26%, respectively, so that the performance is better, but the power consumption is not increased.


In short,Zen3 has successfully achieved the expected goals, including a significant increase in IPC (average 19%), a significant reduction in latency (Unified 8-core and 32MB Level 3 cache), a substantial acceleration of memory access (doubling the direct access of level 3 cache), a substantial increase in frequency (up to 4.9ghz acceleration), a significant improvement in energy efficiency (up to 2.8 times), and a substantial increase in game frame rate (about 26% on average at 1080p).


Amd Zen's next stop will be Zen 4, which will be combined with more advanced 5nm process. It is currently under design. Everything will be pushed forward according to the schedule. It should be launched in the first half of 2022.


Reprint please indicate the source: fast technology

#AMD#CPU processor#framework#turion #Zen 3

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments