Before 2000, the core of the R100 was the first generation architecture of the A card. Whether it was designed with fixed units, 3D geometry conversion and lighting effects now look very primitive.
The 2001-2007 R200-R500 is the second generation architecture. Simple VS texture shader and PS pixel shader are designed separately, but the different proportion is different. The whole rendering pipeline is like a single-channel one-way street.
The third generation TeraScale architecture (representing core R600) of 2008-2011 has achieved a leap. VS and PS are integrated into a unified shader, which is what we often call a stream processor, supporting VLIW (super-long instruction word), and then the GCN architecture (representing core Southern Islands) of 2011-2019. The unified shader plus independent scalar and vector units, the ratio of which is 1:4.
Nowadays, we have a brand new RDNA (Radeon DNA) or unified shader, but scalar and vector units converge to support SIMT (single instruction multi-threading) ILP (instruction set parallel), SIMD (single instruction multi-data stream) similar to CPU processor, single thread performance and instruction set execution efficiency are greatly improved.
It should be emphasized that,RDNA is an all-round redesigned architecture, not another upgrade of GCN, nor a hybrid of GCN and RDNA.However, with the integration of GCN architecture instructions to maintain downward compatibility, existing technologies can still be supported on RDNA architecture.
RDNA architecture will be the cornerstone of AMD GPU graphics cards for many years to come.Next we'll see the second generation version of RDNA 2 using the 7nm processLook at the road map. We are expected to meet early next year.
In addition to the new architecture of RDNA, the Navi core has many new features such as 7Nm technology, GDDR6 memory, PCIe 4.0 bus, Radeon media engine, Radeon display engine and so on.
The Navi 10 core integrates 10.3 billion transistors, about 18% less than Vega 6412.5 billion transistors, while the core area is only 251 square millimetres, which is half as small as Vega 64495 square millimetres, so the performance per unit area has improved by a factor of 1.3.
Although transistors are smaller and smaller,Compared with Vega 64, the performance of Navi 10 core improves by 14%, while power consumption decreases by 23% and energy efficiency ratio increases by 50%.
In terms of pure architecture performance, Navi improves up to 50% compared with Vega under the same power consumption and configuration, which contributes about 60% to the actual product performance improvement. In addition, about 25% comes from the support of 7Nm new process, and about 15% comes from the improvement of frequency and power consumption.
According to AMD, there are four main design concepts of RDNA architecture.In terms of performance, it is necessary to satisfy the demand of modern game load, optimize the power consumption and bandwidth utilization in terms of energy efficiency, strengthen the ecology in terms of function, and eat everything from mobile to desktop to cloud in terms of scalability.
In order to achieve the above goals, RDNA architecture has been transformed from three main aspects, including CU computing unit, cache and pipeline. Next, we will share with you one by one. However, due to the strong technical nature of GPU architecture, we just have a rough look at it, and finally, AMD's planning on ray tracing.
The new cell design is divided into 40 groups, each with 2 scalar processors, 64 stream processors and 4 64-bit bilinear filtering units, totaling 80, 2560 and 160 units.With lower execution delay, stronger single thread performance and higher cache efficiency, the overall computing energy efficiency has been greatly improved compared with GCN architecture, and can be adapted to all kinds of loads from games to computing.
Multilevel cache coherence can lead to lower latency, higher bandwidth, and lower power consumptionIt includes zero-level cache everywhere, 512KB first-level cache and 4MB second-level cache.
The whole graphics engine has been readjusted to be smoother and more efficient, including geometry engine, 64 texture units, and 4 asynchronous computing engines (ACE). The load distribution is more balanced, and higher frequency and energy efficiency can be achieved with lower power consumption.
In terms of CU computing units,Although it seems that there are 64 stream processors in each group, the number remains unchanged, but this number is the most balanced combination of AMD's repeated design and processing resources. At the same time, the structure of the whole computing unit has been thoroughly restructured, which is totally different from the GCN era.
Under RDNA architecture, the number of scalar decoding and transmitting units, vector decoding and transmitting units and schedulers of each CU computing unit has doubled to two, and the instruction processing rate has doubled.
At the same time,Four SIM16 vector units, four SIMD4 special function units into two SIMD32, two SIMD8For example, 64 threads can be combined into two Wave32 threads, then two Wave32 threads are executed by two SIMD32 threads to achieve single-clock cycle instruction transmission (four required before). The utilization rate of SIMD ALU cells has also increased from 25% to 100%.Supporting Wave32 and Wave64 execution modesTo meet different load requirements.
In addition, in order to enhance the efficiency of resource scheduling and utilization,The RDNA architecture also tightly bundles two CU computing units together to form a Work Group Processor, doubling the number of available ALU units and registers and quadrupling the cache bandwidth.
In terms of cache, RDNA architecture designs a multi-level coherence structure. Each dual-CU combination has its own zero-level cache. It doubles the load bandwidth of ALU cells, adds four new first-level buffers (all 16-way 128KB), reduces the congestion of secondary buffers (16-way 4MB), and greatly reduces the overall latency and power consumption.
According to AMD, zero-level cache latency is reduced by 21%, primary and secondary caches by 24%, and memory latency by 7%.
In addition, under the consistency multi-level cache, it is supported everywhere.Delta Data Compression(The yellow arrow in DCC/figure) improves the transmission rate, and also improves the color compression algorithm, which can be read by the display engine, and the shader can also read and write the compressed color data at the same time.
The graphics engine pipeline has undergone a radical restructuring, includingFour Enhanced ACE Asynchronous Computing EnginesA more centralized integrated processor (including four primitive units) with 64 pixel units.
Asynchronous computing has always been the unique skill of A card and the key to better performance under DX12 and Vulkan API. Now it has been enhanced, it can control other modules more accurately in real time.
Interestingly,RDNA GPU architecture design also draws on some advanced concepts of Zen CPU architecture design, especially in clock gating, which has high efficiency and efficiency, and also reduces the logical level required to achieve higher frequencies.
Radeon Display Engine has also made a big leap forward in supportFreeSync 2 HDR, HDMI 2.0/DisplayPort 1.4 HDRFor high resolution HR display optimization, output4K/240Hz, 8K/60HzAnd it can be achieved with only one data line, and VR header display is optimized.
Radeon Multimedia Engine has greatly improved video coding and decoding.A new H.265 HDR/WCG encoder has been added, which fully supports H.264 1080p600, 4K150, 8K30 decoding and 1080p360, 4K90 coding, H.265 1080p360, 4K90, 8K24 decoding and 1080p360, 4K60 coding, VP9 4K90, 8K24 decoding. The overall coding speed is 40% faster.
For ray tracing support, GCN and RDNA architectures do not have dedicated hardware units to accelerate, but in fact, AMD ProRender and Radeon Rays have long supported ray tracing, content-oriented rendering and game development respectively.
In the next generation RDNA architecture, AMD will support the real-time rendering of specific ray tracing effects in the game through hardware units. Even in the more distant future, AMD will not throw all ray tracing to hardware for local processing, otherwise the efficiency will be very low. Instead, it will use cloud computing to achieve the whole scene ray tracing, ensuring the effect of the picture at the same time, will not. Too much pressure on local hardware.
User comments