AnandTech reported that Qualcomm's Cloud AI 100 reasoning chip platform, announced last year, has been put into production and sample to customers, and is expected to achieve commercial delivery in the first half of 2021. Although more inclined
With the chip beginning to sample, the high pass cloud AI 100 reasoning chip has finally moved from the laboratory to reality, and has disclosed many details about its architecture design, performance and power consumption goals.
It is reported that Qualcomm has provided three different packaging forms for commercial deployment, including the mature PCIe 4.0 X8 interface (realizing 400 tops computing power on 75W TDP), and dm.2 and dm.2e interfaces (25W / 15W TDP).
The shape of dm.2 is similar to two m.2 connectors adjacent to each other, so it is very popular in the enterprise market. Dm.2e is smaller in size and lower in package power consumption.
From the perspective of architecture, the design draws on the rich experience of the neural processing unit (NPU) deployed by Qualcomm on the snapdragon mobile SOC, but is still based on a unique architecture design that is completely optimized for enterprise workload.
The biggest advantage of dedicated AI design compared with current general computing hardware (CPU /GPU /FPGA) is that it can be implemented
It is said that cloud AI 100 has achieved a significant leap in performance per watt compared with competitors, and Qualcomm shows a relatively fair comparison in another chart.
Interestingly, it can even beat NVIDIA's 250W A100 accelerator in a 75W PCIe form factor. At the same time, the performance of Intel Goya accelerator is doubled when the power consumption is reduced by 25%.
This kind of performance data makes many people feel incredible, but judging from the chip specifications of cloud A100, things are not simple. The chip contains 16 sets of AI cores and int8 inference throughput of 400 tops.
It is supplemented by 4-channel @ 64 bit lpddr4x-4200 (2100mhz) memory controller. Each controller manages four 16 bit channels, and the total system bandwidth reaches 134 GB / s.
If you are familiar with the current AI accelerator design, you will know that there is a big gap between the bandwidth of Intel Goya and NVIDIA A100, because the latter has high bandwidth cache (hbm2) and bandwidth up to 1-1.6 TB / s.
Even so, Qualcomm managed to equip cloud AI 100 platform with 144MB on-chip SRAM cache to achieve the highest possible storage traffic.
Qualcomm acknowledges that the performance of the architecture will be different when the memory footprint of the workload exceeds the on-chip SRAM. But for the target customers, such a balanced design is still intentional.
In the future, the company also looks forward to a larger kernel and scale out across multiple cloud AI 100 accelerators. When asked how to achieve a dynamic power consumption range of 15W to 75W, Qualcomm announced that it was adjusting the frequency / voltage curve and modulating the number of AI cores.
Imagine a complete set of 400 tops 75W design, including a high frequency chip. The 15W TDP version may be running at a lower frequency. At the same time, the 7Nm process node is helpful to further reduce the power consumption.
In terms of precision, the architecture of cloud AI 100 supports the precision of int8 / int16 and fp16 / fp32, which can bring enough flexibility. Qualcomm also provides a set of SDKs to provide exchange format and framework support for industry standards.
Qualcomm is currently providing customers with samples of cloud AI 100 inference accelerators, which are mainly deployed for edge reasoning workloads in the industrial and commercial domains.
To drive the ecosystem and support software development, the company also launched a new cloud edge AI 100 development kit, which includes a small computing device integrated with the accelerator, a snapdragon 865 SOC, and an x555g modem for cellular connectivity.