Home > News content

Ali's open source self study speech recognition model DFSMN, the accuracy rate is as high as 96.04%.

via:博客园     time:2018/6/11 21:14:17     readed:421


Recently, a new generation of speech recognition models, DFSMN, is open to the new generation of speech recognition models, DFSMN, at the Alibaba's Dharma Machine Intelligence Laboratory, which has increased the accuracy of global speech recognition to 96.04% (the data test is based on the world's largest free speech recognition database LibriSpeech).

Compared with the most widely used LSTM model in the industry, the DFSMN model has faster training speed and higher recognition accuracy. A new DFSMN model of smart audio or smart home equipment is 3 times the speed of the previous generation of advanced learning and training, and the speed of speech recognition is 2 times higher.

Open source address:Https://github.com/tramphero/kaldi

The author of this article: Zhang Shiliang

Ali open source speech recognition modelDFSMN


Figure: Ali opened an independent DFSMN speech recognition model on the GitHub platform.

Acoustic model of speech recognition

Speech recognition technology has always been an important part of human-machine interaction technology. With speech recognition technology, the machine can understand and speak, like human beings, and then be able to think, understand and feedback.

In recent years, with the use of deep learning technology, the performance of speech recognition system based on deep neural network has been greatly improved, and it has begun to be practical. Speech recognition, speech input, voice conversion, voice retrieval and speech translation technology have been widely applied.

At present, the mainstream speech recognition system generally adopts the acoustic model based on the depth neural network and Cin Markov (Deep Neural Networks-Hidden Markov Model, DNN-HMM), and its model structure is shown in Figure 1. The input of the acoustic model is the spectral features of the traditional speech waveform, which are windowed, divided, and then extracted, such as PLP, MFCC and FBK. The output of the model generally uses acoustic modeling units with different granularity, such as monosyllabic (mono-phone), monosyllabic state, and binding phoneme state (tri-phonestate). Different neural network structures can be used between input and output, and the acoustic features of the input are mapped to the posterior probability of the different output modeling units, and then the final recognition results are obtained by decoding the HMM.

The earliest used network structure is the Feedforward Fully-connected Neural Networks (FNN). FNN implements the one-to-one mapping of fixed input to fixed output, and its defect is that it can not effectively utilize the intrinsic long term correlation information of speech signals. One of the improvements is the use of recurrent neural network (Recurrent Neural Networks, RNN) based on Long-Short Term Memory (LSTM). Through the loop feedback connection of the hidden layer, LSTM-RNN can store the historical information in the nodes of the hidden layer, so that the long time correlation of the speech signals can be effectively used.


Figure 1. block diagram of speech recognition system based on DNN-HMM

Further through the use of two-way recurrent neural network (BidirectionalRNN), the history of speech signal and the future information can be effectively used, and it is more conducive to acoustic modeling of speech. Compared with the feedforward fully connected neural network, the speech acoustic model based on recurrent neural network can achieve significant performance improvement. But recurrent neural network is more complex than the feedforward fully connected neural network model, often containing more parameters, which will lead to more computing resources for the training and testing of the model.

In addition, the speech acoustic model based on two way recurrent neural network will face a large delay problem, which is not suitable for real-time speech recognition tasks. Some existing improved models, such as Latency Controlled LSTM (LCBLSTM) [1-2], and Feedforward SequentialMemory Networks, FSMN [3-5] (Feedforward SequentialMemory Networks, FSMN), are based on time delay controlled. Last year, we first launched the voice recognition acoustic model based on LCBLSTM in industry. In accordance with Ali's large-scale computing platform and large data, the acoustic model is modeled by multi machine multi card, 16bit quantization and other training and optimization methods. The relative error rate of the FNN model is reduced to about 17-24%.

The previous life of the FSMN model

1. FSMN model

FSMN is a recently proposed network structure. By adding some learning memory modules in the hidden layer of FNN, the long time correlation of speech can be effectively modeled. Compared with LCBLSTM, FSMN can not only control delay more conveniently, but also achieve better performance and less computing resources. But the standard FSMN is difficult to train very deep structures, which will result in poor training effect due to the problem of gradient disappearance. Deep structure models are proving to be more capable of modeling in many areas. Therefore, we propose an improved FSMN model, called deep FSMN (DeepFSMN, DFSMN). Further we build an efficient real-time voice recognition acoustic model with LFR (lowframe rate) technology. Compared with the LCBLSTM acoustic model we live in last year, we can get more than 20% of the relative performance enhancement, and get 2-3 times of training and decoding speed, which can significantly reduce our system. The computing resources required for the actual application.


Figure 2. FSMN model structure and comparison with RNN

The development of 2. FSMN to cFSMN

The first proposed FSMN model [3] structure, as shown in Figure 2 (a), is essentially a feedforward fully connected neural network. By adding some memory modules (memory block) to the hidden layer to model the surrounding context information, the model can model the long time correlation of the time series signal. The memory module uses the tapped delay structure, as shown in Figure 2 (b), to generate a fixed expression by encoding the current moment and the hidden layer output of the previous N moments through a set of coefficients. The proposal of FSMN is inspired by the theory of filter design in digital signal processing: any Infinite Impulse Response (IIR) filter can be approximated by a high order finite impact response (Finite Impulse Response, FIR) filter. From the filter point of view, the circulatory layer of the RNN model shown in Figure 2 (c) can be considered as a first order IIR filter like Figure 2 (d). FSMN uses the memory module shown in Figure 2 (b), which can be regarded as a high order FIR filter. As a result, FSMN can also model the long time correlation of the signal as effectively as RNN, while the FIR filter is more stable than the IIR filter, so FSMN will be more simple and stable than the RNN training.

According to the selection of coding coefficient of memory module, it can be divided into: 1) scalar FSMN (sFSMN); 2) vector FSMN (vFSMN). SFSMN and vFSMN, as the name implies, use scalar and vector respectively as coding coefficients of memory modules. The expressions of sFSMN and vFSMN memory modules are as follows:


The above FSMN only considers the influence of historical information on the current time. We can call it one-way FSMN. When we consider both the historical information and the impact of future information on the current moment, we can expand the unidirectional FSMN to get a bidirectional FSMN. The coding formulas of bidirectional sFSMN and vFSMN memory modules are as follows:



They represent the order of look-back and the order of forward looking (look-ahead) respectively. We can enhance FSMN's ability to model long time correlation by increasing the order or adding memory modules to multiple hidden layers.


Figure 3. cFSMN block diagram

Compared with FNN, FSMN needs to output the memory module as an additional input to the next hidden layer, which will introduce additional model parameters. The more nodes hidden layer contains, the more parameters will be introduced. The idea of [4] binding matrix low rank decomposition (Low-rank matrix factorization) is studied, and an improved FSMN structure is proposed, which is called the simple FSMN (CompactFSMN, cFSMN), which is a first one.LThe hidden layer contains a block diagram of the memory module's cFSMN.

For cFSMN, a low dimensional linear projection layer is added after the hidden layer of the network, and memory modules are added to these linear projection layers. Further, cFSMN changes the encoding formula of the memory module, and by adding the output of the current time explicitly to the expression of the memory module, the expression of the memory module is only required to be used as the input of the next layer. This can effectively reduce the parameters of the model and speed up the training of the network. The expressions of specific one-way and bidirectional cFSMN memory modules are as follows:




Figure 4. Deep-FSMN (DFSMN) model block diagram

LFR-DFSMN acoustic model

1. Deep-FSMN (DFSMN) network structure

Figure 4 is the network structure block diagram of Deep-FSMN (DFSMN), which we have further proposed, in which the first box on the left represents the input layer and the last box on the right represents the output layer. We add a jump connection (skip connection) between the memory modules of the cFSMN (red block frame representation), so that the output of the low memory module will be directly added to the high memory module. In this way, the gradient of the high memory module will be directly assigned to the memory module of the lower layer during the training process, which can overcome the gradient disappearance caused by the depth of the network, so that the deep network can be trained steadily. We also make some modifications to the expression of memory module. By drawing on the idea of dilation convolution [6], we introduce some stride factors in the memory module. The concrete formulas are as follows:



The output of the layer memory module at the t moment. S1 and S2 represent the encoding step factor of historical and future times, such as S1=2, which means that a value is taken every other time when the historical information is encoded. In this way, we can see further history in the same order, so that we can model the long time correlation more effectively.

For real time speech recognition systems, we can control the time delay of the model by setting the future order flexibly. In extreme cases, when we set the future order of each memory module to 0, we can implement an acoustic model without delay. For some tasks, we can tolerate certain delays, and we can set smaller future orders.

Compared with previous cFSMN, the advantage of DFSMN is that we can train deep networks through jump links. For the original cFSMN, since each hidden layer has been split into a two layer structure by the low rank decomposition of the matrix, a network containing 4 layers of cFSMN layer and two DNN layers will contain a total of 13 layers, thus using more cFSMN layers that will make the number of layers more and the training gradient. The problem of disappearance leads to the instability of training. Our proposed DFSMN avoids the gradient disappearance of deep networks by jumping links, making the training of deep networks stable. It should be noted that the jumper links here can not only be added to the adjacent layers, but also to the non adjacent layers. The jump connection itself can be linear transformation or nonlinear transformation. In specific experiments, we can implement DFSMN networks with dozens of layers of training, and achieve significant performance improvement compared to cFSMN.

From the initial FSMN to cFSMN can not only effectively reduce the parameters of the model, but also achieve better performance [4]. Further, on the basis of cFSMN, the DFSMN we propose can significantly enhance the performance of the model. The following table is a performance comparison of acoustic models based on BLSTM, cFSMN and DFSMN on a 2000 hour English task.









As you can see from the previous table, the DFSMN model can lower the error rate of 14% than the acoustic model in the 2000 hour task and significantly improve the performance of the acoustic model.

2. acoustic model of speech recognition based on LFR-DFSMN


Figure 5. structure block diagram of LFR-DFSMN acoustic model


Product line A

Product line B





8.04 (21.25%)

By combining LFR technology, we can get three times speedup in recognition. As you can see from the previous table, the LFR-DFSMN model can get 20% error rates lower than the LFR-LCBLSTM model in actual industrial scale applications, showing better modeling characteristics for large-scale data.

Training of large data acoustic model based on multi machine and multi card

The actual speech recognition service usually faces very complex voice data. The speech recognition acoustic model must cover all possible scenarios as far as possible, including various dialogues, various sound channels, various noises and even various accents, which means a large amount of data. How to apply mass data to train the acoustic model and service on-line is directly related to the corresponding speed of business.

We use Ali's Max-Compute computing platform and multi machine multi card parallel training tools. In the case of 8 machine 16GPU cards and 5000 hours of training data, the training speed of the LFR-DFSMN acoustic model and LFR-LCBLSTM is as follows:

Time to deal with a epoch


10.8 hours


3.4 hours

Compared to the baseline LCBLSTM model, each epoch DFSMN can get 3 times faster training speed. Training LFR-DFSMN in 20 thousand hours of data, model convergence usually takes only 3-4 epoch, so in the case of 16GPU card, we can train the LFR-DFSMN acoustic model of 20 thousand hours data in 2 days or so.

Decoding delay, recognition speed and model size

To design a more practical speech recognition system, we not only need to improve the recognition performance of the system as much as possible, but also need to consider the real-time performance of the system in order to provide a better experience for the users. Besides, in practical applications, we also need to consider the cost of service, so there are certain requirements for the power consumption of speech recognition system. The traditional FNN system needs to use the framing technology, and the decoding delay is usually in 5-10 frames, about 50-100ms. Last year, the LCBLSTM system on the line solved the problem of BLSTM's whole sentence delay, and eventually delayed the control to about 20 frames, about 200ms. For some online tasks that have higher requirements for delay, it can also control the delay in 100ms with a small amount of loss recognition performance (0.2%-0.3% absolute value), and can fully meet the needs of various tasks. LCBLSTM can get more than 20% relative performance improvement compared with the best FNN, but the speed of recognition on the same CPU is slow (or high power consumption), which is mainly caused by the complexity of the model.

Our latest LFR-DFSMN can speed up the recognition speed by more than 3 times with LFR technology, and further DFSMN can be reduced by about 3 times in the model complexity compared to LCBLSTM. The following table is the recognition time required by different models that we have on a test set. The shorter the time, the lower the computation power we need.


The time required for the entire test set recognition


956 seconds


377 seconds


339 seconds


142 seconds

With regard to the decoding delay of LFR-DFSMN, we can reduce the delay of memory module filter to the future order. In the experiments, we verified different configurations. When we control the LFR-DFSMN delay in 5-10 frames, we lose only 3% relative performance.

In addition, compared with the complex LFR-LCBLSTM model, the LFR-DFSMN model has the characteristics of model simplification. Although there are 10 layers of DFSMN, the size of the whole model is only half of the LFR-LCBLSTM model and the size of the model is 50%.


1.YuZhang, Guoguo Chen, Dong Yu, and Kaisheng Yao, ng Yao, long short.

2.XueS, Yan Z. Improving latency-controlled BLSTM acoustic models for online speech recognition[C]//Acoustics.

3.Zhang S, Liu C, Jiang H, et al. Feedforwardsequential memory networks: A.

4.Zhang S, Jiang H, Xiong S, et al. CompactFeedforward Sequential Memory Networks

5.Zhang S, Liu C, Jiang H, et al. Non-recurrentNeural Structure for Long-Term.

6.Oord A, Dieleman S, Zen H, et al. Wavenet:A generative model model, Dieleman, 2016.

7.Pundak G, Sainath T N. Lower Frame Rate NeuralNetwork Acoustic Models[C]//INTERSPEECH. Acoustic

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments