Semina of SKKU Convergence Instititute for Intelligence and Information
- 2017-11-30
1. Date: 2017. 12. 04 (Mon) 16:00 ~ 17:00
2. Venue: SKKU Natural Science Campus, Research and Business Center 7th Floor, 85777
3. Speaker: Dr. Chanwoo Kim (Senior Software Engineer @ Google Speech Recognition Team)
4. Topic: Recent advances in speech recognition techniques using neural networks and very large training sets
5. Inquiry: Prof. Taeseob, Moon (031-299-4326, tsmoon@skku.edu)
[ Abstract ]
In this talk, we will discuss recent advances in speech recognition techniques using neural networks and very large training sets. Since 2010, the development of various types of neural networks, the availability of very large training sets, and the powerful data centers with a large number of CPUs/GPUs have greatly improved speech recognition. These improvements made it possible to use speech recognition systems for home appliances, robots, and voice assistant systems such as Google Home and Amazon Alexa. This talk consists of three parts. Part 1 provides an overview of recent acoustic model training techniques such as Cross Entropy (CE) training, Connectionist Temporal Classification (CTC), various discriminative sequence training such as state-level Minimum Bayes Risk (sMBR). We also describe how to model the acoustic feature distribution using Feed-Forward Deep Neural Networks (FF-DNNs), Long Short-Term Memories (LSTMs), Gated Recurrent Units (GRUs), and grid-LSTMs. In Part 2, we will discuss simulated data generation and semi-supervised training. We usually do not have enough data for new speech recognition domains. To solve this problem, we create large-scale acoustically simulated databases from existing data. For very large training sets, labeling has been always a very time-consuming and difficult problem. We discuss semi-supervised training techniques to generate labels for such cases. In Part 3, we will look into end-to-end neural recognizers combining both the Acoustic Modeling (AM) and the Language Modeling (LM). Conventional speech recognition systems consist of several isolated components such as AM, LM, pronunciation dictionary, and so on. Thus, building and training such systems have been very difficult and have required a lot of parameter tuning.We describe the attention-model based approaches, the CTC-based approaches, and the Recursive Neural Network (RNN) transducer techniques, and compare their performances.