All date and time in the Technical Program is based on standard Hong Kong time (GMT+8)

Tutorial 1AI for Sound: Large-scale Robust Audio Tagging with Audio and Visual MultimodalityQiuqiang Kong, ByteDance; Juncheng Li, Carnegie Mellon UniversityJanuary 24, 2021

13:30 - 15:00
Tutorial 2Pushing the Frontier of Neural Text to SpeechXu Tan, Microsoft Research Asia (MSRA)January 24, 202113:30 - 15:00
Tutorial 3Audio-Visual Speech Source SeparationQingju Liu, Cambridge Huawei Research CentreJanuary 24, 202115:30 - 17:00
Tutorial 4Neural Mechanisms Underlying Speech Perception in NoiseNai Ding, Zhejiang UniversityJanuary 24, 202115:30 - 17:00

Tutorial 1: AI for Sound: Large-scale Robust Audio Tagging with Audio and Visual Multimodality
Qiuqiang Kong, ByteDance; Juncheng Li, Carnegie Mellon University
Time: 13:30 – 15:00, January 24, 2021

Audio classification and detection are important topics in audio signal processing, which covers audio tagging, audio scene classification, music classification, voice endpoint detection, abnormal event detection, etc. In recent years, neural network-based methods have been successfully applied to audio signal processing and surpassed traditional methods in performance. In the first part of this talk, we will introduce the recent development in network-based audio signal processing and present our recent work progress on this topic. Most existing works are based on smaller datasets, which limits the performance of the classification system. We recently proposed a general audio classification model, which was trained on the large-scale audio database AudioSet and can detect 527 types of natural sounds in real-time. The system achieves the highest average accuracy rate mAP 0.439 so far. In order to explore the generalization performance of the system, the pre-training model was applied to 8 audio classification and detection data and achieved better results than the non-pre-training model. This work summarizes the trend and points out the new vector to future research directions of audio signal processing.

As audio/visual classification models are widely deployed for sensitive tasks like ubiquitous personal assistants, content filtering at scale and etc., it is critical to understand their robustness along with improving the accuracy. The second part of this talk will introduce our recent study on some key questions related to audio/visual learning through the lens of adversarial noises: 1) Are audio/visual models susceptible to adversarial perturbation? 2) How big of a threat it is in the physical space? 3) The trade-off between early/middle/late fusion affecting its robustness and accuracy 3) How do different frequency/time domain features contribute to the robustness? 4) How do different neural modules contribute against the adversarial noise? In our experiments, we constructed adversarial examples to attack state-of-the-art neural models and demonstrated that we could jam the Alexa Wake-word model with some inconspicuous background music to deactivate the voice assistant function while the audio adversary is present. We also analyzed how much attack potency in terms of adversarial perturbation of size using different Lp norms is need to “deactivate” the victim model. Using adversarial noise to ablate multimodal models, we can provide insights into the best potential fusion strategy to balance the model parameters/accuracy and robustness trade-off and distinguish the robust features versus the non-robust features that various neural networks models tend to learn.

Qiuqiang Kong received his Ph.D. degree from University of Surrey, Guildford, UK in 2020. Following his PhD, he joined ByteDance AI Lab as a research scientist. His research topic includes the classification, detection and separation of general sounds and music. He is known for developing attention neural networks for audio tagging, and winning the audio tagging task in the detection and classification of acoustic scenes and events (DCASE) challenge in 2017. He has authored papers in journals and conferences including IEEE/ACM Transactions on Audio, Speech, and Language Processing.



Juncheng (Billy) Li is 4th year PhD student at Carnegie Mellon University’s Language Technology Institute at the School of Computer Science working with Prof. Florian Metze. Juncheng (Billy) Li had worked as a research scientist at the Bosch Center for Artificial Intelligence from 2015 till 2019 where he worked with Prof. Zico Kolter. Juncheng (Billy) has a background in Deep Learning in Acoustics signals and multimodal data, and he is currently working on exploring the adversarial robustness of the multimodal machine learning systems. Juncheng also acquired extensive experience in applying AI to industrial problems when he worked at Bosch, specifically, he has worked on projects including fault detection, machine simulation and sensor fusion. Juncheng has published at IEEE ICASSP, Interspeech, ICML and NeurIPS, and won the best paper award at ICMR 2018.


Tutorial 2: Pushing the Frontier of Neural Text to Speech
Xu Tan, Microsoft Research Asia (MSRA)
Time: 13:30 – 15:00, January 24, 2021

Text to speech (TTS), which aims to synthesize natural and intelligible speech given text, has been a hot research topic in the community and has become an important product service in the industry. Although neural network based end-to-end TTS has significantly improved the quality of synthesized speech, there still exist great challenges when pushing the frontier of neural TTS and making it practical for product deployment. These challenges include 1) slow inference speed: neural TTS usually has high computational cost and slow inference speed in online serving; 2) robustness: the synthesized voice usually has word skipping and repeating issues; 3) controllability: the synthesized voice usually lacks of controllability in terms of speed, pitch, and prosody, etc.; 4) over-smoothing prediction: the TTS model usually predicts the average of training data, which leads to poor voice quality (e.g., dumb or metal voice); 5) high data cost: neural TTS requires huge training data for high-quality voice, which incurs much data collection cost when supporting low-resource languages in TTS; 6) TTS systems need to cover different product scenarios, including multiple speakers, custom voice, noisy speech, singing voice synthesis and talking face synthesis, etc. In this tutorial, we review and introduce a series of TTS research works that address the above challenges correspondingly, including non-autoregressive TTS, robust and controllable TTS, TTS with advanced optimizations, low-resource TTS, and TTS systems for different product scenarios. We further point out some open research problems that are critical to advance the state-of-the-art of neural text to speech and improve the TTS product experience.

Xu Tan is a Senior Researcher in Machine Learning Group, Microsoft Research Asia (MSRA). His research interests mainly lie in machine learning, deep learning, and their applications on natural language and speech processing, including neural machine translation, pre-training, text to speech, automatic speech recognition, music generation, etc. He has achieved human parity on Chinese-English machine translation together with his team in 2018 and won several champions on WMT machine translation competition in 2019. He has designed several popular language and speech models, such as MASS and FastSpeech, and has transferred many research works to the language and speech product in Microsoft. Before he joined MSRA, he worked in JD.com on search ranking and graduated from Zhejiang University.




Tutorial 3: Audio-Visual Speech Source Separation
Qingju Liu, Cambridge Huawei Research Centre
Time: 15:30 – 17:00, January 24, 2021

The video associated with a target speaker contains rich information complementary to the concurrent speech signal, which can improve (e.g., lip reading) and mislead (e.g., the McGurk effect) the listener’s perception of the speech. This additional information is robust to acoustic noise, which can be exploited to improve speech separation in adverse environments, e.g., multiple speakers with a high level of reverberations and background noise.

We have proposed several audio-visual (AV) blind source separation (BSS) methods, summarized as follows. 1) Maximising the audio-visual coherence to address the permutation problem in traditional ICA-based BSS methods. 2) Incorporating visual voice activity detection (VAD) to further suppress the residue from interfering speakers. 3) Using audio-visual dictionary learning (AVDL) to improve time-frequency masking. These methods have been applied to microphone array recordings and showed advantages over the corresponding audio-only methods. Recent work on deep neural networks (DNN) incorporating both audio and visual streams has also been reviewed, offering new challenges and opportunities.

Qingju Liu is a speech research engineer at Cambridge Huawei Research Centre and a visiting research fellow at Vision Speech and Signal Processing (CVSSP), University of Surrey, UK. She received the B.Sc. degree in electronic information engineering from Shandong University, China in 2008, and the Ph.D. degree in signal processing from the University of Surrey, UK in 2013. She worked at CVSSP, the University of Surrey, from Oct 2013 to Oct 2020 as a research fellow. She has been working in broad areas of signal processing, particularly in audio-visual and time series analysis, including blind source separation, person tracking, speech denoising and spatial audio production. Her current research is focused on developing machine learning solutions for keywords recognition. She has publications at top journals in signal processing such as IEEE SP, ASLP, MM.

Tutorial 4: Neural Mechanisms Underlying Speech Perception in Noise
Nai Ding, Zhejiang University
Time: 15:30 – 17:00, January 24, 2021

The human brain can more reliably recognize speech than computers in noisy listening environments. How the human brain encodes speech features in noisy environments has been extensively investigated in the last 10 years using magnetoencephalography (MEG) and electroencephalography (EEG). In this talk, I will review the progress in this field. When listening to speech, it is found that cortical activity tracks the slow amplitude envelope of speech. When background noise is introduced, cortical tracking of the speech envelope remains robust until the background noise undermines the audibility of speech. For individual listeners, how precisely cortical activity tracks the speech envelope can predict how well a listener can recognize speech in noise. In an environment consisting of two or more speakers, attention can dramatically modulate cortical encoding of speech and selectively enhance the neural response to the attended speaker. Neural enhancement of the attended speaker can rely on either acoustic cues or prior knowledge about the attended speech. At the end of the talk, I will also discuss recent progress on how language knowledge modulates speech processing in the human brain.

Nai Ding received the Ph.D. degree in electronic engineering from the University of Maryland in 2012. From 2012 to 2015, he was a postdoc associate at New York University. He is now an assistant professor in the College of Biomedical Engineering and Instrument Science at Zhejiang University. His research focuses on the neural mechanisms underlying speech perception and language comprehension. He published over 30 papers in journals such as Nature Neuroscience, Nature Communications, and PNAS.