Platinum Sponsors

Gold Sponsors

Silver Sponsors

IEEE SLT 2021 Online Conference

Jan 22, Fri

You can join a session by clicking the session title. Clicking any paper title will show the abstract, and the link to paper and video.
Only users who registered for SLT 2021 can access paper, video and Zoom - to register, click here.
Session: Assistive technologies Join Zoom

Chair: Samuel Thomas, Jun Du

1102: IMPROVING SPEECH RECOGNITION ACCURACY OF LOCAL POI USING GEOGRAPHICAL MODELS

Songjun Cao, Yike Zhang, Xiaobing Feng, Long Ma

Paper and Video (new tab)

Nowadays voice search for points of interest (POI) is becoming increasingly popular. However, speech recognition for local POI names still remains a challenge due to multi-dialect and long-tailed distribution of POI names. This paper improves speech recognition accuracy for local POI from two aspects. Firstly, a geographic acoustic model (Geo-AM) is proposed. The Geo-AM deals with multi-dialect problem using dialect-specific input feature and dialect-specific top layers. Secondly, a group of geo-specific language models (Geo-LMs) are integrated into our speech recognition system to improve recognition accuracy of long-tailed and homophone POI names. During decoding, a specific Geo-LM is selected on demand according to the users' geographic location. Experiments show that the proposed Geo-AM achieves 6.5%~10.1% relative character error rate (CER) reduction on an accent test set and the proposed Geo-AM and Geo-LMs totally achieve over 18.7% relative CER reduction on a voice search task for Tencent Map.

1027: Towards Automatic Route Description Unification In Spoken Dialog Systems

Yulan Feng, Alan Black, Maxine Eskenazi

Paper and Video (new tab)

In telephone-based navigation dialog systems, scheduling and direction information are typically collected from routing APIs in text, and then delivered to users via speech. Systematic directions can be augmented with human directions by interviewing multiple experts to provide more accurate and personalized routes and cover broader user needs. However, manually collecting, transcribing, correcting, and rewriting human descriptions is time-consuming. Also its inconsistency with systematic directions can be confusing to users when delivered orally. This paper describes the construction of a pipeline to automate the route description unification process which also renders the resulting direction delivery more concise and consistent.

1202: END-TO-END WHISPERED SPEECH RECOGNITION WITH FREQUENCY-WEIGHTED APPROACHES AND PSEUDO WHISPER PRE-TRAINING

Heng-Jui Chang, Alexander H. Liu, Hung-yi Lee, Lin-shan Lee

Paper and Video (new tab)

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data. In this paper, we present several approaches for end-to-end (E2E) recognition of whispered speech considering the special characteristics of whispered speech and the scarcity of data. This includes a frequency-weighted SpecAugment policy and a frequency-divided CNN feature extractor for better capturing the high-frequency structures of whispered speech, and a layer-wise transfer learning approach to pre-train a model with normal or normal-to-whispered converted speech then fine-tune it with whispered speech to bridge the gap between whispered and normal speech. We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus. The results indicate as long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.

1378: ARTICULATORY COMPARISON OF L1 AND L2 SPEECH FOR MISPRONUNCIATION DIAGNOSIS

Subash Khanal, Michael T. Johnson, Narjes Bozorg

Paper and Video (new tab)

This paper compares the difference in articulation patterns between native (L1) and non-native (L2) Mandarin speakers of English, for the purpose of providing an understanding of mispronunciation behaviors of L2 learners. Consensus transcriptions from the Electromagnetic Articulography Mandarin Accented English (EMA-MAE) corpus are used to identify commonly occurring substitution errors for consonants and vowels. Phoneme level alignments of the utterances produced by speech recognition models are used to extract articulatory feature vectors representing correct and substituted sounds from L1 and L2 speaker groups respectively. The articulatory features that are significantly different between the two groups are identified along with the direction of error for the L2 speaker group. Experimental results provide information about which types of substitutions are most common and which specific articulators are the most significant contributors to those errors.

1272: Data Augmentation for End-to-end Code-switching Speech Recognition

Chenpeng Du, Hao Li, Yizhou Lu, Lan Wang, Yanmin Qian

Paper and Video (new tab)

Training a code-switching end-to-end automatic speech recognition (ASR) model normally requires a large amount of data, while code-switching data is often limited. In this paper, three novel approaches are proposed for code-switching data augmentation. Specifically, they are audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or word insertion. Our experiments on 200 hours Mandarin-English code-switching dataset show that all the three proposed approaches yield significant improvements on code-switching ASR individually. Moreover, all the proposed approaches can be combined with recent popular SpecAugment, and an addition gain can be obtained. WER is significantly reduced by relative 24.0% compared to the system without any data augmentation, and still relative 13.0% gain compared to the system with only SpecAugment.

1314: OPTIMIZED PREDICTION OF FLUENCY OF L2 ENGLISH BASED ON INTERPRETABLE NETWORK USING QUANTITY OF PHONATION AND QUALITY OF PRONUNCIATION

Yang Shen, Ayano Yasukagawa, Daisuke Saito, Nobuaki Minematsu, Kazuya Saito

Paper and Video (new tab)

This paper presents results of a joint project between an engineering team of a university and an educational team of another to develop an online fluency assessment system for Japanese learners of English. A picture description corpus of English spoken by 90 learners and 10 native speakers was used, where fluency was rated by other 10 native raters for each speaker manually. The assessment system was built to predict the averaged manual scores. For system development, a special focus was put on two separate purposes. The assessment system was trained in such an analytical way that teachers can know and discuss which speech features contribute more to fluency prediction, and in such a technical way that teachers' knowledge can be involved for training the system, which can be further optimized using an interpretable network. Experiments showed that quality-of-pronunciation features are much more helpful than quantity-of-phonation features, and the optimized system reached an extremely high correlation of 0.956 with the averaged manual scores, which is higher than the maximum of inter-rater correlations (0.910).

1300: Incorporating Discriminative DPGMM Posteriorgrams for Low-Resource ASR

Bin Wu, Sakriani Sakti, Satoshi Nakamura

Paper and Video (new tab)

The first step in building an ASR system is to extract proper speech features. The ideal speech features for ASR must also have high discriminabilities between linguistic units and be robust to such non-linguistic factors as gender, age, emotions, or noise. The discriminabilities of various features have been compared in several Zerospeech challenges to discover linguistic units without any transcriptions, in which the posteriorgrams of DPGMM clustering show strong discriminability and get several top results of ABX discrimination scores between phonemes. This paper appends DPGMM posteriorgrams to increase the discriminability of acoustic features to enhance ASR systems. To the best of our knowledge, DPGMM features, which are usually applied to such tasks as spoken term detection and zero resources tasks, have not been applied to large vocabulary continuous speech recognition (LVCSR) before. DPGMM clustering can dynamically change the number of Gaussians until each one fits one segmental pattern of the whole speech corpus with the highest probability such that the linguistic units of different segmental patterns are clearly discriminated. Our experimental results on the WSJ corpora show our proposal stably improves ASR systems and provides even more improvement for smaller datasets with fewer resources.

1134: AUTOMATED SCORING OF SPONTANEOUS SPEECH FROM YOUNG LEARNERS OF ENGLISH USING TRANSFORMERS

Xinhao Wang, Keelan Evanini, Yao Qian, Matthew Mulholland

Paper and Video (new tab)

This study explores the use of Transformer-based models for the automated assessment of children's non-native spontaneous speech. Traditional approaches for this task have relied heavily on delivery features (e.g., fluency), whereas the goal of the current study is to build automated scoring models based solely on transcriptions in order to see how well they capture additional aspects of speaking proficiency (e.g., content appropriateness, vocabulary, and grammar) despite the high word error rate (WER) of automatic speech recognition (ASR) on children's non-native spontaneous speech. Transformer-based models are built using both manual transcriptions and ASR hypotheses, and versions of the models that incorporated the prompt text were investigated in order to more directly measure content appropriateness. Two baseline systems were used for comparison, including an attention-based Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) and a Support Vector Regressor (SVR) with manually engineered content-related features. Experimental results demonstrate the effectiveness of the Transformer-based models: the automated prompt-aware model using ASR hypotheses achieves a Pearson correlation coefficient (r) with holistic proficiency scores provided by human experts of 0.835, outperforming both the attention-based RNN-LSTM baseline (r = 0.791) and the SVR baseline (r = 0.767).

1343: FRAME-LEVEL SPECAUGMENT FOR DEEP CONVOLUTIONAL NEURAL NETWORKS IN HYBRID ASR SYSTEMS

Xinwei Li, Yuanyuan Zhang, Xiaodan Zhuang, Daben Liu

Paper and Video (new tab)

Inspired by SpecAugment — a data augmentation method for end-to-end ASR systems, we propose a frame-level SpecAugment method (f-SpecAugment) to improve the performance of deep convolutional neural networks (CNN) for hybrid HMM based ASR systems. Similar to the utterance level SpecAugment, f-SpecAugment performs three transformations: time warping, frequency masking, and time masking. Instead of applying the transformations at the utterance level, f-SpecAugment applies them to each convolution window independently during training. We demonstrate that f-SpecAugment is more effective than the utterance level SpecAugment for deep CNN based hybrid models. We evaluate the proposed f-SpecAugment on 50-layer Self-Normalizing Deep CNN (SNDCNN) acoustic models trained with up to 25000 hours of training data. We observe f-SpecAugment reduces WER by 0.5-4.5% relatively across different ASR tasks for four languages. As the benefits of augmentation techniques tend to diminish as training data size increases, the large scale training reported is important in understanding the effectiveness of f-SpecAugment. Our experiments demonstrate that even with 25k training data, f-SpecAugment is still effective. We also demonstrate that f-SpecAugment has benefits approximately equivalent to doubling the amount of training data for deep CNNs.

1115: IMPROVING L2 ENGLISH RHYTHM EVALUATION WITH AUTOMATIC SENTENCE STRESS DETECTION

Binghuai Lin, Liyuan Wang, Hongwei Ding, Xiaoli Feng

Paper and Video (new tab)

English is a stress-timed language, for which sentence stress or prosodic stress plays an important role. It’s then difficult for Chinese who are used to the syllable-timed rhythm to learn the rhythm of English [1]. In this paper, we investigate how to improve the rhythm evaluation based on the sentence stress for Chinese who learn English as a second language (ESL). Particularly, we explore some rhythm measures to quantify rhythmic differences among second language (L2) learners based on sentence stress. To relieve the dependency on labeled data of sentence stress, we predict sentence stress automatically utilizing a hierarchical network with bidirectional Long Short-Term Memory (BLSTM) [2]. We evaluate the proposed method based on the corpus consisting of 3,500 sentences recorded by 100 Chinese speakers aging from 10 to 20 years old, which was marked with the sentence stress labels and scored by three experts. Experimental results show the proposed sentence stress measure is well correlated with labeled prosody scores with a correlation coefficient of -0.73 and the automatic labeling method achieves comparable results with the method with gold labels.

1361: DATA AUGMENTING CONTRASTIVE LEARNING OF SPEECH REPRESENTATIONS IN THE TIME DOMAIN

Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

Paper and Video (new tab)

Contrastive Predictive Coding (CPC), based on predicting future segments of speech from past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs compared to other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library which we adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure). We find that applying augmentation only to the segments from which the CPC prediction is performed, yields better results than applying it also to future segments from which the samples (both positive and negative) of the contrastive loss are drawn. After selecting the best combination of pitch modification, additive noise and reverberation on unsupervised metrics on LibriSpeech (with a gain of 18-22% relative on the ABX score), we apply this combination without any change to three new datasets in the Zero Resource Speech Benchmark 2017 and beat the state-of-the-art using out-of-domain training data. Finally, we show that the data-augmented pretrained features improve a downstream phone recognition task in the Libri-light semi-supervised setting (10min, 1h or 10h of labelled data) reducing the PER by 15% relative.

1119: ENHANCING THE INTELLIGIBILITY OF CLEFT LIP AND PALATE SPEECH USING CYCLE-CONSISTENT ADVERSARIAL NETWORKS

PROTIMA NOMO SUDRO, Rohan Kumar Das, Rohit Sinha, S. R. Mahadeva Prasanna

Paper and Video (new tab)

Cleft lip and palate (CLP) refer to a congenital craniofacial condition that causes various speech-related disorders. As a result of structural and functional deformities, the affected subjects' speech intelligibility is significantly degraded, limiting the accessibility and usability of speech-controlled devices. Towards addressing this problem, it is desirable to improve the CLP speech intelligibility. Moreover, it would be useful during speech therapy. In this study, the cycle-consistent adversarial network (CycleGAN) method is exploited for improving CLP speech intelligibility. The model is trained on native Kannada-speaking childrens' speech data. The effectiveness of the proposed approach is also measured using automatic speech recognition performance. Further, subjective evaluation is performed, and those results also confirm the intelligibility improvement in the enhanced speech over the original.

1369: DUAL APPLICATION OF SPEECH ENHANCEMENT FOR AUTOMATIC SPEECH RECOGNITION

Ashutosh Pandey, Chunxi Liu, Yun Wang, Yatharth Saraf

Paper and Video (new tab)

In this work, we exploit speech enhancement for improving a recurrent neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral mapping based speech enhancement, and find it helpful for ASR in two ways: a data augmentation technique, and a preprocessing frontend. In using it for ASR data augmentation, we exploit a KL divergence based consistency loss that is computed between the ASR outputs of original and enhanced utterances. In using speech enhancement as an effective ASR frontend, we propose a three-step training scheme based on model pretraining and feature selection. We evaluate our proposed techniques on a challenging social media English video dataset, and achieve an average relative improvement of 11.2% with speech enhancement based data augmentation, 8.3% with enhancement based preprocessing, and 13.4% when combining both.

1391: DEVELOPMENT OF CNN-BASED COCHLEAR IMPLANT AND NORMAL HEARING SOUND RECOGNITION MODELS USING NATURAL AND AURALIZED ENVIRONMENTAL AUDIO

RAM CHARAN CHANDRA SHEKAR, CHELZY BELITZ, JOHN HANSEN

Paper and Video (new tab)

Restoration of auditory function among hearing impaired individuals using Cochlear Implant (CI) technology has contributed significantly towards an improved quality of life. Most clinical studies and research efforts in CI, including Machine Learning (ML) techniques, are focused on enhancing speech perception while limited research efforts have considered environmental sound awareness. It is also well known that CI users experience greater challenges in effective speech recognition in noisy, reverberant, or time-varying diverse environments. This study focuses on a comparative analysis of normal hearing (NH) vs. CI environmental sound recognition using classifiers trained on learned sound representations using a CNN-based sound event model. Sounds experienced by CI listeners are recreated by auralizing electrical stimuli. CCi-MOBILE is used to generate electrical stimuli and Braecker Vocoder is used for auralization. Representations of natural and auralized sounds are used to model NH and CI sound recognition systems respectively. Information related to environmental sound is extracted by analyzing f1-scores and other performance characteristics. Benefits stemming from this research can help CI researchers advance sound recognition performance, develop novel sound processing algorithms and identify optimal CI electrical stimulation characteristics. Among CI users, improvement in environmental sound awareness contributes to improved quality of life.

1420: TWO-STAGE AUGMENTATION AND ADAPTIVE CTC FUSION FOR IMPROVED ROBUSTNESS OF MULTI-STREAM END-TO-END ASR

Ruizhi Li, Gregory Sell, Hynek Hermansky

Paper and Video (new tab)

Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE features of randomly selected streams to simulate diverse stream combinations. During inference, we also present adaptive Connectionist Temporal Classification (CTC) fusion with the help of hierarchical attention mechanisms. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3% across several unseen stream combinations.

Q&A
Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

Q&A
Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

Copyright © 2019-2021. SLT2021 Organizing Committee. All rights reserved.