Platinum Sponsors

Gold Sponsors

Silver Sponsors

IEEE SLT 2021 Online Conference

Jan 19, Tue

You can join a session by clicking the session title. Clicking any paper title will show the abstract, and the link to paper and video.
Only users who registered for SLT 2021 can access paper, video and Zoom - to register, click here.
Session: Speaker Recognition Join Zoom

Chair: Omid Sadjadi, Hung-yi Lee

09:10

1062: TRANSFORMER-BASED ONLINE SPEECH RECOGNITION WITH DECODER-END ADAPTIVE COMPUTATION STEPS

Mohan Li, Catalin Zorila, Rama Doddipatla

Paper and Video (new tab)

Transformer-based end-to-end (E2E) automatic speech recognition (ASR) systems have recently gained wide popularity, and are shown to outperform E2E models based on recurrent structures on a number of ASR tasks. However, like other E2E models, Transformer ASR also requires the full input sequence for calculating the attentions on both encoder and decoder, leading to increased latency and posing a challenge for online ASR. The paper proposes Decoder-end Adaptive Computation Steps (DACS) algorithm to address the issue of latency and facilitate online ASR. The proposed algorithm streams the decoding of Transformer ASR by triggering an output after the confidence acquired from the encoder states reaches a certain threshold. Unlike other monotonic attention mechanisms that risk visiting the entire encoder states for each output step, the paper introduces a maximum look-ahead step into the DACS algorithm to prevent from reaching the end of speech too fast. A Chunkwise encoder is adopted in our system to handle real-time speech inputs. The proposed online Transformer ASR system has been evaluated on Wall Street Journal (WSJ) and AIShell-1 datasets, yielding 5.5% word error rate (WER) and 7.4% character error rate (CER) respectively, with only a minor decay in performance when compared to the offline systems.

1098: SUPERVISED ATTENTION FOR SPEAKER RECOGNITION

Seong Min Kye, Joon Son Chung, Hoirin Kim

Paper and Video (new tab)

The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline in some settings, which implies that the attention is not learnt effectively in end-to-end training. To tackle this problem, we introduce strategies for training the attention mechanism in a supervised manner, which learns the context vector using classified samples. With our proposed methods, context vector can be boosted to select the most informative frames. We show that our method outperforms existing methods in various experimental settings including short utterance speaker recognition, and achieves competitive performance over the existing baselines on the VoxCeleb datasets.

1047: CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

Haohan Guo, Shaofei Zhang, Frank Soong, Lei He,Lei Xie

Paper and Video (new tab)

End-to-end neural TTS has achieved excellent performance in reading style speech synthesis. However, it’s still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach which has an auxiliary encoder and a conversational context encoder to reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed methods produce more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors, like fillers and repeated words, which makes the conversational speaking style more realistic.

09:16

1176: STREAMING ATTENTION-BASED MODELS WITH AUGMENTED MEMORY FOR END-TO-END SPEECH RECOGNITION

Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank Zhang, Julian Chan, Michael Seltzer

Paper and Video (new tab)

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.

1181: CROSS ATTENTIVE POOLING FOR SPEAKER VERIFICATION

Seong Min, Kye Yoohwan, Kwon Joon, Son Chung

Paper and Video (new tab)

The goal of this paper is text-independent speaker verification where utterances come from `in the wild' videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilises the context information across the reference-query pair to generate utterance-level embeddings that contain the most discriminative information for the pair-wise matching problem. Experiments are performed on the VoxCeleb dataset in which our method outperforms comparable pooling strategies.

1166: Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis

Liangqi Liu, Jiankun Hu, Zhiyong Wu, Song Yang, Songfan Yang, Jia Jia, Helen Meng

Paper and Video (new tab)

In speech interaction scenarios, speech emphasis is essential for expressing the underlying intention and attitude. Recently, end-to-end emphatic speech synthesis greatly improves the naturalness of synthetic speech, but also brings new problems: 1) lack of interpretability for how emphatic codes affect the model; 2) no separate control of emphasis on duration and on intonation and energy. We propose a novel way to build an interpretable and controllable emphatic speech synthesis framework based on forward attention. Firstly, we explicitly model the local variation of speaking rate for emphasized words and neutral words with modified forward attention to manifest emphasized words in terms of duration. The decoder is further divided into attention-RNN and decoder-RNN to disentangle the influence of emphasis on duration and on intonation and energy. The emphasis information is injected into decoder-RNN for highlighting emphasized words in the aspects of intonation and energy. Experimental results have shown that our model can not only provide separate control of emphasis on duration and on intonation and energy, but also generate more robust and prominent emphatic speech with high quality and naturalness.

09:22

1178: CASCADE RNN-TRANSDUCER: SYLLABLE BASED STREAMING ON-DEVICE MANDARIN SPEECH RECOGNITION WITH A SYLLABLE-TO-CHARACTER CONVERTER

Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie

Paper and Video (new tab)

End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.

1170: ResNeXt and Res2Net Structures for Speaker Verification

Tianyan Zhou, Yong Zhao, Jian Wu

Paper and Video (new tab)

The ResNet-based architecture has been widely adopted to extract speaker embeddings for text-independent speaker verification systems. By introducing the residual connections to the CNN and standardizing the residual blocks, the ResNet structure is capable of training deep networks to achieve highly competitive recognition performance. However, when the input feature space becomes more complicated, simply increasing the depth and width of the ResNet network may not fully realize its performance potential. In this paper, we present two extensions of the ResNet architecture, ResNeXt and Res2Net, for speaker verification. Originally proposed for image recognition, the ResNeXt and Res2Net introduce two more dimensions, cardinality and scale, in addition to depth and width, to improve the model's representation capacity. By increasing the scale dimension, the Res2Net model can represent multi-scale features with various granularities, which particularly facilitates speaker verification for short utterances. We evaluate our proposed systems on three speaker verification tasks. Experiments on the VoxCeleb test set demonstrated that the ResNeXt and Res2Net can significantly outperform the conventional ResNet model. The Res2Net model achieved superior performance by reducing the EER by 18.5% relative. Experiments on the other two internal test sets of mismatched conditions further confirmed the generalization of the ResNeXt and Res2Net architectures against noisy environment and segment length variations.

1173: VAW-GAN FOR DISENTANGLEMENT AND RECOMPOSITION OF EMOTIONAL ELEMENTS IN SPEEC

Kun Zhou, Berrak Sisman, Haizhou Li

Paper and Video (new tab)

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. In this paper, we study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion. We train a spectral encoder that disentangles emotion and prosody (F0) information from spectral features; we also train a prosodic encoder that disentangles emotion modulation of prosody (affective prosody) from linguistic prosody. At run-time, the decoder of spectral VAW-GAN is conditioned on the output of prosodic VAW-GAN. The vocoder takes the converted spectral and prosodic features to generate the target emotional speech. Experiments validate the effectiveness of our proposed method in both objective and subjective evaluations.

09:28

1186: STREAMING TRANSFORMER ASR WITH BLOCKWISE SYNCHRONOUS BEAM SEARCH

Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe

Paper and Video (new tab)

The Transformer self-attention network has shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute both self-attention and source–target attention. In this paper, we propose a novel blockwise synchronous beam search algorithm based on blockwise processing of encoder to perform streaming E2E Transformer ASR. In the beam search, encoded feature blocks are synchronously aligned using a block boundary detection technique, where a reliability score of each predicted hypothesis is evaluated based on the end-of-sequence and repeated tokens in the hypothesis. Evaluations of the HKUST and AISHELL-1 Mandarin, LibriSpeech English, and CSJ Japanese tasks show that the proposed streaming Transformer algorithm outperforms conventional online approaches, including monotonic chunkwise attention (MoChA), especially when using the knowledge distillation technique. An ablation study indicates that our streaming approach contributes to reducing the response time, and the repetition criterion contributes significantly in certain tasks. Our streaming ASR models achieve comparable or superior performance to batch models and other streaming-based Transformer methods in all tasks considered.

1389: Embedding Aggregation for Far-Field Speaker Verification with Distributed Microphone Arrays

Danwei Cai, Ming Li

Paper and Video (new tab)

With the successful application of deep speaker embedding networks, the performance of speaker verification systems has significantly improved under clean and close-talking settings; however, unsatisfactory performance persists under noisy and far-field environments. This study aims at improving the performance of far-field speaker verification systems with distributed microphone arrays in the smart home scenario. The proposed learning framework consists of two modules: a deep speaker embedding module and an aggregation module. The former extracts a speaker embedding for each recording. The latter, based on either averaged pooling or attentive pooling, aggregates speaker embeddings and learns a unified representation for all recordings captured by distributed microphone arrays. The two modules are trained in an end-to-end manner. To evaluate this framework, we conduct experiments on the real text-dependent far-field datasets Hi Mia. Results show that our framework outperforms the naive averaged aggregation methods by 20% in terms of equal error rate (EER) with six distributed microphone arrays. Also, we find that the attention-based aggregation advocates high-quality recordings and repels low-quality ones.

1239: FINE-GRAINED EMOTION STRENGTH TRANSFER, CONTROL AND PREDICTION FOR EMOTIONAL SPEECH SYNTHESIS

Yi Lei, Shan Yang, Lei Xie

Paper and Video (new tab)

This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional emotional speech synthesis often needs manual labels or reference audio to determine the emotional expressions of synthesized speech. Such coarse labels cannot control the details of speech emotion, often resulting in an averaged emotion expression delivery, and it is also hard to choose suitable reference audio during inference. To conduct fine-grained emotion expression generation, we introduce phoneme-level emotion strength representations through a learned ranking function to describe the local emotion details, and the sentence-level emotion category is adopted to render the global emotions of synthesized speech. With the global render and local descriptors of emotions, we can obtain fine-grained emotion expressions from reference audio via its emotion descriptors (for transfer) or directly from phoneme-level manual labels (for control). As for the emotional speech synthesis with arbitrary text inputs, the proposed model can also predict phoneme-level emotion expressions from texts, which does not require any reference audio or manual label.

09:34

1209: CONVOLUTION-BASED ATTENTION MODEL WITH POSITIONAL ENCODING FOR STREAMING SPEECH RECOGNITION ON EMBEDDED DEVICES

Jinhwan Park, Chanwoo Kim, Wonyong Sung

Paper and Video (new tab)

On-device automatic speech recognition (ASR) is very preferred over server-based implementations owing to its low latency and privacy protection. Many server-based ASRs employ recurrent neural networks (RNNs) to exploit their ability to recognize long sequences with an extremely small number of states; however, they are inefficient for single-stream implementations in embedded devices. In this study, a highly efficient convolutional model-based ASR with monotonic chunkwise attention is developed. Although temporal convolution-based models allow more efficient implementations, they demand a long filter-length to avoid looping or skipping problems. To remedy this problem, we added positional encoding, while shortening the filter length, to a convolution-based ASR encoder. It is demonstrated that the accuracy of the short filter-length convolutional model is significantly improved. In addition, the effect of positional encoding is analyzed by visualizing the attention energy and encoder outputs. The proposed model achieves the word error rate of 11.20% on TED-LIUMv2 for an end-to-end speech recognition task.

1157: SYNTH2AUG: CROSS-DOMAIN SPEAKER RECOGNITION WITH TTS SYNTHESIZED SPEECH

Yiling Huang, Yutian Chen, Jason Pelecanos, Quan Wang

Paper and Video (new tab)

In recent years, Text-To-Speech (TTS) has been used as a data augmentation technique for speech recognition to help complement inadequacies in the training data. Correspondingly, we investigate the use of a multi-speaker TTS system to synthesize speech in support of speaker recognition. In this study we focus the analysis on tasks where a relatively small number of speakers is available for training. We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance and can be combined effectively with multi-style training. Additionally, we explore the effectiveness of different types of text transcripts used for TTS synthesis. Results suggest that matching the textual content of the target domain is a good practice, and if that is not feasible, a transcript with a sufficiently large vocabulary is recommended.

1074: SUPERVISED AND UNSUPERVISED APPROACHES FOR CONTROLLING NARROW LEXICAL FOCUS IN SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS

Slava Shechtman, Raul Fernandez, David Haws

Paper and Video (new tab)

Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited by a lack of flexibility when it comes to controlling the output. In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled parameters. We apply this framework to the realization of emphatic lexical focus, proposing a variety of architectures designed to exploit different levels of supervision based on the availability of labeled resources. We evaluate these approaches via listening tests that demonstrate we are able to successfully realize controllable focus while maintaining the same, or higher, naturalness over an established baseline, and we explore how the different approaches compare when synthesizing in a target voice with or without labeled data.

09:40

1245: LEARNING TO COUNT WORDS IN FLUENT SPEECH ENABLES ONLINE SPEECH RECOGNITION

George Sterpu, Christian Saam, Naomi Harte

Paper and Video (new tab)

Sequence to Sequence models, in particular the Transformer, achieve state of the art results in Automatic Speech Recognition. Practical usage is however limited to cases where full utterance latency is acceptable. In this work we introduce Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting. We use the cumulative word sum to dynamically segment speech and enable its eager decoding into words. Experiments performed on the LRS2, LibriSpeech, and Aishell-1 datasets of English and Mandarin speech show that the online system performs comparable with the offline one when having a dynamic algorithmic delay of 5 segments. Furthermore, we show that the estimated segment length distribution resembles the word length distribution obtained with forced alignment, although our system does not require an exact segment-to-word equivalence. Taris introduces a negligible overhead compared to a standard Transformer, while the local relationship modelling between inputs and outputs grants invariance to sequence length by design.

1241: MULTI-FEATURE LEARNING WITH CANONICAL CORRELATION ANALYSIS CONSTRAINT FOR TEXT-INDEPENDENT SPEAKER VERIFICATION

Zheng Li, Miao Zhao, Lin Li, Qingyang Hong

Paper and Video (new tab)

In order to improve the performance and robustness of text-independent speaker verification systems, various speaker embedding representation learning algorithms have been developed. Typically, exploring manifold kinds of features to describe speaker-related embeddings is a common approach, such as introducing more acoustic features or different resolution scale features. In this paper, a new multi-feature learning strategy with canonical correlation analysis (CCA) constraint is proposed to learn the instinct speaker embeddings, which maximizes the correlation between two features from the same utterance. Based on the multi-feature learning structure, the CCA constraint layer and the CCA loss are utilized to explore the correlation representation between two kinds of features and alleviate the redundancy. Two multi-feature learning strategies are studied, using the pairwise acoustic features, and the pair of short-term and long-term features. Furthermore, we improve the long short-term feature learning structure by replacing the LSTM block with the Bidirectional-GRU (B-GRU) block and introducing more dense layers. The effectiveness of these improvements are shown on the VoxCeleb 1 evaluation set, the noisy VoxCeleb 1 evaluation set and the SITW evaluation set.

1409: GraphPB: GRAPHICAL REPRESENTATIONS OF PROSODY BOUNDARY IN SPEECH SYNTHESIS

Aolan Sun, Jianzong Wang, Ning Cheng, Huayi Peng, Zhen Zeng, Lingwei Kong, Jing Xiao

Paper and Video (new tab)

This paper introduces a graphical representation approach of prosody boundary (GraphPB) in the task of Chinese speech synthesis, intending to parse the semantic and syntactic relationship of input sequences in a graphical domain for improving the prosody performance. The nodes of the graph embedding are formed by prosodic words, and the edges are formed by the other prosodic boundaries, namely prosodic phrase boundary (PPH) and intonation phrase boundary (IPH). Different Graph Neural Networks (GNN) like Gated Graph Neural Network (GGNN) and Graph Long Short-term Memory (G-LSTM) are utilised as graph encoders to exploit the graphical prosody boundary information. Graph-to-sequence model is proposed and formed by a graph encoder and an attentional decoder. Two techniques are proposed to embed sequential information into the graph-to-sequence text-to-speech model. The experimental results show that this proposed approach can encode the phonetic and prosody rhythm of an utterance. The mean opinion score (MOS) of these GNN models shows comparative results with the state-of-the-art sequence-to-sequence models with better performance in the aspect of prosody. This provides an alternative approach for prosody modelling in end-to-end speech synthesis.

09:46

1328: BENCHMARKING LF-MMI, CTC AND RNN-T CRITERIA FOR STREAMING ASR

Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

Paper and Video (new tab)

In this work, we perform comprehensive evaluations on automatic speech recognition (ASR) accuracy and efficiency with three popular training criteria for latency-controlled streaming ASR application: LF-MMI, CTC and RNN-T. In recognizing challenging social media videos of 7 languages, with training data sized from 3K to 14K hours, we conduct large-scale controlled experimentation across each training criterion with identical datasets and encoder model architecture, and found out that RNN-T models have consistent advantage in word error rates (WER) and CTC models have consistent advantage in inference efficiency measured by real-time factor (RTF). Additionally for different training criteria, we selectively examine various modeling strategies including modeling units, encoder architectures, pre-training, etc. To our best knowledge, this is the first comprehensive benchmark on these three widely-used ASR training criteria on real-world streaming ASR applications over multiple languages.

1350: IMPROVING SPEAKER RECOGNITION WITH QUALITY INDICATORS

Hrishikesh Rao, Kedar Phatak, Elie Khoury

Paper and Video (new tab)

Nuisance factors such as short duration, noise and transmission conditions still pose accuracy challenges to state-of-the-art automatic speaker verification (ASV) systems. To address this problem, we propose a no reference system that consumes quality indicators encapsulating information about duration of speech, acoustic events and codec artifacts. These quality indicators are used as estimates to measure how close a given speech utterance would be to a high-quality speech segment uttered by the same speaker. The proposed measures when fused with a baseline ASV system are found to improve the performance of speaker recognition. The experimental study carried on the NIST SRE 2019 dataset shows a relative decrease of 9.6% in equal error rate (EER) compared to the baseline.

1439: HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS

Chung-Ming Chien, Hung-yi Lee

Paper and Video (new tab)

Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work, we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features. The proposed method outperforms other competitors in terms of audio quality and prosody naturalness on objective and subjective evaluation.

09:52

1351: ALIGNMENT RESTRICTED STREAMING RECURRENT NEURAL NETWORK TRANSDUCER

Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-feng Yeh, Christian Fuegen, Michael L Seltzer

Paper and Video (new tab)

There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times higher throughput for our LSTM model architecture, enabling faster training and convergence on GPUs.

1423: AUDIO ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF AUDIO REPRESENTATION

Po-Han Chi, Pei-Hung Chung, Tsung-Han Wu, Chun-Cheng Hsieh, Yen-Hao Chen, Shang-Wen Li, Hung-yi Lee

Paper and Video (new tab)

Self-supervised speech models are powerful speech representation extractors for downstream applications. Recently, larger models have been utilized in acoustic model training to achieve better performance. We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We apply the light-weight representation extractor to two downstream tasks, speaker classification and phoneme classification. We show that Audio ALBERT achieves performance comparable with massive pre-trained networks in the downstream tasks while having 91% fewer parameters. Moreover, we design probing models to measure how much the latent representations can encode the speaker’s and phoneme’s information. We find that the representations encoded in internal layers of Audio ALBERT contain more information for both phoneme and speaker than the last layer, which is generally used for downstream tasks. Our findings provide a new avenue for using self-supervised networks for better performance and efficiency.

1375: WHISPERED AND LOMBARD NEURAL SPEECH SYNTHESIS

Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Erik Marchi, Varun Lakshminarasimhan

Paper and Video (new tab)

It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.

09:58

Q&A
Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

1284: NEURAL MOS PREDICTION FOR SYNTHESIZED SPEECH USING MULTI-TASK LEARNING WITH SPOOFING DETECTION AND SPOOFING TYPE CLASSIFICATION

Yeunju Choi, Youngmoon Jung, Hoirin Kim

Paper and Video (new tab)

Several studies have proposed deep-learning-based models to predict the mean opinion score (MOS) of synthesized speech, showing the possibility of replacing human raters. However, inter- and intra-rater variability in MOSs makes it hard to ensure the high performance of the models. In this paper, we propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model using the following two auxiliary tasks: spoofing detection (SD) and spoofing type classification (STC). Besides, we use the focal loss to maximize the synergy between SD and STC for MOS prediction. Experiments using the MOS evaluation results of the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction. Our proposed model achieves up to 11.6% relative improvement in performance over the baseline model.

10:04

Q&A
Use the room number in the parenthesis to join the individual Zoom breakout room for each paper.

10:10

Copyright © 2019-2021. SLT2021 Organizing Committee. All rights reserved.