Audio-visual (AV) Automatic Speech Recognition (ASR) refers to the problem of recognizing speech using both audio and video information. Seminal work in psychology has shown that speech is not a purely auditory process but the way that the listener perceives speech is also through the tracking and recognition of the spatiotemporal visual patterns associated with the lips and mouth movement.
Over the past years, this correlation of the AV information has been occasionally explored by the speech and computer vision communities in order to develop more robust ASR systems for cases in which the auditory environment is noisy (e.g. low quality audio, noise, and multiple speakers). However, the problem of AV ASR has been studied only within relatively small size data sets, most of which are collected in laboratory conditions.
TalkingHeads goes beyond the state-of-the-art by addressing the problem of AV ASR in videos collected from real-world multimedia database using end-to-end Deep Learning methods.
On 7 September ’17, we are co-organizing a workshop on Lip-reading using Deep Learning methods as part of the BMVC-2017 conference at Imperial College London. Submit your papers and/or extended abstracts!
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art.
Accepted at Interspeech-2017 (Stockholm, Sweden)
T Stafylakis and G Tzimiropoulos, “Combining Residual Networks with LSTMs for Lipreading”, ISCA Interspeech 2017.
We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28 sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.
T Stafylakis and G Tzimiropoulos, “Deep word embeddings for visual speech recognition”, IEEE ICASSP 2018.
In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. We first show that the proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words. We then examine the capacity of the embeddings in modelling words unseen during training. We deploy Probabilistic Linear Discriminant Analysis (PLDA) to model the embeddings and perform low-shot learning experiments on words unseen during training. The experiments demonstrate that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set. Subjects: Computer Vision and Pattern Recognition (cs. CV) Cite as: arXiv: 1710.11201 [cs. CV](or arXiv: 1710.11201 v1 [cs. CV] for this version) Submission history From: Themos Stafylakis [view email][v1] Mon, 30 Oct 2017 19: 09: 29 GMT (59kb, D)
T Stafylakis and G Tzimiropoulos, “Zero-shot keyword spotting for visual speech recognition in-the-wild”, ECCV 2018.
Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.
S Petridis, T Stafylakis, P Ma, F Cai, G Tzimiropoulos, M Pantic, “End-to-end Audiovisual Speech Recognition”, IEEE ICASSP 2018.
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-to-end audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW). The model consists of two streams, one for each modality, which extract features directly from mouth regions and raw waveforms. The temporal dynamics in each stream/modality are modeled by a 2-layer BGRU and the fusion of multiple streams/modalities takes place via another 2-layer BGRU. A slight improvement in the classification rate over an end-to-end audio-only and MFCC-based model is reported in clean audio conditions and low levels of noise. In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.
N Brummer, A Silnova, L Burget, T Stafylakis, “Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model”, ISCA Odyssey 2018.
Embeddings in machine learning are low-dimensional representations of complex input patterns, with the property that simple geometric operations like Euclidean distances and dot products can be used for classification and comparison tasks. The proposed meta-embeddings are special embeddings that live in more general inner product spaces. They are designed to propagate uncertainty to the final output in speaker recognition and similar applications. The familiar Gaussian PLDA model (GPLDA) can be re-formulated as an extractor for Gaussian meta-embeddings (GMEs), such that likelihood ratio scores are given by Hilbert space inner products between Gaussian likelihood functions. GMEs extracted by the GPLDA model have fixed precisions and do not propagate uncertainty. We show that a generalization to heavy-tailed PLDA gives GMEs with variable precisions, which do propagate uncertainty. Experiments on NIST SRE 2010 and 2016 show that the proposed method applied to i-vectors without length normalization is up to 20% more accurate than GPLDA applied to length-normalized ivectors.
KA Lee, H Sun, S Aleksandr, W Guangsen, T Stafylakis, G Tzimiropoulos, et al. “The I4U submission to the 2016 NIST speaker recognition evaluation”, NIST SRE 2016 Workshop, 2016.
The I4U’s submission to SRE’16 was a result from the collaboration and active exchange of information among researchers across sixteen Institutes and Universities across 4 continents. The submitted results were based on the fusion of multiple classifiers. A lot of efforts have been devoted to two major challenges, namely, test duration variability and dataset shift from Switchboard and Mixer corpora to the new Call My Net dataset.
KA Lee, V Hautamäki, T Kinnunen, A Larcher, C Zhang, A Nautsch, T Stafylakis, G Tzimiropoulos, et al. “The I4U mega fusion and collaboration for NIST speaker recognition evaluation 2016”, ISCA Interspeech 2017.
The 2016 speaker recognition evaluation (SRE’16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE’16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE’16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.
Marie Curie Research Fellow
Themos Stafylakis is a Marie Curie Research fellow in the School of Computer Science at the University of Nottingham.
He was a Post Doctoral Researcher at Centre de Recherche Informatique de Montréal (CRIM) in Canada, working with Dr Patrick Kenny on the problems of Speaker Recognition and Speaker Diarization. His main research interests are Deep Learning and Bayesian approaches to Speech and Speaker Recognition and Computer Vision.
Georgios (Yorgos) Tzimiropoulos is an Assistant Professor in the School of Computer Science at the University of Nottingham and member of the Computer Vision Laboratory.
He has worked on the problems of object detection and tracking, alignment and pose estimation, and recognition with humans and faces being the focal point of his research. He has approached these problems mainly using tools from Mathematical Optimization and Machine Learning. His current focus is on Deep Learning.