Audio-visual (AV) Automatic Speech Recognition (ASR) refers to the problem of recognizing speech using both audio and video information. Seminal work in psychology has shown that speech is not a purely auditory process but the way that the listener perceives speech is also through the tracking and recognition of the spatiotemporal visual patterns associated with the lips and mouth movement.
Over the past years, this correlation of the AV information has been occasionally explored by the speech and computer vision communities in order to develop more robust ASR systems for cases in which the auditory environment is noisy (e.g. low quality audio, noise, and multiple speakers). However, the problem of AV ASR has been studied only within relatively small size data sets, most of which are collected in laboratory conditions.
TalkingHeads goes beyond the state-of-the-art by addressing the problem of AV ASR in videos collected from real-world multimedia database using end-to-end Deep Learning methods.
On 7 September ’17, we are co-organizing a workshop on Lip-reading using Deep Learning methods as part of the BMVC-2017 conference at Imperial College London. Submit your papers and/or extended abstracts!
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art.
Accepted at Interspeech-2017 (Stockholm, Sweden)
Marie Curie Research Fellow
Themos Stafylakis is a Marie Curie Research fellow in the School of Computer Science at the University of Nottingham.
He was a Post Doctoral Researcher at Centre de Recherche Informatique de Montréal (CRIM) in Canada, working with Dr Patrick Kenny on the problems of Speaker Recognition and Speaker Diarization. His main research interests are Deep Learning and Bayesian approaches to Speech and Speaker Recognition and Computer Vision.
Georgios (Yorgos) Tzimiropoulos is an Assistant Professor in the School of Computer Science at the University of Nottingham and member of the Computer Vision Laboratory.
He has worked on the problems of object detection and tracking, alignment and pose estimation, and recognition with humans and faces being the focal point of his research. He has approached these problems mainly using tools from Mathematical Optimization and Machine Learning. His current focus is on Deep Learning.