Workshop on

“Lip-Reading using deep learning methods”


BMVC 2017

7th September 2017

Imperial College London

– Introduction –

Lip-reading is a field of major importance for a wide range of applications, such as silent dictation, speech recognition in noisy environments, improved hearing aids and biometrics. It lies at the intersection of computer vision and speech recognition, which are the two fields that pioneered deep learning methods. The aim of the workshop is to gather researchers on audiovisual speech recognition and lip-reading to disseminate their work and to exchange views on the new potentials of the field created by the advent of deep learning methods.

– Keynote speakers –

Prof Andrew Zisserman & Joon Son Chung

University of Oxford & Google DeepMind, UK

Learning to lip read by watching TV

In this talk we describe a sequence to sequence model for lip reading. It is able to recognize phrases and sentences being spoken by a talking face, with or without the audio, in an open world scenario with unrestricted vocabulary. An important aspect of training such models is the availability of a large dataset of aligned input and output sequences. We describe how the dataset was built automatically in this case from TV broadcasts,  using a form of self-supervision for the alignment. The trained model can indeed read lips, with a performance that  exceeds human ability, and also we show that lip reading can improve  the performance of automated speech recognition. We will also describe how the architecture can be repurposed to generate talking faces synchronized with speech.

– Invited speaker –

Dr Helen L Bear

University of East London, UK

Visual speech processing: What’s the problem?

Visual speech processing, or computer lip reading, has recently benefited from developments in deep learning. This renewed interest in a difficult problem has enabled our community to have two complimentary approaches to research the visual channel of the speech signal; either we can develop end-to-end systems or, we can pursue a deeper understanding of the information in this channel. The advantage of this second, but more difficult, approach is in addition to improving lipreading systems, the knowledge has the potential to be applied in new domains.

– Programme –

7 September 2017 / Imperial College London – Huxley building

14:25 – 14:30

14:30 – 15:30
Keynote by Prof. Andrew Zisserman and Joon Son Chung

15:30 – 16:00
Invited Speaker: Dr. Helen L. Bear

16:00 – 16:30

16:30 – 17:20
Spotlight Session

17:20 – 18:30
Poster Session

– Accepted Papers & Extended Abstracts –


[1] M Kubokawa and T Saitoh, “Intensity Correction Effect for Lip Reading”
[2] H L Bear, “Visual gesture variability between talkers in continuous visual speech”
[3] K Thangthai, H L Bear and R Harvey, “Comparing phonemes and visemes with DNN-based lipreading”
[4] H L Bear and S Taylor, “Visual speech recognition: aligning terminologies for improved understanding”

Extended Abstracts

[5] J-C Hou, S-S Wang, Y Tsao and H-M Wang, “Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network”
[6] L Liu, G Feng and D Beautemps, “Inner lips features extraction based on CLNF with hybrid dynamic template for cued speech”
[7] C Wright and D Stewart, “Real-World DataSets for Lip-Based Research”
[8] G Sterpu and N Harte “Lipreading Sentences with Sequence to Sequence Models: Preliminary Results”
[9] T Stafylakis and G Tzimiropoulos, “Visual word recognition using Residual Networks and LSTMs”
[10] S Petridis, Y W Z Li and M Pantic, “End-to-end multi-view lipreading”

– Call for Papers & Extended Abstracts –

The fields covered by the proposed workshop are listed below:


– Deep Learning methods for Lip-Reading

– Audiovisual Speech Recognition and fusion methods

– Combinations of probabilistic and Deep Learning methods for Lip-Reading

– Visual units for Lip-Reading

– Audiovisual biometrics

– Tracking methods for Lip-Reading

– Multi-View Lip-Reading

– Audiovisual Speech Synthesis

– Lip-Reading and Audiovisual Databases


The workshop organizers further encourage paper submissions that are not covered by the above list but are related to Lip-Reading. The templates for regular papers are same as those of the main conference.
The deadlines for regular papers are given below and are at 11:59PM (Pacific time) of the corresponding day.

Paper submission deadline

13 July 2017

Paper acceptance notification

18 July 2017

Camera ready

25 July 2017


Presentations not accompanied by a paper are also encouraged, in order to promote dissemination of recent and ongoing research. In the latter case, a 2-page extended abstract is required. The extended abstracts are not considered to be publications and will not be included in the conference proceedings.

The deadline for submission of extended abstracts is 24 August 2017

The papers and extended abstracts should be submitted through the CMT of the workshop, and they will be reviewed by at least two reviewers.

– Organising Committee –

Dr Themos Stafylakis

Marie-Curie Fellow
Computer Vision Laboratory

Dr Themos Stafylakis, Marie Curie Fellow on Audio-Visual speech recognition at the Computer Vision Laboratory of University of Nottingham. He has a strong publication record on audio-based speech and speaker recognition, as a result of his 5 year post-doc at CRIM (Montreal, Canada), under the supervision of Patrick Kenny. He is currently working on lip-reading and audiovisual speech recognition using deep learning methods.

Dr Stavros Petridis

Research Fellow
Intelligent Behaviour Understanding Group (i-bug)


Dr Stavros Petridis, research fellow at Imperial College London (i-bug group). He has worked in a wide range of human behaviour understanding problems like audiovisual laughter recognition, facial expression recognition, age / gender recognition, native speaker recognition and face re-identification. He is currently working on deep learning models for audiovisual fusion, lipreading and facial expression recognition.

Dr Georgios Tzimiropoulos

Assistant Professor
Computer Vision Laboratory

Dr Georgios Tzimiropoulos, Assistant Professor on Computer Vision at the University of Nottingham. He has worked on the problems of object detection and tracking, alignment and pose estimation, and recognition with humans and faces being the focal point of his research. For his work, he has used a variety of tools from Mathematical Optimization and Machine Learning. His current focus is on Deep Learning.

Prof Maja Pantic

Intelligent Behaviour Understanding Group (i-bug)

Prof Maja Pantic, Professor of Affective and Behavioral Computing at Imperial College London and at University of Twente and the leader of the i-bug group. She has published more than 250 technical papers in the areas of machine analysis of facial expressions, machine analysis of human body gestures, audiovisual analysis of emotions and social signals, and human-centered machine interfaces. She is also active on Lip-Reading and Audio-Visual speech recognition

– Scientific Programme Committee –

Dr Gerasimos Potamianos, University of Thessaly, Greece

Dr Andrew Senior, Google DeepMind, London, UK

Dr Takeshi Saitoh, Kyushu Institute of Technology, Japan

Dr Helen L Bear, University of East London, UK

Dr Darryl Stewart, Queen’s University of Belfast, N. Ireland, UK

Xiaohua Huang, University of Oulu, Finland

Yannis Assael, Google DeepMind, London, UK

Brendan Shillingford, Google DeepMind, London, UK

Joon Son Chung, University of Oxford, UK

– Venue –

The workshop on “Lip-Reading using deep learning methods” is part of the BMVC 2017, which is hosted in Imperial College London.

For more information on BMVC, click here