Detail of the student project

List
Topic:Lip activity detector
Department:Katedra kybernetiky
Supervisor:Ing. Jan Čech Ph.D.
Announce as:DP,BP,SOP,PRO
Description:lip activity
Lip activity detector (also called speech activity detector) [1] is an algorithm which automatically identifies whether a person in a video speaks at a time. This task is important in audio-visual diarization problem [2,3], i.e. recognizing who speaks when, and in audio-visual cross modal identity recognition and learning. There are typical situations when multiple people is in a camera field of view and an audio signal is perceived. Then the question is who is speaking.

The speech activity detector can be visual only (based on observing the statistics of intensity variations in the mouth region), or audio-visual (which exploits the audio-visual synchrony between these modalities [4]). A basic study of these approaches is needed together with a detailed analysis of the impact of video resolution and a camera viewpoint. We will also provide a simple ground-truth annotated dataset containing examples of positive (speaking people) and negative (non-speaking, listening people) cases.

A code for precise localization of facial landmarks will be provided [5], and [6].

If this work is successful, the task can be naturally extended to a broader Audio-Visual digitization algorithm, which would integrate multiple audio-visual cues as e.g. video identity from face recognition, visual identity from clothing models, audio identity from voice recognition, direction estimate from a microphone array, etc. Therefore this topic can be chosen as a semestral, bachelor, or master project.

References

[1] K. C. van Bree. Design and realisation of an audiovisual speech activity detector. Technical Report PR-TN 2006/00169, Philips research Europe, 2006.
[2] Felicien Vallet, Slim Essid, and Jean Carrive. A Multimodal Approach to Speaker Diarization on TV Talk-Shows. IEEE Trans. on Multimedia, 15(3), 2013.
[3] Athanasios Noulas, Gwenn Englebienne, and Ben J.A. Kroese. Multimodal Speaker Diarization. IEEE Trans. on PAMI, 34(1), 2012.
[4] Einat Kidron, Yoav Y. Schechner, and Michael Elad. Pixels that sound. In CVPR, 2005.
[5] Jan Cech, Vojtech Franc, Jiri Matas. A 3D Approach to Facial Landmarks: Detection, Refinement, and Tracking. In Proc. ICPR, 2014.
[6] M. Uricar, V. Franc and V. Hlavac, Detector of Facial Landmarks Learned by the Structured Output SVM. In VISAPP 2012. http://cmp.felk.cvut.cz/~uricamic/flandmark/
Realization form:software, technical report
Date:13.05.2014
Responsible person: Petr Pošík