This study proposes a new modular architecture for performing multimodal talker localization in the video conferencing environment. Data streams from sensing devices are decoupled in the beginning, and purpose specific localization methods are used to locate the talker separately. Individual localization results are then combine using data fusion techniques to form the final estimation of the talker's location. The proposed architecture has the advantage of being flexible. Additional localization modality can be added by simply duplicating the functional module in the architecture with a new sensor and its associated localizer. The architecture is tested with three localization modalities: one audio, and two different video localizers. The results demonstrated that the modular architecture successfully yielded a multimodal localization method that outperforms single modal localization methods when the audio and video localizers are used as stand alone localization methods.

Additional Metadata
Conference Proceedings - 3rd IEEE International Workshop on Haptic, Audio and Visual Environments and their Applications - HAVE 2004
Citation
Lo, D. (David), Goubran, R, & Dansereau, R. (2004). Multimodal talker localization in video conferencing environments. In Proceedings - 3rd IEEE International Workshop on Haptic, Audio and Visual Environments and their Applications - HAVE 2004 (pp. 195–200).