Multimodal diarization : towards robustness and fairness in the wild

Yannis Tevissen

Speaker diarization, or the task of automatically determining "who spoke, when?" in an audio or video recording, is one of the pillars of modern conversation analysis systems. On television, the content broadcasted is very diverse and covers about every type of conversation, from calm discussions between two people to impassioned debates and wartime interviews. The archiving and indexing of this content, carried out by the Newsbridge company, requires robust and fair processing methods. In this work, we present two new methods for improving systems' robustness via fusion approaches. The first method focuses on voice activity detection, a necessary pre-processing step for every diarization system. The second is a multimodal approach that takes advantage of the latest advances in natural language processing. We also show that recent advances in diarization systems make the use of speaker diarization realistic, even in critical sectors such as the analysis of large audiovisual archives or the home care of the elderly. Finally, this work shows a new method for evaluating the algorithmic fairness of speaker diarization, with the objective to make its use more responsible.

Authors

Yannis Tevissen

Related Organizations

Bibliographic Reference: Yannis Tevissen. Diarisation multimodale : vers des modèles robustes et justes en contexte réel. Intelligence artificielle [cs.AI]. Institut Polytechnique de Paris, 2023. Français. ⟨NNT : 2023IPPAS014⟩. ⟨tel-04345081⟩

HAL Collection: STAR - Dépôt national des thèses électroniques

HAL Identifier: 4345081

Institution: Télécom SudParis

Laboratory: Services répartis, Architectures, MOdélisation, Validation, Administration des Réseaux

Published in: France

Multimodal diarization : towards robustness and fairness in the wild

Authors

Related Organizations

Table of Contents

Share artifact

Add to list

Citation

Full-page Screenshot