Audio Augmented Reality aims to integrate virtual audio content into the user'sacoustic environment, creating an immersive audio experience. The commercial availability of augmented reality headsets such as Apple Vision Pro has further motivated interestin this research field. To synthesize binaural spatial audio that can recreate the perception of distance, direction, and acoustic cues, the knowledge of specific acoustic parameters of the user's environment is a prerequisite. Acoustic parameters can be divided into two categories: global parameters associated with the room's geometry, reverberation time, and wall materials, and local parameters concerning the location of each sound source. With the help of room acoustic simulators, these parameters are used to simulate room impulse responses. These room impulse responses can then be convolved with dry speech signals to synthesize binaural spatial audio with a perception of realism. However, the estimation of these acoustic parameters is a challenge. Previous research has attempted to address this problem through cumbersome and time-consuming in-situ measurements, which are often impractical. In this thesis, we tackle this challenge by leveraging supervised machine-learning techniques using speech recordings as input. Our primary focus is on cuboid rooms with static acoustic scenarios. In the initial part of our work, we develop a multi-task neural network for room parameter estimation. We then assess its robustness using real-world data. In the second part, we shift our focus towards virtually supervised learning. This approach involves training machine learning models exclusively on simulated data. The rationale behind this strategy is rooted in the limited availability of task-specific real datasets within this domain. To ensure genralization, the training dataset should closely resemble the scenarios encountered in the test datasets. In order to bridge the gap, we improve realism in the open-source room acoustics simulator Pyroomacoustics by implementing an extended image source method. Further, this improved room acoustics simulator is used to train neural networks for the tasks of room parameter estimation and sound source localization. We employ several real test datasets to assess the positive impact brought by training the systems using the improved simulator. Our experiments show that the generalization of the system is improved across both tasks when compared to the systems trained for the same task with less realistic training data. To the best of our knowledge, this is one of the first studies to explore the field of virtually supervised learning for the task of global and local room acoustic parameter estimation.
Authors
- Bibliographic Reference
- Prerak Srivastava. Realism in virtually supervised learning for acoustic room characterization and sound source localization. Machine Learning [cs.LG]. Université de Lorraine, 2023. English. ⟨NNT : 2023LORR0184⟩. ⟨tel-04313405⟩
- Department
- Department of Natural Language Processing & Knowledge Discovery
- Funding
- INRIA
- HAL Collection
- ['CNRS - Centre national de la recherche scientifique', 'INRIA - Institut National de Recherche en Informatique et en Automatique', 'STAR - Dépôt national des thèses électroniques', 'INRIA Nancy - Grand Est', 'Publications du LORIA', 'TESTALAIN1', 'Université de Lorraine', 'INRIA 2', 'Laboratoire Lorrain de Recherche en Informatique et ses Applications', 'Department of Natural Language Processing & Knowledge Discovery', "Thèses de doctorat soutenues à l'Université de Lorraine"]
- HAL Identifier
- 4313405
- Institution
- ['Institut National de Recherche en Informatique et en Automatique', 'Université de Lorraine']
- Laboratory
- ['Inria Nancy - Grand Est', 'Laboratoire Lorrain de Recherche en Informatique et ses Applications']
- Published in
- France
Table of Contents
- List of figures 21
- List of tables 23
- List of acronyms 25
- Introduction 27
- Motivation 27
- Research context 28
- Objective and contributions 29
- Contributions 30
- Multichannel room parameter estimation using multiple viewpoints 30
- Improved simulation and its effect on room parameter estimation 30
- Improved simulation and its effect on speaker localization 31
- List of published papers 32
- Structure of the thesis 32
- Background 35
- Microphones and loudspeakers 35
- Microphones 35
- Frequency response 36
- Directivity 37
- Ambisonics 38
- Loudspeakers 38
- Frequency response 38
- Directivity 39
- Digital signal model 40
- Analog-to-digital conversion 40
- Signal model and terminologies 40
- Sound representations 41
- Discrete Fourier transform 41
- Short-time Fourier transform 42
- Spherical harmonics 44
- Discrete spherical harmonic transform 44
- Acoustics 46
- Sound wave propagation 46
- Room acoustics 47
- Reflection and scattering 47
- Room impulse response 48
- Image source method 51
- RIR perception and reverberation time 52
- Deep learning 53
- Conclusion 54
- State of the art 55
- Room parameter estimation 55
- Pre-deep learning methods 56
- Deep learning methods 58
- Sound source localization 60
- Signal-processing-based methods 61
- Machine learning methods 63
- Virtually supervised learning 65
- Room acoustics simulators 66
- Wave-based simulation 66
- Geometric-acoustics based simulation 67
- Room acoustics simulation libraries 67
- RIR and audio datasets 68
- RIR datasets 69
- Binaural room impulse response (BRIR) datasets 69
- Smart-home datasets 70
- Audio challenge datasets 70
- Audio-visual datasets 71
- Synthetic datasets 71
- Summary 72
- Multichannel room parameter estimation using multiple viewpoints 73
- Training data 73
- RIR simulation 73
- Mixture generation 75
- Neural network model 77
- Input features 78
- Loss function 78
- Fusion of the estimates 78
- Hyperparameters and training 79
- Alternative DNN architectures 79
- Experiments and results 81
- Baseline system and evaluation metric 81
- Simulated data 81
- Real data 83
- Summary 85
- Extended image source method and implementation under Pyroomacoustics 87
- Functioning of Pyroomacoustics 87
- Extended ISM 90
- Extended ISM implementation in Pyroomacoustics 91
- Directivity datasets 91
- SOFA format 92
- DIRPAT and other datasets 92
- DSHT and interpolation 94
- Frequency domain RIR construction 97
- Added features 100
- Qualitative analysis of obtained simulated RIRs 101
- Comparison between the original and the modified version of Pyroomacoustics 101
- Similarity to measured RIRs 103
- Further enhancements and improvements 105
- Summary 106
- Impact of simulation realism on virtually supervised learning 107
- Room parameter estimation 108
- Simulated data 108
- Training sets used for the ablation study 108
- RIR simulation and mixture generation 110
- Training and hyperparameters 110
- Experiments and results 110
- Simulated test set 111
- Real test set 111
- Results 112
- Sound source localization 112
- Angle of arrival estimation 114
- DOA estimation on real test sets 116
- Scenario-based data generation 117
- RIR simulation 117
- Mixture generation 117
- Model selection and hyperparameters 118
- Experiments and results 118
- Baseline system and evaluation metric 118
- Simulated test data 119
- Real test data 120
- Summary 121
- Conclusion and perspectives 123
- Conclusion 123
- Perspectives 124
- Advanced Pyroomacoustics simulator 124
- Room parameter estimation 125
- Sound source localization 126
- Résumé étendu 127
- Estimation des paramètres de la pièce à canaux multiples en utilisant de multiples points de vue 128
- Les données d'entraînement 129
- Modèle de réseau de neurones 130
- Caractéristiques d'entrée 131
- Fonction de perte 131
- Fusion des estimations 132
- Hyperparamètres et entraînement 132
- Résultats et expérimentations 133
- Données simulées 133
- Données réelles 134
- Résumé 135
- ISM étendue et implémentation sous Pyroomacoustics 135
- ISM étendu 136
- Jeux de données de directivité 136
- Construction de RIR dans le domaine des fréquences 137
- Similarité avec les RIR mesurées 138
- Résumé 138
- Impact du réalisme de la simulation sur l'apprentissage virtuellement supervisé 140
- Estimation des paramètres de la salle 140
- Ensembles d'entraînement utilisés pour l'étude d'ablation 140
- Simulation des RIR et génération de mélanges 141
- Résultats 142
- Localisation des sources sonores 143
- Localisation sur des ensembles de tests réels 143
- Génération de données sur la base de scénarios 144
- Sélection de modèles et hyperparamètres 145
- Expériences et résultats 146
- Résumé 147
- Les perspectives 147
- Simulateur avancé Pyroomacoustics 147
- Estimation des paramètres de la pièce 148
- Localisation de la source sonore 149
- Bibliography 151