Single channel speech separation using minimum mean square error estimation of sources' log spectra
We present an approach for separating two speech signals when only one single recording of their linear mixture is available. The log spectra of the sources are estimated from the mixture's log spectrum using minimum mean square error (MMSE) approach. The estimation is obtained from the assumption that the sources are modelled using a set of Gaussian subsources which are related to the mixture using MIXMAX approximation. The resulting estimator has a closed form and is expressed using the mean and variance of Gaussian subsources. In order to obtain the two most likely subsources which generate the mixture, we use the estimation-detection technique. We also show that the binary mask filtering which has been empirically - and with no mathematical justification - used in speech separation techniques is, in fact, a simplified form of the MMSE estimator. The proposed technique is compared with the binary mask when the input consists of male-male, female-female, and female-male mixtures. The experimental results in terms of segmental SNR show that the MMSE estimator outperforms binary mask filtering.
|Conference||17th IEEE International Workshop on Machine Learning for Signal Processing, MLSP-2007|
Radfar, M.H., & Dansereau, R. (2007). Single channel speech separation using minimum mean square error estimation of sources' log spectra. Presented at the 17th IEEE International Workshop on Machine Learning for Signal Processing, MLSP-2007. doi:10.1109/MLSP.2007.4414294