Robust sub-band speech feature extraction using multiresolution convolutional neural networks
Convolutional neural networks (CNNs), as a kind of deep neural networks, have been recently used for acoustic modeling and feature extraction along with acoustic modeling in speech recognition systems. In this paper, we propose to use CNN for robust feature extraction from the noisy speech spectrum. In the proposed manner, CNN inputs are noisy speech spectrum and its targets are denoised logarithm of Mel filter bank energies (LMFBs). Consequently, CNN extracts robust features from speech spectrum. The drawback of CNN in the proposed method is its fixed frequency resolution. Thus, we propose to use multiple CNNs with different convolution filter sizes to provide different frequency resolutions for feature extraction from the speech spectrum. We named this method as Multiresolution CNN (MRCNN). Recognition accuracy on Aurora 2 database, shows that CNNs outperform deep belief networks such that, CNN recognition accuracy has 20% relative improvement on average over DBN. However, results show that MRCNN recognition accuracy has 1% relative improvement on average over CNN.