Keyword Spotting Based On Decision Fusion

M. Sowmya

doi:10.32628/IJSRSET1734132

Authors

M. Sowmya Department of ECE, JNTUA College of Engineering , Ananthapuramu, Andhra Pradesh, India

Keywords:

Automatic speech recognition (ASR), Keyword spotting, Decision fusion, WPCA, HMM-ANN method.

Abstract

Automatic speech recognition (ASR) technology is available now-a-days in all handsets where keyword spotting plays a vital role. Keyword spotting performance significantly degrades when applied to real-world environment due to background noise. As visual features are not affected much by noise this provides better solution. In this paper, audio-visual integration is proposed which combines audio features with the visual features where decision fusion used to adapt for various noise conditions. Visual features are extracted by a set of both geometry based features and appearance based features for facial landmark localization. To avoid similarities among the textons spatiotemporal lip feature (SPTLF) is used which map the features into intra class subspace. The dimensionality of the lip features are reduced using WPCA. A hybrid HMM-ANN method is proposed for integrating audio and visual features. Adaptive weights are generated using neural network for integration of audio and visual features. A parallel two step keyword spotting strategy is provided to avoid overlap between audio and visual keywords. Experiments results on dataset demonstrate that the proposed HMM-ANN method shows improved performance compared to the state of the art network.

References

A Novel Lip Descriptor for Audio-Visual Keyword based Spotting Based on Adaptive Decision fusion Ping Wu, Hong Liu, Member, IEEE, Xiaofei Li, Ting Fan, and Xuewu Zhang IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 18, NO. 3, MARCH 2016
P. Motlicek, F. Valente, and I. Szoke, "Improving acoustic based keyword spotting Using lvcsr lattices," in Proc. IEEE Int. Conf. Acoustic., Speech, Signal Process., Mar. 2012, pp. 4413–4416.
Z.Zhou, G. Zhao, and M. Pietikainen, "Towards a practical lip reading system," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2011, pp. 137–144.
S. T. Shivappa, B. D. Rao, and M. M. Trivedi, "Audio-visual fusion and tracking with multilevel iterative decoding: Framework and experimental evaluation," IEEE J. Sel. Topics Signal Process., vol. 4, no. 5, pp. 882–894, Oct. 2010.
J.-S. Lee and C.H. Park, "Adaptive decision fusion for audio-visual speech recognition," Speech Recog., Technol. Appl., pp. 275–296, 2008.
V. Estellers, M. Gurban, and J.-P. Thiran, "On dynamic stream weighting for audio-visual speech recognition," IEEE/ACM Trans. Audio, Speech, Language Process., vol. 20, no. 4, pp. 1145–1157, May 2012.
H.Liu, T. Fan, and P.Wu, "Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction," in Proc. IEEE Int. Conf. Robot. Autom., May–Jun. 2014, pp. 6644–6651.
G. Zhao,M. Barnard, and M. Pietikai "Lip reading with local spatio temporal descriptors," IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1254–1265, Jun. 2009.
J. Yang, K. Yu, Y. Gong, and T. Huang, "Linear spatial pyramid matching using sparse coding for image classification," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2009.
J. Wang et al., "Locality-constrained linear coding for image classification," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2010, pp. 3360–3367.
X. Cao, Y. Wei, F. Wen, and J. Sun, "Face alignment by explicit shape regression," Int. J. Comput. Vis., vol. 107, no. 2, pp. 177–190, 2014.

Keyword Spotting Based On Decision Fusion

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite