昆明理工大学 信息工程与自动化学院, 云南 昆明 650500
*邮箱: shaoyubin@kust.edu.cn
收稿:2022-05-11,
网络出版:2023-08-07,
纸质出版:2023-07-30
移动端阅览
华英杰, 刘晶, 邵玉斌, 等. 面向战场环境下的语种识别[J]. 兵工学报, 2023,44(7):2197-2206.
Yingjie HUA, Jing LIU, Yubin SHAO, et al. Language Identification in Battlefield Environments[J]. Acta Armamentarii, 2023, 44(7): 2197-2206.
华英杰, 刘晶, 邵玉斌, 等. 面向战场环境下的语种识别[J]. 兵工学报, 2023,44(7):2197-2206. DOI: 10.12382/bgxb.2022.0367.
Yingjie HUA, Jing LIU, Yubin SHAO, et al. Language Identification in Battlefield Environments[J]. Acta Armamentarii, 2023, 44(7): 2197-2206. DOI: 10.12382/bgxb.2022.0367.
为实现语种识别在战场环境下保持较高的识别性能
提出一种基于语谱图灰度变换的语种识别方法。根据语音信息和战场环境下的噪声信息在语谱图上的分布特性
引入带通滤波;根据人耳听觉特性提取对数灰度语谱图;采用自动色阶算法抑制语谱图上的噪声信息
增强语种信息
并采用残差神经网络模型进行训练识别。实验结果表明:在-10dB掠夺者战斗机驾驶舱噪声环境下
相对于线性灰度语谱图特征
识别正确率提升了46%;在其他噪声环境下
识别性能也大幅度提升。
To achieve accurate language identification in battlefield environments
a language identification method based on spectrogram gray transformation is proposed. Bandpass filtering is introduced based on the distribution characteristics of speech information and noise information in the spectrogram under battlefield noise conditions. Logarithmic gray spectrogram is extracted in line with human auditory characteristics. An automatic color adjustment algorithm is used to suppress noise information and enhance language information on the spectrogram
and a residual neural network model is used for training and identification. Experimental results show that compared with linear gray spectrogram features
the recognition accuracy is improved by 46% in the -10dB Predator fighter cockpit noise environment. In other noise environments
the recognition performance is also greatly improved.
LI H Z , MA B , LEE K A . Spoken language identification: from fundamentals to practice [J ] . Proceedings the IEEE , 2013 , 101 ( 5 ): 1136 - 1159 . DOI: 10.1109/JPROC.2012.2237151 http://doi.org/10.1109/JPROC.2012.2237151 http://ieeexplore.ieee.org/document/6451097/ http://ieeexplore.ieee.org/document/6451097/
DAVIS S , MERMELSTEIN P . Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences [J ] . IEEE Transactions on Acoustics, Speech and Signal Processing , 1980 , 28 ( 4 ): 357 - 366 . DOI: 10.1109/TASSP.1980.1163420 http://doi.org/10.1109/TASSP.1980.1163420 http://ieeexplore.ieee.org/document/1163420/ http://ieeexplore.ieee.org/document/1163420/
TORRES-CARRASQUILLO P A , SINGER E , KOHLER M A , et al . Approaches to language identification using Gaussian mixture models and shifted delta cepstral features [C ] //Proceedings of the 7th International Confrence on Spoken Language Processing. Piscataway, NJ , US : IEEE , 2002 : 89 - 92 .
HEMANSK H . Perceptual linear predictive (PLP) analysis of speech [J ] . The Journal of the Acoustical Society of America , 1990 , 87 ( 4 ): 1738 - 1752 . DOI: 10.1121/1.399423 http://doi.org/10.1121/1.399423 https://pubs.aip.org/jasa/article/87/4/1738/930759/Perceptual-linear-predictive-PLP-analysis-of https://pubs.aip.org/jasa/article/87/4/1738/930759/Perceptual-linear-predictive-PLP-analysis-of A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2′ and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.
张卫强 , 刘加 . 基于听感知特征的语种识别 [J ] . 清华大学学报(自然科学版) , 2009 , 49 ( 1 ): 78 - 81 .
ZHANG W Q , LIU J . Language recognition based on auditory perception characteristics [J ] . Journal of Tsinghua University (Natural Science Edition) , 2009 , 49 ( 1 ): 78 - 81 . (in Chinese)
CAMPBELL W M , STURIM DE , REYNODS D A . Support vector machines using GMM supervectors for speaker's verification [J ] . IEEE Signal Processing Letters , 2006 , 13 ( 5 ): 308 - 311 . DOI: 10.1109/LSP.2006.870086 http://doi.org/10.1109/LSP.2006.870086 http://ieeexplore.ieee.org/document/1618704/ http://ieeexplore.ieee.org/document/1618704/
REYNOLDSD A , QUATIERI T F , DUNN R B . Speaker verification using adapted Gaussian mixture models [J ] . Digital Signal Processing , 2000 , 10 ( 123 ): 19 - 41 . DOI: 10.1006/dspr.1999.0361 http://doi.org/10.1006/dspr.1999.0361 https://linkinghub.elsevier.com/retrieve/pii/S1051200499903615 https://linkinghub.elsevier.com/retrieve/pii/S1051200499903615
ZISSMAN M A . COMPARISON of four approaches to automatic language identification of telephone speech [J ] . IEEE Transactions on Speech and Audio Processing , 1996 , 4 ( 1 ): 31 - 38 . DOI: 10.1109/TSA.1996.481450 http://doi.org/10.1109/TSA.1996.481450 http://ieeexplore.ieee.org/document/481450/ http://ieeexplore.ieee.org/document/481450/
YAN Y H , BARNARD E . An approach to automatic language identification based on language-dependent phone identification [C ] //Proceedings of IEEE InternationalConfrence on Acoustics, Speech and Signal Processing. Piscataway, NJ , US : IEEE , 1995 : 3511 - 3514 .
DEHAK N , TORRES P A , RETNOLDS D A , et al . Language Identification via I-Vectors and Dimensionality Reduction [C ] // Proceedings of Interspeech, Conference of the International Speech Communication Association. Florence, Italy:DBLP , 2011 : 857 - 860 .
JIANG B , SONG Y , WEI S , et al . Performance evaluation of deep bottleneck features for spoken language identification [C ] // Proceedings of International Symposium on Chinese Spoken Language Processing. Singapore:IEEE , 2014 : 143 - 147 .
MONTAVON G . Deep learning for spoken language identification [J ] . NIPS Workshop on Deep Learning for Speech Recognition and Related Applications , 2009 , 49 ( 10 ): 911 - 914 .
LOPEZ I , GONZALEZ J , PLCHOT O , et al . Automatic language identification using deep neural networks [C ] //Proceedings of IEEE International Conference on Acoustics. Florence , Italy : IEEE , 2014 : 5337 - 5341 .
GARCIA R D , MCCREE A . Stacked long-term TDNN for spoken language identification [C ] //Proceedings of the 17th Annual Conf of the Int Speech Communication Association. Baixas , France : International Speech and Communication Association , 2016 : 3226 - 3230 .
HE K , ZHANG X , REN S , et al . Deep residual learning for image recognition [C ] //Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas,NE , US : IEEE Computer Society , 2016 : 770 - 778 .
WANG G , WANG W F , ZHAO Y Y , et al . End-to-end language identification using attention-based recurrent neural networks [C ] //Proceedings of the 17th Annual Conf of the Int Speech Communication Association. Baixas , France : International Speech and Communication Association , 2016 : 2944 - 2948 .
JIN M , SONG Y , MCLOUGHLIN I , et al . LID-senones and their statistics for language identification [J ] . IEEEACM Transactions on Audio, Speech, and Language Processing , 2019 , 26 ( 1 ): 171 - 183 .
CAI W C , CAI Z X , LIU W B , et al . Insights into end-to-end learning scheme for language identification [J ] . IEEE Signal Processing Society Sigport , 2018 , 28 ( 2 ): 202 - 210 .
DESHWAL D , SANGWAN P , KUMAR D . Feature extraction methods in language identification: a survey [J ] . Wireless Personal Communications , 2019 , 107 ( 4 ): 2071 - 2103 . DOI: 10.1007/s11277-019-06373-3 http://doi.org/10.1007/s11277-019-06373-3 Language Identification (LI) is one of the widely emerging field in the areas of speech processing to accurately identify the language from the data base based on some features of the speech signal. LI technologies have a wide set of applications in different spheres due to the growing advancement in the field of artificial intelligence and machine learning. Feature extraction is one of the fundamental and significant process performed in LI. This review presents main paradigms of research in Feature Extraction methods that will provide a deep insight to the researchers about the feature extraction techniques for future studies in LI. Broadly, this review summarizes and compare various feature extraction approaches with and without noise compensation techniques as the current trend is towards robust universal Language Identification framework. This paper categorizes the different feature extraction approaches on the basis of different features, human speech production system/peripheral auditory system, spectral or cepstral analysis, and lastly on the basis of transform. Moreover, the different noise compensation-based feature extraction techniques are also covered in the review. This paper also presents, that Mel-Frequency Cepstral Coefficients (MFCCs) are the most popular approach. Results indicates that MFCC fused with other feature vectors and cleansing approaches gives improved performance as compared to the pure MFCC based Feature Extraction approaches. This study also describes the different categories at the front end of the LI system from research point of view.
LI L , LI Z , LIU Y , et al . Deep joint learning for language recognition [J ] . Neural Networks: The Official Journal of the International Neural Network Society , 2021 , 141 ( 9 ): 72 - 86 . DOI: 10.1016/j.neunet.2021.03.026 http://doi.org/10.1016/j.neunet.2021.03.026 https://linkinghub.elsevier.com/retrieve/pii/S0893608021001143 https://linkinghub.elsevier.com/retrieve/pii/S0893608021001143
BHANJA C C , LASKAR M A , LASKAR R H . Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal preclassification-based Indian language identification system [J ] . Language Resources and Evaluation , 2021 , 55 ( 3 ): 689 - 730 . DOI: 10.1007/s10579-020-09527-z http://doi.org/10.1007/s10579-020-09527-z
刘威 . 单通道语音水印与语音增强算法研究 [D ] . 南京 : 东南大学 , 2017 .
LIU W . Research on single channel speech watermarking and speech enhancement algorithm [D ] . Nanjing : Southeast University , 2017 . (in Chinese)
TAWHID M N A , SIULY S , WANG H . Diagnosis of autism spectrum disorder from EEG using a time-frequency spectrogram image-based approach [J ] . Electronics Letters , 2020 , 56 ( 25 ): 1372 - 1375 . DOI: 10.1049/ell2.v56.25 http://doi.org/10.1049/ell2.v56.25 https://onlinelibrary.wiley.com/toc/1350911x/56/25 https://onlinelibrary.wiley.com/toc/1350911x/56/25
HIMANSHU C . Time-frequency representations: spectrogram, cochleogram and correlogram [J ] . Procedia Computer Science , 2020 , 16 ( 7 ): 1901 - 1910 .
马元锋 , 陈克安 , 马苗 , 等 . 一种新的可应用于声目标识别的倒谱系数 [J ] . 兵工学报 , 2009 , 30 ( 11 ): 1477 - 1483 . 提出一种新的倒谱系数,与目前广泛采用的美尔倒谱系数( MFCC)相比有以下改进:1)基于听觉外周模型改进了美尔频率倒谱系数( MFCC)的三角滤波器;2)用与频率相关的指数压缩代替固定的对数压缩;3)分析了频率的临界带变换在语音识别中所起的作用,从声目标识别的角度提出在临界带变换中引入与信号相关的自适应机制。通过4组声目标识别仿真对比试验,表明了新的倒谱系数比MFCC在抗噪声能力方面有明显提升。
MA Y F , CHEN K A , MA M , et al . A new cepstral coefficient applicable to sound target recognition [J ] . Acta Armamentarii , 2009 , 30 ( 11 ): 1477 - 1483 . (in Chinese)
冯红波 , 李萍 , 李波 . 基于自动色阶和多尺度Retinex彩色图像增强算法 [J ] . 无线电工程 , 2019 , 49 ( 10 ): 911 - 914 .
FENG H B , LI P , LI B . Based on automatic color scale and multi-scale Retinex color image enhancement algorithm [J ] . Radio Engineering , 2019 , 49 ( 10 ): 911 - 914 . (in Chinese)
ZHU D , HUANG M , YANG J J , et al . Identificationof spoken language from webcast using deep convolutionalrecurrent neural networks [C ] //Proceedings of 2019 International Conference on Information Technology. Sanya , China : IEEE , 2019 : 1147 - 1152 .
邵玉斌 , 刘晶 , 龙华 , 等 . 基于声道频谱参数的语种识别 [J ] . 北京邮电大学学报 , 2021 , 44 ( 3 ): 112 - 119 . DOI: 10.13190/j.jbupt.2020-228 http://doi.org/10.13190/j.jbupt.2020-228 针对低信噪比下语种识别正确率低的问题,提出了一种声道冲激响应频谱参数和Teager能量算子倒谱参数融合的识别方法.根据语音中不同特征信息量分布特性,首先在特征提取前端引入低通滤波器滤除信号高频部分,并采用重采样方法降低采样率,再基于信号频谱提取声道冲激响应频谱参数,然后融合Teager能量算子倒谱参数,最后通过高斯混合通用背景模型进行语种识别验证.不同信噪比条件下性能测试表明,所提方法相对于基于单一的梅尔频率倒谱系数特征、单一的伽玛通频率倒谱系数特征和基于对数梅尔尺度滤波器组能量特征,在低信噪比下提升约15 dB,显著提高了识别正确率.
SHAOYB , LIU J , LONG H , et al . Language recognition based on vocal tract spectral parameters [J ] . Journal of Beijing University of Posts and Telecommunications , 2021 , 44 ( 3 ): 112 - 119 . (in Chinese)
0
浏览量
148
下载量
0
CNKI被引量
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024360号