Speech recognition means converting the content and meaning of human speech into computer-readable input, such as keystrokes, binary codes, or character sequences. Unlike speaker recognition, the latter is mainly to identify and confirm the person who made the speech rather than the content contained therein. The purpose of speech recognition is to make the machine understand human spoken language, which includes two meanings: the first is to understand word by word rather than into written language; the second is to include commands or requests contained in the spoken language. Comprehend and respond correctly, not just stick to the correct conversion of all vocabulary.
Since 1952, Davis and others in the AT & TBell laboratory have developed the first person-specific speech enhancement system capable of ten English digits-the Audry system. In 1956, Olson and Belar and others from the RCA laboratory of Princeton University in the United States have developed 10 A system of monosyllabic words, which uses the spectral parameters obtained by the band-pass filter bank as speech enhancement features. In 1959, Fry and Denes and others tried to build a phoneme to produce 4 vowels and 9 consonants, and used spectrum analysis and pattern matching to make decisions. This greatly improves the efficiency and accuracy of speech recognition. Since then, computer speech recognition has attracted the attention of researchers from various countries and began to enter into the research of speech recognition. In the 1960s, the Soviet Union ’s MaTIn and others proposed endpoint detection of speech end points, which significantly increased the level of speech recognition; Vintsyuk proposed dynamic programming, which was indispensable in future recognition. The important achievements in the late 1960s and early 1970s were the signal linear predictive coding (LPC) technology and dynamic time warping (DTW) technology, which effectively solved the problem of speech signal feature extraction and unequal length speech matching; at the same time, it proposed Vector Quantization (VQ) and Hidden Markov Model (HMM) theory. The combination of speech recognition technology and speech synthesis technology enables people to get rid of the fetters of the keyboard. Instead, it is an easy-to-use, natural, and humanized input method such as voice input. It is gradually becoming the key technology of human-machine interface in information technology.
One: the status quo of speech recognition technology development-classification of speech recognition systemsSpeech recognition systems can be categorized according to restrictions on input speech. If considering the correlation between the speaker and the recognition system, the recognition system can be divided into three categories:
(1) Specific person speech recognition system. Only consider the recognition of the voice of the person.
(2) Non-specific person voice system. Recognized speech has nothing to do with people. Usually, a large number of different people's speech databases are used to learn the recognition system.
(3) Multi-person identification system. Usually can recognize the voice of a group of people, or become a specific group of voice recognition system, the system only requires training of the voice of the group of people to be recognized.
If you consider the way of speaking, you can also divide the recognition system into three categories:
(1) Isolated word speech recognition system. The isolated word recognition system requires a pause after entering each word.
(2) Conjunction speech recognition system. The connection word input system requires that each word be pronounced clearly, and some consonants begin to appear.
(3) Continuous speech recognition system. Continuous voice input is a continuous fluent input of natural fluency, and a large number of legato and accent will appear.
If you consider the vocabulary size of the recognition system, you can also divide the recognition system into three categories:
(1) Small vocabulary speech recognition system. A speech recognition system that usually includes dozens of words.
(2) Speech recognition system with medium vocabulary. The recognition system usually includes hundreds of words to thousands of words.
(3) Large vocabulary speech recognition system. Speech recognition systems that usually include thousands to tens of thousands of words. As the computing power of computers and digital signal processors and the accuracy of recognition systems improve, the classification of recognition systems based on vocabulary size also continues to change. It is currently a medium vocabulary recognition system, and it may be a small vocabulary speech recognition system in the future. These different limitations also determine the difficulty of speech recognition systems.
Two: the current status of speech recognition technology-a summary analysis of speech recognition methodsAt present, the representative speech recognition methods mainly include dynamic time warping technology (DTW), hidden Markov model (HMM), vector quantization (VQ), artificial neural network (ANN), support vector machine (SVM) and other methods.
Dynamic time warping algorithm (Dynamic TIme Warping, DTW) is a simple and effective method in non-specific person speech recognition. This algorithm is based on the idea of ​​dynamic programming and solves the problem of template matching with different pronunciation lengths. It is a voice recognition technology. An earlier and more commonly used algorithm appeared. When applying DTW algorithm for speech recognition, it is to compare the pre-processed and framed speech test signal with the reference speech template to obtain the similarity between them, according to a certain distance measurement to get the similarity between the two templates And choose the best path.
Hidden Markov Model (HMM) is a statistical model in speech signal processing, evolved from Markov chain, so it is a statistical recognition method based on a parameter model. Because its pattern library is formed by repeated training, the best model parameter with the highest probability of matching with the training output signal is not the pre-stored pattern sample, and the likelihood probability between the speech sequence to be recognized and the HMM parameter is used in its recognition process The best state sequence corresponding to the maximum value is used as the recognition output, so it is an ideal speech recognition model.
Vector quantization (Vector QuanTIzaTIon) is an important signal compression method. Compared with HMM, vector quantization is mainly suitable for speech recognition of small vocabulary and isolated words. The process is to combine several scalar data of speech signal waveforms or characteristic parameters into a vector for overall quantization in multi-dimensional space. The vector space is divided into several small regions. Each small region finds a representative vector. The vector that falls into the small region during quantization is replaced by this representative vector. The design of the vector quantizer is to train a good codebook from a large number of signal samples, find a good distortion measurement definition formula from the actual effect, design the best vector quantization system, and use the least amount of search and calculation to calculate the distortion Achieve the largest possible average signal-to-noise ratio.
In the actual application process, people have also studied a variety of methods to reduce complexity, including memoryless vector quantization, memory vector quantization and fuzzy vector quantization methods.
Artificial neural network (ANN) is a new speech recognition method proposed in the late 1980s. It is essentially an adaptive nonlinear dynamics system, simulating the principle of human neural activity, with adaptability, parallelism, robustness, fault tolerance and learning characteristics, and its powerful classification capabilities and input-output mapping capabilities Very attractive in speech recognition. The method is an engineering model that simulates the thinking mechanism of the human brain. It is just the opposite of HMM. Its classification decision-making ability and description of uncertain information are recognized worldwide, but its ability to describe dynamic time signals is not yet satisfactory, usually MLP classifier can only solve the problem of static pattern classification, and does not involve the processing of time series. Although scholars have proposed many structures with feedback, they are still insufficient to characterize the dynamic characteristics of time series such as speech signals. Because ANN can not describe the time dynamic characteristics of speech signals well, ANN is often combined with traditional recognition methods to use their respective advantages for speech recognition to overcome the shortcomings of HMM and ANN. In recent years, significant progress has been made in the research of recognition algorithms combining neural networks and hidden Markov models. Its recognition rate is close to that of hidden Markov model recognition systems, which further improves the robustness and accuracy of speech recognition.
Support vector machine (Support vector machine) is a new learning machine model that uses statistical theory. It uses Structural Risk Minimization (SRM) to effectively overcome the shortcomings of traditional empirical risk minimization methods. Taking into account the training error and generalization ability, it has many excellent performances in solving small samples, nonlinear and high-dimensional pattern recognition, and has been widely used in the field of pattern recognition.
Three: The development status of speech recognition technology-foreign researchThe research work of speech recognition can be traced back to the Audry system of AT & T Bell Labs in the 1950s. It is the first speech recognition system that can recognize ten English digits.
But real progress was made and research was carried out as an important subject in the late 1960s and early 1970s. This is firstly because the development of computer technology provides the possibility of hardware and software for the realization of speech recognition. More importantly, the linear predictive coding (LPC) technology and dynamic time warping (DTW) technology of speech signals have been proposed to effectively solve the problem of speech. Signal feature extraction and unequal length matching problems. The speech recognition in this period was mainly based on the template matching principle, and the research field was limited to specific people. The isolated word recognition of the small vocabulary list realized the specific person isolated word speech recognition system based on linear prediction cepstrum and DTW technology; Vector Quantization (VQ) and Hidden Markov Model (HMM) theory.
With the expansion of the application field, the constraints on speech recognition such as small vocabulary, specific people, and isolated words need to be relaxed. At the same time, it also brings many new problems: First, the expansion of the vocabulary makes the choice of templates Difficulties with the establishment; second, in continuous speech, there is no obvious boundary between each phoneme, syllable and word, and each pronunciation unit has a co-articulation phenomenon that is strongly affected by the context; third, recognition by non-specific people When different people say the same thing, the corresponding acoustic characteristics are very different. Even if the same person speaks the same content at different times, physiological and psychological states, there will be a big difference; fourth, the recognition There is background noise or other interference in the voice. Therefore, the original template matching method is no longer applicable.
A huge breakthrough in laboratory speech recognition research was born in the late 1980s: People finally broke through the three major obstacles of large vocabulary, continuous speech, and non-specific people in the laboratory. For the first time, these three features were integrated into one In the system, the Sphinx system of Carnegie Mellon University (CarnegieMellonUniversity) is more typical. It is the first high-performance non-person-specific, large vocabulary continuous speech recognition system.
During this period, speech recognition research went further, and its salient feature is the successful application of HMM model and artificial neural network (ANN) in speech recognition. The wide application of the HMM model should be attributed to the efforts of scientists such as Rabiner of the AT & TBell laboratory. They engineered the original difficult HMM pure mathematical model to understand and recognize more researchers, so that the statistical method has become the mainstream of speech recognition technology. .
The statistical method shifts the researcher's line of sight from micro to macro, and no longer deliberately pursues the refinement of speech features, but builds the best speech recognition system from the perspective of overall average (statistics). In terms of acoustic models, the Markov chain-based speech sequence modeling method HMM (implicit Markov chain) effectively solves the short-term stability and long-term time-varying characteristics of speech signals, and can be constructed according to some basic modeling units. The sentence model of continuous speech achieves relatively high modeling accuracy and modeling flexibility. At the linguistic level, through the statistical co-occurrence probability between the words of the real large-scale corpus, that is, the N-ary statistical model, the fuzzy and homophonic words brought by recognition are distinguished. In addition, artificial neural network methods and language processing mechanisms based on grammar rules have also been applied in speech recognition.
In the early 1990s, many well-known large companies such as IBM, Apple, AT & T and NTT invested heavily in the practical research of speech recognition systems. Speech recognition technology has a good evaluation mechanism, that is, the accuracy of recognition, and this indicator has been continuously improved in laboratory research in the mid and late 1990s. More representative systems include: ViaVoice from IBM and NaturallySpeaking from DragonSystem, NuanceVoicePlatform from Nuance, Microsoft's Whisper, Sun's VoiceTone, etc.
Among them, IBM developed the Chinese ViaVoice speech recognition system in 1997, and the following year developed a speech recognition system ViaVoice'98 that can recognize local accents such as Shanghai dialect, Cantonese dialect and Sichuan dialect. It comes with a basic vocabulary of 32,000 words, which can be expanded to 65,000 words, and also includes commonly used office entries, has a "correction mechanism", and its average recognition rate can reach 95%. This system has high accuracy for news speech recognition and is currently the representative Chinese continuous speech recognition system.
Four: Development status of speech recognition technology-Domestic researchChina's speech recognition research started in the 1950s, but it has developed rapidly in recent years. The research level has also gradually moved from laboratory to practical. Since the implementation of the National 863 Program in 1987, the National 863 Intelligent Computer Expert Group has set up special projects for speech recognition technology research, which rolls every two years. The research level of China's speech recognition technology has been basically synchronized with foreign countries, and it has its own characteristics and advantages in Chinese speech recognition technology, and has reached the international advanced level. Research institutes such as the Institute of Automation, Acoustics, Tsinghua University, Peking University, Harbin Institute of Technology, Shanghai Jiaotong University, University of Science and Technology of China, Beijing University of Posts and Telecommunications, Huazhong University of Science and Technology have conducted research on speech recognition, including representatives The research unit is the State Key Laboratory of Pattern Recognition, Department of Electronic Engineering, Tsinghua University and Institute of Automation, Chinese Academy of Sciences.
The speech technology and special chip design research group of the Department of Electronic Engineering, Tsinghua University, has developed a non-person-specific Chinese digital string continuous speech recognition system with a recognition accuracy of 94.8% (indefinite length digital string) and 96.8% (fixed length digital string). In the case of 5% rejection rate, the system recognition rate can reach 96.9% (indefinite-length digital string) and 98.7% (fixed-length digital string), which is one of the best international recognition results, and its performance is close to Practical level. The recognition rate of the 5000-word postal packet verification and non-personal continuous speech recognition system developed by the company has reached 98.73%, and the recognition rate of the first three choices has reached 99.96%; and it can recognize both Mandarin and Sichuan dialect, meeting practical requirements.
The Institute of Automation of the Chinese Academy of Sciences and its affiliated Pattern Technology (Pattek) company released the "Tianyu" Chinese speech series product jointly launched by them for different computing platforms and applications-PattekASR, which ended the Chinese speech recognition products since 1998. The history of foreign company monopoly.
Five: the current status of the development of speech recognition technology-the current problems to be solvedThe performance of a speech recognition system is affected by many factors, including the pronunciation of different speakers, the way of speaking, environmental noise, fading of transmission channels, and so on.
There are four specific problems to be solved:
â‘ Enhance the robustness of the system, that is to say, if the conditions become very different from the training, the performance of the system cannot be abrupt.
â‘¡ To increase the adaptability of the system, the system must be able to stably and continuously adapt to changes in the conditions, because speakers have differences in age, gender, accent, speaking rate, speech intensity, pronunciation habits, etc. Therefore, the system should be able to eliminate these differences. Achieve stable recognition of speech.
â‘¢ Looking for a better language model, the system should get as many constraints as possible in the language model, so as to solve the impact caused by the increase in vocabulary.
â‘£ For dynamic modeling, the speech recognition system assumes that the fragments and words are independent of each other in advance, but in fact the vocabulary and phoneme clues require the integration of the characteristics of the vocal organ movement model. Therefore, dynamic modeling should be performed to integrate this information into the speech recognition system.
Six: Development status of speech recognition technology-the latest development of speech recognition systemThe development of speech recognition technology to today, especially small and medium vocabulary non-specific person speech recognition system recognition accuracy has been greater than 98%, the recognition accuracy of the specific person speech recognition system is higher. These technologies have been able to meet the requirements of common applications. Due to the development of large-scale integrated circuit technology, these complex voice recognition systems can already be made into dedicated chips and mass produced. In Western economically developed countries, a large number of voice recognition products have entered the market and service areas. Some user exchanges, telephones, and mobile phones already include voice recognition dialing functions, voice notepads, voice smart toys, and other products, as well as voice recognition and voice synthesis functions. People can inquire about air ticket, travel, and bank information through the telephone network using voice recognition spoken dialogue system. Survey statistics show that up to 85% of people are satisfied with the performance of the voice recognition information query service system. It can be predicted that in the past 5 years, the application of voice recognition systems will be more extensive, and various voice recognition system products will continue to appear on the market. The role of speech recognition technology in manual mail sorting is also increasingly apparent, with attractive prospects for development. Some postal departments in developed countries have already used this system, and voice recognition technology has gradually become a new technology for mail sorting. It can overcome the shortcomings of manual sorting that rely solely on the sorter's memory, solve the problem of high personnel costs, and improve the efficiency and effectiveness of mail processing. In terms of education, the most direct application of speech recognition technology is to help users better practice language skills.
Another development branch of speech recognition technology is the development of telephone speech recognition technology. Bell Labs is a pioneer in this area. Telephone speech recognition technology will be able to implement telephone inquiry, automatic wiring and some special services such as travel information. After applying the voice inquiry system of voice understanding technology, banks can provide customers with 24-hour telephone banking financial services day and night. As for the securities industry, if a voice recognition system using telephone voice recognition is used, users can directly tell the stock name or code if they want to query the market. After the system confirms the user's request, it will automatically read the latest stock price, which will greatly facilitate the user. . At present, there are a large number of manual services at the 114 number search desk. If voice technology is used, the computer can automatically answer the user's needs and then play back the inquired phone number, thereby saving human resources.
Smart Socket,Outside Socket,Smart Plug Socket,Outside Plug Socket
Shenzhen Chaoran Technology Corp. , https://www.chaoran-remote.com