The MITRE Audio Hot Spotting Prototype - Using Multiple Speech and Natural Language Processing Technologies

Qian Hu, Stanley Boykin, Fred Goodman, Warren Greiff, Margot Peet
The MITRE Corporation

Audio contains more information than is conveyed by the text transcript produced by an automatic speech recognizer.  Information such as: a) who is speaking, b) the vocal effort used by each speaker, and c) the presence of certain non-speech background sounds, are lost in a simple speech transcript. In addition, due to the variability of noise conditions, speaker variance, and the limitations of automatic speech recognizers, speech transcripts can be full of errors.  Deletion errors can prevent the users from finding what they are looking for from audio or video data, while insertion and substitution errors can be misleading and/or confusing.  Audio Hot Spotting technology permits a user to automatically locate regions of interest in an audio/video file that meet his/her specified criteria.  In the query, users may search for keywords or phrases, speakers, both keywords and speakers, non-verbal speech characteristics, or non-speech signals of interest. In order to provide more and better information from multimedia data, we have incorporated multiple speech technologies and natural language processing techniques in the MITRE Audio Hot Spotting prototype currently under development.  

We focused on finding words that are information rich and machine recognizable  (i.e. content words).  The MITRE Audio Hot Spotting prototype examines the speech recognizer output and creates an index list of content words. For example, short duration and weakly stressed words are much more likely to be mis-recognized.  To eliminate words that are information poor and prone to mis-recognition, our index-generation algorithm takes the following factors into consideration: a)absolute word length, b) the number of syllables, c) the recognizer's own confidence score, d) the part of speech (i.e. verb, noun) using a POS tagger with some heuristic rules, and e) the word's frequency of occurrence.  Experiments we have conducted indicate that the index list produced typically covers less than 10% of the total words spoken, while more than 90% of the indexed words are actually spoken and correctly recognized.

The prototype allows the user to query the system by keywords or phrases, either by selecting them from the index list or by manual entry.  If matches are found, the system displays the recognized text and allows the user to play the audio or video in the vicinity of the match. In addition, the user can query and retrieve segments spoken by a particular speaker. We achieved this capability by integrating and extending a research speaker identification algorithm.  Based on the speaker identification results, the system automatically computes the number of times and the total duration the speaker spoke.  We combined large-vocabulary, speaker-independent, continuous-speech recognition and speaker identification to refine lexical queries by a particular speaker. For example, the user can ask for incidents of the word "terrorism" spoken only by the President.  More recently, we have experimented with algorithms that detect information-bearing background sounds, such as applause and laughter, which can be queried and retrieved by users. 

Related link: <a href=""></a>