Slide 30 of 49
Notes:
The basic reference point was the standard speech evaluation data, which was used to benchmark speech recognition systems with large vocabularies between five and sixty thousand words. The recognition systems were carefully tuned to this evaluation and the results could be considered close to optimal for the current state of speech recognition research. In these evaluations, we typically saw word error rates ranging from eight to twelve percent depending on the test set. Note that word error rate was defined as the sum of insertions, substitutions and deletions. This value can be larger than one hundred percent and was regarded as a better measure of recognizer accuracy than the number of words correct. (i.e. words correct = 100% - deletions - substitutions).
Taking a transcript of TV broadcast data with an average reader and re-recording it in a speech lab under good acoustic conditions, with a close-talking microphone showed an estimate of word error rate between ten and seventeen percent for speech recognition systems that were not tuned for the specific language and domain in question.
Speech that had been recorded by a professional narrator in a TV studio and that did not include any music or other noise gives us an error rate of around twenty percent. Part of the increased error rate is due to poor segmentation of utterances leaving the speech recognizer unable to tell where an utterance started or ended. This problem was not present in the lab-recorded data. Different microphones and environmental acoustics also contributed to the higher error rate.
Speech recognition on C-Span broadcast data showed a doubling of the word error rate to forty percent. While speakers in this data set were mostly constant and always close to the microphone, other noises and verbal interruptions degraded the accuracy of the recognition.
The dialog portions of broadcast documentary videos yielded recognition word error rates of fifty to sixty-five percent, depending on the video data. The signal for these sections contained many more environmental noises, as well as speech recorded outdoors.
The evening news was initially recognized with sixty five percent overall error rate. This rate included recognition accuracy for commercials and introductions as well as the actual news program. As more training data became available, that error rate dropped to around 50 %. With the more accurate Sphinx-III system, the error rate was reduced even further, down to about 24 % in the 1997 broadcast news evaluation.
A full one-hour documentary video including commercials and music raised the word error rate to seventy-five percent.
Worst of all were commercials, which were recognized with an eighty-five percent error rate due to the large amounts of music in the audio channel as well as the unusual speech characteristics, and singing, contained in the spoken portion.