[slightly off-topic] programmable speech recognition software

Fri Oct 22 23:34:27 EDT 1999

In article <000001bf1c51$78160160$352d153f at tim>, Tim Peters
<tim_one at email.msn.com> wrote:
> > In the recorded audio stream should be recognized as many words as
> > possible - speaker independent and in German! (I can hear you shudder)
> 
> Speaker-independent is hard.  From a recording is hard.  If people aren't
> speaking into a high-quality microphone, it's very hard.  If it's not
> intentional speech (i.e., if people don't know they're talking to a
> computer -- they're just chatting), it's very hard.  The good news is that
> German is easy (compared to the rest of this).
> 
> > In unclear cases - there will probably be lots of them - the closest
> > interpretations would be OK.
> 
> I doubt you can do this yourself and get acceptable results with any
> available SR software.  It's an extremely difficult problem.  Taking
> software designed for close-talking microphone, speaker-dependent, "dictate
> into word" applications, and trying to apply it to a radically harder task,
> is like hoping (say) Perl can be used to get real work done <wink>.  The
> closest Dragon Systems gets is described at:
> 
>     http://www.dragonsys.com/products/audiomining/index.html
> 
> Note that there's no product mentioned there -- Dragon has appropriate
> technology, and that's what the page talks about.  This is so bleeding edge
> there are no pre-packaged applications for sale.  Dragon would be happy to
> build one for you, though, in return for an insignificant percentage of
> Germany's gross national product <wink>.
According to a recently-conducted NIST evaluation, current experimental
large-vocabulary recognizers achieve 70-80% word accuracy on
unrestricted American radio and TV news broadcasts. That includes
studio news announcer speech (often much more accurate on that), phone
call-ins, field recordings (often poor, with varied noise in the
background), interviews and round-table discussions. This level of
accuracy is sufficient for searching news stories by content: with
appropriate methods to compensate for recognition errors, search from
automatic transcriptions is almost as effective as from manual
transcriptions (for those interested, see the paper by Singhal and
Pereira in the proceedings of ACM SIGIR 99). Recognition runtime for
these results varies between almost real time and three times slower
depending on hardware, recognizer and desired accuracy. As far as I
know, none of these recognizers is available as a product, although
some may be close. In my experience, the critical element is not the
actual recognizer code -- although that matters quite a bit for speed,
memory usage and flexibility in using different kinds of speech models
-- but rather the choice of speech and language modeling techniques for
the particular task, selection of training data, and actual model
creation recipe.

The situation becomes noticeably worse for spontaneous speech over
uncontrolled channels. However, even then it is often possible to do
topic classification of spoken utterances with fairly high accuracy for
a relatively small set of topics by training appropriate classifiers.
The point is that for some tasks it's not necessary to recognize each
word, but just what the speaker is talking about, which can be
identified from several cues in the utterance. The kind of audio
data-mining applications that Dragon Systems has announced and Tim
mentions above rely on this, I believe.

-- F