Mark Liberman,
Shelties On Alki Story Forest, Language Log, 26 Nov 2019.
Last week I gave a talk at an Alzheimer's Association workshop on "Digital Biomarkers". Overall I told a hopeful story, about the prospects for a future in which a few minutes of interaction each month, with an app on a smartphone or tablet, will give effective longitudinal tracking of neurocognitive health.
But I emphasized the fact that we're not there yet, and that some serious research and development problems stand in the way. In particular, the current state of the art in speech recognition is not yet good enough for reliable automated evaluation of spoken responses.
Speech-based tasks have been part of standard neuropsychological test batteries for many decades, because speaking engages many psychological and neurological systems, offering many (sometimes subtle) clues about what might be going wrong. One of many such tasks is describing a picture, for which the usual target is the infamous Cookie Theft:
It's past time to replace this image with pictures that are less dated and culture-bound — and in any case, we'll need multiple pictures for longitudinal tracking — but this is the one that clinical researchers have mostly been using. Whatever the source of the speech to be analyzed, many obvious measures — word counts, sentence structure, word frequency and concreteness, etc. — depend on a transcript, which at present is supplied by human labor.
We've tried many speech-to-text solutions, including open-source packages and commercial APIs. And the technology is not quite there yet.
Later:
We can expect that general speech-to-text technology will continue to improve.
But the most important remedy is language models that are better adapted to specific tasks and speaker populations. A system's "language model" encodes its expectations about what's likely to be said. And current ASR systems are more dependent on their language models that they should be, compensating for the weakness of their acoustic analysis. The good news is that we know very well how to create and incorporate improved language models, if we have large enough amounts of good-quality transcriptions from sources similar to the target application.