posted on 2006-03-01, 00:00authored byRong Jin, Alexander Hauptmann
The problem of title generation involves finding the essence of
a document and expressing it in only a few words. The results
of a query to the Informedia Digital Video Library are
summarized through an automatically generated title for each
retrieved news story. When the document is errorful, as with
speech-recognized broadcast news stories, the title creation
challenge becomes even greater. We implemented a set of title
word selection strategies and evaluated them on an
independent test corpus of 579 broadcast news documents,
comparing manual transcription results to automatically
recognized speech using the CMU Sphinx speech recognition
system with a 64000-word broadcast news language model.
Using a training collection of 21190 transcribed broadcast
news stories, we trained several systems to produce appropriate
title words, i.e. Naïve Bayesian approach with full vocabulary,
Naïve Bayesian approach with limited vocabulary, nearest
neighbor approach and extractive approach. The F1 results
shows that the nearest neighbor approach is a quick and easy
way of generating good titles for speech recognized documents
(F1 = 15.2%), while a Nave Bayesian approach with limited
vocabulary also does well on our F1 measure (F1 = 21.6%),
which ignores word order in the titles. Overall, the results show
that title generation for speech recognized news documents is
possible at a level approaching the accuracy of titles generated
for perfect text transcriptions. One surprising phenomenon is
that extractive approach performances slightly better for
speech recognized documents than for manual transcripts.