Carnegie Mellon University
Browse

Title Generation for Spoken Broadcast News using a Training Corpus

Download (41.36 kB)
journal contribution
posted on 2006-03-01, 00:00 authored by Rong Jin, Alexander Hauptmann
The problem of title generation involves finding the essence of a document and expressing it in only a few words. The results of a query to the Informedia Digital Video Library are summarized through an automatically generated title for each retrieved news story. When the document is errorful, as with speech-recognized broadcast news stories, the title creation challenge becomes even greater. We implemented a set of title word selection strategies and evaluated them on an independent test corpus of 579 broadcast news documents, comparing manual transcription results to automatically recognized speech using the CMU Sphinx speech recognition system with a 64000-word broadcast news language model. Using a training collection of 21190 transcribed broadcast news stories, we trained several systems to produce appropriate title words, i.e. Naïve Bayesian approach with full vocabulary, Naïve Bayesian approach with limited vocabulary, nearest neighbor approach and extractive approach. The F1 results shows that the nearest neighbor approach is a quick and easy way of generating good titles for speech recognized documents (F1 = 15.2%), while a Nave Bayesian approach with limited vocabulary also does well on our F1 measure (F1 = 21.6%), which ignores word order in the titles. Overall, the results show that title generation for speech recognized news documents is possible at a level approaching the accuracy of titles generated for perfect text transcriptions. One surprising phenomenon is that extractive approach performances slightly better for speech recognized documents than for manual transcripts.

History

Date

2006-03-01

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC