The Heinz Electronic Library Interactive On-Line System (HELIOS): Building A Digital Archive Using Imaging, OCR, and Natural Language Processing Technologies
journal contributionposted on 01.01.1995 by Gabrielle Michalek
Any type of content formally published in an academic journal, usually following a peer-review process.
In February 1994, Carnegie Mellon University (CMU) embarked on an ambitious project to convert one million pages of the congressional papers of Senator John Heinz (R-PA) into digital format and to provide access to these papers through innovative information retrieval software developed at CMU. Named in memory of the late Senator, the Heinz Electronic Library Interactive Online System (HELIOS) supports full-page digital images and it utilizes natural language processing (NLP) technology to search large quantities of unstructured text. HELIOS will allow researchers to access the Heinz papers through the campus network as well as through the Internet. Over one million dollars was donated by the Heinz Family Foundation, Heinz Company Foundation, and Heinz Endowments to support the establishment of the H. John Heinz III Archives and the digitization project. Heinz assistance has made it possible to advance the principles of digital preservation and access for archival collections. In addition to the Heinz gift, CMU has committed an additional $450,000 in matching resources to the project. These resources primarily come in the form of permanent full-time staff salaries, archival equipment, and rental of a processing facility. Our goal is to develop a digital archive that will serve as a model for the archival profession. We expect to create an archival information technology environment that dramatically increases the depth of indexing and the quality of retrieval beyond what archiving resources have traditionally allowed. To create the HELIOS database, documents are scanned, converted to ASCII form via OCR, verified and organized, and indexed using the CLARIT natural language processing software. The project will develop three graphical user interfaces in a Microsoft Windows environment: a scanning interface, an archivist/verification interface, and an end-user interface. HELIOS represents a significant breakthrough technology that has the potential to transform the work of archivists by helping them to overcome the significant challenges they face, including an inability to: 1. create good finding aids and indexes for paper archives that provide deep access to collections, 2. provide effective retrieval from paper archives due to the inherent diversity and size of these one-of-a-kind files, and 3. offer broad public access to archives because they represent resources that the researcher must visit in order to use effectively. Archivists have resisted the use of information technology because they lack appropriate tools to automatically process large amounts of text for retrieval. HELIOS will offer such a tool. Clearly, there are many problems yet to be solved in the management and preservation of digital archives, but it is CMU's intention to work with the larger archival and library community to help establish standard practices for digitizing paper archives and to develop the information management tools to give scholars and students state-of-the-art access to them.