From Supercomputer to Static Site: Boiling Down Big Research Data for Preservation and Usability
"Print & Probability" is an interdisciplinary and inter-institutional project to develop new techniques for visual anomaly detection in the OCR of early printed books. By detecting damaged letterforms that create consistent aberrations, the project aims to allow direct inference of letterpress printers at scale.
This presentation will detail the unique data management issues that the resulting 13 billion+ character images present, and how CMU Libraries is strategizing to publish extracts of these data that are both sustainable and usable. The team’s research software engineer will outline the design and technologies behind their management pipeline: a REST API interface to a database managed at the Pittsburgh Supercomputing Center to store and filter image data and metadata from the automated extraction pipeline; and a Vue JS-based web interface to assess results and provide new annotations for model training. Finally, this talk will present plans to distill this massive research database into a data deposit of interest to computer scientists and digital humanities researchers, as well as a sustainable static site that presents a human-and-machine-curated collection of distinctive early type usable by historians and librarians of rare books.
Presented at code4lib 2020, Pittsburgh, PA.