<p>"Print & Probability" is an interdisciplinary
and inter-institutional project to develop new techniques for visual anomaly
detection in the OCR of early printed books. By detecting damaged letterforms
that create consistent aberrations, the project aims to allow direct inference
of letterpress printers at scale.</p><p><br></p>
<p> </p>
<p>This presentation will detail the unique data management
issues that the resulting 13 billion+ character images present, and how CMU
Libraries is strategizing to publish extracts of these data that are both
sustainable and usable. The team’s research software engineer will outline the
design and technologies behind their management pipeline: a REST API interface
to a database managed at the Pittsburgh Supercomputing Center to store and
filter image data and metadata from the automated extraction pipeline; and a
Vue JS-based web interface to assess results and provide new annotations for
model training. Finally, this talk will present plans to distill this massive
research database into a data deposit of interest to computer scientists and
digital humanities researchers, as well as a sustainable static site that
presents a human-and-machine-curated collection of distinctive early type
usable by historians and librarians of rare books.</p><p><br></p><p>Presented at code4lib 2020, Pittsburgh, PA.</p>