This README.txt file was generated on 2020-01-07 by Matthew D. Lincoln

#
# General instructions for completing README:
# For sections that are non-applicable, mark as N/A (do not delete any sections).
# Please leave all commented sections in README (do not delete any text).
#

-------------------
GENERAL INFORMATION
-------------------

1. Title of Dataset: Frankenstein Variorum Collations

This dataset contains TEI-XML encoded versions of Mary Shelley's Frankenstein at various stages of the collation process. It documents the first phase (2018-2019) of the Frakenstein Variorum, a project to collate the textual changes between multiple editions of Frankenstein.

We began preparing a new digital encoding of the novel Frankenstein by returning to its first online text in Stuart Curran’s and Jack Lynch’s Pennsylvania Electronic Edition. That hypertext edition represented groundbreaking digital scholarship in the era of web 1.0, by deploying an interface for reading the 1818 and 1831 texts in juxtaposed parallel texts, using HTML frames now deprecated by the Worldwide Web Consortium. That edition prepared the novel in hundreds of distinct HTML files, representing a few paragraphs at a time to provide its comparison view of the 1818 and 1831 editions. The Pennsylvania Electronic Edition also gathered many hundreds of files of context, including editions of related poems like “The Witch of Atlas” and “The Revolt of Islam” together with scholarly articles, maps, glosses and annotations.

The Pennsylvania Electronic Edition was partially curated by Romantic Circles in a version of TEI, the XML language of the Text Encoding Initiative recommended for sustainable transfer and long-range storage of digital editions. According to Neil Fraistat, the HTML publication of the 1818 and 1831 editions has become Romantic Circles’ most-visited site. However, the HTML “skeleton” of the Pennsylvania Electronic Edition posed a serious problem to convert to TEI, and the TEI first produced from the HTML consisted of minimal TEI renderings of HTML tags – mainly presentational rather than semantic markup. Though the TEI provides critical apparatus markup for storing alternate versions of passages and for storing multiple editions in a single XML document, the first TEI edition of Frankenstein for Romantic Circles preserved the 1818 and 1831 texts in separate documents. A representation of the texts in comparison appears via Juxta Commons, but there are problems with the differentiation of long texts using the Juxta algorithm.

Our work on the project has involved returning to the code of Curran’s and Lynch’s electronic editions of the 1818 and 1831 texts, and converting its HTML tags into simple XML marking the structure of the document. (Click here for details.) New with our Variorum is a text-based digital edition of the 1823 publication supervised by William Godwin, the first edition to show Mary Wollstonecraft Shelley’s name on the title page. We prepared this plain text edition from OCR of the 1823 edition, derived via ABBYY Finereader, and formatted like our plain texts of the 1818 and 1823. Throughout this process we have been correcting our new and restored digital texts against photo facsimiles of the originals.

We then prepared all editions to be compared with one another with computer-aided collation. To create the TEI variorum, we prepared all the print editions with the same XML elements, and then we “flattened” those elements as self-closing milestone markers for collation, because the collation process needs to be able to locate alterations that collapse or open up new paragraphs and chapters. We similarly flattened the markup of the Shelley-Godwin archive texts, and we wrote an algorithm in Python to exclude page surface and line markers from the collation, because our process compares what we think of as semantic structures; thus, the paragraphing, the chapter, the volume boundaries matter where the page boundaries and lineation do not. When the editions are thus prepared in comparable “flat” XML, we process them with CollateX, which locates the points of variance (or “deltas”) and outputs these in TEI XML critical apparatus markup. We have devised a structure that we think of as the “spine” of the edition created from the TEI critical apparatus to point to specific locations in the manuscript notebooks. This provides a way to link a reading interface of the novel that highlights “hotspots” of variance in the print edition and that links into relevant passages in the Notebooks.

We first prepared the “skeleton” of the new TEI edition, a structure fundamentally different from the TEI currently featured at Romantic Circles. We include a version of the little-studied 1823 and “Thomas” editions. We hope our edition will inspire fresh investigations of longstanding questions about Frankenstein’s transformations, such as the extent of Godwin’s interventions in the text in 1823 and how many of these these persist in the 1831 text, and what alterations Mary Shelley made in her Thomas copy marginalia diverge from the version of the text she prepared in 1831.


#
# Authors: Include contact information for at least the
# first author and corresponding author (if not the same),
# specifically email address, phone number (optional, but preferred), and institution.
# Contact information for all authors is preferred.
#

2. Author Information
<create a new entry for each additional author>

Corresponding Author Contact Information
    Name: Elisa Beshero-Bondar
    Institution: University of Pittsbrugh, Greensburg
    Email: ebb8@pitt.edu

Author Contact Information (if applicable)
    Name: Rikk Mulligan
    Institution: Carnegie Mellon University
    Email: rmulligan@andrew.cmu.edu

Author Contact Information (if applicable)
    Name: Raffaele Viglianti
    Institution: University of Maryland, College Park
    Email: rviglian@umd.edu

---------------------
DATA & FILE OVERVIEW
---------------------

#
# Directory of Files in Dataset: List and define the different
# files included in the dataset. This serves as its table of
# contents.
#

Directory of Files:

1. collation-chunks.zip: Edition files prepared to be processed with collateX.
Each of the five versions of Frankenstein are portioned into 33 "chunk" files. These XML files are portioned to represent aligned start and end points in each of five versions of Frankenstein, and segments are prepared to optimize and isolate comparable units to for best automated collation results. These files are prepared by Elisa Beshero-Bondar and RIkk Mulligan to be processed with machine-assisted collation with collateX.

2. Part3.5-allWitnessIM_collation_to_xml.py: Python code written and maintained by Elisa Beshero-Bondar with assistance from David Birnbaum and Raffaele Viglianti. This code parses XML markup as strings of text and instructs collateX to ignore particular tags for the purposes of comparison. It also contains the normalizing algorithm that indicates which characters are to be considered identical with different strings of text (such as the ampersand and the word "and").

3. collated-data.zip: Corrected Collation Data: These files represent carefully corrected output of machine-assisted collation with collateX, prepared for the current edition prototype (as of January 2020) of the first one third of the novel. Collated data files represent collation units encoded in XML markup following the TEI critical apparatus, holding aligned collation data on five versions of Frankenstein. Elisa Beshero-Bondar worked on correcting and improving collateX alignments in these files. These files serve to construct the "spine" the Variorum Edition.

4. unready-collated-data.zip: Uncorrected Collation Data: These files hold collation output files that remain to be corrected, together with a Readme file that documents how these files are currently used by the Frankenstein Variorum project team. They are currently used internally to produce HTML views for the project team to study comparisons across most of the novel.

5. standoff-spine.zip: Spine data file for the Frankenstein Variorum interface. The spine file holds data pointers to indicate locations in each of five edition versions to coordinate the Variorum edition. Spine data are prepared by with an XSLT pipeline process in the fv-postCollation repository (https://github.com/FrankensteinVariorum/fv-postCollation) by Elisa Beshero-Bondar with assistance from Raffaele Viglianti.

6. variorum-chunks.zip: Variorum edition files. These files represent print editions prepared from non-TEI sources for the Frankenstein Variorum interface. Data for these files are prepared by Elisa Beshero-Bondar with assistance from Raffaele Viglianti and Rikk Mulligan.

--------------------------
METHODOLOGICAL INFORMATION
--------------------------

#
# Software: If specialized software(s) generated your data or
# are necessary to interpret it, please provide for each (if
# applicable): software name, version, system requirements,
# and developer.
#If you developed the software, please provide (if applicable):
#A copy of the software’s binary executable compatible with the system requirements described above.
#A source snapshot or distribution if the source code is not stored in a publicly available online repository.
#All software source components, including pointers to source(s) for third-party components (if any)

1. Software-specific information:

Name: CollateX
Version: 1.7.1
System Requirements:
Open Source? (Y/N): Y

(if available and applicable)
Executable URL:
Source Repository URL: https://collatex.net/
Developer:
Product URL:
Software source components:


#
# Dates of Data Collection: List the dates and/or times of
# data collection.
#

3. Date of data collection (single date, range, approximate date):

2016-10-15 to 2019-12-15