This blog includes text and images drawn from historical sources that may contain material that is offensive or harmful. We strive to accurately represent the past while being sensitive to the needs and concerns of our audience. If you have any feedback to share on this topic, please either comment on a relevant post, or use our Ask Us form to contact us.

Acquiring Digital Archives in the Field at Princeton

As a digital archivist on Mudd’s Technical Services team, I spend a fair amount of my time looking at screens like the one pictured here.

results of virus scan
21st century mold

I briefly panicked when I came across this screen while processing a restricted University Archives collection last year. The information was the output of the software ClamTK, the default virus scanner for our customized Ubuntu Linux digital archives workstation that I wrote about previously. How, in a collection of nearly 7,000 files that are spread across more than 800 subfolders, was I supposed to identify, assess, and possibly remove 34 individual viruses? The theatrics of the term “threats” was, fortunately, more dramatic than the actual threats themselves: embedded links in several PDF documents that the software flagged as PUA’s, or potentially unwanted applications. I reviewed the specifics of each file, and afterwards packaged the bundle of documents for our secure storage location.

I joked with a few of my colleagues that handling digital archives might require archivists to become epidemiologists on the spot. The fortunate aspect of the above scenario was that it happened in our processing room, which means that I was able to thoroughly research the issue, weigh the considerations, and then make a decision. I could have only wished for such calm and contained circumstances two weeks ago when I went to acquire 50 gigabytes of historical materials from the Princeton Plasma Physics Laboratory.

Typically when I go on site to acquire digital archives from University offices or departments, I gather background information on the software and hardware of the machine holding the content and I arrive with a Plan A, a Plan B, and a Plan C. But the challenge of archiving this particular set of documents made me reach for a Plan Z. In the words of Chris Cane, Manager of Digital Strategy and Visual Communication at PPPL:

“Jarrett M. Drake, Digital Archivist at the Seeley G. Mudd Manuscript Library, Princeton University, reached out to us regarding archiving the PPPL digital images and publications. 50GB worth of materials was a daunting task to archive, using state-of-the-art software tools and a fast, portable USB disk drive attached to an underpowered Mac. Through difficulties in downloading and ‘bagging’ the archive, Jarrett resorted to his scripting skills to solve the problem, and stayed on task to complete the archiving process with calmness and alacrity, and with minimal fuss.  Astonishingly, PPPL’s entire digital collection was available online in under a week!”

Chris Cane, Manager of Digital Strategy and Visual Communication, Princeton Plasma Physics Laboratory. Photo taken from October 29, 2012, issue of PPPL Weekly, which is located at AC332, Princeton Plasma Physics Laboratory Records.

Chris, along with the rest of the PPPL staff that I met that day, is awesome because he describes seemingly impossible tasks as merely “difficult.” He makes a reference to ‘bagging the archive,’ which is the process we use to extract files and create cryptographic hash values. The creation of this value, also known as a checksum, requires a program that will use an algorithm to read the sequences of bits—0’s and 1’s—and generate a unique alphanumeric value that identifies each file.

As you can imagine, there are more than a few bits in 50GB worth of materials; billions and billions of them, actually, so running this program would have taken a lifetime in this circumstance. Further complicating the procedure was that I could only access the archive itself through a remote connection on an Apple mini Desktop computer, which is hardly an ideal scenario in which to ask a program to read billions and billions of bits and create a unique value for them.

Screenshot of January 29, 1999, issue of Hotline, which is located at AC332, Princeton Plasma Physics Laboratory Records. Image may or may not depict actual scenarios.

So I deviated from my plan on the fly, and cooked up a small script on the command-line that acquired the archive much more efficiently, all while preserving the materials’ original order and filesystem metadata. Chris dropped off the external hard drive the next day, and by the next week we had processed, described, and linked to the digital materials through its finding aid.

The process concluded much more smoothly and predictably than it began, fortunately, but I think the PPPL acquisition highlights the rapidly changing environment of modern archival programs. Digital archives demand archivists be light on their feet, open to alternative solutions, and creative under pressure. Many thanks to Chris and the PPPL staff—and whoever provided the delicious bagels I snacked on while writing the script—for affording us this learning opportunity and, most importantly, for donating a truly magnificent collection!


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.