Using open source tools in a newspaper digitization workflow

At the GLBT Historical Society we’re diligently digitizing more than 1,500 issues of the Bay Area Reporter, the San Francisco-based weekly newspaper that’s been serving the LGBT community since 1971. Thanks to a generous grant from the Bob Ross Foundation, we purchased a shiny new scanner that could accommodate newspaper spreads, and we set about digitizing the paper to the specifications put forth by the National Digital Newspaper Program (NDNP) and the California Digital Newspaper Collection (CDNC). When the project is complete, we’ll have created a publicly-accessible, full-text-searchable collection of over three decades worth of LGBT and California history, written week by week.

The software that accompanied our scanning hardware appeared well-suited for the task, with image processing capabilities like deskewing, optical character recognition (OCR), and image format conversion baked in. However, in practice we quickly realized that while this software worked well for small projects and one-off scans, it was not sufficient for the large-scale effort before us, mainly because it did not allow us to shift this processor-intensive work to off hours. We set out to construct a digital workflow using free, open-source tools that would replicate these image-processing tasks, but could churn through large batches of newspaper scans at night and over the weekend, freeing up precious work hours for staff, interns, and volunteers to move quickly from one newspaper issue to the next. Continue reading Using open source tools in a newspaper digitization workflow