Using open source tools in a newspaper digitization workflow

At the GLBT Historical Society we’re diligently digitizing more than 1,500 issues of the Bay Area Reporter, the San Francisco-based weekly newspaper that’s been serving the LGBT community since 1971. Thanks to a generous grant from the Bob Ross Foundation, we purchased a shiny new scanner that could accommodate newspaper spreads, and we set about digitizing the paper to the specifications put forth by the National Digital Newspaper Program (NDNP) and the California Digital Newspaper Collection (CDNC). When the project is complete, we’ll have created a publicly-accessible, full-text-searchable collection of over three decades worth of LGBT and California history, written week by week.

The software that accompanied our scanning hardware appeared well-suited for the task, with image processing capabilities like deskewing, optical character recognition (OCR), and image format conversion baked in. However, in practice we quickly realized that while this software worked well for small projects and one-off scans, it was not sufficient for the large-scale effort before us, mainly because it did not allow us to shift this processor-intensive work to off hours. We set out to construct a digital workflow using free, open-source tools that would replicate these image-processing tasks, but could churn through large batches of newspaper scans at night and over the weekend, freeing up precious work hours for staff, interns, and volunteers to move quickly from one newspaper issue to the next. Continue reading Using open source tools in a newspaper digitization workflow

Tumblr Image Bot: A friendly social media robot

I, for one, welcome our new robot overlords.

The ARChive of Contemporary Music website features many image galleries depicting items from the collection, including great album and book covers, 45-rpm adaptors, punk flyers and more. Since the launch of the site in May 2014, web traffic to the galleries has been relatively low, about a third of the number of users that hit the homepage. The ARC’s social media posts also have relatively low reach and low engagement (e.g., average interaction per tweet = 1).

As an ARC employee and the developer of the ARC website, I thought that by repurposing interesting, fun, and quirky digital content in the context of social media, perhaps we could better engage followers, attract new users, and drive new traffic to the site, potentially attracting new donors to the non-profit archive.

This was my idea when I was dreaming up a final project for LIS 664 – Programming for Cultural Heritage. By the end of the semester, I had written some Python scripts that, in conjunction with free web services, allowed me to put this idea to the test. Continue reading Tumblr Image Bot: A friendly social media robot