Ben Crowder / Blog

Blog: #digitization

Scanning journals

I’ve recently begun scanning my journals using my iPhone and the Scanner Pro app, and it’s working out fairly well. My process:

  • Using the built-in iPhone camera app, I long press to lock focus and exposure (this saves time so it doesn’t have to autofocus each time), then photograph each page of the journal. It’s not as high quality as it would be if I used an actual scanner, but it’s much, much faster, and far more portable.
  • After I’m done photographing, I open Scanner Pro and select the images from the camera roll, then use the Black & White Document setting to process them into a PDF.
  • From Scanner Pro, I export the PDF to Dropbox.

The resulting PDF is nice and clean and easy to read, and the files aren’t too big (150 pages is usually between 80 and 200 megs — for me, very much worth the space to preserve important documents).

A concocted example:

input.jpg

That’s before (the image is straight from my iPhone camera, no postprocessing), and this is after Scanner Pro is done with it:

scanner-pro.jpg

I should add that ordinarily, with actual journals there wouldn’t be as much empty border around the content.

One hitch I’ve run into is that Scanner Pro chokes on anything larger than around 150 pages (it crashes), so I do long journals in chunks.

For that reason and a few other small annoyances, I’ve been looking into replacing Scanner Pro with a desktop-based script that takes a list of photos and processes them into a nice black and white PDF. Imagemagick gets me part of the way there with this command:

convert input.jpg -threshold 50% -blur 1x1 output.jpg

Here’s what it looks like for the above note card scan, at 30%, 50%, and 70% threshold, respectively:

imagemagick.jpg

At some point I’ll try writing a Python script that dynamically evaluates each page and adjusts the threshold as necessary to get the best result. Until then, though, I’m still using Scanner Pro.


Reply via email or office hours

On digital Greek and Latin texts

A good blog post by Gregory Crane (editor-in-chief of the Perseus Digital Library at Tufts) back in February about the Digital Loeb Classical library and the digitization of Greek and Latin texts:

We need transcriptions of public domain print editions to provide a starting point for work. These editions do not have to be the most up-to-date and they do not even have to be error free (99% may be good enough rather than 99.95%). If the community has the ability to correct and augment and to add features such as are described above and to receive recognition for that work, then the editions will evolve rapidly and outperform closed editions. If no community emerges to improve the editions, then the edition is good enough for current purposes. This model moves away from treating the community as a set of consumers and towards viewing members of the community as citizens with an obligation to contribute as well as to use.

The post has links to some fascinating projects I didn’t know about, like the Open Philology Project and the Homer Multitext Project.


Reply via email or office hours