Ben Crowder / Blog

Inside the Mormon Texts Project

Here’s a bit more in-depth detail on the Mormon Texts Project and our process for digitizing books and getting them on Project Gutenberg.

Why we do what we do

First, we’re doing this to make more Mormon books available through Project Gutenberg. Why PG? They’ve been around for almost forty years, they have lots of mirrors, and they use a low common denominator (plain text) so that everyone can read the texts.

We’re also doing this because people in other countries often don’t have easy access to these books. Google Books is only available in the U.S., for example.

Another reason is that most of these books aren’t available in Braille editions, and screenreaders are mostly guaranteed to be able to read plain text files.

Choosing the books

We’re looking for books published before 1923 so that we know they’re 100% public domain. (Books published in or after 1923 are most likely still in copyright. There are exceptions, but it’s a hassle to figure out which books fall in that category, so for MTP we’re restricting ourselves to pre-1923 books. Luckily that still gives us almost a hundred years’ worth of Mormon books to choose from.)

As for selecting which pre-1923 Mormon books to do, we started with a short list of books that I’d heard of and wanted to digitize (Joseph Smith As Scientist, The Life of Heber C. Kimball, etc.). We’ve taken a few reader requests as well, and I’ve added to our list by searching for other books by some of these same authors (Orson F. Whitney, B.H. Roberts, John A. Widtsoe, etc.). I should add that we’re not interested in digitizing anti-Mormon books. We’re doing this to build the kingdom, not to try to tear it down.

When we finish with the books currently on our list, I plan to start working through A Mormon Bibliography to find other books to digitize.

Getting the books

A lot of Mormon books are already available on Google Books and the Internet Archive, which makes our job a lot easier. We can download page images (as PDFs) and unproofed OCRed text from both sites. And I’ve checked with Project Gutenberg and they have no problem with our using Google Books images as our original source, as long as the book is pre-1923.

For books not on Google Books or the Internet Archive, we’ll have to scan them ourselves and then run the page images through OCR software.

Copyright clearance

At this point I’ll take the images for the title page and verso (the page right after the title page, which usually has the copyright statement) and submit them to Project Gutenberg for copyright clearance, via their website. They’ve always responded within a few days letting me know if we’re clear. (Since we only do pre-1923 works, we always get clearance.)

Digitizing

This is where most of the work takes place. I split the book up into batches of one to five pages each (if we’re doing a book from Google Books, I usually go with five pages since that’s how Google gives the OCRed text to me) and begin to assign batches to volunteers. They then take their assigned pages and go through the OCRed text, comparing it to the page images to eliminate typos and make sure we’re digitizing things correctly, also ensuring that the final text follows our MTP guidelines for formatting and such.

I’ve been emailing the OCRed text and page numbers to the volunteers (who can then just go to Google Books to see the page images and use Notepad or Textedit or another text editor to edit the OCRed text), but once I finish Unbindery, everything will be in the app and they won’t have to download anything. It’s a system that has worked pretty well for PGDP.

(Why are we not just using PGDP, then? Mostly because I wanted a cleaner user interface. And it’s nice knowing that if we need any special features, I can easily add them to Unbindery.)

Proofing

After everyone finishes and returns their batches, I collate the batch text files into a single file and make a quick pass through the text making sure that things generally look right. Then we make a final, more thorough pass to make sure everything is formatted correctly and that we didn’t miss anything.

I should add here that with Unbindery, each batch will be proofed twice before it’s considered complete, which will help with accuracy.

Releasing

Once we finish the proof, I go to the Project Gutenberg upload page, fill out the form, submit the finished text, and wait. It usually only takes a couple days — sometimes just a few hours — before the text is up on Project Gutenberg and available for download.

And then we start the process all over again with another book.

Volunteering with MTP

If this sort of thing interests you, email me. We’d love to have you.