Ben Crowder

Blog: #project-gutenberg

5 posts :: tag feed :: about the blog :: archive

I’m getting a bit of a nostalgia kick reading through the Standard Ebooks process. I haven’t made anything with them (though they do good work and I’m reading two of their editions right now), but years ago — in a former life, it seems — I used to make ebook editions of old books.

As far as I can tell, the first ebook I made was Chesterton’s The Ball and the Cross, which I typed up by hand and posted to Project Gutenberg. Around that time I worked on a handful of other books for PG, including Henry Sweet’s An Icelandic Primer, which was much more involved (Old Icelandic characters, tables, etc.) and incredibly fun.

After that I worked on several more books as part of the Mormon Texts Project and also started making EPUB and Kindle editions of other books (like the 1812/1815 edition of Grimms’ fairy tales and George MacDonald’s The Light Princess). Those were quite fun, too.

Somewhere around five or so years ago I stopped, partly from working on other things, partly from repetitive strain injuries. (Even with Vim macros to help, there’s still a multitude of repetitive keystrokes in cleaning up texts, at least for me.) With reading about Standard Ebooks and writing this post, though, I’m tempted to get back into it. I built Fledge years ago as an attempt to script away more of the repetitive work, and I suspect wiser use of both it and Vim might be enough to minimize the RSI.

On a related note, I’ve been wanting to rewrite md2epub. It’s a decent-enough Python script that takes Markdown files and turns them into an EPUB, and it’s worked well. But it’s an old codebase, and I don’t like the name anymore, and it could be faster, and I have a few ideas on how to make it more ergonomic, so I’m planning to dub it Caxton and rewrite it in Go or Rust. (Primarily so I’ll have an easier way to make EPUB editions of my fiction.) This part is the most likely to actually happen, I think.

Reply via email or via office hours

Recommended: Standard Ebooks. They’re doing the same kind of thing I’ve done — making nice EPUB/Kindle editions of Project Gutenberg (though my efforts have of course been at a much smaller scale, and far more sporadic). Even better, Standard Ebooks has good typography standards and they’re proofing the books against original scans. This is a good project.

Reply via email or via office hours

Inside the Mormon Texts Project

Here’s a bit more in-depth detail on the Mormon Texts Project and our process for digitizing books and getting them on Project Gutenberg.

Why we do what we do

First, we’re doing this to make more Mormon books available through Project Gutenberg. Why PG? They’ve been around for almost forty years, they have lots of mirrors, and they use a low common denominator (plain text) so that everyone can read the texts.

We’re also doing this because people in other countries often don’t have easy access to these books. Google Books is only available in the U.S., for example.

Another reason is that most of these books aren’t available in Braille editions, and screenreaders are mostly guaranteed to be able to read plain text files.

Choosing the books

We’re looking for books published before 1923 so that we know they’re 100% public domain. (Books published in or after 1923 are most likely still in copyright. There are exceptions, but it’s a hassle to figure out which books fall in that category, so for MTP we’re restricting ourselves to pre-1923 books. Luckily that still gives us almost a hundred years’ worth of Mormon books to choose from.)

As for selecting which pre-1923 Mormon books to do, we started with a short list of books that I’d heard of and wanted to digitize (Joseph Smith As Scientist, The Life of Heber C. Kimball, etc.). We’ve taken a few reader requests as well, and I’ve added to our list by searching for other books by some of these same authors (Orson F. Whitney, B.H. Roberts, John A. Widtsoe, etc.). I should add that we’re not interested in digitizing anti-Mormon books. We’re doing this to build the kingdom, not to try to tear it down.

When we finish with the books currently on our list, I plan to start working through A Mormon Bibliography to find other books to digitize.

Getting the books

A lot of Mormon books are already available on Google Books and the Internet Archive, which makes our job a lot easier. We can download page images (as PDFs) and unproofed OCRed text from both sites. And I’ve checked with Project Gutenberg and they have no problem with our using Google Books images as our original source, as long as the book is pre-1923.

For books not on Google Books or the Internet Archive, we’ll have to scan them ourselves and then run the page images through OCR software.

Copyright clearance

At this point I’ll take the images for the title page and verso (the page right after the title page, which usually has the copyright statement) and submit them to Project Gutenberg for copyright clearance, via their website. They’ve always responded within a few days letting me know if we’re clear. (Since we only do pre-1923 works, we always get clearance.)


This is where most of the work takes place. I split the book up into batches of one to five pages each (if we’re doing a book from Google Books, I usually go with five pages since that’s how Google gives the OCRed text to me) and begin to assign batches to volunteers. They then take their assigned pages and go through the OCRed text, comparing it to the page images to eliminate typos and make sure we’re digitizing things correctly, also ensuring that the final text follows our MTP guidelines for formatting and such.

I’ve been emailing the OCRed text and page numbers to the volunteers (who can then just go to Google Books to see the page images and use Notepad or Textedit or another text editor to edit the OCRed text), but once I finish Unbindery, everything will be in the app and they won’t have to download anything. It’s a system that has worked pretty well for PGDP.

(Why are we not just using PGDP, then? Mostly because I wanted a cleaner user interface. And it’s nice knowing that if we need any special features, I can easily add them to Unbindery.)


After everyone finishes and returns their batches, I collate the batch text files into a single file and make a quick pass through the text making sure that things generally look right. Then we make a final, more thorough pass to make sure everything is formatted correctly and that we didn’t miss anything.

I should add here that with Unbindery, each batch will be proofed twice before it’s considered complete, which will help with accuracy.


Once we finish the proof, I go to the Project Gutenberg upload page, fill out the form, submit the finished text, and wait. It usually only takes a couple days — sometimes just a few hours — before the text is up on Project Gutenberg and available for download.

And then we start the process all over again with another book.

Volunteering with MTP

If this sort of thing interests you, email me. We’d love to have you.

Reply via email or via office hours

Mormon Digitization Project, resurrected

I’m resurrecting the Mormon Digitization Project, which I blogged about nine months ago and then abandoned while I went and got married. (I feel justified.)

Project page: Mormon Digitization Project

Brief recap: the goal is to find pre-1923 Mormon books (out of copyright), scan them, OCR them, clean up the OCRed text, and release the plain text files on Project Gutenberg (along with ePub editions, possibly PDFs, and possibly Lulu editions as well).

I’m starting with John A. Widtsoe’s book Joseph Smith As Scientist and will go from there. If you have any suggestions/requests, leave them in the comments (or email them to me). If I get enough people helping out, we’ll be able to tackle a few books at a time.

Process-wise, I’m thinking about trying Bite-Size Edits for at least part of the cleanup. There’s also a remote possibility I’ll use PGDP, but I really, really don’t like their interface. Right now I’m planning to track things using email and a Google Spreadsheet.

Yes, this will be kind of similar to the Mormon Documentation Project, but they don’t seem to be doing the types of books we’ll be doing. (I did use their text for the Standard Works web app and for this D&C reader’s edition I’m still working on, though.)

Reply via email or via office hours

Mormon Digitization Project

Not too long ago I downloaded Eucalyptus, a slick new ebook reader for the iPhone. I love it. I didn’t think anything could knock Stanza down from being king of the hill in my ebook-reading world, but Eucalyptus did it and with style.

Caveat: Eucalyptus can only read books from Project Gutenberg. But that’s not really a problem for me, since most of what I wanted to read was on there anyway. (Well, most of what I wanted to read that already happened to be free.)

Fast forward to this morning. I’m Mormon, and I want to read more Mormon-related texts. I searched around on Project Gutenberg but only found six or seven books — the Book of Mormon (of course), James E. Talmage’s Jesus the Christ and The Story of Mormonism, and then some outsider and/or anti works. Hardly anything.

I want to change that.

There are lots of public domain (pre-1923) texts related to the Church which would be valuable to make available for free, so my new goal is to start digitizing them and putting them into Project Gutenberg. (So I can read them in Eucalyptus.)

Yes, yes, I’m aware that there are already places like GospeLink with plenty of these texts. That’s great, but I want Mormon books in Project Gutenberg, and so far that hasn’t really happened. It’s been seven years since I submitted The Story of Mormonism to Project Gutenberg, and the number of Mormon-related texts added since then (if any) is paltry at best.

I’m going to start building a list of the books I think should be added, and if you have any additions, let me know. (The only real stipulation is that there has to be at least one edition of the book published before 1923, to ensure that it’s out of copyright.) First on my list is John A. Widtsoe’s Joseph Smith As Scientist. I also plan to add the D&C, Pearl of Great Price, and eventually the Journal of Discourses.

I’ll also be developing my Unbindery web app as part of this, and I’ll need volunteers to help with proofreading. When that part is ready, I’ll let you know, but if any of you do want to help out, shoot me an email and I’ll add you to the list.

Last but not least: I like naming things, mainly so I have a way to talk about them. To that end, then, I’m going to call this the Mormon Digitization Project. Here we go.

Reply via email or via office hours