Ben Crowder / Blog

Blog: #project-gutenberg

Eons ago in 2006, I apparently — I have no memory of this and happened to stumble across it tonight — wrote up a small piece for the Project Gutenberg Volunteers’ Voices page about my experience with PG.

In case the page ever goes away (yes, quite aware of the irony should that happen), here’s the text of it, with the referenced projects linked:

I’ve been a book lover ever since the day I learned to read. Several years ago I discovered Project Gutenberg while surfing the net and was delighted to find so many good books freely available. I downloaded all the etexts I was interested in and read quite a few of them. After a few years, I decided to get more involved, so I started proofing with Distributed Proofreaders. I liked that a lot — I was a newspaper editor in high school for two years — but I felt an itch to try to produce etexts on my own. I didn’t have a scanner, however, so the only solution I could see at the time was to find a book and start typing it in by hand. I’m a relatively fast typist and I figured it wouldn’t take that long.

So, I went to my university library, found a pre-1923 edition of G.K. Chesterton’s The Ball and the Cross (Chesterton is one of my favorite writers), and began typing. It took much longer than I expected — certainly over 30 hours, perhaps even close to 50. When I finished, I came across a page on the PG site that mentioned there should be two spaces between sentences. I looked at the etext I’d just typed in and realized in horror that I’d used single spaces the whole way through. :) [1] I had been sure that PG used single spaces, convinced that I’d read it in one of the PG docs, which had taken a little while to get used to since I normally use two spaces. But all the PG etexts I checked had two spaces between sentences, so I began the monotonous task of adding an extra space between each sentence (and being very careful not to add spaces in where they shouldn’t be). Several hours later the book was finally done. I’d gotten copyright clearance before I started, so I soon submitted it and within a few days I saw those lovely words in my inbox, “Posted (#5265, Chesterton)”.

[1] Ben was right both times: people have posted advocating both one space and two. Either would have been accepted! –jt

Since then, I’ve been addicted to producing etexts. Languages interest me greatly, so I found an Old Icelandic primer that someone had scanned in, OCRed the images using DocMorph (it didn’t take as long as I thought it would, and the output was decent enough to work with), and realized I would have a problem entering in the foreign characters (o’s with hooks underneath, etc.). Thank heavens for Unicode. Vim (my editor of choice) has fairly good Unicode support and it didn’t take long to make a list of the Unicode codes for the Icelandic characters.

As noted, I use Vim for all my editing. I can rewrap lines to 65 characters by typing “gq”, I can use regular expressions for search and replaces (very handy), I can edit in Unicode when I need to, and I can speed things up greatly by making keyboard mappings for repetitive tasks. (On one text I was working on, I had to add a blank line between each paragraph. Each was numbered, but the blank lines had somehow been taken out before I got the text, so I started going through and adding them in by hand. The file was 30,000 lines long, however, and I quickly realized it would take a long time. I then noted which keys I was pressing to add the blank line between each paragraph, mapped them to [this bit got lost somewhere along the way], and held the key down while Vim zipped through the rest of the file. It sped it up by a factor of over a hundred.)

My university library is well-stocked and has lots of old books, so I usually rely on it when I need to get TP&Vs for texts I’m not typing in myself. I still don’t have a scanner, so I either find already-existing texts on the Internet and reformat them for Project Gutenberg (after getting permission, of course), or find page images on the net and OCR them myself, or type the books in by hand. Typing in by hand takes a long time and so I prefer the first two methods.

Volunteering with Project Gutenberg has been extremely satisfying. The people are wonderful to work with, the work is fun, and it feels very good to know that one is making a difference in the world.

One unimportant point I feel the urge to clarify: I absolutely do not use two spaces between sentences anymore. (I use three! Sometimes four when I’m feeling exuberant!) (But alas, no, my typographic preferences very much inhibit such unbridled playfulness.)


Reply via email or office hours

I’m getting a bit of a nostalgia kick reading through the Standard Ebooks process. I haven’t made anything with them (though they do good work and I’m reading two of their editions right now), but years ago — in a former life, it seems — I used to make ebook editions of old books.

As far as I can tell, the first ebook I made was Chesterton’s The Ball and the Cross, which I typed up by hand and posted to Project Gutenberg. Around that time I worked on a handful of other books for PG, including Henry Sweet’s An Icelandic Primer, which was much more involved (Old Icelandic characters, tables, etc.) and incredibly fun.

After that I worked on several more books as part of the Mormon Texts Project and also started making EPUB and Kindle editions of other books (like the 1812/1815 edition of Grimms’ fairy tales and George MacDonald’s The Light Princess). Those were quite fun, too.

Somewhere around five or so years ago I stopped, partly from working on other things, partly from repetitive strain injuries. (Even with Vim macros to help, there’s still a multitude of repetitive keystrokes in cleaning up texts, at least for me.) With reading about Standard Ebooks and writing this post, though, I’m tempted to get back into it. I built Fledge years ago as an attempt to script away more of the repetitive work, and I suspect wiser use of both it and Vim might be enough to minimize the RSI.

On a related note, I’ve been wanting to rewrite md2epub. It’s a decent-enough Python script that takes Markdown files and turns them into an EPUB, and it’s worked well. But it’s an old codebase, and I don’t like the name anymore, and it could be faster, and I have a few ideas on how to make it more ergonomic, so I’m planning to dub it Caxton and rewrite it in Go or Rust. (Primarily so I’ll have an easier way to make EPUB editions of my fiction.) This part is the most likely to actually happen, I think.


Reply via email or office hours

Recommended: Standard Ebooks. They’re doing the same kind of thing I’ve done — making nice EPUB/Kindle editions of Project Gutenberg (though my efforts have of course been at a much smaller scale, and far more sporadic). Even better, Standard Ebooks has good typography standards and they’re proofing the books against original scans. This is a good project.


Reply via email or office hours

Inside the Mormon Texts Project

Here’s a bit more in-depth detail on the Mormon Texts Project and our process for digitizing books and getting them on Project Gutenberg.

Why we do what we do

First, we’re doing this to make more Mormon books available through Project Gutenberg. Why PG? They’ve been around for almost forty years, they have lots of mirrors, and they use a low common denominator (plain text) so that everyone can read the texts.

We’re also doing this because people in other countries often don’t have easy access to these books. Google Books is only available in the U.S., for example.

Another reason is that most of these books aren’t available in Braille editions, and screenreaders are mostly guaranteed to be able to read plain text files.

Choosing the books

We’re looking for books published before 1923 so that we know they’re 100% public domain. (Books published in or after 1923 are most likely still in copyright. There are exceptions, but it’s a hassle to figure out which books fall in that category, so for MTP we’re restricting ourselves to pre-1923 books. Luckily that still gives us almost a hundred years’ worth of Mormon books to choose from.)

As for selecting which pre-1923 Mormon books to do, we started with a short list of books that I’d heard of and wanted to digitize (Joseph Smith As Scientist, The Life of Heber C. Kimball, etc.). We’ve taken a few reader requests as well, and I’ve added to our list by searching for other books by some of these same authors (Orson F. Whitney, B.H. Roberts, John A. Widtsoe, etc.). I should add that we’re not interested in digitizing anti-Mormon books. We’re doing this to build the kingdom, not to try to tear it down.

When we finish with the books currently on our list, I plan to start working through A Mormon Bibliography to find other books to digitize.

Getting the books

A lot of Mormon books are already available on Google Books and the Internet Archive, which makes our job a lot easier. We can download page images (as PDFs) and unproofed OCRed text from both sites. And I’ve checked with Project Gutenberg and they have no problem with our using Google Books images as our original source, as long as the book is pre-1923.

For books not on Google Books or the Internet Archive, we’ll have to scan them ourselves and then run the page images through OCR software.

Copyright clearance

At this point I’ll take the images for the title page and verso (the page right after the title page, which usually has the copyright statement) and submit them to Project Gutenberg for copyright clearance, via their website. They’ve always responded within a few days letting me know if we’re clear. (Since we only do pre-1923 works, we always get clearance.)

Digitizing

This is where most of the work takes place. I split the book up into batches of one to five pages each (if we’re doing a book from Google Books, I usually go with five pages since that’s how Google gives the OCRed text to me) and begin to assign batches to volunteers. They then take their assigned pages and go through the OCRed text, comparing it to the page images to eliminate typos and make sure we’re digitizing things correctly, also ensuring that the final text follows our MTP guidelines for formatting and such.

I’ve been emailing the OCRed text and page numbers to the volunteers (who can then just go to Google Books to see the page images and use Notepad or Textedit or another text editor to edit the OCRed text), but once I finish Unbindery, everything will be in the app and they won’t have to download anything. It’s a system that has worked pretty well for PGDP.

(Why are we not just using PGDP, then? Mostly because I wanted a cleaner user interface. And it’s nice knowing that if we need any special features, I can easily add them to Unbindery.)

Proofing

After everyone finishes and returns their batches, I collate the batch text files into a single file and make a quick pass through the text making sure that things generally look right. Then we make a final, more thorough pass to make sure everything is formatted correctly and that we didn’t miss anything.

I should add here that with Unbindery, each batch will be proofed twice before it’s considered complete, which will help with accuracy.

Releasing

Once we finish the proof, I go to the Project Gutenberg upload page, fill out the form, submit the finished text, and wait. It usually only takes a couple days — sometimes just a few hours — before the text is up on Project Gutenberg and available for download.

And then we start the process all over again with another book.

Volunteering with MTP

If this sort of thing interests you, email me. We’d love to have you.


Reply via email or office hours

Mormon Digitization Project, resurrected

I’m resurrecting the Mormon Digitization Project, which I blogged about nine months ago and then abandoned while I went and got married. (I feel justified.)

Project page: Mormon Digitization Project

Brief recap: the goal is to find pre-1923 Mormon books (out of copyright), scan them, OCR them, clean up the OCRed text, and release the plain text files on Project Gutenberg (along with ePub editions, possibly PDFs, and possibly Lulu editions as well).

I’m starting with John A. Widtsoe’s book Joseph Smith As Scientist and will go from there. If you have any suggestions/requests, leave them in the comments (or email them to me). If I get enough people helping out, we’ll be able to tackle a few books at a time.

Process-wise, I’m thinking about trying Bite-Size Edits for at least part of the cleanup. There’s also a remote possibility I’ll use PGDP, but I really, really don’t like their interface. Right now I’m planning to track things using email and a Google Spreadsheet.

Yes, this will be kind of similar to the Mormon Documentation Project, but they don’t seem to be doing the types of books we’ll be doing. (I did use their text for the Standard Works web app and for this D&C reader’s edition I’m still working on, though.)


Reply via email or office hours

Mormon Digitization Project

Not too long ago I downloaded Eucalyptus, a slick new ebook reader for the iPhone. I love it. I didn’t think anything could knock Stanza down from being king of the hill in my ebook-reading world, but Eucalyptus did it and with style.

Caveat: Eucalyptus can only read books from Project Gutenberg. But that’s not really a problem for me, since most of what I wanted to read was on there anyway. (Well, most of what I wanted to read that already happened to be free.)

Fast forward to this morning. I’m Mormon, and I want to read more Mormon-related texts. I searched around on Project Gutenberg but only found six or seven books — the Book of Mormon (of course), James E. Talmage’s Jesus the Christ and The Story of Mormonism, and then some outsider and/or anti works. Hardly anything.

I want to change that.

There are lots of public domain (pre-1923) texts related to the Church which would be valuable to make available for free, so my new goal is to start digitizing them and putting them into Project Gutenberg. (So I can read them in Eucalyptus.)

Yes, yes, I’m aware that there are already places like GospeLink with plenty of these texts. That’s great, but I want Mormon books in Project Gutenberg, and so far that hasn’t really happened. It’s been seven years since I submitted The Story of Mormonism to Project Gutenberg, and the number of Mormon-related texts added since then (if any) is paltry at best.

I’m going to start building a list of the books I think should be added, and if you have any additions, let me know. (The only real stipulation is that there has to be at least one edition of the book published before 1923, to ensure that it’s out of copyright.) First on my list is John A. Widtsoe’s Joseph Smith As Scientist. I also plan to add the D&C, Pearl of Great Price, and eventually the Journal of Discourses.

I’ll also be developing my Unbindery web app as part of this, and I’ll need volunteers to help with proofreading. When that part is ready, I’ll let you know, but if any of you do want to help out, shoot me an email and I’ll add you to the list.

Last but not least: I like naming things, mainly so I have a way to talk about them. To that end, then, I’m going to call this the Mormon Digitization Project. Here we go.


Reply via email or office hours