Well, I’m 99% better. The cough is virtually nonexistent, though it still pops up from time to time... — Blog

Well, I’m 99% better. The cough is virtually nonexistent, though it still pops up from time to time (but very rarely). It’s very nice to be normal again.

I downloaded Clara OCR this morning. I’m thinking about OCRing Henry Sweet’s An Icelandic Primer and submitting it to Project Gutenberg (after cleaning up the text, of course). Clara OCR looks like the most promising free OCR software, but I must have been doing it wrong, since it would only recognize the letters I’d trained it to recognize (the ones I’d clicked on). Training it is a lot of fun, by the way. I could spend hours doing it. I’m serious. But the point isn’t to click on every single letter on the page. I also tried GOCR, with limited success (but more than with Clara OCR). It’ll take a lot of clean-up. I’ve also got to figure out how to get the Icelandic characters entered in. Perhaps I’ll have to use Unicode.

I don’t think it’s going to work after all. I’ll have to look around and see if there are any other free OCR programs available. I’m rather interested in finding out what options there are for entering foreign characters. Unicode, yes — are there any others? I think Unicode is the only real option. Is mixing ASCII and Unicode kosher? Perhaps the best way to put something like the Icelandic grammar online is to do it in XML/HTML using normal text for the most part and Unicode entities when necessary.

I’ve realized that I won’t have time to work on Proofread after all. I’m involved in too many other things at the moment. I really want to see it finished, though, so I’m going to try to find someone else who’s interested in it. If you think you’d like to try your hand at it (it’s written in C using Gtk+), let me know. It’s already usable, and all it really needs is some bug fixes before it’s solid.

I’m OCRing the Icelandic primer through DocMorph, a free web OCR service. It’s rather slow (by nature of uploading the images and so on), but it works.

It took about an hour to OCR the Icelandic primer. There’ll be a lot of cleanup. I’m thinking I’ll put it in HTML using the Unicode entities for the non-ASCII characters.

I’ve cleaned up all the prefatory material and the first thirteen pages. It takes a devilishly long time, especially since there are so many italics (which I put underscores around in the text). I’m using xterm with the -u8 setting. (The full command is xterm -u8 -fg gray -bg black -geometry 90x31+220+175 -fn -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1.) Once I have that running, I open the text file in Vim. I decided to rename all the OCRed files to .ice (from .txt), so that I could put a au BufNewFile,BufRead *.ice set encoding=utf-8 line in my .vimrc (that tells Vim that all .ice files are in Unicode). To enter the foreign characters, I just have to type Ctrl-V followed by a ‘u’ and then the hex code for the character (an o with an ogonek (ǫ), for example, is 01EB, so I type "Ctrl-V u 01EB" to produce it). So far I haven’t run into a character I can’t reproduce with Unicode. (There is an o with an ogonek and a diaresis on top of it, but luckily there’s a combining diaresis, so I just insert the o/ogonek and then the diaresis (which is hex 0308, FYI).) The cleaning-up process is going along very well. I’m very pleased. It’s slow, of course, but it’s a lot more fun than typing in The Ball and the Cross. I’m even learning a bit about Old Icelandic in the process. Right now I’m making a text file out of the primer, but the really useful one will be the HTML conversion (which I’ll do at the end). I’m thinking about an XML version as well, but that can wait. Vim has pretty good Unicode support, provided that you have the right font on your terminal. Yudit is nice for checking the files and making sure they’re okay.