If you are a user of CONTENTdm and have recently switched to version 5, you no doubt have discovered the trials and tribulations of trying to replicate some of your routine processes under older versions. Wracked with bugs galore, the more complex your projects are, the more problems you have run into. Since its release this past Spring, OCLC has released two version updates that I am aware of and another one is expected this coming fall. At Rose-Hulman, we do not operate our own server, but rather piggy back on Indiana State University’s collaborative project, the Wabash Valley Visions and Voices, a digital memory project dedicated to the documentation and the preservation of the region’s history and heritage in print, pictures, and sound in the Wabash Valley region in west central Indiana and east central Illinois. Our biggest undertaking over the last several years has been to digitize our entire collection of student yearbooks. With about two part time student workers at any time during the school year and myself conducting quality control, it is a slow process. To make the process even more time consuming, the process consists of scanning each page three times, once for pdf, once for a master tiff, and once for OCR of the text files which needed to be formatted with break tags for better viewing in page and text layout. I know there are better ways of doing this, but measuring the quality of the scans, files sizes, and display options, I have chosen the long and tedious route.
Although a slow process, this has worked out quite well for us until the release of version 5. While I have been faced with countless road blocks, newly discovered bugs, and lots of hair pulling and teeth gnashing, I may just have discovered a better, faster approach to digitizing yearbooks, one that just may allow us to finish the rest of them within the year or shortly thereafter.
First, let me point out a few of the road blocks I have run into. First, when creating simple document compound objects built from pdf files, I am no longer able to import transcript files. It used to work and should still work, but it does not; the text simply is not there. The pdf files were not scanned with the text imbedded into them. I have tried this in the past and was not happy with the OCR results. I’ve also recently tried uploading the tiff images and telling it to use the pdf files as display images but then it places the pages out of order. In fact telling it to use any other external display images places the pages out of order. External text files can only be imported with jpg of tiff images, but not pdf. So I finally accepted the idea of using the tiff images but still was having problems with pages being out of order. What I discovered is that with Version 5, if you are going to use transcript files, you have to have a transcript file for every single page, and not just those that have text you would like included. So the solution was to create a text page for all those pages that didn’t have a transcript before. So a simple page with the word “the” (a stop word) did the trick. That one page had to be saved for every page that needed a transcript file.
The good news. Yes, there is some good news from all of this. For the remainder of the yearbooks to be completed, we may only need to create master tiff images and then use CONTENTdm’s built in OCR feature. My initial tests show that clear text with simple fonts works well as where fancy fonts or text that is too small results in lots of errors. So it may be that for newer yearbooks, we can simply create a tiff image and OCR the text upon import. For some of the older yearbooks with fancy fonts, we may still need to create external OCR transcript files. So far, I am encouraged, the true test will come in the coming weeks as I start importing these books.