Digitization of Modulus (student yearbook) Now Complete Through 1980

February 4, 2010

Bookmark and Share

Rose-Hulman Modulus Yearbooks

The digitization process of the Modulus, the student yearbook for Rose-Hulman Institute of Technology, is moving along at a swift pace due our diligent student worker.  Elizabeth is producing high quality scans at a very fast pace.

We have scanned and put online all yearbooks through 1980.  1982, 1985 and 1987 are already online.  1981 and 1983 are scanned and 1984 is on the works.  We hope to have the 1980s finished by the end of the month and the remaining 10 completed by the end of the school year.

We are able to conduct this process much faster than in the past because we are only doing one scan, a high resolution TIFF from which the OCR feature in CONTENTdm does a decent job importing the text.

http://www.rose-hulman.edu/Archives/modulus.html


This Week’s Agenda

January 18, 2010

Bookmark and Share

http://thisthatotherthing.wordpress.com/2010/01/18/this-week’s-agenda
Yep, have to work on MLK Day.  But that gives me a jump on the week.  That’s my story and I’m sticking to it.
•    Post Part 2 to my blog series “The Small College Librarian.”  Part 2 is titled “So Many Jobs, So Little Time”
•    Work with Office of Institutional Planning and Assessment to complete online survey for students use of Web 2.0 services
•    Work with student worker to continue adding yearbooks. – Will upload 1960, 1961, 1964, and probably 1966 which she started scanning.
•    Setup new scanner when it arrives (hopefully today).
•    Create new LibGuide
•    Work on updating AtoZ records (Hopefully will be able to finish A – C tab.
•    Look into evaluations of Serials Solutions Products


UPDATE on “Editing OCR Transcript Field in CONTENTdm”

December 15, 2009

Bookmark and Share

http://thisthatotherthing.wordpress.com/2009/12/15/update-on-edit…d-in-contentdm/
To update an earlier post titled “Editing OCR Transcript Field in CONTENTdm” I would like to confirm that this process is at least 10 times faster!  It is faster because within the CONTENTdm Project Client, you can go from one page to another VERY quickly.  With normal fonts, most pages do not need editing at all.  If you wait until after you have upload a compound object and edit it each page in the web administrator module, you spend lots of time waiting for each page to open and close, far more time then you actually spend editing any pages.  I just uploaded and checked the OCR of our 1948 yearbook and it took no time at all.


Editing OCR Transcript Field in CONTENTdm

October 22, 2009

Bookmark and Share

Are you a user of CONTENTdm 5.x?  Have you fooled around with the OCR feature for TIFF images?  If you have not or have and found it frustrating, here are some tips.  First, it works best with text with basic, easy to read fonts; the larger the better.  Like most OCR software, the smaller the font the more likely there will be mistakes in the OCR text.  The same goes for fancy fonts.  We scanned a yearbook from 1901 that used this font that was similar to Old English and we had to make corrections on almost every line.  But even with good clear text, there are bound to be issues and sometimes images can be interpreted as text and so a string of strange characters will be entered into the transcript field.

Here is the BIG TIP!! If you are building a compound object of many pages such as a yearbook and using the OCR feature, edit the transcript fields for each page while still in the CONTENTdm Project Client BEFORE uploading the object and its files.  This method is much faster than editoing the transcript fields once it has been uploaded.  I have found this out the hard way.  I uploaded about 5 yearbooks and then had my students find each page in the web administrator module.  This is very inefficient as you have to first search for the page, then open and edit it, and then close it.  All this is a slow process for each page.  A much quicker way is to do it right in the CONTENTdm Project Client, after you have built the object, but before you upload it.  You can edit one page right after another much faster.


CONTENTdm and LOTS of Patience

September 10, 2009

If you are a user of CONTENTdm and have recently switched to version 5, you no doubt have discovered the trials and tribulations of trying to replicate some of your routine processes under older versions.  Wracked with bugs galore, the more complex your projects are, the more problems you have run into.  Since its release this past Spring, OCLC has released two version updates that I am aware of and another one is expected this coming fall.  At Rose-Hulman, we do not operate our own server, but rather piggy back on Indiana State University’s collaborative project, the Wabash Valley Visions and Voices, a digital memory project dedicated to the documentation and the preservation of the region’s history and heritage in print, pictures, and sound in the Wabash Valley region in west central Indiana and east central Illinois.  Our biggest undertaking over the last several years has been to digitize our entire collection of student yearbooks.  With about two part time student workers at any time during the school year and myself conducting quality control, it is a slow process.  To make the process even more time consuming, the process consists of scanning each page three times, once for pdf, once for a master tiff, and once for OCR of the text files which needed to be formatted with break tags for better viewing in page and text layout.  I know there are better ways of doing this, but measuring the quality of the scans, files sizes, and display options, I have chosen the long and tedious route.

Although a slow process, this has worked out quite well for us until the release of version 5.  While I have been faced with countless road blocks, newly discovered bugs, and lots of hair pulling and teeth gnashing, I may just have discovered a better, faster approach to digitizing yearbooks, one that just may allow us to finish the rest of them within the year or shortly thereafter.

First, let me point out a few of the road blocks I have run into.  First, when creating simple document compound objects built from pdf files, I am no longer able to import transcript files.  It used to work and should still work, but it does not; the text simply is not there.  The pdf files were not scanned with the text imbedded into them. I have tried this in the past and was not happy with the OCR results.  I’ve also recently tried uploading the tiff images and telling it to use the pdf files as display images but then it places the pages out of order.  In fact telling it to use any other external display images places the pages out of order.  External text files can only be imported with jpg of tiff images, but not pdf.  So I finally accepted the idea of using the tiff images but still was having problems with pages being out of order.  What I discovered is that with Version 5, if you are going to use transcript files, you have to have a transcript file for every single page, and not just those that have text you would like included.  So the solution was to create a text page for all those pages that didn’t have a transcript before.  So a simple page with the word “the” (a stop word) did the trick.  That one page had to be saved for every page that needed a transcript file.

The good news.  Yes, there is some good news from all of this.  For the remainder of the yearbooks to be completed, we may only need to create master tiff images and then use CONTENTdm’s built in OCR feature.  My initial tests show that clear text with simple fonts works well as where fancy fonts or text that is too small results in lots of errors.  So it may be that for newer yearbooks, we can simply create a tiff image and OCR the text upon import.  For some of the older yearbooks with fancy fonts, we may still need to create external OCR transcript files.  So far, I am encouraged, the true test will come in the coming weeks as I start importing these books.


Follow

Get every new post delivered to your Inbox.