www.fgks.org   »   [go: up one dir, main page]

 

The Winner of the 2007 IT Division Jo Ann Clifton Student Award

  Missouri Sunset

Forging cultural heritage collections online

The story of An American Tale

 
 
Candidate for
M.A. Information Resources & Library Science
University of Arizona
Tucson, Arizona

Windmill at Sunset - Boone County [Missouri],
Photo Credit:  Duane Perry, Columbia, Missouri 
(http://www.missouri.gov/mo/mophotos/sunsets)



Contents
Introduction

1.0 Initial goals and work undertaken

1.1 Goals

1.2 Project design

1.2.1  Metadata

1.2.2  Copyright protection

1.2.3  Custodianship

1.2.4  Image collection

2.0 The virtual tour

3.0 Lessons learned

3.1  Planning is crucial

3.2  Experience counts

3.3 Choose wisely

3.4 Be flexible

3.5 Keep a sense of humor


Introduction

In the heartland of 19th-century America, Missouri welcomed emigrants from states and countries, near and far, as the central crossroads of the nation.   The digital collection, An American Tale:  19th-century Folkways to Missouri, was created by the author to document that migrant experience through the heritage of one individual, to understand the process of constructing a cultural heritage collection online.

The purpose of this paper is to reflect back upon the extensive planning and execution required to create, from the ground up, the digital repository of 3 migrant pathways to Missouri, to understand best practices in building an online digital collection.

The reader will learn in part 1.0 the initial goals of the project and the work which was undertaken.  Part 2.0 will describe corresponding outputs from the effort, through a virtual tour of the finished collection.  Part 3.0 will evaluate lessons learned from that endeavor. 

Like the journey of early settlers to Missouri, the road to constructing a premier digital collection is fraught with danger:  potholes, treacherous stream crossings, dangerous wildlife, bad equipment, limited funds, and all kinds of weather. Through lessons learned by the experience of the author, the reader may take away valuable lessons to begin the journey to building a premier digital collection.


 

1.0  Initial goals and work undertaken 1.1  Goals

Project goals for the 8-week digital collection project were to:  1) digitize 30 primary and secondary sources from research collected over the past ten years by the author,  2) create an open access collection online of the digitized images with relevant metadata, 3) create an online guide which would include interpretive and educational materials pertaining to the subject , and 4) use the project as a platform to understand the decision issues associated with organizing, describing, indexing, classifying, digitizing, presenting and retrieving items in building a digital collection.

The scope of the collection consisted of thirty vintage photographs, and primary and secondary records, uncovered by the author through correspondence with individuals or on-site research at local cemeteries, public libraries, academic libraries, county courthouses, state departments of health, state historical societies, or federal archives.  Types of records selected were photographs, correspondence, vital records, census records, naturalization and immigration records, church records, military records, newspaper clippings, court, land and tax records.

Three discrete, topical themes formed the intellectual boundary of the project:   1)  Slaveholder from Virginia, 2) Union Soldier from Hesse-Darmstadt and 3) Farmer from Iowa.  Selection of the material was guided by the mission of the collection:  to document the 19th-century migrant experience to Missouri, through the author’s ancestral heritage.  Corresponding records of patriarchs Henry M. Ogden (1792-1888), Philip P. Wilhelm (1827-1909), and Jacob Peters (1831-1918), were selected. 

The NISO standard for building digital libraries, entitled "A framework of guidance for building good digital collections," served as the framework for constructing the digital repository (http://www.niso.org/framework/Framework2.html).  The three intended audiences for An American Tale  were academic historians, graduate students, and family historians.

 

1.2  Project Design

Goals set for the project, above, dictated requirements for its design.  The content management system used as the container for the collection was Greenstone Digital Library shareware (www.greenstone.org), assigned to all students matriculating Digital Libraries,  the course for which the project was assigned.

Revealed in a pilot walk-through of sample records was the need for a taxonomy to uniquely identify each object.  After research and experimentation, a naming standard was created for each file, using the family name, generation number, pedigree placement, and record type.  File naming followed the ISO 9660, Level 2 convention, which allowed file names of up to 31 characters,  only  lower case characters a-z, numerical digits, and special characters period, underscore, and hyphen  (http://en.wikipedia.org/wiki/ISO_9660).  Spaces or any other special characters were not used.  The reader will learn later in this paper how critical proper formation of a file naming convention early on in a project is to its later success.

1.2.1  Metadata

The collection required a simple metadata standard with modest granularity, due to the simple nature of the collection and the limited experience of its builder.   The metadata standard selected for the project was Dublin Core (DC), which provides standard accessibility and expanded use of the collection.  Use of the DC standard retains the context of each record, and provides a 'footprint' for rights status and digital provenance.   This is compliant with the Open Archives Initiative Metadata Harvesting Protocol standard (http://www.openarchives.org/OAI/openarchivesprotocol.html ).  

Full bibliographic detail of the preserved items, including structural, administrative, and descriptive metadata, is detailed in a Microsoft Excel spreadsheet file which accompanies the digitized records. Implementation of standard encoding practices for metadata will facilitate sharing with others among federated archives. Library of Congress Subject Heading authorities were used to standardize descriptive terms.

1.2.2  Copyright protection

Of the six possible Creative Commons licenses available to individuals, the project used the” Attribution Non-commercial No Derivatives license” (www.creativecommons.org).  Others may download works in the collection, on the condition that users cite their source, do not alter the material in any way, or reuse it for commercial purposes. Access to the original physical photographs and print records is available to the general public, with prior request for permission in writing.

1.2.3  Custodianship

Custodianship of high-quality digital master copies of the original records is retained by the author on compact discs, and on the author’s local hard drive.  FastSum Integrity Control was used to ensure data integrity of master files through back-up and any future migration (www.fastsum.com).    Lower resolution digital surrogates of the high-quality digital master copies reside in the Greenstone Digital Library database for public viewing. 

1.2.4 Image collection

Records were scanned using an A4-standard AcerScan 620U Prisa USB flatbed image scanner.  Maximum resolution of 600x1,200 dots per inch provided adequate viewing of the objects.   Images were manipulated to ensure consistency in size using Microsoft Paint.  No part of the original digitized record was cut, cropped, or altered in any way in the manipulation process.  TEI-P-5 Guidelines (version 0.4.1, July 2006) for processing and creating images were used to guide digitization of photographic or photocopy images, created for uploading to Greenstone (http://www.tei-c.org/release/doc/tei-p5-doc/html/).   

Finally, as part of project management of the digital library construction, an 8-week timeline was created toward work completion, auxiliary personnel were identified, equipment needs were assessed, a proposed budget was assembled, and project metrics or means for evaluating the process were created. 

In summary, the ‘magic formula’ for creating a premier digital library collection was to clearly state project goals, identify the scope and selection policy of the collection, then target the main audience.  Next, the best metadata standard was considered, and copyright rules suitable to the collection were identified.   Then, ownership and access conditions of the collection were ascertained, and software and hardware requirements were refined. Finally, a clear timeline was created, needed personnel were hand-picked, a flexible budget was formulated, and plans to measure success were defined.


 

2.0:  The virtual tour

The finished Greenstone library collection represents a simple mock-up of 30 artifacts, representative of a grander vision of what could be an extensive collection on 19th-century records of immigrants to Missouri.  It includes three main features:  1) orienting remarks, 2) search features, and 3) browse features. 

The home page, or the About page in Greenstone vernacular, shown in Exhibit 2.1, orients the user to the purpose, scope, selection process, and arrangement of the collection.

The collection is divided up into 3 searchable modules shown on the About page top task bar:  the titles a-z of the artifact, the subjects entailed, and its coverage period in Missouri history.  The home, help, and preferences buttons lead the user back to the University of Arizona master site, help the user form search arguments, and allow the user to select foreign language options, textual or graphical interface, and toggle search preferences.

Finally, the search tool bar found in the middle of the page lets the user enter search terms belonging to titles in one of the three discrete topical themes:  Slaveholder from Virginia, Soldier from Hesse-Darmstadt, or Farmer from Iowa.

Exhibit 2.1:  Collection Home Page
Greenstone About Page



On the
Titles a-z page, each bookshelf icon (below) represents a single document or photograph, sorted alphabetically by topical theme, surname, and then record title, shown below in Exhibit 2.2.  Where did all of that information come from to fill the Titles a-z index?  The secret is in the metadata.  The elegance of Greenstone lies behind the scenes, buried in the metadata assigned to each record.  What the user will not see while navigating An American Tale, are the 15 Dublin Core metadata elements which describe each record.
Exhibit 2.2:  Titles A-Z Page
Titles a-z page


Take for example, the 1863 Certificate of Disability for Discharge issued to Sergeant Philip P. Wilhelm (20th from the top of the list), who was released from the Union Army's Company E, 37th Ohio Volunteer Infantry, at Louisville, Kentucky.  All that we know from the Titles A-Z entry, above, is the name of the topical theme, name of the person concerned, his dates of birth and death, and the title of the record:  Certificate of disability for Discharge.  But when we pull the curtain behind Greenstone (see Exhibit 2.3), we see the hidden Dublin Core elements tagged to the certificate, like the record creator, selected Library of Congress Subject Headings, a full description of the document, who donated it, its date in universal format, the type of record, its file name, where the document originated, its native language, and its time period in Missouri history.

Exhibit 2.3  Greenstone Metadata Screenshot
Certificate of disability metadata for Sgt. Wilhelm


Wherever the certificate travels, shown in Exhibit 2.4, below, through data harvesting or other means, users will have full Dublin Core metadata to know its provenance.  Sergeant Wilhelm may not appreciate the world knowing about his indelicate disease contracted at the Battle of Fayetteville.  But the world will have accessible proof that he was there at the Battle, thanks to descriptive administrative and structural metadata which comply with generally-accepted metadata harvesting standards.

Exhibit 2.4:  Document Object - Certificate of Disability for Discharge for Phillip P. Wilhelm, 12 January 1863
Certificate of Disability - Sgt Wilhelm


The Subjects Page, shown below in Exhibit 2.5,  represents some of the most powerful browsing capability within the An American Tale website.  Users may browse through detailed Library of Congress Subject Heading authorities to identify the specific document or photograph sought after.  Four pages of subject headings give the user over 70 topics and sub-topics from which to choose.

Exhibit 2.5:  Greenstone Subjects Page
Subjects screenshot


For example, an interest in carte-de-visite photographs (pronounced cart-du-viZEET), popular during the American Civil War, will net three finds in the photographic medium shown in Exhibit 2.6, below:  one for Henry M. Ogden, Mary Frances Turpin Ogden, and Captain John James Ogden.
Once again, by clicking on the thumbnail photograph, a full screen version comes into view of an enlarged image. Once the enlarged document appears, a navigational icon in the lower right hand corner allows the viewer to zoom in on the text for better viewing.

Exhibit 2.6 Subject Thumbnails
Carte de visite photographs index


The final searchable module is the Coverage Page (Exhibit 2.7), which outlines four periods in Missouri history:  a)  1812-1819 Territorial Missouri, b)  1860-1877 Civil War and Reconstruction Missouri, c) 1878-1899 Outlaw and Volunteer Missouri and d) 1900-1929 World's Fair and Lindbergh Missouri.  The only period not represented is 1820-1859 Statehood Missouri, for which no documents or photographs exist in the present collection. 

Exhibit 2.7:  Coverage Page
Coverage screenshot


Once a bookshelf icon is selected on the Coverage Page, for example, for Territorial Missouri, two thumbnail images appear for that time frame.  Both are legal records associated with the 1816 marriage of Katharine Smith to Henry M. Ogdon [sic]:  an Affidavit of Age of Majority (Exhibit 2.8), and a Marriage Bond, for financial remuneration to the bride's father, should the groom, 24-year old Mr. Ogdon, choose to flee from the altar.

Exhibit 2.8:  Affidavit of age of majority -  Katharine Smith, November 4, 1816, Bedford County, Virginia
Catherine Smith declaration



The finished Greenstone library collection represents a compact vision of what could be an extensive collection of 19th-century records of immigrants to Missouri.   A magic elixir of careful collection design and development helped to distill the collection of cultural heritage artifacts into a navigable website.

In the final section, 3.0, the reader will learn shortcomings and successes of the project in an effort to understand best practices in building a premier digital collection online.


 


3.0:  Lessons learned

The reader may take away five important lessons from the project:  1) planning is crucial, 2) experience counts, 3) choose wisely, 4) be flexible, and 5) keep a sense of humor. 

3.1  Planning is crucial

Digitization requires a material long-term financial investment, and pulls on often limited organizational resources.  Common sense dictates that results be planned well, and measured to make the return on the initial investment imminently clear.  Tools like a project schedule, a prospective budget, a mid-project review, and a plan for a file naming convention served to ease the project's execution.The reader should have a roadmap to where he or she is going. The timeline used in the project, shown in part in Exhibit 3.1, helped the author to frame the project by letting benchmarks lead the way.  Instead of wondering through the project what remained to be done, the author needed only refer to the Project Schedule. That is of particular value later in the project when the tsunami of fine details and unresolved issues press to take over the project.  A project schedule is a valuable way to stay on track.

 Exhibit 3.1:  Excerpt from Project schedule  (in U.S. dollars)


DeadlineDate

Activity
Budgeted
Duration
(hours)
  Actual
Duration
(hours)
Personnel* & Material Resources
Budgeted
Cost
($)
Actual
Cost
($)
Cost
Variance
($)
Oct 8
Training in Greenstone Digital Library (GDL) software.
8
50
Project lead
200
1,250
+1,050
Oct 10
Survey file records and images;  Group photographs/records for selection in 3 patrilineal groupings, retaining provenance - women will be categorized by their maiden names.
10
10
Project lead
250
250
0








Nov 16
Evaluate responses to test launch;  make adjustments to digital library, accordingly.
8
0
Project lead
200
0
-200
Dec 4
Project Launch:  Publish proposed digital library project; track site usage statistics with ChangeDetection  [Santa Cruz, California:  FreeFind.com]. 
1
0
Project lead
25
0
-25
Dec 6
Prepare final report; include comparison of work hours logged against time budgeted
5
9
Project lead
125
225
+100
Total

147 hours
220
hours

$4,425
$5,500
+$1,075

 


Planning a notional budget upfront forced the author to reflect upon actual inputs to the project. How much would be needed for office supplies, long-distance calls, or extra equipment?  Have a small amount of contingency funds been imbedded into the budget to cover inevitable surprise expenses, like the subscription to FastSum Integrity authentication software?  A working budget forces one to ponder the various pieces which go into the effort.

A plan to evaluate day-to-day journaling of activities mid-point through the project helped enormously in planning the second part of the project.  By keeping a log to which one could refer mid-point, the author was able to pinpoint logistical problems immediately, and prepare for their resolution in the second part of the project.

Inadequate planning on the author's part in arriving at a file naming convention meant repeating tasks four or five times, which would have been avoided with better preparation, and saved lots of time.  Files were renamed 6 times before an adequate name construction could be used, shown in the example, below, Exhibit 3.2.  First, the files were named for consistency, by family surname, to provide some order to the chaos. Then, it became clear that some unique object identifier was necessary to reduce confusion regarding similar documents.  Thus, a taxonomy was born to assign numbers to each file.  The unique identifier consisted of an assigned surname number, a generation number, a unique individual number, a document type, then the numbered order of the document type.

In the example in Exhibit 3.2, the Civil War Pension record JPEG file, was assigned the number 23 because the first individual bearing that surname in the pedigree chart was numbered 23 (Rosa May Wilhelm); that Phillip P Wilhelm was the sixth generation back, his unique individual number on the pedigree chart was 46, the document type was a military record assigned the number 7, and it was the first of its kind.

 Exhibit 3.2:  Iterations of one file name for a Civil War Pension record

1
consistency
wilhelm_phillip_p_page_1_1898_jpeg
2
taxonomy
23.6.46-7.1_wilhelm_phillip_p_page_1_1898
3
ISO 9660-shortened
23.6.46-7.1_wilhelm_pp_page_1
4
low-resolution surrogate
23.6.46-7.1_wilhelm_page_1_lo
5
without term ‘page’
23.6.46-7.1_wilhelm_1_lo
6
without period (.)
23646-71_wilhelm_1_lo

But then the filename became too long.  File naming outlined in the project's initial research proposal dictated that the ISO 9660, Level 2 convention for naming files would be followed, as mentioned before in section 1.2, which allows file names of up to 31 characters.  In file naming, only lower case characters a-z, numerical digits, and special characters period, underscore, and hyphen would be used.   Spaces or any other special characters would not be used.  Thus, the file name was shortened in its third iteration.

For its fourth iteration, a new name was needed because the new low resolution surrogates of the digital master files needed a name, which would actually be uploaded to Greenstone.  The suffix 'lo' was appended to the file name, still within the 31 character limit.  Then, after creating, building, and previewing the Greenstone collection, the author learned that the term "page" in the filename, confused Greenstone, and aborted compiling of the library.  Thus, in its fifth iteration, the term 'page' was removed from the file name.

Finally, the author learned that Greenstone reads a file name up until the first period to extract the name of the file, and then stops reading the name.  Typically, one would call a file filename dot JPEG, or filename dot BITMAP, and so forth.  But Greenstone stopped reading the filename after the first dot, which in our example was 23.  Thus, the file name was recorded in Greenstone as 23 along with all other files prefixed '23,' excluding the rest of the file name, creating confusion for the user with multiple files titled "23."  Therefore, in its sixth and final iteration, periods were removed from file names to allow Greenstone to properly index the entire name of the file. 

Through each iteration of renaming files, all associated metadata needed to be reloaded, as well.  It wasn't just a matter of renaming one file. The new digital library designer would do best by testing a small sample of about 5 files, and running them through the whole process, including compiling the library, before naming all files.  Technical anomalies as with the term 'page' or the 'dot' were unavoidable.  But beware of rushing into a project without giving serious thought to planning the file names.

Time invested in planning upfront nets tremendous savings later in the project.

3.2 Experience counts

With that thought in mind, lesson 2 teaches us that experience counts, or rather that inexperience can result in painful revisions and delays. 
Problems encountered in the project were, perhaps, typical of a first effort in creating an online digital library collection.  Problems arising from inexperience in digital project planning, as mentioned, and problems of a technical nature were common in the effort.

Inexperience in prepping the documents and retaining their provenance added time.  With little training, creating a taxonomy for the first time added time.  Understanding new and complicated standards, like the TEI-P-5 Guidelines, added time.  Poor equipment selection added time in requiring that some documents be outsourced.  Inexperience with Library of Congress Subject Headings meant long and laborious dissection of appropriate headings for collection objects which added time.

Several technical problems resulted from inexperience. Resolution of scanned images was of mediocre quality due to the age of the scanner and poor technique, and troubleshooting the Greenstone Library Interface in constructing a basic digital library with Greenstone was an ongoing battle that a more experienced builder would not have had to endure.


3.3 Choose wisely

The free, open access, Greenstone Digital Libray software is a welcome solution to many collections, which would otherwise not be mounted to the World-Wide Web were it not for Greenstone.  The power of posting to the world an item plus its full bibliographic record for later data harvesting, is the stuff of which librarians dream.  But the amount of difficulties surmounted, and the limited support documentation made the selection one to reflect upon.  Greenstone proved a very cramped space in which to build a repository for a beginner.  Its assets are great for universality of metadata but at a high cost.  Dated and unreadable user manuals meant repeated combing through computer-ease written in awkward English. 

Grand designs of "reading rooms" or side-by-side ASCII text translations for each object in an An American Tale, FAQ's, a Contact Us page, Chronological Lifeline, and User Guide all fell flat with hard to understand capabilities served up in Greenstone. If the author were to embark on the same project again, the author would have invested more time in learning more about the potential of other content management systems like DSpace, WebPress, Drupal, Mambo, PostNuke, or Plone.  They may not explicitly be identified as "digital library" software, or automatically support library standards like MARC, Dublin Core, or the Open Archives Initiative Protocol for Metadata Harvesting, but their features and documentation may be better suited for the non-programmer, at a first-strike effort in creating a digital library.

3.4 Be flexible

Revisions during the course of the project meant a better end-product. Original scanning techniques were changed mid-project to improve clarity.  Tighter chronological groups of records meant more refined presentation.  Outsourcing oversized documents meant inclusion of critical objects. Therefore, flexibility resulted in a more polished digital library.


3.5 Keep a sense of humor

Above all else, the reader should remember to try to maintain a sense of humor.  The task of mounting a digital library to the World-Wide Web is no small feat.  The ability to stand back and laugh at one's foibles or mishaps will only aid to keep the project on track, as well as its creator.




 


Conclusion

In Forging cultural heritage collections online:  The story of An American Tale, the reader learned how to lay the foundation for building an online digital collection of cultural heritage artifacts, by first defining one's requirements then defining one's design.  The reader learned what the finished product might look like using the Greenstone Digital Library platform, and its assorted features.  Finally, as a result of the 8-week effort, the reader learned about 5 important lessons concerning digital library design which may save time and heartache in any future attempts.

Like the journey of early settlers to Missouri, the road to constructing a premier digital collection is fraught with danger.   Through mistakes made by the author's own experience, the reader may take away many valuable lessons to begin the journey to building a collection of universally-accessible artifacts of cultural heritage worthy of preservation for generations to come.


 


To visit the

An American Tale:  19th century Folkways to Missouri

digital library, please go to:

http://research.sbs.arizona.edu/gsdl/cgi-bin/library?a=p&p=about&c=anameri1&l=en&ct=1&qt=0&qto=3&uq=1164808403312



 

�2008 INFORMATION TECHNOLOGY DIVISION/SPECIAL LIBRARIES ASSOCIATION