Maryland's iSchool: August 2011

Allison Druin, Associate Dean for Research, iSchool

There is a quiet time for research. It’s the quiet of night- when some of our best writing seems to emerge. It’s the morning hours when an idea can finally come to you even when the birds still seem asleep. It’s the summer months when it seems so hot the pavement will melt, but there is sanctuary on campus in the life-saving air-conditioned buildings. It’s that important quiet time that let us ponder, write, take time for conferences, explore new ideas in the lab, and enjoy the time for research.

These quiet research moments are just as important to the research process as the busy louder times when there are endless campus visitors, energetic classes to teach, and the steady rhythm of campus meetings. It’s during those in between moments that here in the iSchool our research is covered by the Washington Post to Fox5 News which was the case for Assistant Professor & new HCIL Director, Jen Golbeck for her work on social network analysis of FaceBook. It’s during these quiets moments that Assistant Professor Kari Kraus published an op-ed piece in the NewYork Times discussing her ideas and research on digital preservation . It’s during these quiet times when we can find out that we received new grant funding like Assistant Professor Bo Xie did when she recently received an NIH grant for her work on understanding older adults’ e-health literacy. It’s during those times we can give keynote talks at international conferences, like Associate Professor Jimmy Lin did at the International Conference on Weblogs and Social Media.

It’s also during those times that we can make progress exploring our research with colleagues near and far. This is what Bruce Ambacher did, a Visiting Professor here at the iSchool where he coordinates and teaches in the archives specialization of the MLS program. What follows is an update on his work during that quiet time for research:

In April I blogged about the summer that lay ahead for my work with a group of digital cultural heritage curators “field testing” the draft International Standards Organization Standard 16363 at three digital repositories in Europe and three in the United States. The round of tests proved to be just as busy, extensive, and illuminating as expected, although some of the results were somewhat unexpected.

At the heart of the decades long effort is a clear understanding that data is the lifeblood of information, whether that information is science, business, technology or culture. Stewardship of data so that they are available and useful to users both today and in the future is an ongoing challenge due to the rapidly changing technological landscape, the increasingly interdisciplinary and collaborative nature of users and their disciplines, and the uncertainty over the best preservation procedures .
The tests began on June 2^nd and concluded on July 7^th with breaks for travel, holidays and schedule gaps. The three European sites were the United Kingdom Data Archives (UKDA) focusing on social science data; National Computer Center for Higher Education (CINES) focusing on digitized theses and dissertations; and Data Archiving and Networked Services (DANS) focusing on statistical data. The European repositories benefitted from resources provided by the European Commission through the Alliance for Permanent Access. These resources were used to compensate for staff time used to analyze the metrics of the draft standard, accumulate the supporting documentation, complete the metrics document and host the audit team.

The three repositories in the United States that volunteered to participate were Center for International Earth Science Information Network (CIESIN) a center within the Earth Institute at Columbia University that focuses on the interaction of social, natural and information sciences; the National Space Science Data Center (NSSDC), NASA's archive for space science mission data; and the digital archives component of the Kentucky Department of Libraries and Archives (KDLA). Because of funding limitations none of the U.S. repositories were compensated in any way for their efforts.

At each repository the audit focused on specific aspects of their digital data life-cycle operations that incorporated their long-term preservation and access functions. We did not attempt to audit the entire life-cycle operations of any repository.

The test audits were a “learning” experience for both the repositories and the auditors. The test audit teams in Europe generally numbered ten to twelve. In the United States the audit teams were smaller since only one European could schedule the time for those audits. A real audit situation would have only one to three auditors, depending on the complexity of the audit. The test audits also included much more informal interaction between the audit team and the repository staff to enhance the learning experience for both groups, to clarify the meaning of the standard’s metrics, and to expand on the answers given by the repository to a particular metric. One repository’s assessment of the test process was put online.

This interaction also extended to collegial meals and socializing which normally would not be part of an audit. But, heck, it was a test, and we pretty much knew each other, or knew of each other, beforehand. We even got to enjoy a few out of the ordinary sights including one leg of the Tour England bicycle race in Colchester and a highly choreographed demonstration against the austerity cuts sweeping Europe in Montpelier, France. There were nearly as many gendarmes as demonstrators and all went well until the demonstrators stepped outside of the apparently agreed upon path for the march when they were put back in place with speed and determination.

Even as test audits the process revealed extensive similarities and shortcomings among the six repositories. The most troubling observation is that long-term data stewardship is not yet guaranteed. We cannot yet be confident that our digital data will endure – in a usable and retrievable form – into the indefinite future. This should cause us all great concern.

This concern comes out of the following:

1. None of the repositories had yet achieved their full preservation potential. Their primary focus still was on current access. The fact that old data still existed was largely due to format stability. In the case of at least one repository it was still accepting, saving, and providing data in a format that had barely changed in more than four decades. Their insistence on accepting data only in that format is a major factor in the format’s stability.

2. Virtually all of the repositories made format stability a key component of long term preservation by specifying the format(s), and in some cases even the data elements and their metadata definitions, they would accept and by working to keep that format stable. This is not a realistic approach over the long term.

3. The heavy reliance upon “soft money” in the United States to initiate digital data curation, and the resulting scattershot nature of the repository’s data curation plans and activities. This creates a wide gap between goals and realities. A data program dependent upon grants, gifts and one-time allocations to fund certain aspects of a digital preservation program is an uneven program with too many foci and varying amounts of activity resulting in wide gaps with no activity between.

4. Official government repositories have not realized the full potential of using their records management authority to inventory and schedule agency data series and determine which should become long-term records under its control. Some government repositories also have an inspection authority that enables them to inspect the agencies that are part of their level of government to ensure compliance, e.g. transfer of digital data when it is scheduled.

5. Some repositories lacked a logical, well reasoned, end-to-end Content Information lifecycle that reflected the necessary communication and understanding between the data creator/provider and the repository staff, especially as it related to how the repository would meet its data preservation requirements. Unfortunately, this sometimes reflected the fact that the life-cycle process was often not well defined and not always applied consistently by the data creators/providers or the repositories.

6. A core issue in the digital information community is the relationship within the “Designated Community” of the creator/producer, the repository/custodian and the user. This concept extends to how much metadata and other explanatory documentation the producer and the repository must create and/or accumulate to enable the user to understand the data in the future. The more discipline specific knowledge the community expects the user to have, the less documentation and explanation necessary from the producer and repository to have the data remain understandable into the future. Unfortunately, what is considered common knowledge today, even within a narrowly defined Designated Community, may not be common knowledge in the future and the data may be misunderstood. Applying too narrow a definition of the user community can limit future use of the data. This concept is fully developed in the Reference Model for an Open Archival Information System, ISO 14721.

7. Some of the staff of some of the repositories used terms within the draft standard without clearly understanding the meaning of the term. It became apparent during some audits that staff did not understand the role of the repository in transforming the creator/producer’s Submission Information Package (SIP) into their Archival Information Package – the form of the data that they preserved into the indefinite future and that they used, possibly in part of in combination with other AIPs, to create the Dissemination Information Package (DIP). If they do not understand the terminology can we be sure they understand the concepts and practices embodied within that terminology?

8. The overwhelming focus of the repositories, even those that have been operating for more than four decades, is access. For too many, that means current access, not long-term access. While they would not take actions that would knowingly corrupt the digital bits or prevent future access, they see their primary role as providing access to their current users, not long-term data stewardship for the yet unborn user in the indefinite future.

All of the repositories viewed the test audit as a positive experience. More importantly, they viewed draft ISO 16363 as a necessary and beneficial international standard. The companion handbook for auditors, draft ISO 16919, Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories, has been approved as an ISO Standard. Following a few editorial changes and the incorporation of changes recommended from the first ballot, draft ISO 16363 will be submitted for final voting.

International efforts that can trace their roots back to 1996, through task force reports, international studies and recommendations, the Reference Model for an Open Archival Information System, Trusted Repository Audit and Control, and now Requirements for Audit and Certification of Trustworthy Digital repositories and its companion Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories have created the guidelines and standards. Will digital data curators who wish to provide for both long-term preservation and meaningful access to today’s digital heritage into the indefinite future use them? That is the preverbal sixty-four dollar question.

Maryland's iSchool

Friday, August 19, 2011

A Quiet Time for Research