I will start off with some background, talk briefly about a couple of exciting new storage technologies, mention some key characteristics of the Universal Preservation Format, and finally update you on where we've been, what we're doing, and where we go next.
Sponsored by the WGBH Educational Foundation and funded in part by a grant (97-029) from the National Historical Publications and Records Commission of the National Archives, the Universal Preservation Format initiative advocates a format for the long-term storage of electronically generated media. Dave MacCarn, Chief Technologist at WGBH, is the architect of UPF. He, along with Mary Ide, Director of Media Archives and Preservation Center at WGBH, are the Project Directors. I am the Project Coordinator, with feet planted not-always-so-firmly in both engineering and archival camps.
Working with representatives from standards organizations, hardware and software companies, museums, academic institutions, archives and libraries, this project will produce and publish a document called a Recommended Practice. This document will be submitted to the Society of Motion Picture and Television Engineers, also known as SMPTE, suggesting guidelines for engineers when designing computer applications that involve or interact with digital storage. We expect to make the process of preserving and accessing electronic records (both original and migrated) more efficient, cost-effective, and simpler.
Once upon a time, you could "get at" most media through sheer cleverness. With analog media, such as a record or a film slide, there is an "analogy" between process and form. In practical terms, even without playback equipment, you could simulate the media experience. For example, I recall building a phonograph player, using rolled-up cardboard to amplify the sound and a sewing needle for a stylus. I can tell you, it was not very popular with my parents, whose records I sometimes borrowed for my prototype.
Digital can refer literally to both fingers on a hand and to numbers.
"A digital computer performs calculations and logical operations with quantities represented as digits, usually in the binary number system."
To get at digital media, you need some form of decoder, too often the exact decoder.
There are presently over a dozen proprietary digital storage formats. We propose to add one more: the UPF. The format would be "platform-independent," meaning that it would load into applications that adopted this standard, regardless of the computer's operating system.
The Universal Preservation Format standard would serve as a universal decoder, co-existing and interchanging with proprietary formats in the same way that, say, an ascii or plain text and RTF or "rich text format" co-exist with Word or WordPerfect formats in your word processors.
I don't need to remind you about the value of standards. Just think about them the next time you replace a light bulb in your living room lamp.
Technical standards for information services and systems benefit in a number of ways:
They make information systems easier to use and less expensive to operate, and they promote competition which lowers prices.
Another benefit: they make our lives easier. To paraphrase from the National Information Standards Organization, standards achieve "compatibility between equipment, data, practices, and procedures so information can be made easily and universally available."
One standard that has made all of our lives easier is acid-free paper. Established in 1984 by the National Information Standards Organization, Z39.48 set the requirements for the durability and longevity of paper. Paper that complies with this standard will last several hundred years.
What made this standard a reality, particularly the 1992 revision, were joint efforts among paper makers, publishers, printers, and the preservation community.
The UPF is sounding a similar call for cooperation and communication between engineers and archivists.
Digital information really consists of just binary code (zeros and ones). When these zeros and ones are arranged in particular way, you build digital objects. These objects can be data types, such as video or music, and they can be information about the data types, which is fashionably called "metadata," but which most of know as cataloging information.
Information, represented by sequences of zeros and ones, ideally lasts forever.
What does not last forever is the medium or material upon which these zeros and ones make their home. There are limitations on how much you can store on a digital medium and how long it will maintain its integrity on that medium.
And yet the medium for digital storage may be a red herring. There are very promising storage technologies on the horizon.
One technology being marketed now stores digital information on paper. Developed by Massachusetts company, Cobblestone Software, this product, PaperDisk, will process any type of computer file -- text, spreadsheets, graphics, even audio and video, -- and print it out in the form of bar codes. These bar codes can then be scanned through a common image scanner and translated back into a digital file. I demo-ed this software at home with my $150 inkjet printer and $100 scanner, and it worked well for simple text files. Understand, this is a brand new product, guaranteed to improve with age.
More exciting for storage capacity and permanance is the Pancake Disc developed by Los Alamos National Laboratory. Norsam Technologies, which sells this product, claims that Pancake Disks will last thousands of years without any special storage requirements. Norsam is issuing two products on the Pancake this, one for digital storage and one for analog storage.
The analog product, called HD-Rosetta, is an intriguing variation on microfiche technology. You access the stored records as a series of images through a high-powered microscope. HD-Rosetta can record about 90,000 8 * x 11 analog images on a two-inch Pancake Disc.
The other product, called the HD-ROM, is being developed with IBM, and can presently store 650 gigabytes, which is 1,000 times the capacity of today's CD-ROMs. Eventually these disks may hold up to 12 times that amount. Norsam's technology uses charged particle beams, rather than laser beams, to write data onto disks. Or to put it another way, it uses a smaller pencil.
Because storage technologies like these are advancing at a breath-taking rate, the UPF will not recommend one storage vehicle. Our concern is with how the information is stored, regardless of medium and the computer operating system you are using.
Both PaperDisc and Pancake Disks are hybrid technologies. Perhaps it will take such a combination of digital and analog technologies to serve the needs of preservationists as they migrate to new computer solutions.
A hybrid approach might also fulfill requirements of a preservation standard.
For example, information required to identify and decode the rest of the stored data might be placed on these Pancake disks in both digital and analog formats.
But what about this UPF? What does the thing look like?
I will give a simplified overview of the important characteristics: the Wrapper, the Rosetta Stone, and Unique Identifiers, and a Media Compiler.
The wrapper -- or container -- is a file format for storing both the media content or "essence" along with the information that describes it. Think of it as kind of a digital fahita or burrito with the basic ingredients as the "essence" and the optional hot sauce as its metadata.
The wrapper is both a file format and a framework.
Anyone familiar with the Dublin Core metadata initiative, specifically the Warwick Framework Architecture, may understand frameworks as a method for managing data. Material describing certain objects may either be embedded in the source or be referenced to files or storage areas external to the source. This information might include domain specific descriptions, terms and conditions for document use, pointers to all manifestations of document, or archival responsibility.
Standards for long-term digital storage should carry forth the traditional practices of analog archives, specifically provenance and original order. Its framework must be robust, allowing for certain types of metadata to be embedded with the media, while others to be referenced externally.
Identifying digital objects as unique entities is essential to establishing archival integrity. As files are modified, you need to distinguish the offspring from the parent, but also map the "blood lines," so to speak. The UPF is looking at initiatives dealing with unique identifiers and expects to include such a system or systems in our Recommended Practice. Basically, each object carries an I D that is unique within its container. Different properties of an object might indicate the name or author of the object, a comment, a copyright notice, etc. Formats and relationships may be defined in a digital ÒRosetta stone.Ó
The UPF uses the digital Rosetta stone to get at the range of data types held in a digital storage bank.
It would serve as a key, defining data types and encapsulating algorythms for decyphering those files. Jeff Rothenberg, in an article for Scientific American, suggests encapsulating software with the stored digital media as a way to get at the media through time. MacCarn proposes the use of platform-independent algorythms to decode the stored file types.
MacCarn's Rosetta stone might state in effect, "This system uses MaRC, which is defined as such-as-such" or "This system was originally recorded on 422 Video, which is defined as so-and-so." In addition, the Rosetta stone might include some form of mapping among among multimedia file formats or even classification or cataloging systems. The Rosetta stone would also serve as registry for unique identifiers.
The actual moving of data would be performed by a media compiler. It would remove the baggage of the acquisition format as it imported data into the archive. It would optionally export whatever metadata you needed from the archive. Specifically, you could pre-select which set of relationships or media formats you wish to transport for a given need, such as Internet access.
A universal standard for digital storage could have an enormous impact upon the selective distribution of your media collections. Because the relational integrity among your data objects would be built-in, you could very easily "package" information. For example, you could extract certain media objects, along their associative text files, based on a scholar's search patterns. These materials could then be burned into a CD-ROM or transferred onto some other portable storage vehicle, and then loaned to the scholar for a fee or sold outright.
On September 22, 1997, the Society of Motion Picture and Television Engineers assigned the UPF an official Study Group (ST13.14). Titled "Requirements for a Universal Preservation Format" and chaired by Dave MacCarn, the group first met to establish an agenda and to hash out a statement of objectives, which includes gathering input from the archival community through the surveys and meetings like this.
On December 9th, Dave MacCarn and I attended the first SMPTE work study forum. Joined by Robin Dale of the Research Library Group, we met with about 20 SMPTE engineers at the Sony headquarters in San Jose, California to discuss the components of the UPF in respect to the stated needs and concerns of archivists, as expressed in our User Survey.
Recently we published results of our survey on our web site, as well as follow-up questions. We will include your commentaries in future site revisions. We urge you all to read these often inspiring commentaries from some very respected people in this field.
This material will accompany us to Altanta, Georgia when we take part in the next SMPTE study group meeting. This meeting will be held on March 12th at the Turner Entertainment building, and we invite all archivists in the Atlanta area to attend.
The success of any initiative dealing with digital preservation hinges on the quality of communication between professionals who manage media collections and leaders within the digital storage industry. By concentrating on elemental concepts of how data and information about that data might be stored through time, the Universal Preservation Format initiative is attempting to construct a bridge between engineers and information scientists, between those who make and market technical specifications and those who use them.