9.10.D : Acquired Data: Archiving Images

$Id: images.html,v 1.6 1996/03/13 22:39:13 de Exp de $
In preserving the historic record of observations made with DEIMOS, the most challenging of the three kinds of information to be archived is of course the image data, the product of the observing run. The challenge consists not in the complexity of the data, but in their sheer volume.

How to store image data

Although many modern database vendors tout their ability to store binary objects such as images (Sybase calls these BLOBs, Binary Large OBjects), in practise database engineers generally do not choose to exploit this feature. (As one database engineer has said, "You can store large images in relational databases. You can also fit a cow into your refrigerator if you try really hard, but it doesn't do much for the cow or the refrigerator.") The preferred method in practise is to store a library of header information in the relational database, and to use this library to reference a file-system based archive of large image files.

FITS headers map very well to records in a traditional RDBMS table. An archive of millions of FITS headers would not strain the limits of today's database engines; with proper maintenance and indexing, very rapid and intelligent searches on million-record tables are routinely done throughout the information systems world today. Now all we have to do is include in each record a reference to the location of the corresponding image file, or adequate information to instruct a jukebox to retrieve the file.

How much storage is needed for image data

One image (single beam) from the full 8-CCD array (2Kx4K pixels per CCD) is 128MB (at 16 bits per pixel). A dual-beam observation would be twice that size. We might expect to acquire as many as 100 images in a single night, and an observing run might be 3 nights. These numbers pose an obvious problem even at the stage of immediate image capture, storage, and quick-look reduction. When we consider archiving them, there are very few existing technologies that will do the job.

Supposing that we can really acquire, say, 26GB of archive-worthy data every night (which won't actually happen, but we should plan for the worst case), and that DEIMOS is in use one quarter of the time (since there are 4 instruments planned for Keck-II). Supposing in addition that every night of the year is perfect, we have 91 nights of DEIMOS observing per year, we end up with 2366GB of image data to archive every year -- a daunting prospect.

Realistically, Keck observing logs indicate that only about 38% of telescope time is spent actually acquiring images with Keck I. If we assume that we will do a little better with Keck II and make good use of at least 50% of our observing time, then we can reduce our yearly acquired data volume to 1200GB. We can then further estimate that some number of nights are lost to bad weather and engineering, perhaps a fifth of all available nights. 4/5 of 1200GB brings us down to 960GB, a little more than a third of our original estimate.

In either case, it is unreasonably costly to archive this quantity of data on traditional hard drives: the acquisition of the equipment, though appallingly expensive, would be less of a difficulty than the maintenance of such an enormous disk farm. The only technology capable of dealing economically with this volume of data is database + archival media jukebox, and today's more affordable jukebox technology is not capable of handling our worst-case volume of data. Our more optimistic estimate, however, is within the reach of today's tools.

Automated removable media (jukebox) technology

Jukebox technology is the obvious choice for the library of images. However, the volume of data involved disqualifies some of today's options. The largest CDROM jukebox of reasonable (<$100K) cost (that I know of) holds 500 CDs. At today's limit of 600MB per CDROM, this (300GB) is pathetically inadequate. If we assume availability of the oft-predicted 10x (6GB) CDROM (which may or may not materialize by the time we are ready to archive) 3000GB is only about 18 months' work in our worst-case estimate. In our more optimistic estimate, however, 3000GB would represent several years' work and would provide a reasonable mid-term solution. Even a lesser achievement of, say, a 4GB CDROM format would be acceptable if our more optimistic estimate is accurate.

Tape jukebox using high-speed, high-density cartridge tapes such as the DEC (Quantum) DLT may seem like a better option today, since the cartridges hold up to 40GB and provide relatively rapid seek times (as compared to other tape transports such as Exabyte). Still, the largest jukebox I have seen for DLT cartridges held only 10 units (400GB), only a slight improvement over CDROM jukes in storage space. The seek time is not rapid when compared to CDROM media, nor are these jukeboxes cheap ($10-15K); and the media are more frail than CDROM. For a long-term historical archive we should choose the most robust, damage-resistant media possible.

Data Compression

We might at this point ask whether compression couldn't solve our problem for us. However, DEIMOS spectral frames are bad candidates for compression (almost all the imaging surface is relevant; simple lossless schemes (like RLE) are not going to work due to the "no background" quality of the images and their fine texture. Sky image frames might be subjected to compression, but since sky background and background noise are part of the historical record, compression schemes which impose a noise floor (for example) cannot be used. From the available lossless algorithms, we can't expect better than 30% compression, not enough to solve the problem over any reasonable period.

Long Term Planning

Media density and formats change fairly rapidly in the computing industry. The challenge for archive maintainers is to choose a technology mature enough for reliability, yet not so mature as to be in imminent danger of "product death". There is always the risk of committing to a particular storage medium which does not survive the extreme evolutionary pressures of the market. To have a format become obsolete within a few months of our commitment to it would be unfortunate.

However, we must assume that no choice of media and format, however optimal at the time, will endure for the (indefinite) lifetime of our archive. At some point the technology will become obsolete or the volume of our data will exceed its practical capacity; we will then have to "migrate" the data to a new medium and format. This expense (of periodic migration) must be accepted and planned for, if an archive of lasting historical value is our aim.

If we acknowledge that any technology choice for archiving will have at most a 5-year lifetime in practice, then CDROM jukebox is probably the optimal choice for the first generation. The 5inch CDROM format is as close to universal as the industry offers; CDROMs are robust, and the robotics to handle them are well-understood and mass-produced. The initial cost of a 500-CD jukebox of good quality is on the order of $25-30K.

Relationship of Archive and Backup

Keck already enforces a mandatory backup of all acquired images using the Save-the-Bits software package. Unfortunately STB is a hack which uses the Berkeley lpd protocol to move images around; not only is Berkeley lpd unlikely to work exactly the same under future Solaris releases, but lpd-type protocols in general are not appropriate for images of DEIMOS size.

How, then does the archiving concept interact with the need for reliable backups of each night's observing? It seems foolish to run two completely separate software packages; if we assume that all images are archived, and that all image headers are stored in our database, we have taken care of all backup issues already; we now have one remaining problem which is the duplication of media (obviously backup media want to remain on Mauna Kea, whereas the archival jukebox/WWW-server are likely to be elsewhere in Hawaii or even on the mainland).

Copying very large CDROMs may be time consuming and although it could conceivably be done during the day, daytime may be used for downtime and network interruptions which could prevent the duplication. Ideally we would like to make two identical copies on the spot, during observing, with no additional use of network bandwidth or human time. If we can acquire hardware that will duplicate signals to two identical SCSI devices, making them look like one device, then this is a practical and ideal strategy. We need to investigate the existence and/or cost of such hardware (effectively RAID for CDROM).


de@ucolick.org
webmaster@ucolick.org
De Clarke
UCO/Lick Observatory
University of California