format supporting legacy software
across the biological disciplines is a
Gordian knot. Convincing software
developers to make this a high priority
is a difficult proposition. Implementation occuring across hundreds of
legacy packages and flawlessly fielded
in thousands of laboratories is not a
trivial task. Ideally, presenting images
simultaeously in their legacy formats
and in a new advanced format would
mitigate the technical, social, and logistical obstacles. However, this must
be accomplished without duplicating
the pixels in secondary storage.
One proposal is to mount an HDF5
file as a VFS (virtual file system) so that
HDF5 groups become directories and
HDF5 datasets become regular files.
Such a VFS using FUSE (
Filesystem-in-User-Space) would execute simultaneously across the user-process space
and the operating system space. This
hyperspace would manage all HDF-VFS file activity by interpreting, intercepting, and dynamically rearranging
legacy image files. A single virtual file
presented by the VFS could be composed of several concatenated HDF5
datasets, such as a metadata header
dataset and a pixel dataset. Such a VFS
file could have multiple simultaneous
filenames and legacy formats depending on the virtual folder name that
contains it, or the software application
attempting to open it.
The design and function of an HDF-VFS has several possibilities. First,
non-HDF5 application software could
interact transparently with HDF5 files.
PDF files, spreadsheets, and MPEGs
would be written and read as routine
file-system byte streams. Second, this
VFS, when combined with transparent
on-the-fly compression, would act as
an operationally usable compressed
tarball. Third, design the VFS with
unique features such as interpreting
incoming files as image files. Commu-nity-based legacy image format filters
would rearrange legacy image files. For
example, the pixels would be stored as
HDF5 datasets in the appropriate dimensionality and modality, and the
related metadata would be stored as a
separate HDF5 1D byte dataset. When
legacy application software opens the
legacy image file, the virtual file is dynamically recombined and presented
by the VFS to the legacy software in the
same byte order as defined by the legacy image format. The fourth possibility
is to endow the VFS with archival and
performance analysis tools that could
transparently provide those services to
legacy application software.
To achieve the goal of an exemplary
image design having wide, long-term
support, we offer the following recommendations to be considered through
a formal standards process:
Permit and encourage scientific 1.
communities to continually to evolve
their own image designs. They know
the demands of their disciplines best.
Implementing community image formats through HDF5 provides these
communities flexible routes to a common image model.
Adopt the archival community’s 2.
recommendations on archive-ready
datasets. Engaging the digital preservation community from the onset, rather
than as an afterthought, will produce
better long-term image designs.
Establish a common image mod- 3.
el. The specification must be conceptually simple and should merely distinguish the image’s pixels from the
various metadata. The storage of pixels should be in an appropriate dimensional dataset. The encapsulation of
community metadata should be in 1D
byte datasets or attributes.
The majority of the metadata is 4.
uniquely specific to the biological community that designs it. The use of binary or XML is an internal concern of the
community creating the image design;
however, universal image metadata
will overlap across disciplines, such
as rank, dimensionality, and pixel modality. Common image nomenclature
should be defined to bridge metadata
namespace conversions to legacy formats.
Use RDF 5. (Resource Description
Framework) 15 as the primary mechanism to manage the association of pixel datasets and the community metadata. A Subject-Predicate-Object-Time
tuple stored as a dataset can benefit
from HDF5’s B-tree search features.
Such an arrangement provides useful
time stamps for provenance and generic logging for administration and
performance testing. The definition
of RDF predicates and objects should
follow the extensible design strategy
used in the organization of NFS (
Network File System) version 4 protocol
In some circumstances it will 6.
be desirable to define adjuncts to the
common image model. An example is
MPEG video, where the standardized
compression is the overriding reason
to store the data as a 1D byte stream
rather than decompressing it into the
standard image model as a 3D YCbCr
pixel dataset. Proprietary image format is another type of adjunct requiring 1D byte encapsulation rather than
translating it into the common image
model. In this scenario, images are
merely flagged as such and routine archiving methods applied.
Provide a comprehensively tested 7.
software API in lockstep with the image
model. Lack of a common API requires
each scientific group to develop and
test the software tools from scratch or
borrow them from others, resulting in
not only increased cost for each group,
but also increased likelihood of errors
and inconsistencies among implementations.
Implement HDF5 as a virtual file 8.
system. HDF-VFS could interpret incoming legacy image file formats by
storing them as pixel datasets and encapsulated metadata. HDF-VFS could
also present such a combination of
HDF datasets as a single legacy-format
image file, byte-stream identical. Such
a file system could allow user legacy applications to access and interact with
the images through standard file I/O
calls, obviating the requirement and
burden of legacy software to include,
compile, and link HDF5 API libraries
in order to access images. The duality
of presenting an image as a file and
an HDF5 dataset offers a number of
intriguing possibilities for managing
images and non-image datasets such
as spreadsheets or PDF files, or managing provenance without changes to
legacy application software.
Make the image specification 9.
and software API freely accessible and
available without charge. Preferably,
such software should be available under an open source license that allows
a community of software developers to
contribute to its development. Charging the individual biological imaging
communities and laboratories adds