computing vision. We should not only
become early adopters of semantic
computing technologies and infrastructure in our research projects but
we should also actively develop and
evolve them. In Microsoft Research we
are taking some first steps toward this
vision, as we are investing in projects
that can demonstrate the benefits of
semantic computing technologies in
research. We are therefore attempting to build an ecosystem of research
tools and services as demonstrations
of these ideas and concepts.
We focus here on the role of the researcher as an “extreme information
worker” meaning a technology user
with expectations and requirements
at a scale not yet required by the business community. We believe information representation, management,
and processing tools in combination
with automation technologies will
greatly help them in their research.
We are therefore taking small steps
toward developing semantics-aware
tools and services. Here, we describe
some of the work we are doing in supporting the scholarly communications
life cycle through semantic computing
technologies.
Semantic Annotations and Metadata
in Word. The authoring stage is perhaps
natural language
may not always
be adequate
to convey the
meaning of a word
or an expression,
especially in the
scientific world.
the best time to capture an author’s
intentions and to record the meaning
of the words as they are being written. Natural language may not always
be adequate to convey the meaning
of a word or an expression, especially
in the scientific world. In many disciplines domain-specific ontologies are
therefore being created by experts to
address this issue but they have not so
far been incorporated with productivity tools like Microsoft Office.
In collaboration with Phil Bourne
and Lynn Fink at the University of
California, San Diego, we worked to-
Support for annotations
straight from within Word
Domain-specific ontology
Annotations travel with the
document
Can be used to improve
domain-specific discovery of
information, cross-linking,
and so forth.
Figure 3. Semantic annotations in Word.
Figure 4. a simple “Chemistry zone” in a Word document and the CmL representation
(in pseudo-xmL) stored inside the ooxmL document.
<cml ...> <molecule ...> <atomArray> <atom elementType=”C” ... /> <atom elementType=”H” ... /> ... </atomArray> <bondArray> <bond ... /> <bond ... /> ... </bondArray>
</molecule>
</cml>
ward a plug-in for Word 2007 (part of
the BioLit project; http://biolit.ucsd.
edu/) that allows authors to annotate
words or sentences with terms from
an ontology (for example, Gene Ontology; http://www.geneontology.org/).
The annotations are stored as part of
the Office Open XML (OOXML) representation of the document (OOXML
has been accepted as an ISO standard.
More information can be found at
http://openxmldeveloper.org/). Tools
and services can now extract the annotations by just opening the OOXML
package without human intervention
and there is not even a need for Word
to be installed. As a result, the documents will be able to be better categorized, indexed, and searched with the
author’s intent always closely associated with the text.
The ability to easily annotate terms
from within Word is a first step in producing documents that semantically
relate to the body of knowledge in a
domain. In this way, information can
easily become part of a data mesh as it
is being generated (see Figure 3). The
source code for the plugin is now available as open source (see http://ucsdbi-
olit.codeplex.com/) for the community
to further extend or just use as the basis for a new generation of semantics-oriented authoring tools.
Chemistry in Word. We are investigating, in collaboration with Peter
Murray-Rust, Jim Downing, and Joe
Townsend from the University of Cambridge, the introduction of chemistry
drawing functionality into Word documents (see http://research.microsoft.
com/en-us/projects/chem4word/).
Rather than just having images of
chemical structures, we would like to
preserve the chemistry-related semantics in a machine-processable manner. For that reason, we are using the
Chemistry Markup Language (CML) in
our investigations; instances of CML
would be embedded inside OOXML
documents. We believe an ecosystem
of chemistry-related tools and services
can then emerge to enable the automatic processing of documents, making the authoring process an easy but
increasingly valuable part of the research life cycle.
As an example, consider the water
molecule (H2O). In a Word document,
it appears as a series of characters,