suggests the Deep Web alone can generate more than one billion pages of
valuable structured data. The result
is an astounding number of distinct
structured data sets, most still waiting to be exposed more effectively to
users.
This structured data differs from
data stored in traditional relational
databases in several ways:
Data in “page context” must be extracted. Consider a database embedded in an HTML table (such as local
coffeehouses in Seattle and the U.S.
presidents in Figure 1). To the user
the data set appears to be structured,
but a computer program must be able
to automatically distinguish it from,
say, a site’s navigational bar that also
uses an HTML table. Similarly, a Web
form that gives access to an interesting Deep Web database, perhaps containing all Starbucks locations in the
world, is not that different from a form
offering simple mailing-list signup.
The computer program might also
have to automatically extract schema
information in the form of column labels sometimes appearing in the first
row of an HTML table but that sometimes do not exist at all. Moreover, the
subject of a table may be described in
the surrounding text, making it difficult to extract. There is nothing akin
to traditional relational metadata that
leaves no doubt as to how many tables
there are and the relevant schema information for each table.
PHO TOGRAPH BY MAX WESTBY
No centralized data design or data-quality control. In a traditional database, the relational schema provides
a topic-specific design that must be
observed by all data elements. The
database and the schema may also
enforce certain quality controls (such
as observing type consistency within
a column, disallowing empty cells,
and constraining data values to a certain legal range). For example, the set
of coffeehouses may have a column
called year-founded containing
integers constrained to a relatively
small range. Neither data design nor
quality control exists for Web data; for