example is, given an entity, return a
set of possible properties—attributes
and relationships—that may be associated with it. Such a service would
be useful for both information-extrac-tion tasks and query expansion.
Structured data from other sources. Some of the principles of our previous projects are useful for extracting
structured data from other growing
sources on the Web:
Socially created data sets. These
data sets (such as encyclopedia articles, videos, and photographs) are
large and interesting and exist mainly in site-specific silos, so integrating them with information extracted
from the wider Web would be useful;
Hypertext-based data models. These
models, in which page authors use
combinations of HTML elements
(such as a list of hyperlinks), perform
certain data-model tasks (such as indicate that all entities pointed to by
the hyperlinks belong to the same
set); this category can be considered a
generalization of the observation that
HTML tables are used to communicate relations; and
Office-style documents. These documents (such as spreadsheets and
slide presentations) contain their own
structured data, but because they are
complicated, extracting information
from them can be difficult, though it
also means they are a tantalizing target.
Creating and publishing struc-
tured data. The projects we’ve de-
scribed are reactive in the sense that
they try to leverage data already on
the Web. In a complementary line of
work, we created Google Fusion Ta-
bles, 13 a service that aims to facilitate
the creation, management, and pub-
lication of structured data, enabling
users to upload tabular data files, in-
cluding spreadsheets and CSV, of up
to 100MB. The system provides ways
to visualize the data—maps, charts,
timelines—along with the ability to
query by filtering and aggregating the
data. Fusion Tables enables users to
integrate data from multiple sources
by performing joins across tables that
may belong to different users. Users
can keep the data private, share it with
a select set of collaborators, or make
it public. When made public, search
engines are able to crawl the tables,
thereby providing additional incen-
tive to publish data. Fusion Tables
also includes a set of social features
(such as collaborators conducting
detailed discussions of the data at
the level of individual rows, columns,
and cells). For notable uses of Fusion
Tables go to https://sites.google.com/
site/fusiontablestalks/stories.
conclusion
Structured data on the Web involves
several technical challenges: difficult
to extract, typically disorganized, and
often messy. The centralized control
enforced by a traditional database
system avoids all of them, but centralized control also misses out on the
main virtues of Web data—that it can
be created by anyone and covers every
topic imaginable. We are only starting
to see the benefits that might accrue
from these virtues. In particular, as illustrated by Web Tables synonym finding and schema auto-suggest, we see
the results of large-scale data mining
of an extracted (and otherwise unobtainable) data set.
It is often argued that only select
Web-search companies are able to
carry out research of the flavor we’ve
described here. This argument holds
mostly for research projects involving
access to logs of search queries, but
the research described here was made
easier by having access to a large Web
index and computational infrastructure, and much of it can be conducted
at academic institutions as well, in
particular when it involves such challenges as extracting the meaning of
tables on the Web and finding interesting combinations of such tables.
ACSDb is freely available to researchers outside of Google (https://www.
eecs.umich.edu/ michjc/ acsdb.html);
we also expect to make additional
data sets available to foster related research.
References
1. Barbosa, L. and Freire, J. Siphoning Hidden-Web data
through keyword-based interfaces. In Proceedings
of the Brazilian Symposium on Databases, 2004 ,
309–321.
2. Bergman. M.K. The Deep Web: Surfacing hidden value.
Journal of Electronic Publishing 7, 1 (2001).
3. Cafarella, M.J., Halevy, A. Y., and Khoussainova, N. Data
integration for the relational Web. Proceedings of the
VLDB Endowment 2, 1 (2009), 1090–1101.
4. Cafarella, M.J., Halevy, A. Y., Wang, D.Z., Wu, E., and
Zhang, Y. Web Tables: Exploring the power of tables
on the Web. Proceedings of the VLDB Endowment 1, 1
(Aug. 2008), 538–549.
5. Cafarella, M.J., Halevy, A. Y., Zhang, Y., Wang, D.Z., and
Wu, E. Uncovering the relational Web. In Proceedings
of the 11th International Workshop on the Web and
Databases (Vancouver, B. C., June 13, 2008).
6. Callan, J.P. and Connell, M.E. Query-based sampling
of text databases. ACM Transactions on Information
Systems 19, 2 (2001), 97–130.
7. Cars.com (faq); http://siy.cars.com/siy/qsg/
faqgeneralinfo.jsp#howmanyads
8. Cazoodle apartment search; http://apartments.
cazoodle.com/
9. Chang, K.C.-C., He, B., and Zhang, Z. Toward
large-scale integration: Building a metaquerier
over databases on the Web. In Proceedings of the
Conference on Innovative Data Systems Research
(Asilomar, CA, Jan. 2005).
10. Chen, H., Tsai, S., and Tsai, J. Mining tables from
large-scale html texts. In Proceedings of the
18th International Conference on Computational
Linguistics (Saarbrucken, Germany, July 31–Aug. 4,
2000), 166–172.
11. Elmeleegy, H., Madhavan, J., and Halevy, A. Harvesting
relational tables from lists on the Web. Proceedings of
the VLDB Endowment 2, 1 (2009), 1078–1089.
12. Gatterbauer, W., Bohunsky, P., Herzog, M., Krüupl,
B., and Pollak, B. Towards domain-independent
information extraction from Web tables. In
Proceedings of the 16th International World Wide Web
Conference (Banff, Canada, May 8–12, 2007), 71–80.
13. Gonzalez, H., Halevy, A., Jensen, C., Langen, A.,
Madhavan, J., Shapley, R., Shen, W., and Goldberg-Kidon, J. Google Fusion Tables: Web-centered data
management and collaboration. In Proceedings of the
SIGMOD ACM Special Interest Group on Management
of Data (Indianapolis, 2010). ACM Press, New York,
2010, 1061–1066.
14. He, B., Patel, M., Zhang, Z., and Chang, K.C.-C.
Accessing the Deep Web. Commun. ACM 50, 5 (May
2007), 94–101.
15. Ipeirotis, P.G. and Gravano, L. Distributed search over
the Hidden Web: Hierarchical database sampling and
selection. In Proceedings of the 28th International
Conference on Very Large Databases (Hong Kong, Aug.
20–23, 2002), 394–405.
16. Limaye, G., Sarawagi, S., and Chakrabarti, S.
Annotating and searching Web tables using entities,
types, and relationships. Proceedings of the VLDB
Endowment 3, 1 (2010), 1338–1347.
17. Madhavan, J., Ko, D., Kot, L., Ganapathy, V.,
Rasmussen, A., and Halevy, A. Y. Google’s Deep Web
Crawl. Proceedings of the VLDB Endowment 1, 1
(2008), 1241–1252.
18. Madhavan, J., Cohen, S., Dong, X.L., Halevy, A. Y.,
Jeffery, S. R., Ko, D., and Yu, C. Web-scale data
integration: You can afford to pay as you go. In
Proceedings of the Second Conference on Innovative
Data Systems Research (Asilomar, CA, Jan. 7–10,
2007). 342–350.
19. Ntoulas, A., Zerfos, P., and Cho, J. Downloading
textual Hidden Web content through keyword queries.
In Proceedings of the Joint Conference on Digital
Libraries (Denver, June 7–11, 2005), 100–109.
20. Raghavan, S. and Garcia-Molina, H. Crawling the
Hidden Web. In Proceedings of the 27th International
Conference on Very Large Databases (Rome, Italy,
Sept. 11–14, 2001), 129–138.
21. Trulia; http://www.trulia.com/
22. Wang, Y. and Hu, J. A machine-learning-based
approach for table detection on the Web. In
Proceedings of the 11th International World Wide Web
Conference (Honolulu, 2002), 242–250.
23. Zanibbi, R., Blostein, D., and Cordy, J. A survey of table
recognition: Models, observations, transformations,
and inferences. International Journal on Document
Analysis and Recognition 7, 1 (2004), 1–16.
Michael J. Cafarella ( michjc@umich.edu) is an assistant
professor of computer science and engineering at the
University of Michigan, Ann Arbor, MI.
Alon halevy ( halevy@google.com) is Head of the
Structured Data Management Research Group, Google
Research, Mountain View, CA.
Jayant Madhavan ( jayant@google.com) a senior
software engineer at Google Research, Mountain View, CA.
© 2011 ACM 0001-0782/11/0200 $10.00