NAGA should be extended to better
capture the context of the user and the
data. User context requires personalized and task-specific LMs that consider current location, time, short-term
history, and intention in the user’s digital traces. Data context calls for LMs
for entity-relationship graphs, aiming
to better model complex patterns beyond single facts (edges) and consider
types; and
Efficient search. Evaluating complex
query predicates over graphs is computationally difficult. Moreover, the
need for ranking suggests that the system should avoid materializing overly
large numbers of results and better
aim for solely computing the top-k results in a more efficient way.
16
On a grander scale is the question of which is the most appropriate
paradigm. The three avenues toward
comprehensive knowledge harvesting—Semantic, Statistical, and Social Web—are by no means mutually
exclusive. The projects outlined here
combine aspects of several of these directions. Deeper understanding of feedback between and synergies from the
three paradigms is an overriding theme
of great potential value to researchers.
Semantic-Web sources can be powerful
bootstrap tools for large-scale Statistical-Web mining. Statistical-Web tools
may produce many false hypotheses,
but they can be assessed by Social-Web
platforms with large communities of
users that engage in human-computing
tasks. Social-Web endeavors in turn are
often grassroots catalysts for developing high-value knowledge repositories
that eventually become Semantic-Web
assets; examples are Wikipedia and derived knowledge bases (such as YAGO
and DBpedia).
conclusion
We have presented motivations for
and approaches toward integrating
the historically separated DB and IR
methodologies. While deep DB/IR integration may be wishful thinking, at
least for the time being, we observe
strong trends toward adopting IR concepts in the DB world and vice versa.
In addition to applications that must
be able to manage structured and unstructured data or highly heterogeneous information sources, we also
see increasing interest and success in
extracting entities and relationships
from text sources. The envisioned
path toward automatically building
and growing comprehensive knowledge bases with expressive search and
ranking capabilities may take a long
time to mature. In any case, it is an
exciting and rewarding challenge that
should appeal to and benefit from innovation in several research communities, most notably DB and IR.
Acknowledgments
Our work on knowledge harvesting is
supported by the Excellence Cluster
“Multimodal Computing and Interaction” ( www.mmci.uni-saarland.de) funded by the German Science Foundation.
References
1. agichtein, e. scaling information extraction to
large document collections. IEEE Data Engineering
Bulletin 28, 4 (Dec. 2005), 3–10.
2. amer-yahia, s, and lalmas, m. xml search:
languages, inex, and scoring. ACM SIGMOD Record
35, 4 (mar. 2006), 16-23.
3. anyanwu, k., maduko, a., and sheth, a. sParQ2l:
towards support for subgraph extraction queries
in rDf databases. in Proceedings of the 16th
International Conference on World Wide Web (banff,
canada, may 8–12). acm Press, new york, 2007,
797–806.
4. auer, s., bizer, c., kobilarov, g., lehmann, j.,
cyganiak, r., and ives, Z. Dbpedia: a nucleus for
a Web of open data., in Proceedings of the Sixth
International Semantic Web Conference (Pusan,
korea, nov. 11–15). springer, berlin/heidelberg, 2007,
722–735.
5. banko, m., cafarella, m., soderland, s., broadhead, m.,
and etzioni, o. open information extraction from the
Web. in Proceedings of the 20th International Joint
Conference on Artificial Intelligence (hyderabad,
india, jan. 6–12, 2007), 2670–2676; www.ijcai.org.
6. cafarella, m., re, c., suciu, D., and etzioni, o.
structured querying of Web text data: a technical
challenge. in Proceedings of the Third Biennial
Conference on Innovative Data Systems Research
(asilomar, ca, jan. 7–10, 2007), 225–234; www.
crdrdb.org.
7. chakrabarti, s. Dynamic personalized Pagerank in
entity-relation graphs. in Proceedings of the 16th
International Conference on World Wide Web (banff,
canada, may 8–12). acm Press, new york, 2007,
571–580.
8. cheng, t., yan, x., and chang, k. entity rank,
searching entities directly and holistically. in
Proceedings of the 33rd International Conference on
Very Large Data bases (Vienna, austria, sept. 23–27).
acm Press, new york, 2007, 387–398.
9. cohen, W. integration of heterogeneous databases
without common domains using queries based
on textual similarity. in Proceedings of the ACM
SIGMOD International Conference on Management
of Data (seattle, june 2–4). acm Press, new york,
1998, 201–212.
10. cunningham, h. an introduction to information
extraction. in Encyclopedia of Language and
Linguistics, Second Edition, k. brown et al., eds.,
elsevier, amsterdam, 2005.
11. Derose, P., shen, W., chen, f., Doan, a.-h., and
ramakrishnan, r. building structured Web
community portals: a top-down, compositional, and
incremental approach. in Proceedings of the 33rd
International Conference on Very Large Data Bases
(Vienna, austria, sept. 23–27). acm Press, new york,
2007, 399–410.
12. etzioni, o., cafarella, m., Downey, D., Popescu, a.-m.,
shaked, t., soderland, s., Weld, D., and yates, a.
unsupervised named-entity extraction from the Web:
an experimental study. Artificial Intelligence 165, 1
(june 2005), 91–134.
13. fuhr, n. and rölleke, t. a probabilistic relational
algebra for the integration of information retrieval
and database systems. ACM Transactions on
Information Systems 15, 1 (jan. 1997), 32–66.
14. fuhr, n. Probabilistic datalog: a logic for powerful
retrieval methods. in Proceedings of the 18th Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval (seattle,
july 9–13). acm Press, new york 1995, 282–290.
15. getoor, l. and taskar, b., eds. Introduction to
Statistical Relational Learning. mit Press,
cambridge, ma, 2007.
16. ilyas, i., beskales, g., and soliman, m. a survey
of top-k query-processing techniques in relational
database systems. ACM Computing Surveys 40, 1
(oct. 2008), 1–58.
17. ipeirotis, P., agichtein, e., jain, P., and gravano, l.
towards a query optimizer for text-centric tasks. ACM
Transactions on Database Systems 32, 4 (nov. 2007).
18. kasneci, g., suchanek, f., ifrim, g., ramanath,
m., and Weikum, g. naga: searching and ranking
knowledge. in Proceedings of the 24th International
Conference on Data Engineering (cancun, mexico,
apr. 7–12). ieee computer society, Washington, D.c.,
2008, 953–62.
19. navarro, g. and baeza-yates, r. Proximal nodes: a
model to query document databases by content and
structure. ACM Transactions on Information Systems
15, 4 (1997), 400–435.
20. nie, Z., ma, y., shi, s., Wen, j.-r., and ma, W.-y.
Web object retrieval. in Proceedings of the 16th
International Conference on World Wide Web (banff,
canada, may 8–12). acm Press, new york, 2007,
81–90.
21. sarawagi, s. information extraction. Foundations and
Trends in Databases 1, 3 (2008), 261–377.
22. shen, W., Doan, a.h., naughton, j., and
ramakrishnan, r. Declarative information extraction
using datalog with embedded extraction predicates.
in Proceedings of the 33rd International Conference
on Very Large Databases (Vienna, austria, sept.
23–27). acm Press, new york, 2007, 1033–1044.
23. suchanek, f., kasneci, g., and Weikum, g. yago: a
large ontology from Wikipedia and Wordnet. Journal
of Web Semantics 6, 3 (2008), 203–217.
24. suchanek, f., kasneci, g., and Weikum, g. yago: a
core of semantic knowledge. in Proceedings of the
16th International Conference on World Wide Web
(banff, canada, may 8–12). acm Press, new york,
2007, 697–706.
25. theobald, m., bast, h., majumdar, D., schenkel, r.,
and Weikum, g. topx: efficient and versatile top-k
query processing for semistructured data. VLDB
Journal 17, 1 (jan. 2008), 81–115.
26. Wu, f. and Weld, D. automatically refining the
Wikipedia infobox ontology. in Proceedings of the
17th International Conference on World Wide Web
(beijing, apr. 21–25). acm Press, new york, 2008,
635–644.
27. Wu, f. and Weld, D. autonomously semantifying
Wikipedia. in Proceedings of the 16th ACM
Conference on Information and Knowledge
Management (lisbon, nov. 6–10). acm Press, new
york, 2007, 41–50.
28. Zhu, j., nie, Z., Wen, j.-r., Zhang, bo, and ma,
W.-y. simultaneous record detection and attribute
labeling in Web data extraction. in Proceedings of
the 12th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (Philadelphia,
Pa, aug. 20–23). acm Press, new york, 2006,
494–503.
Gerhard Weikum ( weikum@mpi-inf.mpg.de) is a
scientific director leading the research group on
databases and information systems at the max Planck
institute for informatics, saarbruecken, germany.
Gjergji Kasneci ( kasneci@mpi-inf.mpg.de) is a doctoral
student at the max Planck institute for informatics,
saarbruecken, germany.
Maya Ramanath ( ramanath@mpi-sb.mpg.de) is a
researcher at the max Planck institute for informatics,
saarbruecken, germany.
Fabian Suchanek ( suchanek@mpi-inf.mpg.de) is a
researcher at the max Planck institute for informatics,
saarbruecken, germany.
© 2009 acm 0001-0782/09/0400 $5.00
64 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4