The user types in “commercio
electronico” to search for BI
information about e-commerce
in Spanish-speaking regions.
describe a framework that addresses
some of the needs for Web searching in a multilingual world. As outlined in Figure 1, the framework
consists of domain collections,
meta-search, statistical language
processing, and Web-page summarization, categorization, and visualization.
Reflecting regional and language
differences, a careful domain analysis
must be conducted by any prospective search-engine developer before a
Web portal is built in any particular
language. To ensure comprehensive
coverage, the analysis should review
existing Web portals and technologies, including the characteristics of
the language, and select an area or
theme for which significant Web
resources in the language have been
developed. The review should cover
regional search engines, government
and business Web sites, and news
Web sites to select the relevant Web
content needed to build a domain-specific collection or for meta-search.
Important keywords and URLs relevant to the chosen domain are gathered as seed queries or hyperlinks to
build the collection.
Managing a large user base and
growing Web content, many non-English Web search engines and portals are challenged to organize their
content properly to support convenient browsing and searching. For
example, Sina.com.cn includes more
than 700 hyperlinks on its home
page, each annotated with long textual descriptions in a small font, making browsing difficult, especially for inexperienced Web users. Pre- and
post-retrieval analysis is thus needed to alleviate information overload.
Modules supporting such analysis include encoding
conversion, summarization, categorization, and visualization. Encoding conversion is necessary when a language is used by people in multiple regions and
countries using different versions of the same language.
For example, the traditional and simplified versions of
Chinese differ enormously in written formats, leading
to two different input formats in information-retrieval
systems; hence, they require encoding conversion for
searching across the two language versions. Web-page
Retrieved pages are
catagorized into folders
labeled with key phrases.
Click to summarize in
3 to 5 sentances.
Retrieved pages titles
and abstracts are listed.
The summary is listed on
the left, while the original
page is on the right.
The SOM visualizer categorizes
about 40 Web pages onto two
regions and displays hyperlinks
on the right.
Figure 2.
Screenshots from
SBizPort.
summarization uses linguistic and heuristic techniques
to extract key sentences from the page to represent a
summary of the article [ 8].
Categorization helps organize search results in different groups that are understood more easily. To
assist the categorization process, lexicons built by a
statistics-based mutual-information approach can
provide meaningful phrases in different languages. A
neural-network approach called “Kohonen self-orga-nizing map” can be used to categorize and visualize
Web pages, helping users navigate on a 2D jigsaw
map to identify the set of similar pages or find relevant
pages.
Based on the framework, three prototype search
portals—in Chinese, Spanish, and Arabic—were
developed [ 3, 6]. The Chinese Web portal (CBizPort)
helps users search and browse for business intelligence
(BI) in mainland China, Hong Kong, and Taiwan.