The user types in “commercio electronico” to search for BI information about e-commerce in Spanish-speaking regions.

describe a framework that addresses some of the needs for Web searching in a multilingual world. As outlined in Figure 1, the framework consists of domain collections, meta-search, statistical language processing, and Web-page summarization, categorization, and visualization.

Reflecting regional and language differences, a careful domain analysis must be conducted by any prospective search-engine developer before a Web portal is built in any particular language. To ensure comprehensive coverage, the analysis should review existing Web portals and technologies, including the characteristics of the language, and select an area or theme for which significant Web resources in the language have been developed. The review should cover regional search engines, government and business Web sites, and news Web sites to select the relevant Web content needed to build a domain-specific collection or for meta-search. Important keywords and URLs relevant to the chosen domain are gathered as seed queries or hyperlinks to build the collection.

Managing a large user base and growing Web content, many non-English Web search engines and portals are challenged to organize their content properly to support convenient browsing and searching. For example, Sina.com.cn includes more than 700 hyperlinks on its home page, each annotated with long textual descriptions in a small font, making browsing difficult, especially for inexperienced Web users. Pre- and post-retrieval analysis is thus needed to alleviate information overload.

Modules supporting such analysis include encoding conversion, summarization, categorization, and visualization. Encoding conversion is necessary when a language is used by people in multiple regions and countries using different versions of the same language. For example, the traditional and simplified versions of Chinese differ enormously in written formats, leading to two different input formats in information-retrieval systems; hence, they require encoding conversion for searching across the two language versions. Web-page

Retrieved pages are catagorized into folders labeled with key phrases.

Click to summarize in 3 to 5 sentances.

Retrieved pages titles and abstracts are listed.

The summary is listed on the left, while the original page is on the right.

The SOM visualizer categorizes about 40 Web pages onto two regions and displays hyperlinks on the right.

Figure 2. Screenshots from

SBizPort.

summarization uses linguistic and heuristic techniques to extract key sentences from the page to represent a summary of the article [ 8].

Categorization helps organize search results in different groups that are understood more easily. To assist the categorization process, lexicons built by a statistics-based mutual-information approach can provide meaningful phrases in different languages. A neural-network approach called “Kohonen self-orga-nizing map” can be used to categorize and visualize Web pages, helping users navigate on a 2D jigsaw map to identify the set of similar pages or find relevant pages.

Based on the framework, three prototype search portals—in Chinese, Spanish, and Arabic—were developed [ 3, 6]. The Chinese Web portal (CBizPort) helps users search and browse for business intelligence (BI) in mainland China, Hong Kong, and Taiwan.

References:

http://Sina.com.cn

Archives