Mining the web for acronyms using the duality of patterns and relations
Abstract
The Web is a ric h source of information, but this infor-mation is scattered and hidden in the diversity of web pages. Searc hengines are windows to the web. How-ev er, the current searc h engines, designed to identify pages with specifipehdrases, ha ve very limited pow er.F or example, they cannot search for phrases related in a particular way (e.g. books and their authors).In this paper w e present a solution for iden tifying a set of inter-related information on the web using the duality concept. Duality problems arise when one tries to identify a pair of inter-related phrases such as (book, author), (name, email) or (acronym, expansion) rela- tions. We propose a solution to this problem that it- erativ ely refines mutually dependent approximations to their identifications. Specifically, we iterativ ely refinei) pairs of phrases related in a specific way, andii) the pat- terns of their occurrences in web pages, i.e. the ways in which the related phrases are marked in the pages. We cast ligh t on the general solution of the duality prob- lems in the web by concentrating on one paradigmatic duality problemi.,e. iden tifying (acronym, expansion) pairs in terms of the patterns of their occurrences in the w ebpages. The solution to this problem involv es tw o mutually dependent duality problems of 1) the duality between the related pairs and their patterns, and 2) the duality betw een the related pairs and the acronym formulation rules.