ACM Home Page
Please provide us with feedback. Feedback
YellowPager: a tool for ontology-based mining of service directories from web sources
Full text PdfPdf (58 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Tampere, Finland
SESSION: Demo session table of contents
Pages: 458 - 458  
Year of Publication: 2002
ISBN:1-58113-561-0
Authors
Prashant Choudhari  SUNY Stony Brook, Stony Brook, NY
Hasan Davulcu  SUNY Stony Brook, Stony Brook, NY
Abhishek Joglekar  SUNY Stony Brook, Stony Brook, NY
Akshay More  SUNY Stony Brook, Stony Brook, NY
Saikat Mukherjee  SUNY Stony Brook, Stony Brook, NY
Supriya Patil  SUNY Stony Brook, Stony Brook, NY
I. V. Ramakrishnan  SUNY Stony Brook, Stony Brook, NY
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 35,   Citation Count: 0
Additional Information:

abstract   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/564376.564499
What is a DOI?

ABSTRACT

The web has established itself as the dominant medium for doing electronic commerce. Realizing that its global reach provides significant market and business opportunities, service providers, both large and small are advertising their services on the web. A number of them operate their own web sites promoting their services at length while others are merely listed in a referral site. Aggregating all of the providers into a queriable service directory makes it easy for customers to locate the one most suited for his/her needs.YellowPager is a tool for creating service directories by mining web sources. Service directories created by YellowPager have several merits compared to those generated by existing practices, which typically require participation by service providers (e.g. Verizon's SuperYellowPages.com). Firstly, the information content will be rich. Secondly since the process is automated and repeatable the content can always be kept current. Finally the same process can be readily adapted to different domains.YellowPager builds service directories by mining the web through a combination of keyword-based search engines,web agents, text classifiers and novel extraction algorithms.The extraction is driven by a services ontology consisting of a taxonomy of service concepts and their associated attributes (such as names and addresses) and type descriptions for the attributes. In addition the ontology also associates an extractor function with each attribute. Applying the function to a web page will identify all the occurrences of the attribute in that page.YellowPager's mining algorithm consists of a training step followed by classification and extraction steps. In the training step a classifier is trained to identify web pages relevant to the service of interest. The classification step proceeds by doing a search for the particular service of interest using a keyword based web search engine and retrieves all the matching web pages. From these pages the relevant ones are identified using the classifier. The final step is extraction of attribute values, associated with the service, from these pages. Each web page is parsed into a DOM tree and the extractor functions are applied. All of the attributes corresponding to a service provider are then correctly aggregated. This can pose difficulties especially in the presence of multiple service providers in a page. Using a novel concept of scoring and conflict resolution to prevent erroneous associations of attributes with service provider entities in the page, the algorithm aggregates all the attribute occurrences correctly. The extractor function may not be complete in the sense that it cannot always identify all the attributes in a page. By exploiting the regularity of the sequence in which attributes occurr in referral pages, the mining algorithm automatically learns generalized patterns to locate attributes that the extractor function misses. The distinguishing aspects of YellowPager's extraction algorithm are: (i) it is unsupervised, and (ii) the attribute values in the pages are extracted independent of any page-specific relationships that may exist among the markup tags.YellowPager has been used by a large pet food producer to build a directory of veterinarian service providers in the United States. The resulting database was found to be much larger and richer than that found in Vetquest, Vetworld, and the Super Yellow pages.YellowPager is implemented in JAVA and is interfaced to Rainbow, a library utility in C that is used for classification. The tool will demonstrate the creation of a service directory for any service domain by mining web sources.

Collaborative Colleagues:
Prashant Choudhari: colleagues
Hasan Davulcu: colleagues
Abhishek Joglekar: colleagues
Akshay More: colleagues
Saikat Mukherjee: colleagues
Supriya Patil: colleagues
I. V. Ramakrishnan: colleagues