|
ABSTRACT
In this paper we present the Infocious Web search engine [23]. Our goal in creating Infocious is to improve the way people find information on the Web by resolving ambiguities present in natural language text. This is achieved by performing linguistic analysis on the content of the Web pages we index, which is a departure from existing Web search engines that return results mainly based on keyword matching. This additional step of linguistic processing gives Infocious two main advantages. First, Infocious gains a deeper understanding of the content of Web pages so it can better match users' queries with indexed documents and therefore can improve relevancy of the returned results. Second, based on its linguistic processing, Infocious can organize and present the results to the user in more intuitive ways. In this paper we present the linguistic processing technologies that we incorporated in Infocious and how they are applied in helping users find information on the Web more efficiently. We discuss the various components in the architecture of Infocious and how each of them benefits from the added linguistic processing. Finally, we experimentally evaluate the performance of a component which leverages linguistic information in order to categorize Web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Altavista Inc. http://www.altavista.com.
|
| |
3
|
Ask Jeeves Inc. http://www.ask.com.
|
| |
4
|
Autonomy Inc. http://www.autonomy.com.
|
| |
5
|
|
| |
6
|
Brainboost. http://www.brainboost.com.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Proceedings of WWW-96, 6th International Conference on the World Wide Web, San Jose, US, 1996.
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
J. Cho and A. Ntoulas. Effective change detection using sampling. In Proceedings of the Twenty-eighth International Conference on Very Large Databases (VLDB), August 2002.
|
| |
15
|
|
| |
16
|
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, September 1990.
|
| |
17
|
The open directory project. http://www.dmoz.org.
|
| |
18
|
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.
|
| |
19
|
S. T. Dumais. Latent semantic indexing (LSI) and TREC-2. In The Second Text Retrieval Conference (TREC-2), 1994.
|
| |
20
|
Excite Inc. http://www.excite.com.
|
| |
21
|
Google Incorporated. http://www.google.com.
|
 |
22
|
|
| |
23
|
Infocious Incorporated. http://www.infocious.com.
|
| |
24
|
Inquira Inc. http://www.inquira.com.
|
| |
25
|
Inxight Inc. http://www.inxight.com.
|
| |
26
|
iPhrase Inc. http://www.iphrase.com.
|
| |
27
|
|
| |
28
|
B. Katz, J. Lin, D. Loreto, W. Hildebrandt, M. Bilotti, S. Felshin, A. Fernandes, G. Marton, and F. Mora. Integrating web-based and corpus-based techniques for question answering, November 2003.
|
| |
29
|
C. Li, J.-R. Wen, and H. Li. Text classification using stochastic keyword generation. In Twentieth International Conference on Machine Learning (ICML), pages 464--471, 2003.
|
| |
30
|
Lycos Inc. http://www.lycos.com.
|
| |
31
|
|
| |
32
|
O. A. McBryan. GENVL and WWWW: Tools for taming the web. In First International Conference on the World Wide Web, CERN, Geneva, Switzerland, May 1994.
|
| |
33
|
R. Mihalcea. Bootstrapping large sense tagged corpora. In Proceedings of the 3rd International Conference on Language Resources and Evaluations (LREC), Las Palmas, Spain, May 2002.
|
| |
34
|
MSNSearch. http://www.msnsearch.com.
|
| |
35
|
|
| |
36
|
A. Ntoulas, P. Zerfos, and J. Cho. Downloading hidden web content. Technical report, UCLA, 2004. Available at http://oak.cs.ucla.edu/~ntoulas/pubs/ntoulas_hidden_web_extended.pdf.
|
| |
37
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Database Group, Computer Science Department, Stanford University, November 1999. http://dbpubs.stanford.edu/pub/1999-66.
|
| |
38
|
A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Proceedings of the First Conference on Empirical Methods in Natural Language Processing, pages 133--142, 1996.
|
| |
39
|
|
| |
40
|
START natural language question answering system. http://www.ai.mit.edu/projects/infolab/.
|
| |
41
|
Teoma. http://www.teoma.com.
|
| |
42
|
|
| |
43
|
|
| |
44
|
A. J. Viterbi. Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13:260--267, 1967.
|
| |
45
|
|
| |
46
|
Yahoo! Inc. http://www.yahoo.com.
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.1
Content Analysis and Indexing
Additional Classification:
C.
Computer Systems Organization
C.3
SPECIAL-PURPOSE AND APPLICATION-BASED SYSTEMS
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
H.3.7
Digital Libraries
Keywords:
concept extraction,
crawling,
indexing,
information retrieval,
language analysis,
linguistic analysis of web text,
natural language processing,
part-of-speech tagging,
phrase identification,
web search engine,
web searching,
word sense disambiguation
|