|
ABSTRACT
This paper discusses the problem of automatically identifying the language of a given Web document. Previous experiments in language guessing focused on analyzing "coherent" text sentences, whereas this work was validated on texts from the Web, often presenting harder problems. Our language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure. Both fast and robust, the software has been in use for the past two years, as part of a crawler for a search engine. Experiments show that it achieves very high accuracy in discriminating different languages on Web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Amitay. Hypertext - the importance of being different. Master's thesis, Centre for Cognitive Science, Edinburgh University, 1997.
|
| |
2
|
E. Amitay. Using common hypertext links to identify the best phrasal description of target Web documents. In Proceedings of the SIGIR-98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998.
|
 |
3
|
|
| |
4
|
I. Biskri and S. Delisle. Text classification and multilinguism: Getting at words via n-grams of characters. In Proceedings of SCI-2002, 6th World Multiconference on Systemics, Cybernetics and Informatics, volume 5, pages 110--115, July 2002.
|
| |
5
|
A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources, 2nd meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-2000), June 2001.
|
| |
6
|
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, the 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, Las Vegas, Nevada, U.S.A. 1994.
|
| |
7
|
Soumen Chakrabarti , Byron E. Dom , S. Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins , David Gibson , Jon Kleinberg, Mining the Web's Link Structure, Computer, v.32 n.8, p.60-67, August 1999
[doi> 10.1109/2.781636]
|
| |
8
|
|
| |
9
|
Y. S. M. Cutler and W. Meng. Using the structure of HTML documents to improve retrieval. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, 1997.
|
| |
10
|
M. Damashek. Gauging similarity with n-grams: language independent categorization of text. Science, 267(5199):843--848, 1995.
|
| |
11
|
T. Dunning. Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University, 1994.
|
| |
12
|
I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40:237--264, 1953.
|
| |
13
|
G. Grefenstette and P. Tapanainen. What is a word, what is a sentence? problems of tokenization. In Proceedings of COMPLEX-94, the 3rd International Conference on Computational Lexicography, pages 79--87, 1994.
|
| |
14
|
P. Henrich. Language identification for the automatic grapheme-to-phoneme conversion of foreign words in a german text-to-speech system. In Proceedings of Eurospeech 1989, European Speech Communication and Technology, pages 220--223, September 1989.
|
| |
15
|
C. Hill. Information space based on html structure. In E. M. Voorhees and D. K. Harman, editors, Proceedings of TREC-9, the 9th Text REtrieval Conference. Department of Commerce of National Institute of Standards and Technology, 2000.
|
| |
16
|
J. Y. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of ROCLING-X, the ROCLING 1997 International Conference on Research on Computational Linguistics, 1997.
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
E. Miller, D. Shen, J. Liu, and C. Nicholas. Performance and scalability of a large-scale n-gram based information retrieval system. Journal of Digital Information, 1(21), 2000.
|
| |
22
|
P. Newman. Foreign language identification - first step in the translation process. In K. Kummer, editor, Proceedings of the 28th Annual Conference of the American Translators Association, pages 509--516, 1987.
|
| |
23
|
C. Pearce and B. Rye. N-gram term weighting: A comparative analysis. Technical Report TR-R52-001-98, National Security Agency Technical, January 1998.
|
| |
24
|
P. Sibun and J. C. Reynar. Language identification: Examining the issues. In Proceedings of SDAIR-96, the 5th Symposium on Document Analysis and Information Retrieval, pages 125--135, 1996.
|
| |
25
|
|
| |
26
|
C. Souter, G. Churcher, J. Hayes, J. Hughes, and S. Johnson. Natural language identification using corpus-based models. Hermes Journal of Linguistics, 13:183--203, 1994.
|
| |
27
|
L. Wittmann, T. Pêgo, and D. Santos. Português do Brasil e de Portugal: alguns contrastes. In Actas do XI Encontro da Associação Portuguesa de Linguistica, pages 465--487, 1995.
|
| |
28
|
|
|