ACM Home Page
Please provide us with feedback. Feedback
Language identification in web pages
Full text PdfPdf (264 KB)
Source Symposium on Applied Computing archive
Proceedings of the 2005 ACM symposium on Applied computing table of contents
Santa Fe, New Mexico
SESSION: Document engineering (DE) table of contents
Pages: 764 - 768  
Year of Publication: 2005
ISBN:1-58113-964-0
Authors
Bruno Martins  Faculdade de Ciências Universidade de Lisboa, Lisboa, Portugal
Mário J. Silva  Faculdade de Ciências Universidade de Lisboa, Lisboa, Portugal
Sponsor
SIGAPP: ACM Special Interest Group on Applied Computing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 133,   Citation Count: 6
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1066677.1066852
What is a DOI?

ABSTRACT

This paper discusses the problem of automatically identifying the language of a given Web document. Previous experiments in language guessing focused on analyzing "coherent" text sentences, whereas this work was validated on texts from the Web, often presenting harder problems. Our language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure. Both fast and robust, the software has been in use for the past two years, as part of a crawler for a search engine. Experiments show that it achieves very high accuracy in discriminating different languages on Web pages.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. Amitay. Hypertext - the importance of being different. Master's thesis, Centre for Cognitive Science, Edinburgh University, 1997.
 
2
E. Amitay. Using common hypertext links to identify the best phrasal description of target Web documents. In Proceedings of the SIGIR-98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998.
3
 
4
I. Biskri and S. Delisle. Text classification and multilinguism: Getting at words via n-grams of characters. In Proceedings of SCI-2002, 6th World Multiconference on Systemics, Cybernetics and Informatics, volume 5, pages 110--115, July 2002.
 
5
A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources, 2nd meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-2000), June 2001.
 
6
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, the 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, Las Vegas, Nevada, U.S.A. 1994.
 
7
 
8
 
9
Y. S. M. Cutler and W. Meng. Using the structure of HTML documents to improve retrieval. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, 1997.
 
10
M. Damashek. Gauging similarity with n-grams: language independent categorization of text. Science, 267(5199):843--848, 1995.
 
11
T. Dunning. Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University, 1994.
 
12
I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40:237--264, 1953.
 
13
G. Grefenstette and P. Tapanainen. What is a word, what is a sentence? problems of tokenization. In Proceedings of COMPLEX-94, the 3rd International Conference on Computational Lexicography, pages 79--87, 1994.
 
14
P. Henrich. Language identification for the automatic grapheme-to-phoneme conversion of foreign words in a german text-to-speech system. In Proceedings of Eurospeech 1989, European Speech Communication and Technology, pages 220--223, September 1989.
 
15
C. Hill. Information space based on html structure. In E. M. Voorhees and D. K. Harman, editors, Proceedings of TREC-9, the 9th Text REtrieval Conference. Department of Commerce of National Institute of Standards and Technology, 2000.
 
16
J. Y. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of ROCLING-X, the ROCLING 1997 International Conference on Research on Computational Linguistics, 1997.
 
17
18
 
19
 
20
 
21
E. Miller, D. Shen, J. Liu, and C. Nicholas. Performance and scalability of a large-scale n-gram based information retrieval system. Journal of Digital Information, 1(21), 2000.
 
22
P. Newman. Foreign language identification - first step in the translation process. In K. Kummer, editor, Proceedings of the 28th Annual Conference of the American Translators Association, pages 509--516, 1987.
 
23
C. Pearce and B. Rye. N-gram term weighting: A comparative analysis. Technical Report TR-R52-001-98, National Security Agency Technical, January 1998.
 
24
P. Sibun and J. C. Reynar. Language identification: Examining the issues. In Proceedings of SDAIR-96, the 5th Symposium on Document Analysis and Information Retrieval, pages 125--135, 1996.
 
25
 
26
C. Souter, G. Churcher, J. Hayes, J. Hughes, and S. Johnson. Natural language identification using corpus-based models. Hermes Journal of Linguistics, 13:183--203, 1994.
 
27
L. Wittmann, T. Pêgo, and D. Santos. Português do Brasil e de Portugal: alguns contrastes. In Actas do XI Encontro da Associação Portuguesa de Linguistica, pages 465--487, 1995.
 
28

Collaborative Colleagues:
Bruno Martins: colleagues
Mário J. Silva: colleagues