ACM Home Page
Please provide us with feedback. Feedback
Mining the web to create minority language corpora
Full text PdfPdf (1.41 MB)
Source Conference on Information and Knowledge Management archive
Proceedings of the tenth international conference on Information and knowledge management table of contents
Atlanta, Georgia, USA
Session: Corpus Linguistics table of contents
Pages: 279 - 286  
Year of Publication: 2001
ISBN:1-58113-436-3
Authors
Rayid Ghani  Carnegie Mellon Univ., and Accenture Technology Labs
Rosie Jones  Carnegie Mellon University, Pittsburgh, PA
Dunja Mladenić  J. Stefan Inst., Slovenia and Carnegie Mellon Univ.
Sponsors
SIGMIS: ACM Special Interest Group on Management Information Systems
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 39,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/502585.502633
What is a DOI?

ABSTRACT

The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pages 161-175, Las Vegas, NV, 11-13 April 1994.
 
5
 
6
7
 
8
R. Ghani, R. Jones, and D. MladeniC. Building minority language corpora by learning to generate web search queries. Technical Report Technical Report CMU-CALD-01-100, Carnegie Mellon University, Center for Automated Learning and Discovery, 2001.
 
9
 
10
11
 
12
 
13
M. Liberman and C. Cieri. The creation, distribution and use of linguistic data. In Proceedings of the First International Conference on Language Resources and Evaluation, 1998.
 
14
 
15
 
16
 
17
S. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129- 146, 1976.
 
18
J. J. Rocchio, Jr. Relevance feedback in information retrieval. In G. Salton, editor, The Smart Retrieval System: Experiments in Automatic Document Processing, pages 313-323. Prentice Hall, 1971.
 
19
G. van Noord. Textcat. http://odur.let.rug.nl/ vannoord/TextCat/.
 
20


Collaborative Colleagues:
Rayid Ghani: colleagues
Rosie Jones: colleagues
Dunja Mladenić: colleagues