| Mining the web to create minority language corpora |
| Full text |
Pdf
(1.41 MB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the tenth international conference on Information and knowledge management
table of contents
Atlanta, Georgia, USA
Session: Corpus Linguistics
table of contents
Pages: 279 - 286
Year of Publication: 2001
ISBN:1-58113-436-3
|
|
Authors
|
|
Rayid Ghani
|
Carnegie Mellon Univ., and Accenture Technology Labs
|
|
Rosie Jones
|
Carnegie Mellon University, Pittsburgh, PA
|
|
Dunja Mladenić
|
J. Stefan Inst., Slovenia and Carnegie Mellon Univ.
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 3, Downloads (12 Months): 39, Citation Count: 4
|
|
|
ABSTRACT
The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Daniel Boley , Maria Gini , Robert Gross , Eui-Hong (Sam) Han , Kyle Hastings , George Karypis , Vipin Kumar , Bamshad Mobasher , Jerome Moore, Document Categorization and Query Generation on the World Wide WebUsing WebACE, Artificial Intelligence Review, v.13 n.5-6, p.365-391, Dec. 1999
[doi> 10.1023/A:1006592405320]
|
| |
2
|
|
 |
3
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
4
|
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pages 161-175, Las Vegas, NV, 11-13 April 1994.
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
R. Ghani, R. Jones, and D. MladeniC. Building minority language corpora by learning to generate web search queries. Technical Report Technical Report CMU-CALD-01-100, Carnegie Mellon University, Center for Automated Learning and Discovery, 2001.
|
| |
9
|
Eric J. Glover , Gary W. Flake , Steve Lawrence , Andries Kruger , David M. Pennock , William P. Birmingham , C. Lee Giles, Improving Category Specific Web Search by Learning Query Modifications, Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001), p.23, January 08-12, 2001
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
M. Liberman and C. Cieri. The creation, distribution and use of linguistic data. In Proceedings of the First International Conference on Language Resources and Evaluation, 1998.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
S. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129- 146, 1976.
|
| |
18
|
J. J. Rocchio, Jr. Relevance feedback in information retrieval. In G. Salton, editor, The Smart Retrieval System: Experiments in Automatic Document Processing, pages 313-323. Prentice Hall, 1971.
|
| |
19
|
G. van Noord. Textcat. http://odur.let.rug.nl/ vannoord/TextCat/.
|
| |
20
|
|
|