| Name-ethnicity classification from open sources |
| Full text |
Mov
(16:43),
Pdf
(1.86 MB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Research track papers
table of contents
Pages 49-58
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Anurag Ambekar
|
Stony Brook University, Stony Brook, NY, USA
|
|
Charles Ward
|
Stony Brook University, Stony Brook, NY, USA
|
|
Jahangir Mohammed
|
Stony Brook University, Stony Brook, NY, USA
|
|
Swapna Male
|
Stony Brook University, Stony Brook, NY, USA
|
|
Steven Skiena
|
Stony Brook University, Stony Brook, NY, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 48, Downloads (12 Months): 162, Citation Count: 0
|
|
|
ABSTRACT
The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Aries and K. Moorehead. The importance of ethnicity in the development of identity of black adolescents. Psychological Reports, 65:75--82, August 1989.
|
| |
2
|
|
| |
3
|
M. Bautin, L. Vijayarenu, and S. Skiena. International sentiment analysis for news and blogs, 2008.
|
| |
4
|
E. Berchard, E. Ziv, and et. al. Importance of race and ethnic background in biomedical research and clinical practice. The New England Journal of Medicine, 348:1170--1175, March 2003.
|
| |
5
|
R. W. Buechley. Generally useful ethnic search system, GUESS. In Presented at the Annual Meeting of the American Names Society, New York, NY, 1976.
|
| |
6
|
A. J. Coldman, T. Braun, and R. P. Gallagher. The classification of ethnic status using name information. Journal of Epidemiology and Community Health, 42:390--395, 1988.
|
| |
7
|
K. Fiscella and A. M. Fremon. Use of geocoding and surname analysis to estimate race and ethnicity. Health Service Research, 41:1482:1500, August 2006.
|
| |
8
|
P. Gill, R. Bhopal, S. Wild, and J. Kai. Limitations and potential of country of birth as proxy for ethnic group. British Medical Journal, 330:196, 2005.
|
| |
9
|
N. Godbole, M. Srinivasaiah, and S. Skiena. Large-Scale Sentiment Analysis for News and Blogs. In Proc. First Int. Conf. on Weblogs and Social Media, pages 219--222, Mar. 2007.
|
| |
10
|
S. Harding, H. Dews, and S. Simpson. The potential to identify South Asians using a computerised algorithm to classify names. Population Trends, 97:46--9, 1999.
|
| |
11
|
D. Honer. Identifying ethnicity: A comparison of two computer programmes designed to identify names of south asian ethnic origin. MPH Dissertation, University of Birmingham, 2003.
|
| |
12
|
D. S. Lauderdale and B. Kestenbaum. Asian american ethnic identification by surname. Population Research and Policy Review, 19:283--300, 2000.
|
| |
13
|
L. Lloyd, P. Kaulgud, and S. Skiena. Newspapers vs. blogs: Who gets the scoop? In Computational Approaches to Analyzing Weblogs (AAAI-CAAW 2006), volume AAAI Press, Technical Report SS-06-03, pages 117--124, 2006.
|
| |
14
|
L. Lloyd, D. Kechagias, and S. Skiena. Lydia: A system for large-scale news analysis. In String Processing and Information Retrieval (SPIRE 2005), pages 161--166, 2005.
|
| |
15
|
L. Lloyd, A. Mehler, and S. Skiena. Identifying co-referential names across large corpra. In Proc. Combinatorial Pattern Matching (CPM 2006), volume LNCS 4009, pages 12--23, 2006.
|
| |
16
|
P. Mateos. A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place, 2007.
|
| |
17
|
P. Mateos, R. Webber, and P. Longley. The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names. Technical report, CASA Working Papers 116, Centre for Advanced Spatial Analysis University College London, March 2007.
|
| |
18
|
|
 |
19
|
|
| |
20
|
K. Nanchahal, P. Mangtani, M. Alston, and I. dos Santos Silva. Development and validation of a computerized South Asian names and group recognition algorithm (SANGRA) for use in british health-related studies. Journal of Public Health Medicine, 23:278--285, 2001.
|
| |
21
|
S. L. Stewart, K. C. Swallen, S. L. Glaser, P. L. Horn-Ross, and D. W. West. Comparison of Methods for Classifying Hispanic Ethnicity in a Population-based Cancer Registry. Am. J. Epidemiol., 149(11):1063--1071, 1999.
|
| |
22
|
J. Wales. Wikipedia. http://www.wikipedia.org, 2009.
|
| |
23
|
C. Ward, M. Bautin, and S. Skiena. Identifying differences in news coverage between cultural/ethnic groups. submitted for publication, 2009.
|
|