ACM Home Page
Please provide us with feedback. Feedback
Name-ethnicity classification from open sources
Full text MovMov (16:43),  PdfPdf (1.86 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Research track papers table of contents
Pages 49-58  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Anurag Ambekar  Stony Brook University, Stony Brook, NY, USA
Charles Ward  Stony Brook University, Stony Brook, NY, USA
Jahangir Mohammed  Stony Brook University, Stony Brook, NY, USA
Swapna Male  Stony Brook University, Stony Brook, NY, USA
Steven Skiena  Stony Brook University, Stony Brook, NY, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 48,   Downloads (12 Months): 162,   Citation Count: 0
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557032
What is a DOI?

ABSTRACT

The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. Aries and K. Moorehead. The importance of ethnicity in the development of identity of black adolescents. Psychological Reports, 65:75--82, August 1989.
 
2
 
3
M. Bautin, L. Vijayarenu, and S. Skiena. International sentiment analysis for news and blogs, 2008.
 
4
E. Berchard, E. Ziv, and et. al. Importance of race and ethnic background in biomedical research and clinical practice. The New England Journal of Medicine, 348:1170--1175, March 2003.
 
5
R. W. Buechley. Generally useful ethnic search system, GUESS. In Presented at the Annual Meeting of the American Names Society, New York, NY, 1976.
 
6
A. J. Coldman, T. Braun, and R. P. Gallagher. The classification of ethnic status using name information. Journal of Epidemiology and Community Health, 42:390--395, 1988.
 
7
K. Fiscella and A. M. Fremon. Use of geocoding and surname analysis to estimate race and ethnicity. Health Service Research, 41:1482:1500, August 2006.
 
8
P. Gill, R. Bhopal, S. Wild, and J. Kai. Limitations and potential of country of birth as proxy for ethnic group. British Medical Journal, 330:196, 2005.
 
9
N. Godbole, M. Srinivasaiah, and S. Skiena. Large-Scale Sentiment Analysis for News and Blogs. In Proc. First Int. Conf. on Weblogs and Social Media, pages 219--222, Mar. 2007.
 
10
S. Harding, H. Dews, and S. Simpson. The potential to identify South Asians using a computerised algorithm to classify names. Population Trends, 97:46--9, 1999.
 
11
D. Honer. Identifying ethnicity: A comparison of two computer programmes designed to identify names of south asian ethnic origin. MPH Dissertation, University of Birmingham, 2003.
 
12
D. S. Lauderdale and B. Kestenbaum. Asian american ethnic identification by surname. Population Research and Policy Review, 19:283--300, 2000.
 
13
L. Lloyd, P. Kaulgud, and S. Skiena. Newspapers vs. blogs: Who gets the scoop? In Computational Approaches to Analyzing Weblogs (AAAI-CAAW 2006), volume AAAI Press, Technical Report SS-06-03, pages 117--124, 2006.
 
14
L. Lloyd, D. Kechagias, and S. Skiena. Lydia: A system for large-scale news analysis. In String Processing and Information Retrieval (SPIRE 2005), pages 161--166, 2005.
 
15
L. Lloyd, A. Mehler, and S. Skiena. Identifying co-referential names across large corpra. In Proc. Combinatorial Pattern Matching (CPM 2006), volume LNCS 4009, pages 12--23, 2006.
 
16
P. Mateos. A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place, 2007.
 
17
P. Mateos, R. Webber, and P. Longley. The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names. Technical report, CASA Working Papers 116, Centre for Advanced Spatial Analysis University College London, March 2007.
 
18
19
 
20
K. Nanchahal, P. Mangtani, M. Alston, and I. dos Santos Silva. Development and validation of a computerized South Asian names and group recognition algorithm (SANGRA) for use in british health-related studies. Journal of Public Health Medicine, 23:278--285, 2001.
 
21
S. L. Stewart, K. C. Swallen, S. L. Glaser, P. L. Horn-Ross, and D. W. West. Comparison of Methods for Classifying Hispanic Ethnicity in a Population-based Cancer Registry. Am. J. Epidemiol., 149(11):1063--1071, 1999.
 
22
J. Wales. Wikipedia. http://www.wikipedia.org, 2009.
 
23
C. Ward, M. Bautin, and S. Skiena. Identifying differences in news coverage between cultural/ethnic groups. submitted for publication, 2009.


Collaborative Colleagues:
Anurag Ambekar: colleagues
Charles Ward: colleagues
Jahangir Mohammed: colleagues
Swapna Male: colleagues
Steven Skiena: colleagues