ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
"I know what you did last summer": query logs and user privacy
Full text PdfPdf (204 KB)
Source
Conference on Information and Knowledge Management archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management table of contents
Lisbon, Portugal
POSTER SESSION: Poster session table of contents
Pages: 909-914  
Year of Publication: 2007
ISBN:978-1-59593-803-9
Authors
Rosie Jones  Yahoo! Research, Sunnyvale, CA
Ravi Kumar  Yahoo! Research, Sunnyvale, CA
Bo Pang  Yahoo! Research, Sunnyvale, CA
Andrew Tomkins  Yahoo! Research, Sunnyvale, CA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 118,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1321440.1321573
What is a DOI?

ABSTRACT

We investigate the subtle cues to user identity that may be exploited in attacks on the privacy of users in web search query logs. We study the application of simple classifiers to map a sequence of queries into the gender, age, and location of the user issuing the queries. We then show how these classifiers may be carefully combined at multiple granularities to map a sequence of queries into a set of candidate users that is 300-600 times smaller than random chance would allow. We show that this approach remains accurate even after removing personally identifiable information such as names/numbers or limiting the size of the query log.

We also present a new attack in which a real-world acquaintance of a user attempts to identify that user in a large query log, using personal information. We show that combinations of small pieces of information about terms a user would probably search for can be highly effective in identifying the sessions of that user.

We conclude that known schemes to release even heavily scrubbed query logs that contain session information have significant privacy risks.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. Adar. User 4XXXXX9: Anonymizing query logs. In Query Logs Workshop at the 16th WWW, 2007.
 
2
S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In Proc. 1st Workshop on Innovative Information Systems, 1998.
 
3
S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni. Gender, genre, and writing style in formal written texts. Text, 23(3):321--346, 2003.
4
5
6
7
 
8
 
9
R. Jones, W. V. Zhang, P. Jhala, and B. Rey. Geographic intention and modification in web search. International Journal of Geographical Information Science, 2007.
10
 
11
F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley, 1964.
12
13
 
14
C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a very large altavista query log. Technical Report 1998--014, Digital SRC, 1998.
 
15


Collaborative Colleagues:
Rosie Jones: colleagues
Ravi Kumar: colleagues
Bo Pang: colleagues
Andrew Tomkins: colleagues