|
ABSTRACT
This article introduces a methodology for automatically organizing document collections into thematic categories for Personal Information Management (PIM) through collaborative sharing of machine learning models in an efficient and privacy-preserving way. Our objective is to combine multiple independently learned models from several users to construct an advanced ensemble-based decision model by taking the knowledge of multiple users into account in a decentralized manner, for example, in a peer-to-peer overlay network. High accuracy of the corresponding supervised (classification) and unsupervised (clustering) methods is achieved by restrictively leaving out uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We introduce a formal probabilistic model for the resulting ensemble based meta methods and explain how it can be used for constructing estimators and for goal-oriented tuning. Comprehensive evaluation results on different reference data sets illustrate the viability of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
Matthias Bender , Sebastian Michel , Peter Triantafillou , Gerhard Weikum , Christian Zimmer, Improving collection selection with overlap awareness in P2P search engines, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
[doi> 10.1145/1076034.1076049]
|
| |
4
|
Bender, M., Michel, S., Triantafillou, P., Weikum, G., and Zimmer, C. 2006. P2P content search: Give the Web back to the people. In Proceedings of the 5th International Workshop on Peer-to-Peer Systems (IPTPS).
|
| |
5
|
Bender, M., Michel, S., Weikum, G., and Zimmer, C. 2004. Bookmark-driven query routing in peer-to-peer Web search. In Proceedings of the SIGIR Workshop on P2P Information Retrieval.
|
 |
6
|
Henk Ernst Blok , Djoerd Hiemstra , Sunil Choenni , Franciska de Jong , Henk M. Blanken , Peter M.G. Apers, Predicting the cost-quality trade-off for information retrieval queries: facilitating database design and query optimization, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502585.502621]
|
 |
7
|
|
 |
8
|
|
| |
9
|
Brank, J., Grobelnik, M., Milic-Frayling, N., and Mladenic, D. 2003. Training text classifiers with SVM on very few positive examples. Tech. rep. MSR-TR-2003-34, Microsoft Corp.
|
| |
10
|
|
| |
11
|
Brinker, K. and Hüllermeier, E. 2006. Case-based label ranking. In Machine Learning: Proceedings of the 17th European Conference on Machine Learning (ECML'06), J. Fürnkranz, T. Scheffer, and M. Spiliopoulou, Eds. Lecture Notes in Computer Science. Springer, 566--573.
|
| |
12
|
|
| |
13
|
|
| |
14
|
Buckland, M. K. 1992. Emmanuel Goldberg, electronic document retrieval, and Vannevar Bush's memex. J. Amer. Soc. Inform. Sci. 43, 4, 284--294.
|
| |
15
|
|
| |
16
|
Bush, V. 1945. As we may think. Atlantic Monthly 176, 1, 101--108.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
Cormack, G. V. 2006. Trec 2006 spam evaluation track overview. In Proceedings of the 15th Text Retrieval Conference (TREC'06).
|
| |
21
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
 |
22
|
|
 |
23
|
Edward Cutrell , Daniel Robbins , Susan Dumais , Raman Sarin, Fast, flexible filtering with phlat, Proceedings of the SIGCHI conference on Human Factors in computing systems, April 22-27, 2006, Montréal, Québec, Canada
[doi> 10.1145/1124772.1124812]
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
Dong, X. and Halevy, A. 2005. A platform for personal information management and integration. In Proceedings of the 2nd Conference on Innovative Systems Research (CIDR). 119--130.
|
 |
28
|
Anton N. Dragunov , Thomas G. Dietterich , Kevin Johnsrude , Matthew McLaughlin , Lida Li , Jonathan L. Herlocker, TaskTracer: a desktop environment to support multi-tasking knowledge workers, Proceedings of the 10th international conference on Intelligent user interfaces, January 10-13, 2005, San Diego, California, USA
[doi> 10.1145/1040830.1040855]
|
 |
29
|
Susan Dumais , Edward Cutrell , JJ Cadiz , Gavin Jancke , Raman Sarin , Daniel C. Robbins, Stuff I've seen: a system for personal information retrieval and re-use, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
[doi> 10.1145/860435.860451]
|
 |
30
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
31
|
Ester, M., Kriegel, H.-P., and Sander, J. 2001. Knowledge Discovery in Databases. Springer.
|
 |
32
|
|
| |
33
|
|
 |
34
|
|
| |
35
|
|
 |
36
|
|
 |
37
|
Jim Gemmell , Gordon Bell , Roger Lueder , Steven Drucker , Curtis Wong, MyLifeBits: fulfilling the Memex vision, Proceedings of the tenth ACM international conference on Multimedia, December 01-06, 2002, Juan-les-Pins, France
[doi> 10.1145/641007.641053]
|
| |
38
|
Goerlitz, O., Sizov, S., and Staab, S. 2008. PINTS: Peer-to-Peer infrastructure for tagging systems. In Proceedings of the 7th International Workshop on Peer-to-Peer Systems (IPTPS).
|
| |
39
|
Groza, T., Handschuh, S., Moeller, K., Grimnes, G., Sauermann, L., Minack, E., Mesnage, C., Jazayeri, M., Reif, G., and Gudjonsdottir, R. 2007. The NEPOMUK Project—On the way to the social semantic desktop. In Proceedings of the International Conference on Semantic Technologies (I-Semantics). 201--211.
|
| |
40
|
|
| |
41
|
|
| |
42
|
Hartigan, J. and Wong, M. 1979. A k-Means clustering algorithm. Appl. Stat. 28, 100--108.
|
| |
43
|
imdb. Internet movie database. http://www.imdb.com.
|
| |
44
|
|
| |
45
|
|
| |
46
|
Klimt, B. and Yang, Y. 2004. The enron corpus: A new dataset for email classification research. In Proceedings of the 15th European Conference on Machine Learning (ECML'04). Lecture Notes in Computer Science, Springer, 217--226.
|
| |
47
|
|
| |
48
|
Kuhn, H. 1955. The Hungarian method for the assignment problem. Naval Resear. Logistics Quart. 2, 83--97.
|
| |
49
|
|
| |
50
|
|
| |
51
|
|
 |
52
|
|
| |
53
|
|
| |
54
|
|
| |
55
|
|
| |
56
|
|
| |
57
|
Masulli, F. and Valentini, G. 2000. Comparing decomposition methods for classification. In Proceedings of the International Conference on Knowledge-Based Intelligent Engineering Systems and Applied Technologies (KES). 788--792.
|
| |
58
|
|
| |
59
|
Millen, D., Yeng, M., Whittaker, S., and Feinberg, J. 2007. Social bookmarking and exploratory search. In Proceedings of the European Conference on Computer Supported Cooperative Work.
|
 |
60
|
|
| |
61
|
Pierskalla, W. 1968. The multi-dimensional assignment problem. Operations Res. 16, 422--431.
|
| |
62
|
Platt, J. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, MIT Press, 61--74.
|
| |
63
|
|
| |
64
|
Quan, D., Huynh, D., and Karger, D. 2003. Haystack: A platform for authoring end user semantic web applications. In Proceedings of the International Semantic Web Conference. 738--753.
|
| |
65
|
|
 |
66
|
|
 |
67
|
|
| |
68
|
Shvaiko, P. and Euzenat, J. 2005. A survey of schema-based matching approaches. Lecture Notes in Computer Science, vol. 3730, Springer, 146--171.
|
| |
69
|
Siersdorfer, S. and Sizov, S. 2003. Construction of Feature Spaces and Meta Methods for Classification of Web Documents. In Proceedings of the 10th Conference Datenbanksysteme fuer Business, Technologie und Web (BTW). 197--206.
|
 |
70
|
|
| |
71
|
Siersdorfer, S. and Sizov, S. 2006. Automatic document organization in a p2p environment. In Proceedings of the 28th European Conference on IR Research (ECIR). 265--276.
|
| |
72
|
Siersdorfer, S. and Sizov, S. 2007. Restrictive methods and meta methods for thematically focused web search. In Handbook of Research on Web Information Systems Quality, Idea Group.
|
 |
73
|
|
| |
74
|
Siersdorfer, S. and Weikum, G. 2005. Using restrictive classification and meta classification for junk elimination. In Proceedings of the 27th European Conference on Information Retrieval (ECIR'05), D. Losada and J. M. F. Luna, Eds. Lecture Notes in Computer Science, vol. 3408. Springer, 287--299.
|
 |
75
|
Ion Stoica , Robert Morris , David Karger , M. Frans Kaashoek , Hari Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.149-160, August 2001, San Diego, California, United States
|
| |
76
|
|
| |
77
|
Surendran, A. C., Platt, J. C., and Renshaw, E. 2005. Automatic discovery of personal topics to organize email. In Proceedings of the 2nd Conference on Email and Anti-Spam.
|
 |
78
|
Jaime Teevan , Christine Alvarado , Mark S. Ackerman , David R. Karger, The perfect search engine is not enough: a study of orienteering behavior in directed search, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, April 24-29, 2004, Vienna, Austria
[doi> 10.1145/985692.985745]
|
| |
79
|
|
| |
80
|
Vaidya, J. and Clifton, C. 2004. Privacy preserving naive bayes classifier for vertically partitioned data. In Proceedings of the SIAM International Conference on Data Mining.
|
| |
81
|
Vailaya, A. and Jain, A. K. 2000. Reject option for vq-based bayesian classification. In Proceedings of the International Conference on Pattern Recognition (ICPR'00). 2048--2051.
|
| |
82
|
Van Rijsbergen, C. 1977. A theoretical basis for the use of co-occurence data in information retrieval. J. Document. 33, 2, 106--119.
|
 |
83
|
|
 |
84
|
|
| |
85
|
|
| |
86
|
|
| |
87
|
Zhang, R. and Metaxas, D. 2006. Ro-svm: Support vector machine with reject option for image categorization. In Proceedings of the British Machine Vision Conference (BMNC'06). vol. 3, 1209--1218.
|
|