|
ABSTRACT
If Kolmogorov complexity [25] measures information in one object and Information Distance measures information shared by two objects, how do we measure information shared by many objects? This paper provides an initial pragmatic study of this fundamental data mining question. Firstly, Em(x1,x2,...,xn) is defined to be the minimum amount of thermodynamic energy needed to convert from any xi to any xj. With this definition several theoretical problems have been solved. Second, our newly proposed theory is applied to select a comprehensive review and a specialized review from many reviews: (1) Core feature words, expanded words and dependent words are extracted respectively. (2) Comprehensive and specialized reviews are selected according to the information among them. This method of selecting a single review can be extended to select multiple reviews as well. Finally, experiments show that this comprehensive and specialized review mining method based on our new theory can do the job efficiently.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C. Ané and M. Sanderson. Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology, 54(1):146--157, 2005.
|
| |
2
|
T. Arbuchle, A. Balaban, D. Peters, and M. Lawford. Software documents: Comparison and measurement. In The Nineteenth International Conference on Software Engineering and Knowledge Engineering, July 2007.
|
| |
3
|
D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Physical Review Letters, 88(4):048702, 2002.
|
| |
4
|
C. Bennett, P. Gacs, M. Li, P. Vitányi, and W. Zurek. Information distance. IEEE Transactions on Information Theory, 44(4):1407--1423, July 1998.
|
| |
5
|
C. Bennett, M. Li, and B. Ma. Chain letters and evolutionary histories. Scientific American, 288(6):76--81, June 2003.
|
 |
6
|
|
| |
7
|
|
| |
8
|
X. Chen, B. Francia, M. Li, B. Mckinnon, and A. Seker. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, 50(7):1545--1550, July 2004.
|
| |
9
|
Alexei Chernov , Andrej Muchnik , Andrei Romashchenko , Alexander Shen , Nikolai Vereshchagin, Upper semi-lattice of binary strings with the relation "x is simple conditional to y", Theoretical Computer Science, v.271 n.1-2, p.69-95, January 28, 2002
[doi> 10.1016/S0304-3975(01)00032-9]
|
| |
10
|
R. Cilibrasi and P. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523--1545, 2005.
|
| |
11
|
|
| |
12
|
|
| |
13
|
M. C. de Marneffe, B. MacCartney, and C. D. Manning. Generating typed dependency parses from phrase structure parses. In The fifth international conference on Language Resources and Evaluation (LREC), May 2006.
|
| |
14
|
K. Emanuel, S. Ravela, E. Vivant, and C. Risi. A combined statistical-deterministic approach of hurricane risk assessment. manuscript, Program in Atmospheres, Oceans, and Climate, MIT, 2005.
|
| |
15
|
M. Gamon, A. Aue, S. C. Oliver, and E. Ringger. Pulse: Mining customer opinions from free text. In International Symposium on Intelligent Data Analysis (IDA), pages 121--132, October 2005.
|
| |
16
|
M. Hayashida and T. Akutsu. Image compression-based approach to measuring the similarity of protein structures. In The 6th Asia-Pacific Bioinformatics Conference, pages 221--230, 2008.
|
 |
17
|
|
 |
18
|
|
| |
19
|
S. Kirk and S. Jenkins. Information theory-based software metrics and obfuscation. Journal of Systems and Software, 72:179--186, 2004.
|
| |
20
|
|
| |
21
|
A. Kraskov, H. Stogbauer, R. Andrzejak, and P. Grassberger. Hierarchical clustering using mutual information. Europhys. Lett, 70(2):278--284, 2005.
|
| |
22
|
|
| |
23
|
M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2):149--154, 2001.
|
| |
24
|
M. Li, X. Chen, X. Li, B. Ma, and P. Vitányi. The similarity metric. IEEE Transactions on Information Theory, 50(12):3250--3264, 2004.
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
M. Nykter, N. Price, M. Aldana, S. Ramsey, S. Kauffman, L. Hood, O. Yli-Harja, and I. Shmulevich. Gene expression dynamic in the macrophage exhibit criticality. PNAS, 105(6):1897--1900, 2008.
|
| |
29
|
M. Nykter, N. Price, A. Larjo, T. Aho, S. Kauffman, O. Yli-Harja, and I. Shmulevich. Critical networks exhibit maximal information diversity in structure-dynamics relationships. Physical Review Letters, 100:058702 (1-4), 2008.
|
| |
30
|
H. Otu and K. Sayood. A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(6):2122--2130, 2003.
|
| |
31
|
H. Pao and J. Case. Computing entropy for ortholog detection. In International Conference on Computational Intelligence, December 2004.
|
| |
32
|
D. Parry. Use of Kolmogorov distance identification of web page authorship, topic and domain. In Workshop on Open Source Web Inf. Retrieval, 2005.
|
| |
33
|
|
| |
34
|
S. Rahmati and J. Glasgow. Noise tolerance of universal similarity metric applied to protein contact maps comparison in two dimensions. manuscript, Queen Univ, 2008.
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
W. Taha, S. Crosby, and K. Swadi. A new approach to data mining for software design. manuscript, Rice Univ, 2006.
|
 |
39
|
|
| |
40
|
|
| |
41
|
|
 |
42
|
Xian Zhang , Yu Hao , Xiaoyan Zhu , Ming Li , David R. Cheriton, Information distance from a question to an answer, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
[doi> 10.1145/1281192.1281285]
|
 |
43
|
|
|