|
ABSTRACT
The problem of version detection is critical in many important application scenarios, including software clone identification, Web page ranking, plagiarism detection, and peer-to-peer searching. A natural and commonly used approach to version detection relies on analyzing the similarity between files. Most of the techniques proposed so far rely on the use of hard thresholds for similarity measures. However, defining a threshold value is problematic for several reasons: in particular (i) the threshold value is not the same when considering different similarity functions, and (ii) it is not semantically meaningful for the user. To overcome this problem, our work proposes a version detection mechanism for XML documents based on Naïve Bayesian classifiers. Thus, our approach turns the detection problem into a classification problem. In this paper, we present the results of various experiments on synthetic data that show that our approach produces very good results, both in terms of recall and precision measures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, v. 50, n. 7, p-1545--1551, 2004.
|
| |
7
|
Baeza-Yates, R., Castillo, C.. Relating Web Characteristics with Link based Web Page Ranking. Proc. of the 8th Intl. Symposium on String Processing and Information Retrieval, 2001.
|
| |
8
|
|
 |
9
|
|
| |
10
|
Guth, G.J.: Surname spellings and computerized record linkage. Historical Methods Newsletter 10, 10--19, 1976.
|
| |
11
|
|
| |
12
|
|
| |
13
|
Nierman, A. e Jagadish, H.V.. Evaluating Structural Similarity in XML Documents. Proc. of the 5th Intl. Workshop on the Web and Databases (WebDB 2002), 2002.
|
 |
14
|
Carina F. Dorneles , Carlos A. Heuser , Andrei E. N. Lima , Altigran Soares da Silva , Edleno Silva de Moura, Measuring similarity between collection of values, Proceedings of the 6th annual ACM international workshop on Web information and data management, November 12-13, 2004, Washington DC, USA
[doi> 10.1145/1031453.1031465]
|
| |
15
|
Silva, R.; Stasiu, R. K.; Orengo, V. M.; Heuser, C. A. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, v. 1, p. 4, 2007.
|
| |
16
|
Stasiu, R. K.; Heuser, C. A.; Silva, R. Estimating Recall and Precision for vague queries in Databases. In: 17th International Conference Advanced Information Systems Engineering (CAISE), Porto, Portugal, 2005. v. 3520. p. 187--200.
|
| |
17
|
|
| |
18
|
|
| |
19
|
Wang, Y., DeWitt, D. J., Cai, J. X-Diff: An Effective Change Detection Algorithm for XML Documents. Intl. Conf. on Data Engineering, 519--530, 2003.
|
| |
20
|
|
 |
21
|
|
| |
22
|
Cohen, W.W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proc. of IJCAI-03 Workshop on Information Integration on the Web, Acapulco, Mexico, Morgan Kaufmann, 73--78, 2003.
|
| |
23
|
|
| |
24
|
|
 |
25
|
|
| |
26
|
Langley, P., Iba, W., & Thompson, K. An analysis of Bayesian classifiers. Proc. of the 10th National Conference on Artificial Intelligence (pp. 223--228). San Jose, CA: AAAI Press, 1992.
|
| |
27
|
|
| |
28
|
Pon, R.K., Cárdenas, A.F., Buttler, D., Critchlow, T.. iScore: Measuring the Interestingness of Articles in a Limited User Environment. In: IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, HI, 2007.
|
| |
29
|
|
| |
30
|
|
| |
31
|
|
|