ACM Home Page
Please provide us with feedback. Feedback
XML version detection
Full text PdfPdf (199 KB)
Source
Document Engineering archive
Proceedings of the 2007 ACM symposium on Document engineering table of contents
Winnipeg, Manitoba, Canada
SESSION: XML documents table of contents
Pages: 79 - 88  
Year of Publication: 2007
ISBN:978-1-59593-776-6
Authors
Deise de Brum Saccol  Universidade Federal do Rio Grande do Sul
Nina Edelweiss  Universidade Federal do Rio Grande do Sul
Renata de Matos Galante  Universidade Federal do Rio Grande do Sul
Carlo Zaniolo  University of California
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 48,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1284420.1284441
What is a DOI?

ABSTRACT

The problem of version detection is critical in many important application scenarios, including software clone identification, Web page ranking, plagiarism detection, and peer-to-peer searching. A natural and commonly used approach to version detection relies on analyzing the similarity between files. Most of the techniques proposed so far rely on the use of hard thresholds for similarity measures. However, defining a threshold value is problematic for several reasons: in particular (i) the threshold value is not the same when considering different similarity functions, and (ii) it is not semantically meaningful for the user. To overcome this problem, our work proposes a version detection mechanism for XML documents based on Naïve Bayesian classifiers. Thus, our approach turns the detection problem into a classification problem. In this paper, we present the results of various experiments on synthetic data that show that our approach produces very good results, both in terms of recall and precision measures.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
5
 
6
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, v. 50, n. 7, p-1545--1551, 2004.
 
7
Baeza-Yates, R., Castillo, C.. Relating Web Characteristics with Link based Web Page Ranking. Proc. of the 8th Intl. Symposium on String Processing and Information Retrieval, 2001.
 
8
9
 
10
Guth, G.J.: Surname spellings and computerized record linkage. Historical Methods Newsletter 10, 10--19, 1976.
 
11
 
12
 
13
Nierman, A. e Jagadish, H.V.. Evaluating Structural Similarity in XML Documents. Proc. of the 5th Intl. Workshop on the Web and Databases (WebDB 2002), 2002.
14
 
15
Silva, R.; Stasiu, R. K.; Orengo, V. M.; Heuser, C. A. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, v. 1, p. 4, 2007.
 
16
Stasiu, R. K.; Heuser, C. A.; Silva, R. Estimating Recall and Precision for vague queries in Databases. In: 17th International Conference Advanced Information Systems Engineering (CAISE), Porto, Portugal, 2005. v. 3520. p. 187--200.
 
17
 
18
 
19
Wang, Y., DeWitt, D. J., Cai, J. X-Diff: An Effective Change Detection Algorithm for XML Documents. Intl. Conf. on Data Engineering, 519--530, 2003.
 
20
21
 
22
Cohen, W.W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proc. of IJCAI-03 Workshop on Information Integration on the Web, Acapulco, Mexico, Morgan Kaufmann, 73--78, 2003.
 
23
 
24
25
 
26
Langley, P., Iba, W., & Thompson, K. An analysis of Bayesian classifiers. Proc. of the 10th National Conference on Artificial Intelligence (pp. 223--228). San Jose, CA: AAAI Press, 1992.
 
27
 
28
Pon, R.K., Cárdenas, A.F., Buttler, D., Critchlow, T.. iScore: Measuring the Interestingness of Articles in a Limited User Environment. In: IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, HI, 2007.
 
29
 
30
 
31


Collaborative Colleagues:
Deise de Brum Saccol: colleagues
Nina Edelweiss: colleagues
Renata de Matos Galante: colleagues
Carlo Zaniolo: colleagues