|
ABSTRACT
In this paper, we propose a set of similarity metrics for manipulating collections of values occuring in XML documents. Following the data model presented in TAX algebra, we treat an XML element as a labeled ordered rooted tree. Consider that XML nodes can be either atomic, i.e, they may contain single values such as short character strings, date, etc, or complex, i.e., nested structures that contain other nodes, we propose two types of similarity metrics: MAVs, for atomic nodes and MCVs, for complex nodes. In the first case, we suggest the use of several application domain dependent metrics. In the second case, we define metrics for complex values that are structure dependent, and can be distinctly applied for it and collections of values. We also present experiments showing the effectiveness of our method.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
Carmel, D., Efrati, N., Landau, G. M., Maarek, Y., and Mass, Y. An extension of the vector space model for querying XML documents via XML fragments. In Workshop On XML and Information Retrieval, SIGIR (2002).
|
 |
4
|
|
 |
5
|
|
 |
6
|
|
| |
7
|
Cohen, W., Ravikumar, P., and Fienberg, S. A comparison of string metrics for matching names and records. In KDD-2003 Workshop on Data Cleaning and Object Consolidation (2003).
|
| |
8
|
|
| |
9
|
Dorneles, C. F., Lima, A. E. N., Heuser, C. A., da Silva, A., and de Moura, E. S. Acessing xml data by allowing imprecise query arguments. Tech. Rep. RP-342, UFRGS, 2004.
|
| |
10
|
|
| |
11
|
Fuhr, and Grossjohann. XIRQL: an extension of XQL for information retrieval. In ACM SIGIR Workshop On XML and Information Retrieval (2000).
|
| |
12
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
 |
13
|
|
| |
14
|
Jin, L., Li, C., and Mehrotra, S. Efficient similarity string joins in large data sets. In VLDB (April 2002).
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
Nadvorny, C. F., and Heuser, C. A. Twisting the metric space to achieve better metric trees. In Brazilian Symp. on Database, SBBD (2004).
|
| |
20
|
|
 |
21
|
|
CITED BY 6
|
|
|
|
|
Carina F. Dorneles , Carlos A. Heuser , Viviane Moreira Orengo , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
Deise de Brum Saccol , Nina Edelweiss , Renata de Matos Galante , Carlo Zaniolo, XML version detection, Proceedings of the 2007 ACM symposium on Document engineering, August 28-31, 2007, Winnipeg, Manitoba, Canada
|
|
|
Carlos A. Heuser , Francisco N. A. Krieser , Viviane Moreira Orengo, SimEval: a tool for evaluating the quality of similarity functions, Tutorials, posters, panels and industrial contributions at the 26th international conference on Conceptual modeling, November 01-01, 2007, Auckland, New Zealand
|
|
|
|
|
|
Carina F. Dorneles , Marcos Freitas Nunes , Carlos A. Heuser , Viviane P. Moreira , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Information Systems, v.34 n.8, p.740-756, December, 2009
|
|