header

A Novel XML Document Structure Comparison Framework Based-On Subtree Commonalities and Label Semantics

26 Pages Posted: 23 Jun 2018 Publication Status: Accepted

See all articles by Joe M. Tekli

Joe M. Tekli

University of São Paulo (USP) - ICMC Computer Science and Statistics Institute

Richard Chbeir

University Pau & Pays Adour

Abstract

XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison framework to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and to allow the enduser to adjust the comparison process according to her requirements. Our framework consists of four main modules for i) discovering the structural commonalities between sub-trees, ii) identifying subtree semantic resemblances, iii) computing tree-based edit operations costs, and iv) computing tree edit distance. Experimental results demonstrate higher comparison accuracy with respect to alternative methods, while timing experiments reflect the impact of semantic similarity on overall system performance.

Keywords: XML, Semi-structured Data, Structural Similarity, Tree Edit Distance, Semantic similarity, Information Retrieval, Vector Space Mode

Suggested Citation

Tekli, Joe M. and Chbeir, Richard, A Novel XML Document Structure Comparison Framework Based-On Subtree Commonalities and Label Semantics (2012). Available at SSRN: https://ssrn.com/abstract=3198935 or http://dx.doi.org/10.2139/ssrn.3198935

Joe M. Tekli (Contact Author)

University of São Paulo (USP) - ICMC Computer Science and Statistics Institute ( email )

Rua Luciano Gualberto, 315
São Paulo, 14800-901
Brazil

Richard Chbeir

University Pau & Pays Adour ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
63
Abstract Views
574
PlumX Metrics