An Empirical Survey of Linked Data Conformance
42 Pages Posted: 23 Jun 2018 Publication Status: Accepted
There has been a recent, tangible growth in RDF published on the Web in accordance with the Linked Data principles and best practices, the result of which has been dubbed the "Web of Data". Linked Data guidelines are designed to facilitate ad hoc re-use and integration of conformant structured data—across the Web—by consumer applications; however, thus far, systems have yet to emerge that convincingly demonstrate the potential applications for consuming currently available Linked Data. Herein, we compile a list of fourteen concrete guidelines as given in the "How to Publish Linked Data on the Web" tutorial. Thereafter, we evaluate conformance of current RDF data providers with respect to these guidelines. Our evaluation is based on quantitative empirical analyses of a crawl of ~4 million RDF/XML documents constituting over 1 billion quadruples, where we also look at the stability of hosted documents for a corpus consisting of nine monthly snapshots from a sample of 151 thousand documents. Backed by our empirical survey, we provide insights into the current level of conformance with respect to various Linked Data guidelines, enumerating lists of the most (non-)conformant data providers. We show that certain guidelines are broadly adhered to (esp. use HTTP URIs, keep URIs stable), whilst others are commonly overlooked (esp. provide licencing and human-readable meta-data). We also compare PageRank scores for the data-providers and their conformance to Linked Data guidelines, showing that both factors negatively correlate for guidelines restricting use of RDF features, while positively correlating for guidelines encouraging external linkage and vocabulary re-use. Finally, we present a summary of conformance for the different guidelines, and present the top-ranked data providers in terms of a combined PageRank and Linked Data conformance score.
Keywords: semantic web, linked data, web of data, pagerank, web publishing, data quality, web science
Suggested Citation: Suggested Citation