Classification in Networked Data: a Toolkit and a Univariate Case Study

41 Pages Posted: 9 Oct 2008

See all articles by Sofus Macskassy

Sofus Macskassy

Fetch Technologies, Inc

Foster Provost

New York University

Date Written: 2004

Abstract

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-studyof its application to a collection of networked data sets used in prior machine learning research.Networked data are relational data where entities are interconnected, and this paper considers thecommon case where entities whose labels are to be estimated are linked to entities for which thelabel is known. NetKit is based on a three-component framework, comprising a local classifier, arelational classifier, and a collective inference procedure. Various existing relational learning algorithmscan be instantiated with appropriate choices for these three components and new relationallearning algorithms can be composed by new combinations of components. The case study demonstrateshow the toolkit facilitates comparison of different learning methods (which so far has beenlacking in machine learning research). It also shows how the modular framework allows analysisof subcomponents, to assess which, whether, and when particular components contribute to superiorperformance. The case study focuses on the simple but important special case of univariatenetwork classification, for which the only information available is the structure of class linkage inthe network (i.e., only links and some class labels are available). To our knowledge, no work previouslyhas evaluated systematically the power of class-linkage alone for classification in machinelearning benchmark data sets. The results demonstrate clearly that simple network-classificationmodels perform remarkably wellâ€"well enough that they should be used regularly as baseline classifiersfor studies of relational learning for networked data. The results also show that there are asmall number of component combinations that excel, and that different components are preferablein different situations, for example when few versus many labels are known.

Keywords: relational learning, network learning, collective inference, collective classification, networked data

Suggested Citation

Macskassy, Sofus and Provost, Foster, Classification in Networked Data: a Toolkit and a Univariate Case Study (2004). Information Systems Working Papers Series, Vol. , pp. -, 2004. Available at SSRN: https://ssrn.com/abstract=1281319

Sofus Macskassy (Contact Author)

Fetch Technologies, Inc ( email )

2041 Rosecrans Ave
Suite 245
El Segundo, CA 90245
United States

HOME PAGE: http://www.fetch.com

Foster Provost

New York University ( email )

44 West Fourth Street
New York, NY 10012
United States

Register to save articles to
your library

Register

Paper statistics

Downloads
41
Abstract Views
517
PlumX Metrics