Variable Selection with Big Data based on Zero Norm and via Sequential Monte Carlo

24 Pages Posted: 21 May 2019

See all articles by Jin-Chuan Duan

Jin-Chuan Duan

National University of Singapore

Date Written: April 22, 2019

Abstract

Selecting a subset from many potential explanatory variables in linear regressions has long been the subject of research interest, and the matter is made more important in the era of big data when many more variables become available/accessible. Of late, the l_1-norm penalty based techniques such as Lasso of Tibshirani (1996) have become very popular. However, the variable selection problem in its natural setting is a zero-norm penalty problem, i.e., a penalty on the number of variables as opposed to the l_1-norm of the regression coefficients. The popularity of the l_1-norm penalty or its variants has more to do with computational considerations, because selection with the zero-norm penalty is a highly demanding combinatory optimization problem when the number of potential variables becomes large. We devise a sequential Monte Carlo (SMC) method as a practical and reliable tool for zero-norm variable selection problems, and selecting, say, best 20 out of 1,000 potential variables can, for example, be completed with a typical multi-core desktop computer in a couple of minutes. The methodological essence is to understand that the selection problem is equivalent to the task of sampling from a discrete probability function de fined over all possible combinations comprising, say, k regressors out of p>=k potential variables, where the peak of this function corresponds to the optimal combination. The solution technique sets out to sequentially generate samples, and after a while the final sample represents the target probability function. With the fi nal SMC sample in place, we deploy the extreme value theory to assess how likely and to what extent the maximum R2 has been achieved. We also demonstrate through a simulation study the method's reliability and superiority vis-a-vis the adaptive Lasso.

Keywords: Lasso, tempering, extreme value, regression

JEL Classification: C1

Suggested Citation

Duan, Jin-Chuan, Variable Selection with Big Data based on Zero Norm and via Sequential Monte Carlo (April 22, 2019). Available at SSRN: https://ssrn.com/abstract=3377038 or http://dx.doi.org/10.2139/ssrn.3377038

Jin-Chuan Duan (Contact Author)

National University of Singapore ( email )

Mochtar Riady Building
15 Kent Ridge Drive
Singapore, 119245
Singapore

Register to save articles to
your library

Register

Paper statistics

Downloads
11
Abstract Views
106
PlumX Metrics