Selection and Fusion of Categorical Predictors with L 0 -Type Penalties

Abstract

In regression modelling, categorical covariates have to be coded. Depending on the number of categorical covariates and on the number of levels they have, the number of coefficients can become huge. To reduce the model complexity, coefficients of similar categories should be fused and coefficients of non-influential categories should be set to zero. To this end, Lasso-type penalties on the differences of coefficients are a standard approach. However, the clustering/selection performance of this approach is sometimes poor–especially when the adaptive weights are badly conditioned or not existing. In some situations, there is no incentive to cluster similar categories. To overcome this, a $L_{0}$ penalty on the differences of coefficients is proposed, whereby the $L_{0}$ ‘norm’ is defined as the number of non-zero entries in a vector. The proposed penalty favours to find clusters of categories that share the same effect on the response variable while the estimation accuracy is comparable to Lasso-type penalties. Numerical experiments within the framework of generalized linear models are promising. For illustration, data on the unemployment rates in Germany is analyzed.

Keywords

adaptive Lasso best subset selection GLMs model selection

Get full access to this article

View all access options for this article.

References

Antoniadis

Fan

(2001) Regularization of wavelet approximations. J. Amer. Statist. Assoc , 96, 939–67.

Bondell

Reich

(2009) Simultaneous factor selection and collapsing levels in ANOVA. Biometrics , 65, 169–77.

Bozdogan

(1987) Model selection and Akaike's information criterion (AIC): the general theory and its analytical extensions. Psychometrika , 52, 345–70.

Donoho

Elad

(2003) Optimally sparse representation in general (nonorthogonal) dictionaries via l¹ minimization. Proceedings of the National Academy of Sciences , 100, 2197–2202.

Fahrmeir

Tutz

(2001) Multivariate statistical modelling based on generalized linear models. New York:

Springer Verlag.

Fan

(2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc , 96, 1348–60.

Jiang

(2011) A note on the complexity of l_p minimization. Math. Program , 192, 285–99.

Gertheiss

Tutz

(2010) Sparse modelling of categorial explanatory variables. Ann. Appl. Stat , 4, 2150–80.

GIMP Team (2012) GNU image manipulation program. http://www.gimp.org.

10.

Heinzl

(2013) Clustering in Linear and Additive Mixed Models. Dissertation, Department of Statistics, Ludwig-Maximilians-Universität München:

Cuvillier Verlag Göttingen.

11.

Jain

Dubes

(1988) Algorithms for Clustering Data.

New Jersey:

Prentice Hall.

12.

Johnson

(2013) A dynamic programming algorithm for the fused lasso and L0-segmentation. J. Comput. Graph. Statist , 22, 246–60.

13.

Zhang

(2013) Sparse approximation via penalty decomposition methods. SIAM J. Optim , 23, 2448–78.

14.

Zhang

(2010) Penalty decomposition methods for l₀-norm minimization. arXiv:1008.5372.

15.

Mancera

Portilla

(2006) L0-norm-based sparse representation through alternate projections. In International Conference on Image Processing, pp. 2089–2092. IEEE.

16.

Molenberghs

Verbeke

(2005) Models for discrete longitdinal data.

New York:

Springer-Verlag.

17.

Oelker

(2013). gvcm.cat: Regularized categorial effects/categorial effect modifiers in GLMs. R package version 1.6.

18.

Oelker

M-R

Gertheiss

Tutz

(2014) Regularization and model selection with categorical predictors and effect modifiers in generalized linear models. Statistical Modelling , 14, 157–77.

19.

Oelker

M-R

Tutz

(2013) A general family of penalties for combining differing types of penalties in generalized structured models. Department of Statistics: Technical Reports 139, http://epub.ub.uni-muenchen.de/17664/.

20.

O'Sullivan

Yandell

Raynor

(1986)Automatic smoothing of regression functions in generalized linear models. J. Amer. Statist. Assoc , 81, 96–03.

21.

Pollak

Willsky

Huang

(2005) Nwebar evolution equations as fast and exact solvers of estimation problems. IEEE Transactions of Signal Processing , 53, 484–98.

22.

Pau

Oles

Smith

Sklyar

Huber

(2012) EBImage: Image processing toolbox for R. R package version 4.4.0.

23.

R Core Team (2013) R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. R version 3.0.2 (2013-09-25).

24.

Rippe

RCA

Meulman

Eilers

PHC

(2012) Visualization of genomic changes by segmented smoothing using an l₀ penalty. PloS One , 6, 1–14.

25.

Schwarz

(1978) Estimating the dimension of a model. Ann. Statist , 6, 461–64.

26.

Tibshirani

(1996) Regression shrinkage and selection via the LASSO. R. Stat. Soc. Ser. B Stat. Methodol , 58, 267–88.

27.

Tibshirani

Saunders

Rosset

Zhu

Knight

(2005) Sparsity and smoothness via the fused LASSO. R. Stat. Soc. Ser. B Stat. Methodol , 67, 91–08.

28.

Tutz

Gertheiss

(2013) Rating scales as predictors-the old question of scale level and some answers. Psychometrika.

29.

Weise

F-J

Alt

Becker

(Eds) (2011) Arbeitsmarkt in Zahlen, Nürnberg. Statistik der Bundesagentur für Arbeit.

30.

Wikipedia User NordNordWest (2008) Federal states of Germany. http://commons.wikimedia.org/wiki/File:Germany_location_map.svg. Licenses: GNU Free Documentation License, Version 1.2 http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License,_version_1.2, Creative Commons Attribution-Share Alike 3.0 Unported http://creativecommons.org/licenses/by-sa/3.0/deed.en.

31.

Wipf

Rao

(2005) l0-norm minimization for basis selection. Advances in Neural Information Processing Systems , 17, 1513–20.

32.

Wood

(2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. R. Stat. Soc. Ser. B Stat. Methodol , 73, 3–36.

33.

Xiang

Gubian

Suomela

Hoeng

(2013). Generalized simulated annealing for global optimization: the GenSA package. R Journ , 5, 13–29. R package version 1.1.4.

34.

Zou

s H

(2006) The adaptive LASSO and its oracle properties. J. Amer. Statist. Assoc , 101, 1418–29.