Sage Journals: Discover world-class research

Abstract

Knowing mathematical properties about machine learning models can provide effective guidance in feature engineering and selection. This paper presents mathematical propositions for identifying redundant variables in decision tree models. The first proposition demonstrates that if one variable is an order-preserving one-to-one corresponding of another, the performance of the model remains unaffected even when one of the two variables is removed. The second proposition reveals that if one variable is an order-preserving mapping of another and this mapping reduces the cardinality of the variable, the model’s performance remains unchanged upon the removal of the variable with lower cardinality. We provide formal mathematical proofs for both propositions and support our findings with simulation-based experiments. These results demonstrate that, within the standard CART-style framework of axis-aligned, greedily constructed decision trees, common order-preserving transformations—such as min–max scaling or logarithmic transformation—do not alter the set of feasible partitions and therefore leave predictive performance unchanged. Furthermore, the results suggest that rank-based correlation measures, such as Spearman’s rank correlation coefficient, can serve as an effective tool for identifying redundant variables under this modeling framework.

Keywords

decision tree feature engineering machine learning

Get full access to this article

View all access options for this article.

References

Bredensteiner

E. J.

(1999). Feature minimization within decision trees (Technical Report). Department of Computer Science, University of Colorado Boulder.

Breiman

(2001). Random forests. Machine Learning, 45(1), 5–32.

Breiman

Friedman

J. H.

Olshen

R. A.

Stone

C. J.

(1984). Classification and regression trees. Wadsworth Statistics/Probability Series.

Chang

Y. C.

Chang

K. H.

G. J.

(2018). Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Applied Soft Computing, 73, 112–116.

Chen

Guestrin

(2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 785–794). ACM (Association for Computing Machinery).

Chrimes

(2023). Using decision trees as an expert system for clinical decision support for COVID-19. Interactive Journal of Medical Research, 12, e42540.

Costa

V. G.

Pedreira

C. E.

(2023). Recent advances in decision trees: An updated survey. Artificial Intelligence Review, 56, 4765–4800.

De Mántaras

R. L.

(1991). A distance-based attribute selection measure for decision tree induction. Machine Learning, 6, 81–92.

Dougherty

Kohavi

Sahami

(1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the twelfth international conference on machine learning (pp. 194–202).

10.

Friedman

J. H.

(2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.

11.

Gregorutti

Michel

Saint-Pierre

(2017). Correlation and variable importance in random forests. Statistics and Computing, 27(3), 659–678.

12.

Hall

M. A.

(2000). Correlation-based feature selection for machine learning [Doctoral dissertation]. University of Waikato.

13.

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

14.

Hong

Choi

Kim

(2020). A house price valuation based on the random forest approach: The mass appraisal of residential property in South Korea. International Journal of Strategic Property Management, 24(3), 140–152.

15.

Izza

Ignatiev

Marques-Silva

(2020). On explaining decision trees. arXiv preprint arXiv:201011034.

16.

Lee

Kim

J. C.

Jung

H. S.

Lee

M. J.

Lee

(2017). Spatial prediction of flood susceptibility using random-forest and boosted-tree models in Seoul metropolitan city, Korea. Geomatics, Natural Hazards and Risk, 8(2), 1185–1203.

17.

Wang

Basu

Kumbier

(2019). A debiased MDI feature importance measure for random forests. Advances in Neural Information Processing Systems, 32, 8047–8057.

18.

Mienye

I. D.

Jere

(2024). A survey of decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727.

19.

Mobley

Sebastian

Blessing

Highfield

W. E.

Stearns

Brody

S. D.

(2021). Quantification of continuous flood hazard using random forest classification and flood insurance claims at large spatial scales: A pilot study in southeast texas. Natural Hazards and Earth System Sciences, 21, 807–822.

20.

Pedregosa

Varoquaux

Gramfort

Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. & Vanderplas, J. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

21.

Podgorelec

Kokol

Stiglic

Rozman

(2002). Decision trees: An overview and their use in medicine. Journal of Medical Systems, 26(5), 445–463.

22.

Strobl

Boulesteix

A. L.

Zeileis

Hothorn

(2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 25.

23.

Tuv

Borisov

Runger

Torkkola

(2009). Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research, 10, 1341–1366.

24.

Zhang

Gionis

(2023). Regularized impurity reduction: Accurate decision trees with complexity guarantees. Data mining and knowledge discovery, 37(1), 434–475.

Mathematical Propositions for Feature Engineering in Decision Tree Models

Abstract

Keywords

Get full access to this article

References