Abstract
Knowing mathematical properties about machine learning models can provide effective guidance in feature engineering and selection. This paper presents mathematical propositions for identifying redundant variables in decision tree models. The first proposition demonstrates that if one variable is an order-preserving one-to-one corresponding of another, the performance of the model remains unaffected even when one of the two variables is removed. The second proposition reveals that if one variable is an order-preserving mapping of another and this mapping reduces the cardinality of the variable, the model’s performance remains unchanged upon the removal of the variable with lower cardinality. We provide formal mathematical proofs for both propositions and support our findings with simulation-based experiments. These results demonstrate that, within the standard CART-style framework of axis-aligned, greedily constructed decision trees, common order-preserving transformations—such as min–max scaling or logarithmic transformation—do not alter the set of feasible partitions and therefore leave predictive performance unchanged. Furthermore, the results suggest that rank-based correlation measures, such as Spearman’s rank correlation coefficient, can serve as an effective tool for identifying redundant variables under this modeling framework.
Get full access to this article
View all access options for this article.
