Random Forests variable importance measures are often used to rank variables by their relevance to a classification problem and subsequently reduce the number of model inputs in high-dimensional data sets, thus increasing computational efficiency. However, as a result of the way that training data and predictor variables are randomly selected for use in constructing each tree and splitting each node, it is also well known that if too few trees are generated, variable importance rankings tend to differ between model runs. In this letter, we characterize the effect of the number of trees (ntree) and class separability on the stability of variable importance rankings and develop a systematic approach to define the number of model runs and/or trees required to achieve stability in variable importance measures. Results demonstrate that both a large ntree for a single model run, or averaged values across multiple model runs with fewer trees, are sufficient for achieving stable mean importance values. While the latter is far more computationally efficient, both the methods tend to lead to the same ranking of variables. Moreover, the optimal number of model runs differs depending on the separability of classes. Recommendations are made to users regarding how to determine the number of model runs and/or trees that are required to achieve stable variable importance rankings.

Additional Metadata
Keywords Computational modeling, Convergence, Data models, Mean decrease in accuracy (MDA), mean decrease in Gini (MDG) index, random forest, Remote sensing, Stability analysis, Systematics, variable reduction., Vegetation
Persistent URL dx.doi.org/10.1109/LGRS.2017.2745049
Journal IEEE Geoscience and Remote Sensing Letters
Behnamian, A. (Amir), Millard, K. (Koreen), Banks, S.N. (Sarah N.), White, L. (Lori), Richardson, M, & Pasher, J. (Jon). (2017). A Systematic Approach for Variable Selection With Random Forests: Achieving Stable Variable Importance Values. IEEE Geoscience and Remote Sensing Letters. doi:10.1109/LGRS.2017.2745049