Sage Journals: Discover world-class research

Abstract

In the present research, we found that different preprocessing options and parameterizations of classification and regression trees alter their model fit and have a direct effect on their applicability for end-users. We found that, in terms of applicability, classification trees react different to pruning than regression trees. Indeed, in case of high pruning levels, classification focus on the extreme values of the response variable, whereas regression tree are more likely to predict the intermediate values. Furthermore, when applying cross-validation with a high number of folds, modellers are likely to find one model that outperforms the other models in terms of reliability. Models were assessed based on the determination coefficient, the percentage of Correctly Classified Instances and the Cohen’s Kappa statistic for each parameterization. We found positive correlations ( $R^{2} > 0.70$ ) between the statistical criteria and we found a non-linear negative relation between the model fit and the level of pruning. Therefore, environmental modellers should make use of an exhaustive list of model parameterizations to develop and compare environmental models in a transparent and objective manner. General methodological guidelines derived from the present research may help modellers to efficiently select statistical and ecological relevant models that are meeting the needs of users. The validity of our conclusion should be further tested for other datasets and scientific domains as our findings are based on one set of freshwater data.

Keywords

Classification and regression tree parameterization applicability field data

Get full access to this article

View all access options for this article.

References

M.E.

Andrew and

S.L.

Ustin, Habitat suitability modelling of an invasive plant with advanced remote sensing data, Diversity and Distributions15 (2009), 627–640. doi:10.1111/j.1472-4642.2009.00568.x.

M.B.

Araujo and

Guisan, Five (or so) challenges for species distribution modelling, Journal of Biogeography33 (2006), 1677–1688. doi:10.1111/j.1365-2699.2006.01584.x.

N.D.

Bennett,

B.F.W.

Croke,

Guariso,

J.H.A.

Guillaume,

S.H.

Hamilton,

A.J.

Jakeman,

Marsili-Libelli,

L.T.H.

Newham,

J.P.

Norton,

Perrin,

S.A.

Pierce,

Robson,

Seppelt,

A.A.

Voinov,

B.D.

Fath and

Andreassian, Characterising performance of environmental models, Environmental Modelling & Software40 (2013), 1–20. doi:10.1016/j.envsoft.2012.09.011.

Boets,

Lock,

Messiaen and

P.L.M.

Goethals, Combining data-driven methods and lab studies to analyse the ecology of Dikerogammarus villosus, Ecological Informatics5 (2010), 133–139. doi:10.1016/j.ecoinf.2009.12.005.

Boets,

I.S.

Pauwels,

Lock and

P.L.M.

Goethals, Using an integrated modelling approach for risk assessment of the ‘killer shrimp’ Dikerogammarus villosus, River Research and Applications30 (2014), 403–412. doi:10.1002/rra.2658.

Breiman,

J.H.

Friedman,

R.A.

Olshen and

C.J.

Stone, Classification and Regression Trees, Wadsworth, Belmont, 1984.

L.A.

Breslow and

D.W.

Aha, Simplifying decision trees: A survey, Knowledge Engineering Review12 (1997), 1–40. doi:10.1017/S0269888997000015.

Cappelli,

Mola and

Siciliano, A statistical approach to growing a reliable honest tree, Computational Statistics & Data Analysis38 (2002), 285–299. doi:10.1016/S0167-9473(01)00044-5.

Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measurement20 (1960), 37–46. doi:10.1177/001316446002000104.

10.

De’ath and

K.E.

Fabricius, Classification and regression trees: A powerful yet simple technique for ecological data analysis, Ecology81 (2000), 3178–3192. doi:10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2.

11.

J.F.

Deliege,

Everbecq,

Magermans,

Grard,

Bourouag,

Blockx and

Smitz, PEGASE, an integrated river/basin model dedicated to surface water quality assessment: Application to cocaine, Acta Clinica Belgica65 (2010), 42–48. doi:10.1179/acb.2010.108.

12.

Dominguez-Granda,

Lock and

P.L.M.

Goethals, Using multi-target clustering trees as a tool to predict biological water quality indices based on benthic macroinvertebrates and environmental parameters in the Chaguana watershed (Ecuador), Ecological Informatics6 (2011), 303–308. doi:10.1016/j.ecoinf.2011.05.004.

13.

Dzeroski and

Drumm, Using regression trees to identify the habitat preference of the sea cucumber (Holothuria leucospilota) on Rarotonga, Cook Islands, Ecological Modelling170 (2003), 219–226. doi:10.1016/S0304-3800(03)00229-1.

14.

T.C.

Edwards,

D.R.

Cutler,

N.E.

Zimmermann,

Geiser and

G.G.

Moisen, Effects of sample survey design on the accuracy of classification tree models in species distribution models, Ecological Modelling199 (2006), 132–141. doi:10.1016/j.ecolmodel.2006.05.016.

15.

El-Baroudy,

Elshorbagy,

S.K.

Carey,

Giustolisi and

Savic, Comparison of three data-driven techniques in modelling the evapotranspiration process, Journal of Hydroinformatics12 (2010), 365–379. doi:10.2166/hydro.2010.029.

16.

EU, Directive 2000/60/EC of the European Parliament and of the Council of 23 October 2000 establishing a framework for Community, action in the field of water policy, 2000.

17.

Everaert,

Bennetsen and

P.L.M.

Goethals, An applicability index for reliable and applicable decision trees in water quality modelling, Ecological Informatics32 (2016), 1–6. doi:10.1016/j.ecoinf.2015.12.004.

18.

Everaert,

Boets,

Lock,

Dzeroski and

P.L.M.

Goethals, Using classification trees to analyze the impact of exotic species on the ecological assessment of polder lakes in Flanders, Belgium, Ecological Modelling222 (2011), 2202–2212. doi:10.1016/j.ecolmodel.2010.08.013.

19.

Everaert,

I.S.

Pauwels,

Boets,

Verduin,

M.A.A.

de la Haye,

Blom and

P.L.M.

Goethals, Model-based evaluation of ecological bank design and management in the scope of the European Water Framework Directive, Ecological Engineering53 (2013), 144–152. doi:10.1016/j.ecoleng.2012.12.034.

20.

Everaert,

I.S.

Pauwels and

P.L.M.

Goethals, Development of data-driven models for the assessment of macroinvertebrates in rivers in Flanders, in: 5th Biennial Meeting of the International Congress on Environmental Modelling and Software (iEMSs 2010): Modelling for Environment’s Sake International Environmental Modelling and Software Society (iEMSs),

D.A.

Swayne,

Yang,

A.A.

Voinov,

Rizzoli and

Filatova, eds, Ottawa, ON, Canada, 2010.

21.

Famili,

W.-M.

Shen,

Weber and

Simoudis, Data preprocessing and intelligent data analysis, International Journal on Intelligent Data Analysis1 (1997), 1–28. doi:10.1016/S1088-467X(98)00006-7.

22.

Fierens,

Ramon,

Blockeel and

Bruynooghe, A comparison of pruning criteria for probability trees, Machine Learning78 (2010), 251–285. doi:10.1007/s10994-009-5147-1.

23.

M.A.E.

Forio,

Landuyt,

Bennetsen,

Lock,

T.H.T.

Nguyen,

M.N.D.

Ambarita,

P.L.S.

Musonge,

Boets,

Everaert,

Dominguez-Granda and

P.L.M.

Goethals, Bayesian belief network models to analyse and predict ecological water quality in rivers, Ecological Modelling312 (2015), 222–238. doi:10.1016/j.ecolmodel.2015.05.025.

24.

Gabriels,

P.L.M.

Goethals,

A.P.

Dedecker,

Lek and

De Pauw, Analysis of macrobenthic communities in Flanders, Belgium, using a stepwise input variable selection procedure with artificial neural networks, Aquatic Ecology41 (2007), 427–441. doi:10.1007/s10452-007-9081-7.

25.

Gabriels,

Lock,

De Pauw and

P.L.M.

Goethals, Multimetric Macroinvertebrate Index Flanders (MMIF) for biological assessment of rivers and lakes in Flanders (Belgium), Limnologica40 (2010), 199–207. doi:10.1016/j.limno.2009.10.001.

26.

P.L.M.

Goethals,

A.P.

Dedecker,

Gabriels,

Lek and

De Pauw, Applications of artificial neural networks predicting macroinvertebrates in freshwaters, Aquatic Ecology41 (2007), 491–508. doi:10.1007/s10452-007-9093-3.

27.

Han and

Kamber, Data Mining: Concepts and Techniques, 2nd edn, Elsevier, San Francisco, 2006.

28.

T.H.

Hoang,

Lock,

Mouton and

P.L.M.

Goethals, Application of classification trees and support vector machines to model the presence of macroinvertebrates in rivers in Vietnam, Ecological Informatics5 (2010), 140–146. doi:10.1016/j.ecoinf.2009.12.001.

29.

A.J.

Jakeman,

R.A.

Letcher and

J.P.

Norton, Ten iterative steps in development and evaluation of environmental models, Environmental Modelling & Software21 (2006), 602–614. doi:10.1016/j.envsoft.2006.01.004.

30.

S.E.

Jorgensen and

Bendoricchio, Fundamentals of Ecological Modelling, 3rd edn, Elsevier, Amsterdam, 2001.

31.

Kocev,

Dzeroski,

M.D.

White,

G.R.

Newell and

Griffioen, Using single- and multi-target regression trees and ensembles to model a compound index of vegetation condition, Ecological Modelling220 (2009), 1159–1168. doi:10.1016/j.ecolmodel.2009.01.037.

32.

Kozak and

Kozak, Does cross validation provide additional information in the evaluation of regression models?, Canadian Journal of Forest Research – Revue Canadienne De Recherche Forestiere33 (2003), 976–987. doi:10.1139/x03-022.

33.

Landuyt,

Broekx,

Engelen,

Uljee,

Van der Meulen and

P.L.M.

Goethals, The importance of uncertainties in scenario analyses – A study on future ecosystem service delivery in Flanders, Science of the Total Environment553 (2016), 504–518. doi:10.1016/j.scitotenv.2016.02.098.

34.

G.R.

Larocque,

J.S.

Bhatti,

J.C.

Ascough,

Liu,

Luckai,

Mailly,

Archambault and

A.M.

Gordon, An analytical framework to assist decision makers in the use of forest ecosystem model predictions, Environmental Modelling & Software26 (2011), 280–288. doi:10.1016/j.envsoft.2010.03.009.

35.

B.G.

Lees and

Ritman, Decision-tree and rule-induction approach to integration of remotely sensed and GIS data in mapping vegetation in disturbed or hilly environments, Environmental Management15 (1991), 823–831. doi:10.1007/BF02394820.

36.

J.M.

McPherson,

Jetz and

D.J.

Rogers, The effects of species’ range sizes on the accuracy of distribution models: Ecological phenomenon or statistical artefact?, Journal of Applied Ecology41 (2004), 811–823. doi:10.1111/j.0021-8901.2004.00943.x.

37.

A.M.

Mouton,

De Baets and

P.L.M.

Goethals, Knowledge-based versus data-driven fuzzy habitat suitability models for river management, Environmental Modelling & Software24 (2009), 982–993. doi:10.1016/j.envsoft.2009.02.005.

38.

A.M.

Mouton,

De Baets and

P.L.M.

Goethals, Ecological relevance of performance criteria for species distribution models, Ecological Modelling221 (2010), 1995–2002. doi:10.1016/j.ecolmodel.2010.04.017.

39.

K.M.

Osei-Bryson, Post-pruning in regression tree induction: An integrated approach, Expert Systems with Applications34 (2008), 1481–1490. doi:10.1016/j.eswa.2007.01.017.

40.

Pesch,

Schmidt,

Schroeder and

Weustermann, Application of CART in ecological landscape mapping: Two case studies, Ecological Indicators11 (2011), 115–122. doi:10.1016/j.ecolind.2009.07.003.

41.

R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2015.

42.

A.E.

Rizzoli and

W.J.

Young, Delivering environmental decision support systems: Software tools and techniques, Environmental Modelling & Software12 (1997), 237–249. doi:10.1016/S1364-8152(97)00016-9.

43.

Schneiders,

Simoens and

Belpaire, Waterkwaliteitscriteria Opstellen voor Vissen in Vlaanderen, INBO, Brussel, 2009(in Dutch).

44.

Soetaert and

P.M.J.

Herman, A Practical Guide to Ecological Modelling. Using R as a Simulation Platform, Springer-Verlag, New York, 2009.

45.

Tirelli and

Pessani, Use of decision tree and artificial neural network approaches to model presence/absence of Telestes muticellus in Piedmont (North-western Italy), River Research and Applications25 (2009), 1001–1012. doi:10.1002/rra.1199.

46.

Tirelli,

Pozzi and

Pessani, Use of different approaches to model presence/absence of Salmo marmoratus in Piedmont (Northwestern Italy), Ecological Informatics4 (2009), 234–242. doi:10.1016/j.ecoinf.2009.07.003.

47.

Voinov and

Bousquet, Modelling with stakeholders, Environmental Modelling & Software25 (2010), 1268–1281. doi:10.1016/j.envsoft.2010.03.007.

48.

Wilson,

Newton,

Echeverria,

Weston and

Burgman, A vulnerability analysis of the temperate forests of South central Chile, Biological Conservation122 (2005), 9–21. doi:10.1016/j.biocon.2004.06.015.

49.

I.H.

Witten and

Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, San Francisco, 2005.

50.

S.C.

Zhang,

C.Q.

Zhang and

Yang, Data preparation for data mining, Applied Artificial Intelligence17 (2003), 375–381. doi:10.1080/713827180.

51.

A.F.

Zuur,

E.N.

Ieno,

N.J.

Walker,

A.A.

Saveliev and

G.M.

Smith, Mixed Effects Models and Extensions in Ecology with R, Springer Science+Business, Media, New York, 2009, LLC, 2009.

Development and selection of decision trees for water management: Impact of data preprocessing,algorithms and settings

Abstract

Keywords

Get full access to this article

References