Abstract
To address the problem of climate change emissions from the transport sector, many countries are promoting electric vehicles (EVs). To support such efforts, it is essential to know what influences the choice of an EV over a traditional internal combustion engine vehicle (ICEV). To study this, a discrete choice experiment was developed, and 2,015 valid responses were gathered from Canadian adults with a driver’s license. In place of a more traditional analysis, a machine learning approach, XGBoost, was applied. However, two key issues were addressed with respect to its application. First, a practical question related to how best to split the training and testing data was examined. A new technique based on the Coyote optimization algorithm (COA) is developed that automatically determines the split that leads to the greatest prediction accuracy. The policy-relevant results of the analysis found that an individual’s Climate Change-Stage of Change (CC-SoC) and the price ratio of EVs to ICEVs are the most important direct influences. The interaction effect of the first two (CC-SoC and price ratio) is also influential. However, this leads to the second key issue: interpretability. Although high prediction accuracy (87.1%) was achieved, the black-box nature of the approach limits its policy relevance. As such, this research applied a technique, Accumulated Local Effects (ALE), that can determine the strength and direction of influence of the variable. This research demonstrates how machine learning can be applied to a policy-relevant question and provide information that is useful to policy decision makers.
Keywords
The transportation sector is one of the largest sources of greenhouse gas (GHG) emissions around the world ( 1 ). This sector contributes to roughly 24% of global CO2 emissions ( 2 ) and approximately a fourth of global fossil fuel consumption ( 3 ). Therefore, governments have been trying to reduce GHG emissions from this sector ( 4 ). Fossil fuels are responsible for air pollution and polluting the environment ( 5 ). One proposed way to reduce GHG emissions, oil dependency, and air pollution is to replace ICEVs with electric vehicles (EVs) ( 6 ). Although many governments have implemented EV promotion policies such as EV subsidies ( 7 ), fuel taxation ( 8 ), and improving charging infrastructure ( 9 ), the EV share in international vehicle stock is low, only nearing 1% in 2020 ( 1 ). Therefore, it is essential to investigate the factors affecting EV preference/choice.
Different methods have been used to investigate determinants of EV preferences. Discrete choice models are conventional methods to analyze EV adoptions ( 10 ). Although discrete choice models are well-understood and transparent for interpretation, it has been demonstrated that artificial intelligence techniques (e.g., machine learning) often outperform discrete choice models with respect to prediction accuracy ( 11 ). Machine learning techniques can capture non-linear interactions between dependent and independent variables. Moreover, robust machine learning methods are highly flexible in structure and can result in higher accuracies than traditional techniques with predetermined (fixed) structures ( 12 ). Further, machine learning techniques can model problems including many features (independent variables), and they are powerful methods in pattern recognition and big data analysis ( 13 ).
Recently, machine learning techniques have been applied to predict who prefers EVs and determine which parameters highly contribute to electric vehicle adoption. For instance, Bas et al. ( 14 ) developed a model to predict individuals’ electric vehicle adoption. As the survey was done in only one state of the USA, it is not known how generalizable the results are. Different machine learning classifiers were used to solve the prediction problem, such as Gradient Boosting Machine, Deep Learning, and the Generalized Linear Model. The results indicated that Gradient Boosting Machine outperformed other classification techniques with regard to prediction accuracy with an area under the curve (AUC) of 92.8%. A further analysis using the most accurate classifier (i.e., Gradient Boosting Machine) revealed that county, next vehicle engine type preference, tax deductions, EV price, and EV range were influential parameters of a high willingness to adopt electric vehicles. Attitudes toward EVs and environmental attitudes were also found to have an influence. Subsequently, Local Interpretable Model-Agnostic Explanations (LIME) was applied to interpret the results of machine learning techniques. However, LIME may not be an appropriate method for policy making since it can not represent the influence of variables on intention to buy EVs for all respondents.
Similarly, Bas et al. ( 15 ) proposed a classification model to predict potential electric vehicle purchasers in the United States. The same survey (i.e., the survey used in Bas et al. [ 14 ]) was used as above and taken into account. The results of the leading model found that attitudes toward EVs, household income, and environmental concerns were the top factors. Although the influential parameters of EV adoption could be determined, how these parameters affected the response variable (EV adoption) was not clear.
Although machine learning techniques are accurate and can capture non-linear relations with features, two major issues are associated with these techniques. The first is a practical use problem: there is no standard procedure to determine the optimal percentage of training and testing data sets for different prediction problems. The second is a policy relevance issue: machine learning techniques are black-box tools and they are, therefore, hard to interpret ( 16 ).
With regard to the first issue, data sets should be divided into training and testing data sets. Training data is used to educate the model. Then, unseen data (testing data) is used to evaluate the performance of the prediction techniques ( 17 ). Nonetheless, determining the optimal split for training and testing data sets has been of immense concern. One possible approach to find the optimal share of the mentioned data sets is to check a limited number of randomly selected percentages for training data (e.g., 90%, 80%, 70%) and consider the shares that led to the highest testing data accuracy as the optimal percentage of data sets ( 18 ). However, finding the optimal value of a continuous-ranged parameter by checking a limited number of possible options is impossible ( 19 ).
The second problem with machine learning techniques is their black-box nature. In other words, machine learning techniques are hard to interpret, and they cannot represent in which direction independent variables affect dependent variables. Researchers generally apply decision tree-based machine learning techniques (e.g., Random Forest and Gradient Boosting Machine) to solve this problem since these techniques can determine the relative influence of each independent variable on the dependent variable. Although decision tree–based techniques can prioritize variables based on their relative influence, they cannot determine how these variables (negatively, positively, linearly, quadratically, etc.) affect the problem’s response variable.
To solve this problem, researchers have developed white-box prediction techniques such as metaheuristic programming techniques (e.g., Water Cycle Programming [ 20 ], Biogeography-based Programming [ 21 ], Soccer League Competition Algorithm [ 19 ], Marine Predator Programming [ 22 ], and Coyote Optimization Programming [ 23 ]) and M5tree (e.g., Wang and Witten [ 24 ]). However, metaheuristic programming techniques can not model classification problems, and they can only be applied for regression problems. Moreover, M5tree divides the data into smaller groups and in cannot, therefore, represent the direction of influence of the variables for all the respondents. To solve this problem, Accumulated Local Effects (ALE) was developed ( 25 ) and is increasingly used ( 26 , 27 ), though its use in the transportation field has been limited ( 28 ). This research will apply ALE to demonstrate how it can make the black-box nature of ML results useful from a policy and practice perspective.
In this study, eXtreme Gradient Boosting (XGBoost), a robust tree-based ensemble machine learning technique, was applied to predict who will buy EVs, with respect to both plug-in hybrid electric vehicles (PHEV) and battery electric vehicles (BEV). Using the XGBoost, variables could be ranked based on their influence on individuals’ intention to buy EVs and the interaction of variables could be examined. Then, a new method is introduced to determine the optimal share of training and testing data sets. After finding the optimal share of data sets, XGBoost was performed to calculate the importance of each parameter on the intention to buy EVs. Moreover, XGBoost was used to understand the interaction of parameters, and the interactions with the highest impact on the likelihood of buying EVs were scrutinized. Ultimately, ALE was employed to make the results interpretable.
Methods
This study attempted to investigate which parameters influence intention to buy EVs using an ensemble learning approach. The methodology flowchart is illustrated in Figure 1. As can be seen, a discrete choice experiment (DCE) was designed and used to collect a data set. A new method was introduced to maximize the model’s accuracy by optimizing the share of training and testing data sets. After optimizing the data split, an ensemble learning method (XGBoost) was performed. Consequently, the variables and their interactions were ranked based on their relative influence on intention to buy an EV. Ultimately, ALE was used to understand how the model’s independent variables and their interactions affect intention to buy EVs.

Flowchart of the methodology.
In this section, the data collection and preparation processes are initially presented. Then, the introduced method and conventional technique to tune the share of training and testing data sets are described. Afterward, the methods used for classification and interpretation processes are explained.
Data and Survey
A survey was conducted to reach the objectives of this study using a DCE. The mentioned survey was administered as an online survey in spring 2021. Only people aged 18 or older (legal adult age in Canada) and driver’s license holders could participate in the survey as the survey was related to car purchases. The survey included some trap questions to detect participants who did not pay attention to questions. Such respondents were removed from the data set. Finally, the survey responses of 2,015 participants remained.
Socio-Demographic and Environmental Attitudes
The socio-demographic information and environmental attitudes of respondents are shown in Appendix (Table A1). Gender, region, age, employment status, education, ethnicity, household income, vehicle ownership, urbanization, and Climate Change-Stage of Change (CC-SoC) were the characteristics of respondents collected during the survey. The correlation matrix of socio-demographic information and environmental attitudes of respondents is shown in Table 1. As can be seen, the correlation between all variables is not significant and, therefore, all variables are applied in the final prediction model. The maximum correlation between variables is related to the correlation between household income and employment status, which is 0.26. CC-SoC is a strong measure to capture behavior and attitudes with respect to personal climate emissions (
29
). It was, therefore, applied to capture the environmental attitudes of participants. The possible responses for CC-SoC were as follows: CC-SoC1: I am not concerned about climate change. CC-SoC2: I am concerned about climate change, but I do not plan to reduce my emissions. CC-SoC3: I would like to reduce my emissions, but I don’t know how. CC-SoC4: I would like to reduce my emissions, and will do so in the future. CC-SoC5: I have already reduced my emissions significantly.
Further, responses of Quebec residents (the predominantly francophone province) were classified into two categories to consider the impact of language and culture: French Quebecers (QC-French) and English Quebecers (QC-English).
Correlation Matrix of Socio-Demographic Information and Environmental Attitudes of Respondents
Note: CC-SoC = Climate Change-Stage of Change.
Discrete Choice Experiment (DCE)
The survey included a discrete choice experiment (DCE) before questions on socio-demographics. Half the respondents saw the Climate Change-Stage of Change question before the DCE and half afterwards to control for any influence on their response from having already seen such a question. DCEs present survey participants with hypothetical tasks (choice situations). The attributes of alternatives are defined by an experimental design. Participants need to select between alternatives based on their preferences. In this study, the DCE was designed according to Wang et al. ( 30 ), which had the same underlying experiment and precise econometric model. Each respondent was asked to select between two alternatives (an EV and an ICEV) in 12 different choice situations (tasks). In each task, the EV could be BEV or PHEV.
For the DCE design, four attributes were considered in the utility function, including purchase price, fuel costs, electric vehicle range, and CO2 emissions. Therefore, in each task, the purchase price, annual fuel cost, and annual GHG emission of alternatives varied. The alternatives were labeled as an ICEV, a PHEV, and a BEV. Respondents needed to select between an ICEV label and an EV label, which could be a BEV or a PHEV. For each vehicle, the annual driven distance was assumed to be 20,000 km, following the actual Natural Resources Canada (NRCan) label. The monthly CO2 emission and monthly fuel cost were calculated based on the distance driven and the unit CO2 emission and unit fuel cost of vehicles, reported by NRCan. The current investigation adopted a D-efficient design in Ngene for the discrete choice experiment. The levels used for attributes are summarized in Table 2.
Discrete Choice Experiment Attributes and Their Levels.
Note: ICEV = internal combustion engine vehicle; PHEV = plug-in hybrid electric vehicle; BEV = battery electric vehicle; EV = electric vehicle; na = not applicable.
Vehicle labels were used to present alternative attributes in the survey. In the designed labels, eight different framings were used to present GHG information (CO2 emissions) since it was demonstrated that GHG information framing significantly influences individual vehicle engine choices ( 31 ). For further discussion on framing, please see Daziano et al. ( 31 ) or Wang et al. ( 30 ). Each respondent is randomly assigned to one treatment where all the labels use one framing technique for the two-choice tasks. In this experiment, the labels were: current NRCan label, societal-goal framing with color (best performing label from Wang et al. [ 30 ]), pressure gauge label, new emoticons, goal-oriented with patriotism, dirty air, tree, and thumbs up/down. For more information about these framings, please see Ji et al. ( 32 ). Figure 2 provides an example of first and second framings. Apart from Figure 2a, all of the emissions are framed with respect to the Government of Canada’s objective to reduce GHG emissions by 30% below 2005 levels by 2030, and these framings were designed based on this goal (for more information, please see: Wang et al. [ 30 ]). An example for other framings is indicated in Appendix (Figure A1).

New-vehicle labels with different framing techniques designed to analyze the impacts of different GHG framings on individuals’ vehicle choices: (a) frame1: NRCan label and vehicle choice (current label in Canada); and (b) frame2: −30% societal goal with color label and vehicle choice.
Optimizing the Percentage of Training and Testing Data Sets
The data set should be divided into training and testing data in a prediction process. In this investigation, a new method, called COA-XGBoost, was developed to optimize the data split ratio. Recently, Nguyen et al. ( 18 ) proposed a new method to optimize the share of training and testing data sets. Therefore, the proposed method by Nguyen et al. ( 18 ) was considered the conventional method.
The Conventional Method for Data Split Optimization
Split optimization is a research question that was recently addressed by Nguyen et al. ( 18 ). In the current study, different split ratios were taken into account for training, validation, and testing data sets, including 98/1/1, 90/5/5, 85/7.5/7.5, 80/10/10, 75/12.5/12.5, 70/15/15, 65/17.5/17.5, 60/20/20, 55/22.5/22.5, 50/25/25, 45/27.5/27.5, 40/30/30, and 35/32.5/32.5. In the mentioned split ratios, the percentage of validation and testing data sets was considered equal, according to Majidifard et al. ( 33 ), since validation data acts as testing data in the split ratio optimization. For each split ratio, the model is run 10 times using different seeds to randomly split training, testing, and validation data. Then, the split ratio that leads to the highest average validation data accuracy (over different seeds) is considered the optimal split ratio. Then, the optimal model is run, and the testing data accuracy is calculated.
The Introduced Method for Data Split Optimization
In this study, a new hybrid method was proposed to optimize the share of training and testing data sets to maximize prediction accuracy. Initially, hyperparameters of XGBoost were tuned using Grid Search and K-fold cross-validation techniques. Then, an optimization framework was developed to maximize the prediction accuracy. The hyperparameters of COA were tuned according to the details provided by Naseri et al. ( 34 ) to tune the hyperparameters of metaheuristic algorithms. Further, the XGBoost library in Python was used to perform XGBoost, and COA was coded in Python according to Pierezan and Dos Santos Coelho ( 35 ). Then, an optimization framework was developed to maximize the prediction accuracy. In the introduced hybrid method, XGBoost, a powerful machine learning technique, was applied for classification purposes. Moreover, the Coyote optimization algorithm (COA), a robust metaheuristic optimization algorithm, was employed to find the optimal solution that resulted in the highest prediction accuracy. The optimization modeling of the proposed approach is presented in Equations 1 to 5:
where
Developed in 2018, COA is a metaheuristic algorithm. Based on Canis latrans’ social behavior and interactive experience, the COA algorithm was developed. Coyotes are associated with solution vectors, and their social behavior determines their fitness value. The first step is to randomly classify coyotes into different groups. Each group of coyotes is ranked based on its social behavior, and the most valuable coyote is called Alpha. The Alpha of each group and its group mates ultimately influence each coyote. By doing so, solution vectors are moved toward the solution vectors of their group and the most optimal solution in that group. A further method of transferring cultures involves replacing coyotes with those from other groups. In this way, solution vectors are not accumulated in local-minimum areas, but instead are checked in a larger area. In the end, weaker coyotes die, and new generations replace them ( 35 , 36 ).
Classification Process
After finding the optimal data split ratio, an ensemble learning technique (i.e., XGBoost) was applied to predict who prefers EVs in the choice tasks and to investigate which parameters affect EV preference. That is, the individuals’ choices (EV or ICEV) were considered the response variable to determine which parameters influence the intention to buy an EV. Therefore, the dependent variable was a binary (EV or ICEV) variable. The independent variables were: the ratio of EV to ICEV price, the ratio of EV to ICEV fuel cost, the ratio of EV to ICEV GHG emission, battery range of EV, EV engine type (BEV or PHEV), GHG framing (treatment), and socio-demographic variables of respondents (i.e., gender, region, age, employment status, education, ethnicity, household income, vehicle ownership, urbanization, and Climate Change-Stage of Change [CC-SoC]). Since the survey included 2015 participants, and each participant did 12 choice tasks, the number of observations was
XGBoost was performed to model the classification problem for four reasons. First, XGBoost is an ensemble learning technique and it can, therefore, present the relative influence of variables on the EV purchase likelihood. Second, XGBoost prioritizes the variables’ interactions according to the influence they have on the response variables. Third, problems with high complexity can generally be solved with high accuracy using this technique ( 37 ). Fourth, XGBoost has outperformed several machine learning techniques for prediction accuracy. For instance, Kim ( 28 ) compared the prediction accuracy of XGBoost with random forest and artificial neural networks on a travel mode choice prediction problem. The results showed that the accuracy of XGBoost was 2.7% and 8.1% higher than random forest and artificial neural networks, respectively. Similarly, XGBoost was found to be more accurate than logistic regression ( 38 ), support vector machine ( 39 ), decision tree ( 40 ), and so forth.
XGBoost is a tree-based prediction technique using boosting modeling in the prediction process. XGBoost is characterized by prediction based on parallel processing and rapid learning ( 41 ). This technique can be used for classification and regression problems, and it generally reaches high prediction accuracy in complicated problems ( 37 ). XGBoost is more accurate than conventional Gradient Boosting as it uses a more accurate approximation ( 42 ). XGBoost generates some weak learners (decision trees) and tries to reach a powerful learner by combining the weak decision trees. An optimization problem is modeled during each iteration of XGBoost to minimize the prediction error. Moreover, the result of the mentioned optimization problem and the residuals are applied to optimize tree structures in each iteration. Residuals are the difference between target values and predicted values. Further, the objective function of the optimization problem contains a regularized term that controls the model’s over-fitting. The second-order and first-order gradient statistics are applied to solve the XGBoost optimization problem and optimize the model’s structure ( 43 ).
Since XGBoost is a tree-based ensemble technique, it can determine the relative influence of each independent variable on the dependent variable. Moreover, XGBoost can prioritize the variables’ interactions based on their influence on the response variable. XGBoost was modeled using the xgbfir Python library to analyze the interactions of independent variables and detect the most important ones. The library can identify the interactions’ importance based on the split points in the XGBoost structure ( 44 ).
Interpreting Black-Box Machine Learning Techniques
Although XGBoost can present the relative influence of independent variables and their interaction on the response variable, it can not interpret how these variables affect the response variable. To this end, ALE was used in this study to interpret the results of XGBoost. ALE is a plot-based analysis to interpret black-box prediction tools, developed by Apley and Zhu ( 25 ). ALE divides independent variables by set intervals. Then, the lower and upper bounds of the interval are calculated. The estimated differences are accumulated, and the mean prediction is centered at zero ( 26 ). More details about ALE can be found in Apley and Zhu ( 25 ).
Results and Discussions
In this section, the optimal split ratio is initially presented. Then, the relative influence of variables and their interactions on the intention to buy EVs are introduced. Finally, ALE results are indicated to understand how variables affect the intention to buy EVs.
Optimal Split Ratio
As discussed, different values were considered in the conventional model to find the optimal value of testing data. The results of the conventional model are shown in Figure 3. In this figure, the range of validation data accuracy and the average validation data accuracy over different seeds are presented. As can be seen, the highest average validation data accuracy was achieved when the percentage of validation data was considered 7.5%. Accordingly, the optimal percentage of testing data was 7.5% in the conventional split ratio optimization model. Therefore, the accuracy of testing data for the optimal split ratio (7.5%) was calculated, and the model reached the testing data accuracy of 83.6%.

Results of the conventional model for split ratio optimization.
The hyperparameters of COA and XGBoost were first tuned. In XGBoost, the number of estimators, minimum data in leaves, maximum depth, and learning rate were 200, 11, 6, and 0.3, respectively. In COA, the optimal number of packs and the number of coyotes in each pack were 10 and 8. Then, the method introduced in this study (i.e., COA-XGBoost) was performed, reaching a validation accuracy of 87.1%. In the optimal solution, the validation data percentage was 3.93%. Therefore, the percentage of testing data was considered 3.93%, and the accuracy of testing data was then calculated. In the proposed optimal solution, the testing data accuracy was 87.4%. Accordingly, applying COA-XGBoost to optimize the split ratio increased the testing data accuracy by 3.8% (from 83.6% to 87.4%), and the method introduced is highly qualified to maximize the accuracy of prediction models. Since COA-XGBoost obtained a higher accuracy, it was used to determine the relative influence of variables, and the results are presented in the following part.
Contribution of Variables to Increase the Intention to Buy EVs
COA-XGBoost was performed to determine the relative influence of variables on intention to buy an EV, and the results are illustrated in Figure 4. As can be seen, CC-SoC, as a strong measure to assess the level of concern about climate change, is the variable with by far the most influence on vehicle choice. It can, therefore, be postulated that concern about climate change plays a crucial role in the intention to buy EVs. Bas et al. ( 14 ) investigated the effects of different variables on electric vehicle adoption, and the results showed that awareness of environmental protection was one of the most important variables in explaining the high willingness to adopt EVs. Moreover, previous studies demonstrated that people with different levels of CC-SoC are statistically different in willingness to buy EVs ( 29 , 30 ).

The relative influence of socio-demographic, environmental concern, and vehicle attributes on individuals’ intention to buy EVs.
EV to ICEV purchase price ratio is the second significant parameter of vehicle choice since its contribution to the likelihood of buying an EV is roughly 9%. It has been demonstrated that monetary incentives, such as income tax deduction for EVs and vehicle price, significantly influence individuals’ choice when selecting between an ICEV and an EV ( 14 , 30 ). The results mentioned are consistent with the outcomes of the current research. This result would suggest that reducing the difference in vehicle costs (either by increasing the costs of ICEV or reducing the costs of EVs) is a policy-relevant lever.
Contextual and individual variables are the next most influential, followed by region, age, and education based on the relative influence on intention to buy an EV. GHG information framing is the sixth most influential parameter on vehicle choice, with an importance weight of 6.73%. Interestingly, the impact of GHG information framing is much more than some other vehicles’ attributes, such as EV to ICEV fuel price ratio, EV battery range, the ratio of EV to ICEV GHG emissions, and EV engine type (BEV or PHEV). Therefore, it can be concluded that it is not sufficient to simply present GHG emissions, they need to be presented in effective ways.
Further, the interaction of variables was ranked based on their impact on vehicle choices using XGBoost. The top-ranked variables’ interactions are shown in Figure 5. The interaction of CC-SoC and EV to ICEV purchase price ratio has the strongest influence on the intention to buy EVs.

The relative influence of variables’ interactions on individuals’ intention to buy EVs.
Interpretation of machine learning results
In general, machine learning methods are considered black-box tools because it is impossible to obtain the influence direction of inputs on output variable ( 45 ). As a result, the results of machine learning techniques may be misinterpreted ( 46 ). Further, interpretation of machine learning results is essential to provide policy makers the required information ( 47 ). That is, the interpreted results of machine learning techniques can be used to set new policies to promote EVs. Interpretation of machine learning may be an exciting avenue for researchers to increase the prediction accuracy of their models by exploiting prior knowledge along with any other benefits of interpretation ( 48 ). In other words, researchers can detect which variables have the highest influence on the response variable (i.e., the intention to buy EVs) and how these variables impact the intention to buy EVs. Therefore, the top variables can be applied to better investigate inclined EV buyers.
In this section, the results of ALE for variables are first presented. Then, the outcomes of ALE for top-ranked interactions are discussed. It should be mentioned that size in ALE plots represents the number of observations in each group.
ALE Results for Variables
The effect of CC-SoC (the most influential parameter on intention to buy EVs) on the likelihood of EV purchase is shown in Figure 6. In ALE figures (e.g., Figure 6), the bar chart implies the number of data samples (size) in the data set, and the bar chart values are shown on the right axis. For example, the number of data samples in CC-SoC1 is 2,664 (222 individuals and 12 choice tasks for each: 222 × 12 = 2,664). The trend line indicates the EV purchase likelihood, and the probability values can be seen on the left axis. The EV purchase likelihood of the two groups can be compared by subtraction of their corresponding EV purchase likelihood. For instance, if the EV purchase likelihood of group 1 and group 2 is −2% and 4.2%, the EV purchase likelihood of group 2 is 6.2% (4.2 − (−2) = 6.2) higher than that of group 1. Further, the value of zero in the left axis denotes the middle group, whose EV purchase likelihood is the average of all respondents.

Effect of CC-SoC on EV purchase likelihood where: 1 = not concerned; 2 = concerned, but do not plan to reduce emissions; 3 = concerned, but do not know what to do; 4 = concerned, and planning to reduce emissions; 5 = concerned, and have significantly reduced emissions.
As shown in Figure 6, individuals who stated “they have reduced emissions (CC-SoC5)” or “they will reduce their emissions (CC-SoC4)” are more likely to purchase EVs. Knowing what percentage of the population is at these stages of change will help policy makers tune their approaches. On the other hand, people who are not concerned about climate change (CC-SoC1) and who are concerned but do not plan to reduce their emissions (CC-SoC2) are significantly less likely to buy EVs. In other words, the likelihood of buying an EV for the CC-SoC1 and CC-SoC2 groups is roughly 26% and 28% less than that of the CC-SoC5 group. Wang et al. ( 30 ) investigated the influence of CC-SoC on willingness to pay for EVs. Their investigation showed that the CC-SoC5 group was more likely to buy an EV, followed by CC-SoC4, CC-SoC3, CC-SoC1, and CC-SoC2 groups. Moreover, individuals who stated they are in the CC-SoC1, CC-SoC2, and CC-SoC3 stages were found to be statistically different from the CC-SoC4 group when comparing their intention to buy an EV. However, the difference between people who stated, “they have reduced emissions significantly (CC-SoC5)” and “they will reduce their emissions (CC-SoC4)” was not statistically different. Therefore, the results of this study are in harmony with the results presented by Wang et al. ( 30 ), where Multinomial Logit was applied.
The effect of EV to ICEV purchase price on the likelihood of EV purchase is indicated in Figure 7. As can be seen, increasing the purchase price ratio of EV to ICEV from 1.22 to 1.56 reduces the intention to buy an EV sharply. By increasing the purchase price ratio from 1.22 to 1.56, the likelihood of buying an EV is reduced by approximately 20%. After this level (the ratio of 1.56), the intention to buy EVs is reduced slightly by EV to ICEV purchase price increment. The probability of EV preference reaches its lowest level when the purchase price ratio is 2.82. The likelihood of buying an EV is decreased by over 30% if the EV to ICEV purchase price is increased from 1.22 to 2.82. A more detailed look at Figure 7 reveals that the maximum EV to ICEV purchase price ratio should be 1.78 if the aim is to increase EV adoption. Because the effect of price ratio on intention to buy EV is zero when the ratio is 1.78 and increasing the ratio to more than 1.78 results in a reduction in intention to buy EVs.

Effect of EV to ICEV purchase price ratio on EV purchase likelihood.
The effect of age on the likelihood of EV purchase is displayed in Figure 8. The intention to buy EVs does not change significantly for those younger than 33 years old. After 38 years of age, the intention to choose EVs is reduced, and it reaches its minimum level for people aged between 50 and 53 years. On the flip side, the willingness to buy an EV is gradually increased by increasing the age from 58 to 87 years. The oldest group is expected to be the wealthiest group, and it might be much easier for them to afford to buy EVs, which are more expensive than ICEVs. The previous study demonstrated that people aged 50 to 59 have the least willingness to pay for EVs, and those under 40 or more than 60 are more willing to pay for EVs in Canada ( 30 ). Therefore, their results are in line with the results obtained in this study. The additional benefit here is that the non-linear relationships can be seen on a finer level.

Effect of age on EV purchase likelihood.
The impact of GHG information framing on the likelihood of EV purchase is presented in Figure 9. With regard to GHG information framings, presenting the emissions with respect to a national goal along with the new emojis developed (frame 4 which integrated color, injunctive norms, and air quality) is the most effective, followed by highlighting the information with only color (frame 2; red for high emissions, blue for low emissions). They can increase the probability of selecting EV by 12% and 9% compared with the current mock-up (i.e., frame1; NRCan label with CO2 emissions g/km). It can, therefore, be deduced that GHG information framings (e.g., frame 4 and frame 2) on labels play a critical role in motivating individuals to buy EVs. Further, frame 2 (societal goal with color label) was the most effective framing in the previous study ( 30 ). Nonetheless, frame 4 (integrating color, injunctive norms, and air quality), which is developed in this study, can increase the intention to buy EVs by 4% compared with frame 2.

Effect of GHG framing on EV purchase likelihood. Frame 1 = current NRCan label; 2 = societal goal with color; 3 = societal goal as pressure gauge; 4 = societal goal with new emoticons; 5 = steps toward societal goal with patriotism; 6 = societal goal as dirty air scale; 7 = societal goal as tree health; 8 = societal goal with thumbs up/down.
Frame 3 (pressure gauge label) and frame 5 (patriotic goal label) are the third and fifth effective treatments to enhance EV adoption, and they increase the EV selection likelihood by nearly 6% and 4% compared with the current label in Canada (frame 1). Frame 1 (current label in Canada) is the weakest frame for EV adoption. Similarly, frame 6 (dirty air label), frame 7 (tree scale label), and frame 8 (thumbs up/down label) are not effective treatments to increase the intention to buy EVs.
ALE Results for Variable Interactions
To better understand the direction of variables on the EV adoption, the first level interaction of variables was extracted using XGBoost (shown in Figure 5). ALE was then used to better understand how variables’ interactions influence the likelihood of buying EVs. In this section, ALE results for first-rank variable interactions are presented.
The impact of the most influential interaction (CC-SoC/EV to ICEV purchase price ratio) on the probability of choosing an EV is indicated in Figure 10. As can be seen, the purchase price of EV to ICEV does not considerably affect the choices of those not concerned about climate change (CC-SoC1) since they are more likely to buy ICEVs. Likewise, decisions of the CC-SoC2 group (concerned but not planning to reduce their emissions) are not significantly influenced by the EV to ICEV purchase price ratio. The CC-SoC2 group is slightly more likely to buy an EV when the EV price is approximately the same as that of the ICEVs. The CC-SoC3 group prefer EVs when the EV to ICEV purchase price ratio has the lowest level (1.21). Therefore, the CC-SoC3 group prefers PHEVs to BEVs, and they are not inclined to pay a considerable amount of money for reducing their emissions. The CC-SoC4 group is more likely to buy EVs when the EV to ICEV purchase price ratio is between 1.78 and 1.9. The ratio of 1.78 implies the task in which the EV option is BEV with a price of $48,000 and the ICEV price is $27,000. Moreover, the ratio of 1.9 denotes the task with a PHEV and an ICEV with purchase prices of $38,000 and $20,000, respectively. Therefore, this group prefers both PHEVs and BEVs to ICEVs, but when the price ratio is not higher than 1.9. The people who stated that they had “reduced their emissions (CC-SoC5)” are more likely to buy BEVs. In other words, the CC-SoC5 group prefers BEVs to both PHEVs and ICEVs. Interestingly, the CC-SoC5 group is willing to pay 1.82 times more than ICEVs for GHG emissions reductions.

Effect of the interaction of CC-SoC and EV to ICEV purchase price ratio on EV purchase likelihood.
Conclusions
The study aimed to predict who will buy EVs and detect the determinants of EV adoption. To this end, a DCE was designed, and a data set of 24,180 observations was collected from 2015 respondents. Applying a new hybrid method (COA-XGBoost), the results showed that the Climate Change-Stage of Change (CC-SoC) was the most important parameter on individuals’ decisions. With regard to interactions between variables, CC-SoC/EV to ICEV price ratio was the most important interaction based on the relative influence on the likelihood of buying an EV.
To address the question of data splitting (optimizing the share of training and testing data sets), a new hybrid method (COA-XGBoost) was introduced to maximize the model’s accuracy. COA-XGBoost found the optimal percentage of testing data to be 3.92%, while the optimal percentage of testing data was 7.5% in the conventional split ratio optimization model. Using COA-XGBoost, the mode reached the testing data accuracy of 87.4, which was 3.8% higher than the conventional method to optimize the split ratio. Therefore, the proposed technique could enhance the prediction power significantly by finding the optimal data split.
Since machine learning techniques are black-box tools, ALE was applied to interpret the results and determine in which direction variables and interactions between variables influence the likelihood of EV preference. ALE results showed that individuals who self-identified as being at the top of the Climate Change-Stage of Change (CC-SoC4 & 5), French Quebecers, those older than 58, and those who had completed a doctoral degree (e.g., Ph.D.), or a degree in medicine, dentistry, veterinary medicine, or optometry had the highest intention to buy EVs. In addition, individuals who do not own a vehicle, part-time workers, females, people of Oceania origins, and individuals with household incomes of over $200,000 are more likely to prefer EVs to ICEVs. The results suggested that the purchase price of EVs should not be over 78% higher than ICEVs since EV to ICEV purchase price ratios of higher than 1.78 result in a reduction in intention to buy EVs. For GHG framing, frame 4 (integrating color, injunctive norms, and air quality) is the most effective treatment to increase the likelihood of buying EVs.
In the case of policy, the government can concentrate more on those who are more likely to buy EVs (e.g., CC-SoC4 & 5, females, and French Quebecers) to increase the share of EVs in the Canada stock market. Moreover, the current mock-up (NRCan label) should be replaced with more effective labels to attract more car purchasers to buy EVs. The results of this study suggest that the integration of color, pressure gauge, emoticon, patriotic goal, dirty air figures, trees, and thumbs up/down can better present GHG information and attract customers to choose more sustainable options. Further, the emoticon label maximizes the intention to buy EVs, followed by color labels, pressure gauge, and patriotic goal. Therefore, replacing these labels with the current labels (NRCan label) can considerably increase the EV purchase likelihood.
Supplemental Material
sj-docx-1-trr-10.1177_03611981231169533 – Supplemental material for Interpretable Machine Learning Approach to Predicting Electric Vehicle Buying Decisions
Supplemental material, sj-docx-1-trr-10.1177_03611981231169533 for Interpretable Machine Learning Approach to Predicting Electric Vehicle Buying Decisions by Hamed Naseri, E.O.D. Waygood, Bobin Wang and Zachary Patterson in Transportation Research Record
Footnotes
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: H Naseri, E-O-D Waygood, B Wang, Z Patterson; data collection: H Naseri, E-O-D Waygood, and B Wang; analysis and interpretation of results: H Naseri, E-O-D Waygood; draft manuscript preparation: H Naseri, E-O-D Waygood, B Wang, Z Patterson. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by (a) the Fonds de recherches du Québec – Nature et Technologie (FRQNT) (grant number 2019-GS-261583); (b) Trottier Energy Institute, Ph.D. Excellence Scholarship; (c) Social Sciences and Humanities Research Council (grant number 435-2020-1292); and (d) Fonds de recherches du Québec – Nature et Technologie (FRQNT) (grant number 322727).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
