Modeling Tenant’s Credit Scoring Using Logistic Regression

Abstract

This study implements the multivariable logistic regression to develop a credit scoring model based on tenants’ characteristics. The credit history of tenant is not considered. Rental information of tenants was collected from a landlord company in Malaysia. Parameters of the multivariable logistic regression were estimated by using the penalized maximum likelihood estimation with ridge regression since separation in training data was detected. The initial factors considered that affect tenants’ credit score were their gender, age, marital status, monthly income, household income, expense-to-income ratio, number of dependents, previous monthly rent, and number of months late payment. However, the marital status factor was then excluded from the logistic regression model due to its low significance to the model. Meanwhile, a tenant’s credit scoring model was generated by calculating the tenant’s probability of defaulting. The main factors of the tenant’s credit score are the number of months late payment, the expense-to-income ratio, gender, previous monthly rent, and age. There is no underfitting or overfitting in the proposed credit scoring model which means the model’s bias and variance are low.

Keywords

credit scoring logistic regression penalized maximum likelihood

Introduction

According to World Bank Group (2022), Malaysia ranks 55th out of 157 countries. Malaysia will need to advance further in education, health and nutrition, and social protection outcomes to achieve a high-income and developed country status. The key priority areas include improving the quality of schooling, rethinking nutritional interventions to reduce childhood stunting, and providing adequate social welfare protection for household investments in human capital formation.

The issue of affordable housing has always been a hot topic in many countries around the world, including Malaysia. The household income classification in Malaysia is divided into three categories: B40, M40, and T20. The B40 represents the bottom 40% of the Malaysian household group whose household income is below RM4,850 per month. Meanwhile, M40 is the middle 40% of the household group with the household income is between RM4,850 and RM10,959 per month. Finally, the T20 represents the top 20% class with a household income of at least RM10,960 per month (Department of Statistics Malaysia, 2020). The housing issue has become more serious as 20% or about 600,000 households in the M40 group have slipped into the B40 group as a result of the Covid-19 crisis (The Star, 2021).

According to the Central Bank of Malaysia (2018), the key reasons for housing loan rejection include insufficient income to support debt repayment, adverse credit history, and inadequate income or financial documentation. In Malaysia, some housing schemes are introduced by the government to assist the M40 and B40 groups to own a house such as Perumahan Rakyat 1 Malaysia (PR1MA), Program Perumahan Rakyat (PPR), and the Rent-to-Own scheme (Liu & Ong, 2021). However, due to the limited units, not all low household income groups will be benefited from the housing schemes. In addition, 60% of affordable home loan applications are rejected by banks and financial institutions due to the applicants’ age or poor credit scores (The Sun Daily, 2021). A credit score is a creditworthiness indicator used by banks and financial institutions to determine their potential borrowers’ likelihood of defaulting on a loan. The higher the loan applicant’s credit score, the higher the chance of the loan application being approved.

In Malaysia, the Central Credit Reference Information System (CCRIS) is a system created by the Central Bank of Malaysia to synthesize the credit information of borrowers and is available to every financial institution. The CCRIS report shows the outstanding loans, special attention accounts, and the number of approved or rejected loan or credit facility applications made in the past 12 months, but without providing a credit score (Ebekozien et al., 2019). Besides, Malaysians can obtain their credit reports with credit scores through the private credit reporting agencies in Malaysia, such as Credit Tip-Off Service (CTOS) and RAM Credit Information Sdn. Bhd. (RAMCI). In the United States, the FICO score created by Fair Isaac Corporation (FICO) and VantageScore introduced by United States national consumer reporting agencies (“NCRAs”), that is, Experian, Equifax, and TransUnion, are the common credit scores used (Albanese, 2021). The FICO, VantageScore, and CTOS credit scores utilize similar factors, that is, payment history, credit amounts owed, length of credit history, credit mix, and new credit but with different proportions.

In the past, the credit bureaus such as FICO and Experian only set credit history as the factor of credit score. The credit scoring model that depends only on credit history cannot be used to gain credit scores for those individuals with little or no credit history. As a result, some credit bureaus have generated credit scoring models using additional non-financial data, that is, the use of rental payment records by Experian and the use of utility data, evictions, and other variables by FICO (Djeundje et al., 2021). The research papers that use non-financial data such as rental payment records, utility data, criminal history, and delinquency are reviewed (Njuguna & Sowon, 2021).

Besides, some papers utilized other non-financial data such as individual characteristics, loan characteristics, and behavioral variables to compute the probability of default or credit score. Lin et al. (2017) stated gender, age, marital status, educational level, working years, company size, monthly payment, loan amount, debt to income ratio, and delinquency history play a significant role for loan defaults in China. Chamboko and Bravo (2019) found gender, age, income earned, debt-to-income ratio, loan terms, and the number of past missed payments in Zimbabwe. On the other hand, Adzis et al. (2020) and Saha et al. (2021) concluded house equity, age, gender, ethnicity, location, the types of occupation, guarantor availability, and loan characteristics like payment-to-income (PTI) ratio, loan original balance, loan tenure, loan interest rate, and loan-to-value (LTV) ratio are the significant factors that influence loans default in Malaysia. The chance of individuals who lack credit histories getting a loan will be increased if financial institutions use non-financial data to develop the credit scoring model.

A wide range of papers applied machine learning such as neural networks, support vector machine, logistic regression, and genetic programming to develop credit scoring models (Louzada et al., 2016). Besides, some papers applied hybrid credit scoring models, such as Munkhdalai et al. (2020) used a hybrid credit scoring model with neural networks and logistic regression whereas Kumar et al. (2021) used a hybrid credit scoring model with neural networks and k-means algorithm.

Problem Statement

Individuals with very little credit history or thin files are referred to as “credit unscored” and those without any credit history are referred to as “credit invisible” (Njuguna & Sowon, 2021). In Malaysia, the B40 category in rural areas is usually classified as “credit unscored” or “credit invisible” where they have no credit records or poor credit scores due to insufficient credit history to support their housing loan application. Hence, they normally rent a property since they cannot afford to own it. However, their rental payment records are not accounted for the housing loan applications. In addition, no agency in Malaysia introduces a credit scoring for the tenant.

Nowadays, the use of credit scores has extended from banks to other areas such as rental property, car and home insurance (Njuguna & Sowon, 2021). For example, TransUnion introduced “ResidentScore” which utilizes rental data to predict the likelihood of eviction. Furthermore, Turner and Walker (2019) showed that the addition of rental payment data as a factor in FICO or VantageScore credit scoring models tends to dramatically reduce credit unscorable.

In addition, some studies generated credit scoring models without using credit history. Berg et al. (2020) proposed a credit scoring model using only digital footprint variables such as device type, operating system, and email host. In order to create a different model for comparison, the credit bureau scores and digital footprint variables were considered. It was concluded that digital footprint variables complement rather than a substitute for credit bureau information. Additionally, the email usage variables such as the fraction of emails sent in certain periods, the fraction of emails sent or received from non-top financial product providers, and the number of contacts sent were used to build a credit scoring model (Djeundje et al., 2021). The study also found that a model that incorporates email usage and psychometric variables performs better than a model that incorporates only individual characteristics. Shema (2019) generated a credit scoring model only based on mobile airtime recharge or top-up history. However, no study generates a credit scoring model without depending on credit history in Malaysia.

In this research, we focus on computing tenant’s credit score, especially the credit score of the B40 group who rent a house. The credit scoring model proposed in this study does not depend on the credit history of tenants but is based on the tenant’s individual characteristics, monthly rent, and rent payment behavior. This credit scoring will increase the confidence of future property owners and developers to select the “credit unscored” or “credit invisible” B40 group as their potential customers. Furthermore, this credit scoring will also increase the credit scorable of low income group with limited credit history and hence might increase the approval rate of their loan application. The multivariable logistic regression is implemented to develop a credit scoring model for the tenants in this study. The parameters of the multivariable logistic regression model are estimated using the maximum likelihood method and the performance of the proposed model is also evaluated. In Section “Methodology,” the methodology implemented in this study is explained in detail. Next, the obtained results are presented and discussed in Section “Results and discussion.”

Methodology

Data

In this study, the initial factors considered that affect tenants’ credit scores are their gender, age, marital status, monthly income, household income, expense-to-income ratio, number of dependents, previous monthly rent, and number of months late payment. Hence, these informations were collected from a landlord company in Malaysia. In this study, the rent paid after 1 week is considered a late payment. Moreover, the tenant who makes late payments for more than 2 months will be assumed to default on rent, otherwise, assumed as not default. There are 33 data collected and among them, 7 (21.21%) are considered default. The statistical description of the collected data is shown in Table 1. As shown in Table 1, 93.94% of the respondents are B40 group and 81.82% of them are single. The collected data were next transformed into numerical data based on Tables 2 and 3 and were then split into training data (70%) and testing data (30%).

Table 1.

Statistical Description of Data.

Factors affecting credit score	Category	Percentage (%)
Gender	Female	45.45
Gender	Male	54.55
Age	18–25	60.61
	26–35	27.27
	36–45	6.06
	46–55	3.03
	56–65	3.03
Marital status	Single	81.82
	In a relationship/engaged/divorced/widow	15.15
	Married	3.03
Monthly income (RM)	<1,200	27.27
	1,200–1,999	15.15
	2,000–2,999	39.39
	3,000–3,999	9.09
	4,000–4,999	3.03
	5,000–5,999	3.03
	6,000–7,999	0.00
	≥8,000	3.03
Household income group	B1	60.61
	B2	18.18
	B3	6.06
	B4	9.09
	M1	3.03
Expense-to-income ratio	T20	3.03
	$\leq$ 0.35	18.18
	0.36–0.49	15.15
	0.5–0.99	57.58
	$\geq$ 1	9.09
Number of dependents	0	63.64
	1–2	24.24
	3–4	6.06
	≥5	6.06
Previous monthly rent (RM)	1–499	48.48
	500–749	45.45
	750–999	6.06
Number of months late payment	0	48.48
	1	21.21
	2	9.09
	3	9.09
	4	12.12

Table 2.

Category of Factors Affecting Credit Score.

Factors affecting credit score	Category
Gender	0	Female
Gender	1	Male
Age	1	18–25
	2	26–35
	3	36–45
	4	46–55
	5	56–65
	6	>65
Marital status	1	Single
	2	In a relationship/engaged/divorced/Widow
	3	Married
Monthly income (RM)	1	<1,200
	2	1,200–1,999
	3	2,000–2,999
	4	3,000–3,999
	5	4,000–4,999
	6	5,000–5,999
	7	6,000–7,999
	8	≥8,000
Household income group	1	B1
	2	B2
	3	B3
	4	B4
	5	M1
	6	M2
	7	M3
	8	M4
	9	T20
Expense-to-income ratio	1	≤0.35
	2	0.36–0.49
	3	0.5–0.99
	4	≥1
Number of dependents	1	0
	2	1–2
	3	3–4
	4	≥5
Previous monthly rent (RM)	1	0
	2	1–499
	3	500–749
	4	750–999
	5	1,000–1,249
	6	≥1,250
Number of months late payment	0–6

Table 3.

Income level for household income decile group. Adapted from Household Income and Basic Amenities Survey Report (p. 75), by (Department of Statistics Malaysia, 2020). (https://bit.ly/HISReportMalaysia). Copyright 2020 by Department of Statistics Malaysia. Adapted with permission.

Household income decile group	Income level (RM)
B40
B1	<2,500
B2	2,500–3,169
B3	3,170–3,969
B4	3,970–4,849
M40
M1	4,850–5,879
M2	5,880–7,099
M3	7,100–8,699
M4	8,700–10,959
T20
T1	10,960–15,039
T2	>15,039

Multivariable Logistic Regression

Logistic regression is the log odds of a binary outcome event success expressed linearly as the combination of all independent variable or factors considered (Botes, 2013). The factors considered in logistic regression can be either qualitative or quantitative. The logistic regression transforms the linear regression to the probability of an event success with the range values from zero to one. Therefore, the logistic regression also can be a classifier to classify the event as success or failure based on the probability computed. The multivariable logistic regression is a logistic regression with more than one factor. The logit of the multivariable logistic regression is (Grant et al., 2019)

g (x) = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p},

(1)

where $x$ is the independent variable or factors considered, which are gender, age, marital status, monthly income, household income, expense-to-income ratio, number of dependents, previous monthly rent and number of months late payment, $p$ is the number of $x$ considered, and $β$ is the regression parameter.

Let binary outcome dependent variable, $y = 1$ indicates that the tenant defaults while $y = 0$ indicates that the tenant does not default and the conditional probability of default equals to $π (x)$ . Then

P (y = 1 | x) = π (x),

(2)

P (y = 0 | x) = 1 - π (x) .

(3)

On the other hand, the logit transformation of $π (x)$ is the natural log of the odds, which can be written as

logit (π (x)) = \ln (\frac{π (x)}{1 - π (x)}) .

(4)

Since the logit transformation of $π (x)$ is equivalent to $g (x)$ , the logistic regression is given by

π (x) = \frac{1}{1 + e^{- g (x)}},

(5)

which is also a sigmoid function (Zou et al., 2019).

Sometimes, there is multicollinearity among independent variables where a lot of independent variables are highly correlated with each other. This problem can lead to the logistic regression model overfitting the data (Bolton, 2010). In this study, the multicollinearity among the factors in the training data set was ensured not to exist by showing the absolute Spearman’s correlation coefficient of any two factors in the training data set was less than .8 (Marime et al., 2020).

Maximum Likelihood Estimation

Maximum likelihood estimation can be utilized to estimate the parameters of logistic regression, $β$ by maximizing the likelihood function. The likelihood function for logistic regression with binary outcome dependent variable, $y$ can be expressed as (Bolton, 2010)

ℓ (β) = Π_{i = 1}^{n} π (x_{i})^{y_{i}} {[1 - π (x_{i})]}^{1 - y_{i}},

(6)

where $n$ is the total number of independent observations. Since the maximization of the likelihood function is similar to the maximization of the log likelihood function and the computation of maximizing the log likelihood function is much easier, the natural log of the likelihood function is preferred to be maximized for parameter estimation (Bolton, 2010). The natural log of the likelihood function, $L (β)$ can be defined as

L (β) = \sum_{i = 1}^{n} {y_{i} \ln [π (x_{i})] + (1 - y_{i}) \ln [1 - π (x_{i})]} .

(7)

In theory, maximum likelihood estimates will not exist if the data is separated (Albert and Anderson, 1984). The data of logistic regression can be classified into three different mutually exclusive and exhaustive classes, that is, complete separation, quasi-complete separation, and overlap (Botes, 2013). Separation is either complete or quasi-complete separation in the data most frequent under the same conditions that lead to small-sample and sparse-data bias, such as the presence of a rare outcome and multicollinearity among independent variables (Mansournia et al., 2018). In 2007, Konis (2007) discussed the existing methods for separation detection and proposed a new method for separation detection using linear programming. The proposed linear program to be solved is

D = [\begin{matrix} 1 & x_{11} & \dots & x_{p 1} \\ 1 & x_{12} & \dots & x_{p 2} \\ \dots & \dots & \dots & \dots \\ 1 & x_{1 n} & \dots & x_{pn} \end{matrix}],

(8)

β = [\begin{matrix} β_{0} \\ β_{1} \\ ⋮ \\ β_{p} \end{matrix}],

(9)

\begin{matrix} maximize & \sum_{j = 1}^{p + 1} X_{1 j} β - \sum_{j = 1}^{p + 1} X_{0 j} β \\ subject to & X_{0} β \leq 0, \\ X_{1} β \geq 0, \\ β is unrestricted, \end{matrix}

(10)

where $X_{0}$ is the submatrix of $D$ when $y_{i} = 0$ while $X_{1}$ is the submatrix of $D$ when $y_{i} = 1$ , whereas $X_{0 j}$ and $X_{1 j}$ are the elements in column $j$ of $X_{0}$ and $X_{1}$ , respectively. If the solution of the linear program is unbounded, the data is separated. Otherwise, the data is overlapped where a zero vector is the only feasible solution for the linear program. This linear program was solved to detect separation in this study.

In this research, the parameters of logistic regression were estimated using maximum likelihood estimation with Limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) through the Scikit-learn library in Python. Meanwhile, the objective function of maximum likelihood estimation will be the natural log of likelihood function, $L (β)$ without penalty if the training data is overlapped. If the training data is separated, then penalized maximum likelihood estimation is applied where the objective function will be the addition of penalty ridge regression with the $L (β)$ , which is

\begin{matrix} minimize F (β) = - L (β) + λ R (β), \\ R (β) = \sum_{j = 1}^{p} β_{j}^{2}, \end{matrix}

(11)

where $λ$ is positive regularization strength and $R (β)$ is ridge regression (Duffy & Santner, 1989).

Factor Reduction

The closer a logistic coefficient, $β$ is to zero, the less influence the predictor, $x$ has in predicting the logit, $g (x)$ (Starkweather, 2011). Therefore, the factor with a logistic coefficient that approximates zero was tested whether it is significant to the logistic regression. The Likelihood Ratio, Wald and Score Tests are statistical significance tests commonly used to test the significance of coefficient in logistic regression. The null hypothesis of the test claims that the coefficient for removed factors is equal to zero while the alternative hypothesis states that the coefficients for removed factors remain (Sur & Candes, 2019). The three tests are asymptotically equivalent and optimal when the sample size is large (Rayner, 1997). However, the Score Test is different from the Likelihood Ratio Test and Wald Test where it is based on the distribution of the derivative of the log likelihood function without the computation of the maximum likelihood estimates for the coefficient of logistic regression (Bolton, 2010).

According to Agresti (2018), the Wald Test is the least reliable of the three tests when the sample size is small to moderate. Therefore, the Likelihood Ratio Test was applied rather than the Wald Test in this study for testing the significance of the logistic coefficient since the size of collected training data is small. The null hypothesis will be accepted if the Likelihood Ratio Test statistic follows a chi-square distribution with a degree of freedom equal to the number of factors removed (Botes, 2013). The Likelihood Ratio Test statistic can be defined as

Likelihood Ratio Test Statistic = - 2 \ln (\frac{L_{0}}{L_{1}}),

(12)

where $L_{0}$ is the likelihood function of the simpler model without removed factors while $L_{1}$ is the likelihood function of the model that considered all factors.

Tenant’s Credit Scoring Model

In this research, we proposed a credit scoring model with a minimum credit score of zero and a maximum score of 100. For the proposed model, the tenant with lower probability of default will have higher credit score. Since $π (x)$ is the probability of the tenant defaulting, the probability of the tenant not defaulting is equal to $1 - π (x)$ . Hence, the proposed credit score of the tenant is

Credit Score = 100 (1 - π (x)) .

(13)

Predictive Performance of Model

In this research, the testing data was considered as default if the computed probability of default, $π (x) \geq 0.5$ , otherwise the testing data was considered as not default. Furthermore, the classification performance of the logistic regression, that is, accuracy, precision, and recall were computed easily by referring to the confusion matrix (Hackeling, 2017).

Confusion Matrix = [\begin{matrix} TN & FP \\ FN & TP \end{matrix}],

(14)

where $TN$ represents the number of true negatives, $FP$ is the number of false positives, $FN$ represents the number of false negatives, and $TP$ is the number of true positives.

Accuracy = \frac{TN + TP}{TN + TP + FN + FP}

(15)

Precision = \frac{TP}{FP + TP}

(16)

Recall = \frac{TP}{FN + TP}

(17)

There are two fundamental causes of the machine learning classifier’s prediction error, that is, the model’s bias and its variance (Hackeling, 2017). A model with high variance overfits the training data, while a model with high bias underfits the training data. Overfitting occurs if the accuracy of training data is significantly higher than testing data, while underfitting occurs if the accuracy of both training and testing data is low (Gu et al., 2016).

Besides, the area under the receiver operating characteristic (ROC) curve (AUC) was computed to determine the ability of the model to distinguish the default and not default classes. ROC curve plots the classifier’s true positive rate $(TPR)$ against its false positive rate $(FPR)$ (Hackeling, 2017). In this research, AUC was calculated using the trapezoidal rule, which can be written as (Bamber, 1975)

TPR = Recall = \frac{TP}{FN + TP},

(18)

FPR = \frac{FP}{FP + TN},

(19)

\begin{matrix} AUC \approx \frac{1}{2 m} [TPR (FP R_{0}) + 2 TPR (FP R_{1}) \\ + 2 TPR (FP R_{2}) + \dots \\ + 2 TPR (FP R_{m - 1}) + TPR (FP R_{m})], \end{matrix}

(20)

where $m$ is the number of subintervals.

The higher the AUC, the better the model is at distinguishing between the default and not default classes. However, the risk of overfitting occurs is higher if the AUC of training data is higher than the AUC of testing data (Nusinovici et al., 2020). Hence, the accuracy and AUC of both training and testing data also were compared to check whether the model is overfitting or underfitting.

The flow chart of the research methodology can be summarized in Figure 1.

Figure 1.

Flow chart of tenant’s credit scoring model.

Results and Discussion

Logistic Regression Results

The Spearman’s correlation matrix of training data is presented in Table 4. From Table 4, we can conclude that there is no multicollinearity among the factors in the training data set as the correlation coefficient of any two factors is less than .8. In addition, separation was detected in the training data set by solving the linear program (Equation 10). Thus, the parameters of the logistic regression model were estimated using the penalized maximum likelihood estimation with ridge regression. In this study, the regularization strength of the penalty was set as one.

Table 4.

Spearman’s Correlation Matrix of Training Data.

Factor, $x$	1	2	3	4	5	6	7	8	9
1	1.00	.37	.27	.21	.23	−.03	.01	.16	.13
2	.37	1.00	.03	.47	.21	−.21	−.37	−.44	.05
3	.27	.03	1.00	.01	−.02	.06	.05	.01	−.05
4	.21	.47	.01	1.00	.59	−.15	−.08	.00	−.50
5	.23	.21	−.02	.59	1.00	.21	.24	.10	−.44
6	−.03	−.21	.06	−.15	.21	1.00	.15	.30	.06
7	.01	−.37	.05	−.08	.24	.15	1.00	.14	.01
8	.16	−.44	.01	.00	.10	.30	.14	1.00	−.34
9	.13	.05	−.05	−.50	−.44	.06	.01	−.34	1.00

It is noticed that only the logistic coefficient of marital status is close to zero among the 9 factors. The result of the Likelihood Ratio Test of marital status is visualized in Figure 2. As shown in Figure 2, the marital status factor can be removed as it is insignificant to the model. The comparison of the logistic coefficients with all factors and without marital status is presented in Table 5. From Table 5, there is not much difference between logistic coefficients of the factors before and after the marital status factor was removed from the model. This result is consistent with the result of the Likelihood Ratio Test.

Figure 2.

Result of likelihood ratio test.

Table 5.

Comparison of Logistic Coefficient with All Factors Versus Without Marital Status.

Factor, $x$	Coefficient, $β$
Factor, $x$	All factors	Factor without marital status
Gender	0.4189	0.4167
Age	0.3309	0.3302
Marital status	−0.0737	—
Monthly income	−0.1287	−0.1300
Household income	−0.1175	−0.1172
Expense-to-income ratio	0.4816	0.4814
Number of dependents	−0.1276	−0.1277
Previous monthly rent	−0.3567	−0.3566
Number of months late payment	1.2891	1.2930

Let $x_{i, j}$ be the factor $x_{i}$ at category $j$ and $g_{1} (x)$ be the parts of linear equation $g (x)$ (Equation 1) that without including the factor $x_{i}$ , then the odds ratio of factor $x_{i}$ at category $j_{2}$ to category $j_{1}$ can be written as

Odds Ratio = e^{β_{i} (x_{i, j_{2}} - x_{i, j_{1}})} .

(21)

Let $Δ x_{i} = x_{i, j_{2}} - x_{i, j_{1}}$ , then the odds ratio $= e^{β_{i} Δ x_{i}}$ . When $x_{i}$ increases by one unit, $Δ x_{i} = 1$ , then the odds ratio $= e^{β_{i}}$ . The analysis of the logistic coefficient without the marital status factor is presented in Table 6. According to Table 6, the number of months late payment, the expense-to-income ratio, gender, previous monthly rent, and age are the main factors of the tenant’s credit score.

Table 6.

Analysis of Logistic Coefficient.

Factor, $x$	Coefficient, $β$			Odds ratio (increases by one unit)
Factor, $x$	Value	Sign	Ranking	Odds ratio (increases by one unit)
Gender	0.4167	+	3	1.5170
Age	0.3302	+	5	1.3913
Monthly income	−0.1300	−	6	0.8781
Household income	−0.1172	−	8	0.8894
Expense-to-income ratio	0.4814	+	2	1.6183
Number of dependents	−0.1277	−	7	0.8802
Previous monthly rent	−0.3566	−	4	0.7000
Number of months late payment	1.2930	+	1	3.6436

Notice that the positive sign of the logistic coefficient indicates that the higher the $x_{i}$ value, the higher the probability of default and vice versa. Therefore, the male is more likely to default as compared to the female, which is similar to the finding of Lin et al. (2017), Chamboko and Bravo (2019), Adzis et al. (2020), and Saha et al. (2021). Adzis et al. (2020) recommended that the Malaysian lenders may create a new mortgage loan product specifically for female borrowers since women are more responsible in their loan repayment rather than male. Besides, we can conclude that the credit score of the tenant decreases when the tenant’s number of months late payment, the expense-to-income ratio, or age increases. Meanwhile, the default rate of the tenant increases if the tenant’s previous monthly rent, monthly income, the number of dependents, and household income decreases.

The default probability increases with age because the tendency for tenant to forget or miss to pay rental is high as the tenant gets older. On the other hand, the more the number of dependents, the greater the tenant’s sense of responsibility to pay rent, which can be the reason of default rate decreases with the number of dependents. The inference that the probability of default decreases with rent can be the tenant’s preference for the amount of rent according to their ability to pay.

Predictive Performance of Logistic Regression

The confusion matrix of testing data and confusion matrix of training data are shown in Figure 3(a) and (b), respectively. The predictive performances of logistic regression on the testing and training data are also summarized in Table 7. From Table 7, we can say that neither underfitting nor overfitting occurs in this research since the accuracy of testing data is slightly lower than the accuracy of training data and the AUC of both training data and testing data are the same.

Figure 3.

Confusion matrix: (a) confusion matrix of testing data and (b) confusion matrix of training data.

Table 7.

Predictive Performance of Logistic Regression on Testing Data and Training Data.

Data set	Testing data	Training data
Accuracy	0.8	1.0
Precision	1.0	1.0
Recall	0.5	1.0
AUC	1.0	1.0

Conclusion

This study proposes credit scoring for the tenant in Malaysia. This study found that the number of months late payment, monthly rent, and the tenant’s individual characteristics, that is, the expense-to-income ratio, gender, and age, were the main factors affecting the credit score. The proposed credit scoring will increase the confidence of future property owners and developers to select the low income group, especially the B40 group, as their potential consumers. Besides, this credit scoring will reduce the credit unscorable of the low income group with limited credit history and might increase the probability of the low income group for getting a loan. Lastly, this paper can be a reference for future research to develop a credit scoring system without depending on credit history.

Footnotes

Acknowledgements

Communication of this research is made possible through monetary assistance by Ministry of Higher Education (MOHE) via Fundamental Research Grant Scheme (FRGS) (FRGS/1/2021/STG06/UTHM/02/1).

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received financial support for the research, authorship, and/or publication of this article: Communication of this research is made possible through monetary assistance by Ministry of Higher Education (MOHE) via Fundamental Research Grant Scheme (FRGS) (FRGS/1/2021/STG06/UTHM/02/1).

ORCID iD

Siti Suhana Jamaian

References

Adzis

A. A.

Lim

H. E.

Yeok

S. G.

Saha

(2020). Malaysian residential mortgage loan default: A micro-level analysis. Review of Behavioral Finance, 13(5), 663–681.

Agresti

(2018). An introduction to categorical data analysis. John Wiley & Sons.

Albanese

(2021). Big Data & Big Errors. Student Borrower Protection Center.

Albert

Anderson

J. A.

(1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1), 1–10.

Bamber

(1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology, 12(4), 387–415.

Berg

Burg

Gombovic

Puri

(2020). On the rise of fintechs: Credit scoring using digital footprints. The Review of Financial Studies, 33(7), 2845–2897.

Bolton

(2010). Logistic regression and its application in credit scoring [Unpublished doctoral dissertation]. University of Pretoria.

Botes

(2013). Comparing logistic regression methods for completely separated and quasi-separated data [Unpublished doctoral dissertation]. University of Pretoria.

Central Bank of Malaysia. (2018). Risk developments and assessment of financial stability in 2017.

10.

Chamboko

Bravo

J. M.

(2019). Frailty correlated default on retail consumer loans in zimbabwe. International Journal of Applied Decision Sciences, 12(3), 257–270.

11.

Department of Statistics, Malaysia. (2020). Household income and basic amenities survey 2019 report. Department of Statistics Malaysia.

12.

Djeundje

V. B.

Crook

Calabrese

Hamid

(2021). Enhancing credit scoring with alternative data. Expert Systems with Applications, 163, 113766.

13.

Duffy

D. E.

Santner

T. J.

(1989). On the small sample properties of normrestricted maximum likelihood estimators for logistic regression models. Communications in Statistics-Theory and Methods, 18(3), 959–980.

14.

Ebekozien

Abdul-Aziz

A. R.

Jaafar

(2019). Housing finance inaccessibility for low-income earners in Malaysia: Factors and solutions. Habitat International, 87, 27–35.

15.

Grant

S. W.

Hickey

G. L.

Head

S. J.

(2019). Statistical primer: Multivariable regression considerations and pitfalls. European Journal of Cardio-Thoracic Surgery, 55(2), 179–185.

16.

Wylie

B. K.

Boyte

S. P.

Picotte

Howard

D. M.

Smith

Nelson

K. J.

(2016). An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data. Remote Sensing, 8(11), 943.

17.

Hackeling

(2017). Mastering machine learning with scikit-learn. Packt Publishing Ltd.

18.

Konis

K. P.

(2007). Linear programming algorithms for detecting separated data in binary logistic regression models [Unpublished doctoral dissertation]. University of Oxford.

19.

Kumar

Shanthi

Bhattacharya

(2021). Credit score prediction system using deep learning and k-means algorithms. Journal of Physics: Conference Series, 1998(1), 012027.

20.

Lin

Zheng

(2017). Evaluating borrower’s default risk in peer-to-peer lending: Evidence from a lending platform in china. Applied Economics, 49(35), 3538–3545.

21.

Liu

Ong

H. Y.

(2021). Can malaysia’s national affordable housing policy guarantee housing affordability of low-income households? Sustainability, 13(16), 8841.

22.

Louzada

Ara

Fernandes

G. B.

(2016). Classification methods applied to credit scoring: Systematic review and overall comparison. Surveys in Operations Research and Management Science, 21(2), 117–134.

23.

Mansournia

M. A.

Geroldinger

Greenland

Heinze

(2018). Separation in logistic regression: Causes, consequences, and control. American Journal of Epidemiology, 187(4), 864–870.

24.

Marime

Magweva

Dzapasi

F. D.

(2020). Demographic determinants of financial literacy in the Masvingo Province of Zimbabwe. PM World Journal, 9(Iv), 1–19.

25.

Munkhdalai

Lee

J. Y.

Ryu

K. H.

(2020). A hybrid credit scoring model using neural networks and logistic regression. In Pan

J. S.

Tsai

P. W.

Jain

(Eds.), Advances in intelligent information hiding and multimedia signal processing (pp. 251–258). Springer.

26.

Njuguna

Sowon

(2021). Poster: A scoping review of alternative credit scoring literature. In ACM sigcas conference on computing and sustainable societies (COMPASS) (COMPASS’21, June 28 July 02, 2021, Virtual Event, Australia. ACM, New York, NY, USA, (pp. 437–444).

27.

Nusinovici

Tham

Y. C.

Yan

M. Y. C.

Ting

D. S. W.

Sabanayagam

Wong

T. Y.

Cheng

C. -Y.

(2020). Logistic regression was as good as machine learning for predicting major chronic diseases. Journal of Clinical Epidemiology, 122, 56–69.

28.

Rayner

(1997). The asymptotically optimal tests. Journal of the Royal Statistical Society: Series D (The Statistician), 46(3), 337–345.

29.

Saha

Lim

H. -E.

Siew

G. -Y.

(2021). Housing loan repayment behaviour in malaysia: An analytical insight. International Journal of Business and Economics, 20(2), 141–159.

30.

Shema

(2019). Effective credit scoring using limited mobile phone data. In Proceedings of the tenth international conference on information and communication technologies and development, Ahmedabad, India, (pp. 1–11). Association for Computing Machinery, New York.

31.

Starkweather

(2011). Sharpening occam’s razor: Using Bayesian model averaging in R to separate the wheat from the chaff. Benchmarks RSS Matters.

32.

Sur

Candes

E. J.

(2019). A modern maximum-likelihood theory for high- ‘ dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525.

33.

The Star. (2021). Roughly 600,000 families went from M40 to B40 due to pandemic, says Tok Pa.

34.

The Sun Daily. (2021). BMF offers recommendations to put affordable homes within reach of B40 families.

35.

Turner

Walker

(2019). Potential impacts of credit reporting public housing rental payment data. http://dx.doi.org/10.2139/ssrn.3615881

36.

World Bank Group. (2022). The world bank in Malaysia: Overview.

37.

Zou

Tian

Shen

(2019). Logistic regression model optimization and case analysis [Conference session]. 7th International Conference on Computer Science and Network Technology, Dalian, China, pp. 135–139, IEEE.