Sage Journals: Discover world-class research

Abstract

BACKGROUND:

HER2, ER, PR, and ERBB2 play a vital role in treating breast cancer. These are significant predictive and prognosis biomarkers of breast cancer.

OBJECTIVE:

We aim to obtain a unique biomarker-specific prediction on overall survival to know their survival and death risk.

METHODS:

Survival analysis is performed on classified data using Classification and Regression Tree (CART) analysis. Hazard ratio and Confidence Interval are computed using MLE and the Bayesian approach with the CPH model for univariate and multivariable illustrations. Validation of CART is executed with the Brier score, and accuracy and sensitivity are obtained using the k-nn classifier.

RESULTS:

Utilizing CART analysis, the cut-off value of continuous-valued biomarkers HER2, ER, PR, and ERBB2 are obtained as 14.707, 8.128, 13.153, and 6.884, respectively. Brier score of CART is 0.16 towards validation of methodology. Survival analysis gives a demonstration of the survival estimates with significant statistical strategies.

CONCLUSIONS:

Patients with breast cancer are at low risk of death, whose HER2 value is below its cut-off value, and ER, PR, and ERBB2 values are greater than their cut-off values. This comparison is with the patient having the opposite side of these cut-off values for the same biomarkers.

Keywords

Cancer biomarker bayesian boosting survival classification

1. Introduction

Currently, breast cancer has become the leading cause of death among women, with most diagnosed malignancies between them [1, 2]. According to GLOBOCAN 2020, estimated new cases were 2,261,419 (24.5% of all new women cancer cases) and 684,996 new deaths (15.5% of all new women cancer deaths) of breast cancer in 2020 worldwide [1]. In 2020, the age-standardized rate (ASR) for incidence and mortality is estimated to be 47.8 and 13.6 respectively per 1,00,000 population for breast cancer worldwide [1]. Nearly one out of four cancer cases in women found to be breast cancer universally [3]. In developed countries like the US, the rates of invasive breast cancer have increased by 0.3% per year between 2007 to 2016. However, the death rate decreased by 1.3% per year between 2013 and 2017 [4]. The treatment choices for breast cancer are surgery, radiation, chemotherapy, targeted therapy, hormone therapy, immunotherapy (e.g., Checkpoint inhibitors), etc. [4]. There are many treatment strategies for breast cancer, but there are still so many deaths. The biomarker-specific pathological strategy may provide promising success in breast cancer and can reduce death rates.

Human Epidermal growth factor Receptor 2 (HER2), Estrogen Receptor (ER), Progesterone Receptor (PR), and ERBB2 are the prominent biomarkers of breast cancer [5]. Existing treatments for breast cancer, when the cell has the receptors for estrogen-positive (ER+) and progesterone positive (PR+), are more likely to respond to the hormonal therapy than the tumor is ER- or PR- [6]. Approximately 75% of breast cancer are found to be ER+ [7], and 65% to 75% are found to be PR+ [8]. Tamoxifen and Raloxifene have been in use since previous years to prevent breast cancer. These hormonal therapies are used for women at high risk of breast cancer [9]. Recently, aromatase inhibitors (e.g., Anastrozole, Exemestane, and Letrozole) are used to prevent the advanced ER+ breast cancer type [9]. If the protein asset is found hormone receptor-negative, then hormonal therapy is not going to work. In that case, it can be treated along with chemotherapy, surgery, or radiation therapy. Treatment of the ER- type of breast cancer patients can be done by kinase inhibitors specific [10]. Approximately 20% of breast cancer are found to be with HER2 receptors, and it is found to be faster spreading and aggressive than other types of breast cancer [11]. It is categorized in HER2+, HER2- and HER2 borderline (HER2B). A special drug for the treatment of breast cancer of type HER2 is called targeted therapies [12]. Some of the targeted drugs used for type HER2+, such as trastuzumab (Herceptin), have been shown positive response to decreasing the risk of breast cancer reappearance [13]. When this medication is given with chemotherapy after surgery, it becomes standard treatment for breast cancer [14]. Patients with triple-negative breast cancer and who have the PD-L1 protein can be diagnosed with chemotherapy drug nab-paclitaxel with checkpoint inhibitor Atezolizumab [15].

There are many scopes for more treatments and a declining mortality rate. Biomarker-specific treatments also exist; nevertheless, we are not getting the outcome as expected. As the scenario of biomarkers is characteristic, it varies from one geographical location to another geographical location [16]. So, we need to have an integrated study on biomarkers to know the risk of disease with the continuous value of biomarkers HER2, ER, PR, and ERBB2. In this article, we made an integrated analysis on overall survival (OS) of breast cancer patients by performing classification on biomarkers HER2, ER, PR, and ERBB2. Knowing the risk level of any disease is the first step towards the treatment of any disease, which this study does. We provide descriptions of the cut-off value of biomarkers for fast survival estimation of breast cancer patients.

In the article [17], the Bayesian approach has been performed to find a classification tree using CART analysis. This article is the motivation of the depicted paper. Here, we first made a classification on integrated breast cancer data using CART, which is biomarker specific, making it unique. Then we performed classical (MLE) and Bayesian approaches on classified data for predicting survival estimates to know the chances to survive and death risk due to variation in values of biomarkers of breast cancer. Similarly, this methodology can also be applied for any other disease to know their biomarker-specific fast prediction on survival.

The purpose of this article is to obtain the optimal threshold value for four prominent biomarkers, HER2, ER, PR, and ERBB2, of breast cancer, with continuous values of biomarkers (predictor variable) and categorical response variable to make a conclusion on OS. This article is designed as section ‘Introduction’ gives a brief outline about breast cancer, its biomarker specific treatments, and methodology motivation. All information about data material, study design, and performed methodology with validation techniques are described in section ‘Material and methods’. Section ‘Results’ contains all the numerical estimates obtained from performed methodology and its interpretation. Effects of numerical results and discussion about methodologies have been included in section ‘Conclusion’.

Figure 1.

Flowchart for selection of data to make study on overall survival.

2. Material and methods

2.1 Material and study design

We studied overall survival for breast cancer patients using datasets of clinical data and continuous-valued genomic data. The flowchart for the selection of the studied six datasets, meeting our criteria, is illustrated in Fig. 1. These 6 independent datasets (GSE3494, GSE7390, GSE16446, GSE20685, GSE20711 and GSE48390) together include 1037 unique breast cancer patients. Dataset GSE3494 (236) is collected in Uppsala (Sweden) from 1 January 1987 to 31 December 1989 and published on the PNAS website and approved by Karolinska Institute, Stockholm, Sweden [18]. Gene expression data GSE7390 (198) is taken from [19]. Diagnosis duration of patients was 1980 to 1998, and validation of data was done at all six clinical centers where data is seen [19] by two independent auditors. Dataset GSE16446 (107) is taken from [20], in which microarray data profiles were assessed using pre-epirubicin biopsies, and validation was done with MDACC 2003-0321 neoadjuvant trials. Dataset GSE20685 (327) is accessed from [21] with a diagnosis period 1991 to 2004 at the Koo Foundation Sun-Yat-Sen Cancer Center (KFSYSCC). Dataset GSE20711 (88) is taken from [22] and available at Infinium Methylation Platform to find the profile of breast tumors at single-CpG resolution. Dataset GSE48390 (81) is taken from [23], and the diagnosis duration was January 2007 to December 2010 for breast cancer patients from Taiwan. These six datasets are taken from different sources on homogenous breast cancer patients and taken from microarray experiments and were performed on a similar platform and direction. These datasets can be downloaded from https://www.ncbi.nlm.nih.gov/geo/ using a data access number.

In this article, we are going to provide a methodology to make a conclusion on OS for breast cancer patients. All these 6 data are combined in a single data of 1037 patients to be meta-analyzed. Here some conversions have been made like the unit of survival time (overall survival) in all six independent data, converted into a single unit ‘day’ whether it was in the day, month or year. Overall survival, time to relapse, last follow-up, follow-up duration, and survival time all have been written in a single notation OS. For the event status, one indicates the occurrence of certain death status, and zero is for living or censored-biological accession ID. Reference of biomarkers into genomic data are 216836_s_at for HER2, 205225_at for ER, 219197_s_at for PR, and 203497_at for ERBB2.

Breast cancer varies from one geographical location to another geographical location [16]. Studied microarray data comes from different geographical locations while all have the same scenarios and same study type. So, we should expect a complete integrated study, which will have representation from different locations to give an overall survival estimate about the actual scenario of different biomarkers HER2, ER, PR and ERBB2. We made a classification study on it. We represented our work through CART analysis towards detecting the threshold values for biomarkers HER2, ER, PR, and ERBB2. From CART analysis, we are getting the cut-off value of all four biomarkers (predictor variable) HER2, ER, PR, and ERBB2 on overall survival for better and faster breast cancer treatment. Further studies are made on classified data with Cox proportional hazard (CPH) model for estimating the estimates (Hazard Ratio, Confidence Interval, and $p$ -value) using the maximum likelihood method and the Bayesian approach using open-source R software. The CPH model is used to know the effects of biomarkers (covariates) on the occurrence of an event (death) and to make predictions on OS. All these estimated pieces of information are kept in Table 2.

A similar strategy we performed to make the validation on CART analysis on training data. As an arbitrary choice of 85% data, we selected from the entire dataset of 1037 sample size as training data for CART analysis towards validation of our methodology. The cut-off value obtained from the training data, we used to classify the testing data of sample size 157(15%). We also generated a Brier score of testing data to check the accuracy of the methodology in terms of risk or mean squared error. The Brier score is obtained with R software using packages ‘DescTools’.

2.2 Methods

CART is used in this study to obtain a threshold value of biomarkers using R. CART is preferred as it does not require making any hypothesis for the distribution of the predictor variable. It is suitable for highly skewed or multi-model data (numerical) and is an optimal tree-building technique [24]. The model used here to perform CART analysis is the regression model, in which covariates are considered as biomarkers, and the outcome was an event (death). Model trees were preferred to create classification on predictor variables. CART is chosen because we want to have the binary classification for our continuous predictor variables in the data.

After performing CART analysis, the study is done on classified data. Parameters have been estimated using the CPH model. Cox regression analysis (CRA) is based on the CPH model. The robust nature of the CPH model permits us to find survival function estimates for the unknown parameter using the correct parametric model. With it, we obtain MLE of unknown parameters with asymptotic deliberation [25]. Inferences of MLE depend on the likelihood of the data. The Cox regression model is most commonly used in regression models, as it does not require any hypothesis or assumptions regarding the nature or shape of the distribution [26]. CPH model is executed in this study to make a comparison among biomarkers HER2, ER, PR, and ERBB2 on OS. The performance of these biomarkers is analyzed by Hazard Ratio (HR), Confidence Interval (CI), and p-value.

Hazard function H(t) is expressed by

$\displaystyle H\left(t\right)=h_{0}\left(t\right)e^{\sum_{i=1}^{p}\beta X_{i}}$ (1)

where $t$ is survival time, $H\left(T\right)$ is the hazard function for $p$ covariates $X_{1},X_{2},\ldots\ldots,X_{p}$ . $h_{0}\left(t\right)$ indicates baseline hazards at time $t$ and coefficients that measures the influence of predictors is denoted by $\beta$ .

The Hazard ratio for two covariate vectors $Y_{1}$ and $Y_{2}$ is obtained by the ratio of hazard rates of respective covariates. It is expressed by

$\displaystyle\textit{HR}=\frac{h_{y_{1}}\left(t\right)}{h_{y_{2}}\left(t\right% )}=\frac{h_{0}\left(t\right)e^{y_{1}\beta}}{h_{0}\left(t\right)e^{y_{2}\beta}}% =\frac{e^{y_{1}\beta}}{e^{y_{2}\beta}}$ (2)

Hazard ratio quantifies the degree of difference between the groups. The event’s occurrence (i.e., risk) increases if $\textit{HR}>1$ by $\left({\textit{HR}\times 100}\right)\%$ , the event’s occurrence (i.e., risk) decreases if $\textit{HR}<1$ by $\left({1-\textit{HR}}\right)\times 100\%$ , and if $\textit{HR}=1$ then it describes lack of association.

Table 1

Measures of efficiency

	Expression	Interpretation
Accuracy	$\frac{\textit{FP}+\textit{TN}}{\textit{TP}+\textit{FN}+\textit{TN}+\textit{FP}}$	It is proportion of number of assessments correctly classified to the number of all classified assessment.
Sensitivity	$\frac{\textit{TP}}{\textit{TP}+\textit{FN}}$	Probability to correctly identify patients with occurrence of target condition (event).
Specificity	$\frac{\textit{TN}}{\textit{TN}+\textit{FP}}$	Probability to correctly identify patients with no occurrence of target condition.

TP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative.

In addition, the receiver operating characteristic (ROC) curve is prepared to measure the accuracy of created classification. Predictions obtained by Bayesian survival analysis (BSA) have very little bias and standard error than the Cox regression analysis (CRA), whether the sample size is large or small [27]. BSA uses the Bayesian method of estimation, in which inferences are exact. The Bayesian approach for estimation is based on prior information and new knowledge from the experimental (observed) data [28]. Bayesian analysis is mainly used for informative and non-informative prior. Informative prior is based on earlier experience, studies, or expert views, which are not conquered by likelihood, used to find posterior distribution. In some cases, when prior information is unavailable, non-informative prior density is used, which is conquered by the likelihood function. It has the minimal influence on posterior distribution [29, 30], and improper posterior can be obtained with it [29]. We have used two ways of estimation, MLE and Bayesian, to estimate the parameters. Prior information plays a vital role in Bayesian survival analysis. Here, regression coefficients are the parameters in the CPH model, in which prior distribution is taken as normal. The Markov Chain Monte Carlo (MCMC) method has used the parameter with posterior density. For estimating the parameters using the Bayesian approach, we have used function ‘survMC’ from the package ‘SurviMChd’ with 10000 iterations, which provides survival estimates using the CPH model with MCMC.

MCMC method is used in multi-dimensional data for the solution or explanation of sampling distribution. Monte Carlo is used for sampling (randomly) a probability distribution to estimate the parameters, whereas Markov chain is a systematic method for generating the sequence of random variables, in which the current variable is generated using a prior variable. In this article, it has been used to update the information about the parameter to be estimated by taking the sample from posterior density. Using MCMC, many samples can be generated from posterior density, and it is used to estimate the study of characteristics by taking its expectation. MCMC has many applications in statistics. For example, the Gibbs sampler is a powerful simulation technique, and it is a method of MCMC.

Risk prediction of the methodology is one of the key tasks. Brier score (BS) is used in this article to validate the accuracy of the classification model using its testing data. This illustrates validation in terms of mean squared error between predicted probabilities and expected values. The expression to calculate the Brier score is

$\displaystyle\textit{BS}=\frac{1}{n}\mathop{\sum}\limits_{j=1}^{n}\left({p_{j}% -o_{j}}\right)^{2}$ (3)

where $p_{j}$ is predicted probability (takes a value between 0 and 1) and $o_{j}$ denotes actual outcome (takes value 0 or 1), and Brier score measures the precision of probabilistic estimation. As much it gives lower values, it means better prediction is obtained.

Figure 2.

Classification tree for Overall Survival (OS) and death on overall data.

Recently, boosting has been used as an optimization technique to estimate the potentially high-dimensional models, whether additive or linear [31, 32, 33]. There is a ‘mboost’ (model-based boosting) package in R software for statistical modeling to minimize the empirical risk function $\frac{1}{n}\sum_{i=1}^{n}\rho(Y_{i},f\left({X_{i}}\right))$ , which uses boosting algorithm [31]. Where $\left({X_{1},Y_{1}}\right),\ldots,\left({X_{n},Y_{n}}\right)$ be the random variable in the dataset. Commonly $Y_{i}$ is considered for one-dimensional death status as the response variable and $X_{i}$ from multi-dimensional predictor variable taken as biomarkers HER2, ER, PR, and ERBB2.

Boosting algorithm is based on the functional gradient descent (FGD) algorithm. It can be implemented on the Cox model using the negative gradient of the partial likelihood by $L_{2}$ -Boosting [34]. $L_{2}$ -boosting is often used in regression, particularly with the high number of covariates or predictor variables. In this case, the loss function is defined as

$\displaystyle\rho_{L_{2}}\left({Y,f}\right)=\frac{|Y-f|^{2}}{2}$ (4)

which yields population minimizer

$\displaystyle f_{L_{2}}^{\ast}\left(X\right)=E\left[{Y|X=x}\right]=p\left(x% \right)=P[Y=1|X=x]$ (5)

$f^{\ast}$ () can be estimated by the empirical risk function mentioned above with boosting, and its iteration follows sequential scheme as $m^{th}$ iteration depends on previous $(m-1)^{th}$ iteration only in functional space.

After performing classification on data, its efficiency can be measured by accuracy rate, sensitivity, and specificity. These measures can be described in terms of true positive rate, false-positive rate, true negative rate, and false-negative rate. It is shown in Table 1.

3. Results

In this article, an integrated study has been done on 1037 unique breast cancer patients. Flow chart for selection and integration of six datasets (GSE3494, GSE7390, GSE16446, GSE20685, GSE20711 and GSE48390) is shown in Fig. 1. We represented our work to make predictions on OS by detecting the cut-off value of biomarkers by performing classification CART analysis on an integrated dataset. Obtained cut-off values for prominent biomarkers HER2, ER, PR and ERBB2 are 14.707, 8.128, 13.153, and 6.884, respectively. Figure 2 shows that HER2 is the node that splits initially, and hence HER2 is the stronger predictor than the other three biomarkers. ROC curve has been obtained to measure the accuracy of the model in CART analysis, which provides the AUC value of 0.82 (shown in Fig. 3). It is relatively high.

Figure 3.

ROC curve for Overall Survival (OS) and death on 85% training data.

We found that from the dataset of 1037 patients (Fig. 2), 226(22.51%) out of 1004 patients whose $\textit{HER2}<$ 14.707 and 19(57.58%) out of 33 patients whose $\textit{HER2}\geqslant$ 14.707 were died. Patients with $\textit{HER2}<$ 14.707, 220(23.6%) out of 932 patients whose $\textit{PR}<$ 13.153 and 6(8.33%) out of 72 patients whose $\textit{PR}\geqslant$ 13.153 had died. Patients were having $\textit{HER2}\geqslant$ 14.707, 7(100%) out of 7 patients whose $\textit{ER}<$ 8.128 and 12(46.15%) out of 26 patients whose $\textit{ER}\geqslant$ 8.128 had died. For the patients whose $\textit{HER2}<$ 14.707 and $\textit{PR}<$ 13.153, 6(60%) out of 10 whose $\textit{ERBB2}<$ 6.884, and 214(23.21%) out of 922 patients whose $\textit{ERBB2}\geqslant$ 6.884 were died.

Table 2

Estimates using MLE and Bayesian for univariate and multivariable

Response ( $n=$ 1037)	Overall survival (OS)
Biomarkers			HER2	ER	PR	ERBB2
Maximum	Univariate analysis	HR	2.91	0.60	0.29	0.21
likelihood		95% CI	(1.82, 4.65)	(0.46, 0.79)	(0.13, 0.64)	(0.09, 0.47)
estimator		$p$ -value	$<$ 0.0001	0.0003	0.0025	0.0002
	Multivariable analysis	HR	2.99	0.62	0.35	0.22
		95% CI	(1.87, 4.80)	(0.47, 0.83)	(0.15, 0.79)	(0.10, 0.49)
		$p$ -value	$<$ 0.0001	0.0010	0.0114	0.0002
Bayesian	Univariate analysis	HR	2.9	0.60	0.28	0.20
		95% CI (HPD)	(1.73, 4.38)	(0.45, 0.78)	(0.10, 0.55)	(0.1, 0.36)
	Multivariable analysis	HR	3.00	0.64	0.35	0.26
		95% CI (HPD)	(1.36, 4.40)	(0.46, 0.82)	(0.10, 0.64)	(0.08, 0.50)

Table 3

Accuracy and sensitivity for different $k$

k	1	2	3	4	5	6	7	8	9	10
Accuracy	0.772	0.772	0.772	0.771	0.771	0.771	0.771	0.764	0.764	0.764
Sensitivity	0.995	0.995	0.995	0.996	0.996	0.996	0.996	1	1	1

After performing CART analysis, estimates of parameters are obtained by the maximum likelihood method and Bayesian approach on classified data using the CPH model (shown in Table 2). Estimates using the Bayesian approach for all co-variates together (in multivariable) is HR (95% CI) of HER2 on death: 3.0 (1.63, 4.40). It shows, whose $\textit{HER2}\geqslant$ 14.707 has 200% more chances of death than patients with $\textit{HER2}<$ 14.707. HR (95% CI) of ER is 0.64 (0.46, 0.82). It indicates patients having $\textit{ER}\geqslant$ 8.128, their chances of reaching death are more decreased by 36% than patients having $\textit{ER}<$ 8.128. HR (95% CI) of PR is 0.35 (0.10, 0.64). It demonstrates, as values of the biomarker PR exceeded its cut-off value 13.153, the chances of death of those patients more decreased by 65% than patients having less than its cut-off value. HR (95% CI) of ERBB2 is 0.26 (0.08, 0.50). It can be explained as patients whose $\textit{ERBB2}\geqslant$ 6.884, the chances of their death more decreased by 74% than the patient having $\textit{ERBB2}<$ 6.884.

Performed a similar strategy on training data to make the validation on CART analysis. Training data is a random selection of 880 (85%) sample size of the entire data of 1037 patients. Obtained threshold value (or cut-off value) for biomarkers HER2, ER, PR and ERBB2 are 14.993, 5.063, 13.153, and 11.152, respectively, using CART analysis on training data towards validating our methodology. It shows a significant difference between the entire dataset and the training dataset, and the cut-off value looks quite impressive. Finally, the cut-off value obtained by training data is used to classify testing data (15% of 1037 patients). We applied the Brier score prediction on classified testing data, and its representation of prediction error is shown in Fig. 4. Brier score provides an accuracy of binary prediction. The Brier score of death in testing data is 0.16, which shows the accuracy of our methodology, which is relatively high.

Figure 4.

Representation of prediction error for Brier score.

The ‘mboost’ package is present in R software to estimate the parameters using bootstrap, k-fold cross-validation and subsampling technique. We have used boosting for cross-validation on training data to know the partial contribution of biomarkers HER2, ER, PR and ERRBB2 in breast cancer patients using the Cox model with the linear predictor. Representation of the partial effect of biomarkers is shown in Fig. 5.

Figure 5.

Partial contribution of biomarkers on breast cancer patients.

The efficiency of CART is measured by accuracy and sensitivity (correct measurement for risk) using the K-nn classifier in R software with package ‘caret’. Sensitivity describes the probability of correct estimation about the occurrence of the event of patients above or below the cut-off value of biomarkers HER2, ER, PR and ERBB2. Suppose $k$ be the chosen number of neighbors. For different values of $k$ , variation in accuracy and sensitivity is shown in Table 3, and the graph in Fig. 6 shows the relation between error and $k$ . For less value of $k$ , we are getting less error, which is high accuracy for CART.

Figure 6.

Variation of error with K.

4. Conclusion

This study presented numerical threshold (or cutoff) values for biomarkers HER2, ER, PR, and ERBB2 by performing CART analysis on the integrated dataset. It also shows HER2 is robust biomarker than others.

CART models are flexible approach for specifying the conditional distribution of response variable given a multivariable vector of predictor variables. It uses the binary tree with a greedy algorithm that makes recursive partitioning in predictor variable space into subsets of it, and it results as the distribution of response variable become more homogeneous. Initially, the classification tree grows large and then prunes the tree that has the minimum error of cross-validation estimate. CART has become widely popular in the statistical field after Breiman, Friedman, Olshen, and Stone [35]. Brown [36] provided a linear programming solution for the linear split of predictor variables in CART by stochastic search. The Bayesian approach is used to find the CART models [17]. Oliver and Hand [37] delivered an experimental comparison of various pruning and Bayesian model methods based on CART. Application of CART is demonstrated to know the novel prognostic factor, the impact of the clinical variable on survival, and interaction between covariates [38].

On classified data based on CART, the comparison has been made on OS with MLE using the Cox model, and the Bayesian approach as BSA shows better results than CRA [27]. Here we observe that patients with breast cancer are at low risk of death, whose HER2 value is below its cut-off value, and ER, PR, and ERBB2 values are greater than their cut-off values. This comparison is with the patient having the opposite side of these cut-off values for the same biomarkers. Observation shows that HR (as it decreases, the death risk decreases with it) decreases to HER2, ER, PR, and ERBB2 in respective order when they exceed their cut-off values. It is consistent for both univariate and multivariable cases with MLE and the Bayesian approach. Brier score is used to verify the skill of probability forecasts for binary outcomes [39]. Here we have used and evaluated the Brier score to validate our methodology. It shows, the performed technique is relatively high. Another algorithm, ‘boosting’, we performed on the same data, which is completely independent of the CART analysis. The partial effect or contribution (partial likelihood estimate) of biomarkers HER2, ER, PR, and ERBB2 in the data on death, indicates the non-linearity. In this work, we have provided the result based on the integrated study on six different datasets belonging to six different-different geographical locations. Hence efficiently, this methodology works on any similar breast cancer data with same model for the same prominent biomarkers HER2, ER, PR, and ERBB2. The intention of this work is not to compare one study to another. The purpose is to integrate and create comprehensive results from different studies to get a holistic idea.

Footnotes

Acknowledgments

The authors are deeply indebted to the editor-in-chief Prof. Sudhir Srivastava and learned reviewers for their valuable comments leading to improving the contents and presentation of the original manuscript.

Authors’ contributions

Conception: GKV and AB.

Interpretation or analysis of data: PK.

Preparation of the manuscript: AB and PK.

Revision for important intellectual content: GKV.

Supervision: GKV and AB.

References

Hyuna

Jacques

Rebecca

L.S.

Mathieu

Isabelle

and Ahmedin

, Jemaland Freddie, Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA: A Cancer Journal for Clinicians 71 (2021), 209–249.

Ferlay

Ervik

Lam

Colombet

Mery

Piñeros

Znaor

Soerjomataram

and Bray

, Global cancer observatory: cancer today, Lyon, France: International Agency for Research on Cancer (2018), 1–6.

Bray

Ferlay

Soerjomataram

Siegel

R.L.

Torre

L.A.

and Jemal

, Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA: A Cancer Journal for Clinicians 68 (2018), 394–424.

Siegel

Miller

and Jemal

, Cancer facts & figures 2020, Atlanta: American Cancer Society (2020), 1–76.

Dai

Xiang

and Bai

, Cancer hallmarks, biomarkers and breast cancer molecular subtypes, Journal of Cancer 7 (2016), 1281–1284.

Clinton

S.K.

Giovannucci

E.L.

and Hursting

S.D.

, The world cancer research fund/american institute for cancer research third expert report on diet, nutrition, physical activity, and cancer: impact and future directions, The Journal of Nutrition 150 (2020), 663–671.

Anderson

W.F.

Chatterjee

Ershler

W.B.

and Brawley

O.W.

, Estrogen receptor breast cancer phenotypes in the surveillance, epidemiology, and end results database, Breast Cancer Research and Treatment 76 (2002), 27–36.

Colomer

Beltran

Dorcas

Cortes-Funes

Hornedo

Valentin

Vargas

Mendiola

and Ciruelos

, It is not time to stop progesterone receptor testing in breast cancer, Journal of Clinical Oncology 23 (2005), 3868–3869.

Obiorah

and Jordan

, Progress in endocrine approaches to the treatment and prevention of breast cancer, Maturitas 70 (2011), 315–321.

10.

Uray

and Brown

P.H.

, Chemoprevention of hormone receptor-negative breast cancer: new approaches needed, Clinical Cancer Prevention 188 (2010), 147–162.

11.

Bhattacharjee

Rajendra

Dikshit

and Dutt

, HER2 borderline is a negative prognostic factor for primary malignant breast cancer, Breast Cancer Research and Treatment 181 (2020), 225–231.

12.

Mohamed

Krajewski

Cakar

and Ma

C.X.

, Targeted therapy for breast cancer, The American Journal of Pathology 183 (2013), 1096–1112.

13.

Slamon

Eiermann

Robert

Pienkowski

Martin

Press

Mackey

Glaspy

Chan

Pawlicki

et al., Adjuvant Trastuzumab in HER2-Positive Breast Cancer, New England Journal of Medicine 365 (2011), 1273–1283.

14.

Bang

Y.-J.

Van Cutsem

Feyereislova

Chung

H.C.

Shen

Sawaki

Lordick

Ohtsu

Omuro

Satoh

et al., Trastuzumab in combination with chemotherapy versus chemotherapy alone for treatment of HER2-positive advanced gastric or gastro-oesophageal junction cancer (ToGA): a phase 3, open-label, randomised controlled trial, The Lancet 376 (2010), 687–697.

15.

Schmid

Adams

Rugo

H.S.

Schneeweiss

Barrios

C.H.

Iwata

Diéras

Hegg

S.-A.

Shaw Wright

et al., Atezolizumab and nab-paclitaxel in advanced triple-negative breast cancer, New England Journal of Medicine 379 (2018), 2108–2121.

16.

Nattinger

A.B.

Gottlieb

M.S.

Veum

Yahnke

and Goodwin

J.S.

, Geographic variation in the use of breast-conserving treatment for breast cancer, New England Journal of Medicine 326 (1992), 1102–1107.

17.

Chipman

H.A.

George

E.I.

and McCulloch

R.E.

, Bayesian cart model search, Journal of the American Statistical Association 93 (1998), 935–948.

18.

Miller

L.D.

Smeds

George

Vega

V.B.

Vergara

Ploner

Pawitan

Hall

Klaar

Liu

E.T.

et al., An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival, Proceedings of the National Academy of Sciences 102 (2005), 13550–13555.

19.

Desmedt

Piette

Loi

Wang

Lallemand

Haibe-Kains

Viale

Delorenzi

Zhang

d’Assignies

M.S.

et al., Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multi-center independent validation series, Clinical Cancer Research 13 (2007), 3207–3214.

20.

Desmedt

Di Leo

De Azambuja

Larsimont

Haibe-Kains

Selleslags

Delaloge

Duhem

Kains

J.-P.

Carly

et al., Multifactorial approach to predicting resistance to anthracyclines, J Clin Oncol 29 (2011), 1578–1586.

21.

Kao

K.-J.

Chang

K.-M.

Hsu

H.-C.

and Huang

A.T.

, Correlation of microarray-based breast cancer molecular subtypes and clinical outcomes: implications for treatment optimization, BMC Cancer 11 (2011), 1–15.

22.

Dedeurwaerder

Desmedt

Calonne

Singhal

S.K.

Haibe-Kains

Defrance

Michiels

Volkmar

Deplus

Luciani

et al., Dna methylation profiling reveals a predominant immune component in breast cancers, EMBO Molecular Medicine 3 (2011), 726–741.

23.

Huang

C.C.

S.H.

Lien

H.H.

Jeng

J.Y.

Huang

C.S.

Huang

C.J.

Lai

L.C.

and Chuang

E.Y.

, Concurrent gene signatures for han chinese breast cancers, PloS One 8 (2013), e76421.

24.

Lewis

R.J.

, An introduction to classification and regression tree (cart) analysis, in Annual meeting of the society for academic emergency medicine in San Francisco, California 14 (2000).

25.

Calle

M.L.

Hough

Curia

and Gómez

, Bayesian survival analysis modeling applied to sensory shelf life of foods, Food Quality and Preference 17 (2006), 307–312.

26.

Ahmed

F.E.

Vos

P.W.

and Holbert

, Modeling survival in colon cancer: a methodological review, Molecular Cancer 6 (2007), 1–12.

27.

Omurlu

I.K.

Ozdamar

and Ture

, Comparison of Bayesian survival analysis and Cox regression analysis in simulated and breast cancer data sets, Expert Systems with Applications 36 (2009), 11341–11346.

28.

Wong

Lam

and Lo

, Bayesian analysis of clustered interval-censored data, Journal of Dental Research 84 (2005), 817–821.

29.

Lindley

, The Bayesian Approach to Statistics. California Univ Berkeley Operations Research Center, Tech. Rep. (1980).

30.

Bhattacharjee

, Application of Bayesian Approach in Cancer Clinical Trial, World Journal of Oncology 5 (2014), 109–112.

31.

Hothorn

and Bühlmann

, Model-based boosting in high dimensions, Bioinformatics 22 (2006), 2828–2829.

32.

Bühlmann

and Yu

, Boosting With the L2 Loss: Regression and Classification, Journal of the American Statistical Association 98 (2003), 324–339.

33.

Buehlmann

et al., Boosting for high-dimensional linear models, The Annals of Statistics 34 (2006), 559–583.

34.

Bühlmann

Hothorn

et al., Boosting algorithms: Regularization, prediction and model fitting, Statistical Science 22 (2007), 477–505.

35.

Breiman

Friedman

J.H.

Olshen

R.A.

and Stone

C.J.

, Classification and regression trees. belmont, ca: Wadsworth, International Group (1984), 151–166.

36.

Brown

C.E.

Pittard

C.L.

and Park

, Classification trees with optimal multivariate decision nodes, Pattern Recognition Letters 17 (1996), 699–703.

37.

Oliver

J.J.

and Hand

, On pruning and averaging decision trees, in Machine Learning Proceedings 1995, Elsevier (1995), 430–437.

38.

Hess

K.R.

Abbruzzese

M.C.

Lenzi

Raber

M.N.

and Abbruzzese

J.L.

, “Classification and regression tree analysis of 1000 consecutive patients with unknown primary carcinoma, Clinical Cancer Research 5 (1999), 3403–3410.

39.

Epstein

E.S.

, A scoring system for probability forecasts of ranked categories, Journal of Applied Meteorology 8 (1969), 985–987.

Thresholding of prominent biomarkers of breast cancer on overall survival using classification and regression tree

Abstract

BACKGROUND:

OBJECTIVE:

METHODS:

RESULTS:

CONCLUSIONS:

Keywords

1. Introduction

2.1 Material and study design

2.2 Methods

Footnotes

Acknowledgments

Authors’ contributions

References