Machine Learning-driven QSAR and Docking Pipeline for Identification of Amyloid Beta-A4 Inhibitors in Alzheimer’s Disease

Abstract

Background/Purpose

Alzheimer’s disease (AD) is a long-term neurodegenerative condition that leads to the gradual deterioration of nerve cells. The goal of this study is to use computational drug discovery (CDD) techniques to discover lead compounds that target Amyloid-beta A4 (AβA4) as a potential target for AD.

Materials and Methods

Quantitative structure-activity relationship (QSAR) modeling is used in this study to compare different machine learning (ML) models aimed at predicting the potency negative logarithm of IC₅₀ (pIC₅₀) of candidate compounds, which are then validated by molecular docking based on their binding affinity. A non-redundant dataset consisting of 1,241 compounds for AβA4 was retrieved from the ChEMBL database. 880 substructure fingerprints were used to define these compounds, followed by building 42 ML models and comparison. The Kennard–Stone algorithm was employed to select a diverse set of 30 compounds from the set of active inhibitors for testing. The application programming interface (API), named NeuroIC₅₀, was developed and deployed.

Results

The histogram-based gradient boosting regression (HGBR) tree has achieved the optimal performance compared to other regression models, as determined by its root mean square error (RMSE) of 0.73, R² value of 0.65, and time efficiency of 0.78. Random forest regression (RFR)-HGBR-derived Gini index revealed the importance of features, include SubFP23, SubFP405, and SubFP577 in the compounds.

Conclusion

The lead compound (CHEMBL5080033) with a pIC₅₀ of 8.67 M and a binding energy of −7.6 kcal/mol was identified. This ML-based QSAR modeling and docking approach is an effective strategy for accelerating drug discovery.

Keywords

Alzheimer’s machine learning regression quantitative structure-activity relationship drug discovery

Introduction

Alzheimer’s disease (AD) is a slowly progressing neurodegenerative disorder, resulting in memory problems, reduced cognitive ability, and difficulty in carrying out daily activities.¹ It is caused by the build-up of amyloid plaques and neurofibrillary tangles, which leads to the death of brain cells and synaptic dysfunction.² AD progresses through stages from mild to severe, with symptoms worsening over time.³ Global estimates of AD and other forms of dementia suggest that the burden on society is escalating at a concerning pace. In 2006, the worldwide prevalence was 26.6 million, projected to quadruple by 2050,⁴ and more recent data suggest even higher estimates, with 51.6 million affected in 2019⁵ and 44.35 million in 2013, expected to reach 135.46 million by 2050.⁶ Interventions delaying disease onset and progression by just 1 year could significantly reduce the global burden.⁷ However, there are not many effective drugs for the treatment of AD. Despite the high prevalence, the symptoms associated with AD, and the significant unmet medical need, many pharmaceutical companies have moved away from drug discovery, development, and research for AD spectrum disorders, due to several factors, including high costs, high failure rates in clinical trials, a poor understanding of the disease’s underlying mechanisms, and the lengthy time required for drug development.⁸

Amyloid-β (Aβ) plaque accumulation represents a pathological characteristic of AD, occurring 15–20 years prior to the onset of clinical symptoms. These plaques trigger a series of biological events, including neuroinflammation, synaptic dysfunction, along with challenges in tau pathology, oxidative stress, protein clearance, mitochondrial function, and calcium homeostasis.⁹ The amyloidogenic pathway produces Aβ peptides, notably Aβ40 and Aβ42, with the accumulation and aggregation of Aβ being associated with AD.¹⁰ Targeting Aβ remains a promising therapeutic approach, despite recent setbacks. Existing AβA4 inhibitors include compounds like Lecanemab, Aducanumab, Valiltramiprosate, Donanemab, and others.¹¹

Accurately predicting the activity of modulators of Amyloid-beta A4 (AβA4) is a complex task because of the diverse nature of transporters and their interactions with inhibitors. To address this complexity, the incorporation of machine learning (ML) techniques into quantitative structure-activity relationship (QSAR) has proven to be a valuable tool. QSAR plays a pivotal role in establishing relationships between chemical structures and biological activity, using statistical or ML models.¹²

ML is a branch of artificial intelligence (AI) that focuses on the development and usage of computer algorithms that are capable of learning from unprocessed data in order to perform designated tasks effectively.¹³ AI algorithms engage in tasks such as classification, regression, clustering, and so on, across extensive datasets. In the pharmaceutical realm, a diverse array of ML techniques has been used to predict novel molecular descriptors, biological functions, activities, interactions, and adverse effects of drugs.¹² Examples of regression models include decision trees, random forests, linear regression, and neural networks.¹⁴ In contrast, popular classification models include logistic regression, support vector machine, and naive Bayes.¹⁵ However, the Lazy Predict library in Python,¹⁶ can be employed to find the best model for classification and regression based on our data, without any parameter tuning. LazyClassifier can be used to classify the datasets into an 80% training set and a 20% test set, while LazyRegressor can be used to compare the performances between several regression models.¹⁷

We present a comprehensive QSAR pipeline aimed at predicting AβA4 inhibition using 1,241 non-redundant compounds. Interpretable learning methods like random forest and molecular descriptors, such as molecular fingerprints, were applied to uncover the inhibitory activity of AβA4 in accordance with Organization for Economic Co-operation and Development (OECD) guidelines. Additionally, molecular docking studies were conducted on selected active AβA4 inhibitors. This combined ligand- and structure-based approach aims to provide insights into the design of selective AβA4 inhibitors, offering a novel therapeutic strategy for AD. To support this, we developed NeuroIC₅₀, a computational server that predicts the inhibition potency, negative logarithm of IC₅₀ (pIC₅₀) of ligands using a QSAR model. Researchers can input ligands as SMILES, screen inhibitory potential, and validate efficacy through pIC₅₀ values. This tool aims to advance the identification of potent AβA4 inhibitors and improve AD treatment strategies.

Materials and Methods

Dataset

A dataset of inhibitors targeting human AβA4 (Target ID CHEMBL2487) was compiled from the ChEMBL database, comprising a total of 8,968 bioactivity data points derived from 1,498 compounds. We refined the SMILES notations of the compounds using ChemAxon’s standardizer, adhering to the parameters set forth in the study by Simeon.¹⁸ Initially, we built the dataset from various bioactivity measurement units, sorted by decreasing data size: IC₅₀, K_i, %activity, %inhibition, EC₅₀, and so forth. We chose to focus on the IC₅₀ values for further investigation, as they represented the largest subset with 1,256 compounds. Upon closer examination, we identified nine compounds lacking reported IC₅₀ and canonical SMILES. Henceforth, they were excluded from the study, thereby leaving 1,247 compounds. Additionally, we retained redundant compounds with differing bioactivity values only if the standard deviation of IC₅₀ values was less than two, and this resulted in a refined dataset of 1,243 compounds. We converted the IC₅₀ values into pIC₅₀ values, using the relation pIC₅₀ = −log₁₀(IC₅₀), so that we have normalized data that are suitable for the ML pipeline. We categorized compounds with pIC₅₀ values >5 as active, <5 as inactive, and = 5 as intermediate. We removed the intermediate class, leading to a final dataset of 1,241 compounds. A summary of this pipeline is illustrated in Figure 1.

Figure 1.

Workflow of Quantitative Structure-activity Relationship (QSAR) Modeling and Molecular Docking for Investigating an Inhibitory Activity.

Molecular Descriptors of Inhibitors

Molecular descriptors are quantitative or qualitative representations that characterize molecules based on their structure, connectivity, and physicochemical characteristics. They play a crucial role in QSAR studies. These descriptors can be computed using graphical user interface (GUI)-based software like Dragon, PaDEL-Descriptor, QuBiLS-MIDAS, QuBiLS-MAS, and CODESSA. Alternatively, they can be computed using R or Python libraries, such as ChemoPy, PyDPI, RDKit, and rcdk. Additionally, web servers like BioTriangle and ChemDes also offer molecular descriptor calculation services online.^19–21 Fingerprint descriptors capture information about the underlying substructures inherently present in a molecule. In this study, we utilized RDKit for computing 881 PubChem fingerprints.²² Additionally, we computed four molecular descriptors that define Lipinski’s rule of five, comprising molecular weight (MW), logarithm of octanol/water partition coefficient (ALogP), number of hydrogen bond donors (nHBDon), and number of hydrogen bond acceptors (nHBAcc), using RDKit.

Feature Selection and Data Splitting

Collinearity refers to a condition where descriptor pairs exhibit intercorrelation, which results in model complexity and potential bias. To address this issue, we employed the corr( ) function from the Python library, pandas, to calculate the pairwise correlation among the descriptors. We excluded the descriptors with a Pearson’s correlation coefficient greater than the threshold of .7 for further study.²³ Following this, we splitted the dataset into two sets, namely, the training set and the test set, with the former comprising 80% and the latter comprising 20% of the original dataset. We utilized the train_test_split( ) function from the sklearn.model selection Python library to facilitate the data split.

Multivariate Analysis

Supervised learning is training a model from labeled data that can be used to make predictions about unseen or future data.²⁴ In this study, the Python library, LazyPredict, was used to construct multiple regression models and evaluate their performances, and these models aim to predict the continuous response variable (pIC₅₀) using predictor variables (fingerprint descriptors). This study primarily employs the histogram gradient boosting regressor (HGBR). HGBR represents an ensemble method grounded in gradient boosting principles, wherein decision trees are built sequentially, where each tree attempts to rectify the inaccuracies made by its predecessors. It integrates histogram-based techniques for accelerated training on extensive datasets.²⁵ The HistGradientBoostingRegressor() function from the Python library scikit-learn was used to construct the models, providing an efficient implementation of the HGBR algorithm.²⁶ For a better understanding of the key substructures of the compounds responsible for the modulatory activity, we extracted informative descriptors from the HGBR model using the built-in feature importance estimator. Specifically, the mean decrease of the Gini index (MDGI) was employed to identify the most important descriptors (descriptors with the highest MDGI values).

Validation of QSAR Models

Model validation plays a crucial role to confirm that a fitted model can reliably predict responses for future or unseen subjects. In this study, we evaluated the performance of the QSAR models using three statistical parameters, namely, R², root mean square error (RMSE), and computation time. R² quantifies the proportion of variance in the dependent variable explained by the independent variables. RMSE assesses the average magnitude of the errors between predicted values and actual values. Computation time refers to the time required to train the model and make predictions. While time is an important consideration in real-world applications, it does not directly affect model accuracy. A shorter time is preferred, especially for real-time applications or when dealing with large datasets. In summary, models with higher R², lower RMSE, and computation time are preferred indicators of quality.

Server Deployment

We have developed and successfully deployed NeuroIC₅₀, a web application designed using the Streamlit (50), framework for predicting the bioactivity of chemical compounds targeting AβA4, a crucial target in AD research. The application offers a comprehensive, interactive interface that seamlessly integrates various functionalities, such as molecular descriptor calculation and ML-based prediction. Users can easily upload their molecular data in text format, after which the application computes molecular descriptors utilizing the PaDEL-Descriptor tool, executed via a subprocess, employing the same PubChem fingerprint calculation protocol used during model development. These calculated descriptors serve as input for a pre-trained ML model, which then predicts bioactivity in terms of pIC₅₀ values. The results are dynamically displayed on the app and can be downloaded in comma-separated values (CSV) format, ensuring an efficient and user-friendly experience for bioactivity prediction tasks.

Results

Chemical Space of AβA4 Inhibitors

We carried out a chemical space analysis of AβA4 inhibitors to gain insights into the structure-activity relationship (SAR) by examining Lipinski’s rule of five descriptors, consisting of MW, the logarithm of the partition coefficient between n-octanol and water (ALogP), the nHBDon, and the nHBAcc. MW, a key size indicator, plays a key role in a compound’s ability to pass through lipid membranes, while ALogP, a commonly used indicator of a compound’s lipophilicity, is important for determining membrane permeability. nHBDon and nHBAcc describe the hydrogen bonding potential of a compound, relevant for assessing hydrogen bond formation capability.

Visualization of the chemical space of AβA4 inhibitors, particularly ALogP as a function of MW (Figure 2), revealed a dense clustering of inhibitors in the MW range of 200–600 Da and ALogP values between 2 and 7. ALogP, nHBAcc, nHBDon, and MW are shown in Figure 3A–3D, respectively. The Mann–Whitney U test was used to statistically compare the descriptors between active and inactive compounds. The analysis demonstrates that there are significant differences in the nHBDon and nHBAcc. Specifically, active compounds exhibit lower values of nHBDon and nHBAcc compared to inactive compounds. In contrast, while there were observable differences in LogP and MW, these were not statistically significant based on the Mann–Whitney U test.

Figure 2.

Chemical Space of Amyloid-beta A4 (AβA4) Inhibitors. Active and Inactive are Shown in Red and Blue, Respectively.

Figure 3.

Boxplot of Amyloid-beta A4 (AβA4) Inhibitors Using Lipinski’s Rule of Five Descriptors. (A) Box Plots of ALogP, (B) Box Plots of Number of Hydrogen Bond Acceptors (NumHAcceptors), (C) Box Plots of Num Number of Hydrogen Bond Donors (NumHDonors), and (D) Box Plots of Molecular Weight.

QSAR Model for Predicting AβA4 Inhibitory Activity

We utilized the dataset consisting of 1,241 compounds to construct QSAR models. We generated molecular descriptors for this dataset using the PaDEL-Descriptor module in Python, which offers a wide range of fingerprint descriptors. For this study, we explicitly employed 881 PubChem fingerprint descriptors for model development and benchmarking. Later, we performed feature selection to eliminate collinear descriptors, and 42 models were built using an 80/20 train-test split. Performance results are summarized in Table 1, with the best models characterized by high R² values, low RMSE, and minimal overfitting.

Table 1.

Performance Summary of Training Set for Predicting Negative Logarithm of IC₅₀ (pIC₅₀).

Model Name	Adjusted R²	R ²	RMSE	Time Taken
Hist gradient boosting regressor	−0.20	0.65	0.73	0.78
Nu SVR	−0.30	0.62	0.76	0.21
K neighbors regressor	−0.30	0.62	0.76	0.04
SVR	−0.31	0.61	0.76	0.20
LGBM regressor	−0.32	0.61	0.77	0.22
Random forest regressor	−0.35	0.60	0.77	1.83
XGB regressor	−0.35	0.60	0.78	0.23
Gradient boosting regressor	−0.36	0.60	0.78	0.77
MLP regressor	−0.38	0.59	0.78	3.51
Bagging regressor	−0.41	0.58	0.79	0.34
Bayesian ridge	−0.85	0.45	0.91	0.22
Poisson regressor	−0.85	0.45	0.91	0.05
Elastic net CV	−0.87	0.45	0.91	8.09
Lasso CV	−0.90	0.44	0.92	3.60
Ridge CV	−0.92	0.43	0.92	0.07
SGD regressor	−0.95	0.42	0.93	0.06
Ada boost regressor	−1.00	0.41	0.94	0.78
Orthogonal matching pursuit	−1.08	0.38	0.96	0.03
Orthogonal matching pursuit CV	−1.08	0.38	0.96	0.07
Ridge	−1.10	0.38	0.97	0.05
Gamma regressor	−1.11	0.38	0.97	0.04
Tweedie regressor	−1.17	0.36	0.98	0.04
Huber regressor	−1.20	0.35	0.99	0.13
Extra trees regressor	−1.23	0.34	0.99	2.39
Transformed target regressor	−1.31	0.32	1.01	0.04
Linear regression	−1.31	0.32	1.01	0.11
Extra tree regressor	−1.40	0.29	1.03	0.05
Decision tree regressor	−1.48	0.27	1.05	0.09
Lasso Lars CV	−1.51	0.26	1.06	0.35
Linear SVR	−1.52	0.25	1.06	0.49
Lasso Lars IC	−1.63	0.22	1.08	0.20
Lars CV	−1.65	0.22	1.09	0.44
Passive aggressive regressor	−2.26	0.04	1.20	0.04
Elastic net	−2.45	−0.02	1.24	0.04
Dummy regressor	−2.45	−0.02	1.24	0.03
Lasso Lars	−2.45	−0.02	1.24	0.04
Lasso	−2.45	−0.02	1.24	0.03
Quantile regressor	−2.49	−0.03	1.25	0.27
RANSAC regressor	−8.09	−1.69	2.01	0.89
Gaussian process regressor	−47.37	−13.30	4.64	0.47
Kernel ridge	−73.73	−21.09	5.76	0.10
Lars	−7,072,599,183,785,335,650,237,567,467,520.00	−2,090,282,349,863,682,157,478,515,048,448.00	1,772,896,344,250,828.25	0.11

Note: CV: Cross-validation; IC: Information criterion; LGBM: Light gradient boosting machine; MLP: Multi-layer perceptron; RANSAC: Random sample consensus; RMSE: Root mean square error; SGD: Stochastic gradient descent; SVR: Support vector regression; XGB: Extreme gradient boosting.

Histogram gradient regressor achieved the highest training set R² (0.65) and a low RMSE (0.73), while the nu support vector regression yielded an R² of 0.62 and an RMSE of 0.76, as shown in Table 1. Upon evaluation on the test set, these two models exhibited consistent performance across both sets, suggesting they learned patterns applicable to new data, making them less prone to overfitting. Although it showed a negative adjusted R² value, that is, −0.20, this does not arise due to poor performance but rather due to the metric penalizing additional features that do not contribute sufficient explanatory gain. As illustrated in the scatterplot (Figure 4), predicted pIC₅₀ values are plotted against experimental pIC₅₀ values. Most data points cluster around the diagonal line, indicating that the HGBR effectively captures the relationship between molecular descriptors and pIC₅₀ values. This positive correlation signifies the model’s accuracy across the dataset.

Figure 4.

Plot of Experimental Versus Predicted Negative Logarithm of IC₅₀ (pIC₅₀) Values for Models Constructed with 881 PubChem Fingerprint Descriptors.

Mechanistic Interpretation of Feature Importance

Feature importance analysis helps identify key features that contribute towards bioactivity. When evaluating the relative importance of features in models using the HGBR algorithm, two parameters mainly come into play: (a) SHapley Additive exPlanations (SHAP) and (b) Gini index (i.e., variance of the responses). The latter was selected as a metric for ranking important features (i.e., the MDGI) for predicting the pIC₅₀ of AβA4 inhibitors (Figure 5A). Table 2 lists the substructure fingerprints, along with their respective descriptions.

Figure 5.

(A) Plot of Mechanistic Feature Importance as Exemplified by the Gini Index. (B) Plot Showing the Distribution of Active Amyloid-beta A4 (AβA4) Inhibitors (Gray Circles) and the Diversity Set (Red Circles) Selected for Molecular Docking.

Table 2.

List of Top PubChem Substructure Fingerprints and Their Corresponding Description.

Fingerprints	Description
PubChemFP23	>=1F
PubChemFP405	O(~C) (~C)
PubChemFP577	C:C-N-C:C
PubChemFP697	C-C-C-C-C-C(C)-C
PubChemFP438	C(-C)(-N)(=N)
PubChemFP615	N-C-C-O-C;
PubChemFP704	O=C-C-C-C-C-C-C
PubChemFP535	O=C-C-C;
PubChemFP542	O-C:C-[#1]
PubChemFP596	N=C-N-C-C

In Figure 5A, we can observe that the top-ranking feature is >=1F (PubChemFP23), which indicates the presence of at least one fluorine atom. In the context of drug design, its small size and electronic properties allow it to influence drug-receptor interactions through hydrophobic, electrostatic, and hydrogen-bond interactions, often leading to enhanced drug activity.²⁷ The second most important feature is O(~C) (~C) (PubChemFP405); it refers to an oxygen atom single-bonded to two carbon atoms. Oxygen atoms in various functional groups, including ethers and esters, exhibit distinct electron density configurations that influence hydrogen bonding directionality. Allylic structures containing oxygen atoms can enhance drug potency and toxicity. These unconventional hydrogen bonds, involving oxygen as an acceptor, contribute significantly to ligand-receptor interactions and are crucial in drug discovery.²⁸

The third most important feature is C:C-N-C:C; a nitrogen atom single-bonded to two carbon atoms, which resembles the pattern of enamines (PubChemFP577). These are central to biological catalysts like aldolases and small-molecule amine organocatalysts.²⁹ The fourth most important feature is C-C-C-C-C-C(C)-C, a branched alkane, 2-methylhexane. Introducing methyl groups into small molecules has become an important strategy for lead compound optimization, improving metabolic stability and generating new uses for existing drugs (PubChemFP697).³⁰ The fifth most important feature is C(-C)(-N)(=N); resembling guanidine (PubChemFP438). Tryptophan-derived guanidine compounds, such as TGN2, have shown potential as multifunctional agents, inhibiting both BACE1 and amyloid aggregation while demonstrating neuroprotective effects.³¹ The sixth most important feature is N-C-C-O-C, resembling amino alcohols or amino ethers (PubChemFP615). Tailor-made amino acids, including their modifications like diamines and amino alcohols, are key structural features in many blockbuster drugs, demonstrating their importance across various therapeutic areas.³² The seventh, eighth, ninth, and tenth important features are O=C-C-C-C-C-C-C; heptanoic acid/enanthic acid (PubChemFP704), O=C-C-C; propanoic acid (PubChemFP535), O-C:C-[#1]; corresponding to the enol ether functional group (PubChemFP542), and N=C-N-C-C; representing a characteristic of substituted guanidine derivatives (PubChemFP596), respectively.³³

Discussion

Our study evaluates a diverse set of chemical compounds for their ability to inhibit AβA4, a therapeutic target in AD. Using a combination of ML-based QSAR modeling and molecular docking, we evaluated the bioactivity of candidate compounds (Figure 5B). Our study shows that CHEMBL5080033 has a predicted pIC₅₀ of 8.67 and a binding energy of −7.6 kcal/mol, making it a promising lead for future therapeutic development.

The chemical space analysis of AβA4 inhibitors highlighted that active compounds tend to have lower numbers of hydrogen bond donors and acceptors, suggesting that steric and hydrogen-bonding features influence binding interactions. We also observed that the molecular descriptors, such as MW and lipophilicity, cluster within ranges of favorable pharmacokinetic properties, further supporting their potential as drug-like candidates.

And, HGBR outperformed the other 41 models in predicting pIC₅₀ values, as shown by its R² of 0.65 and an RMSE of 0.73. We identified key substructural fingerprints contributing to inhibitory activity via the HGBR-derived Gini index, which includes fluorine atoms (PubChemFP23), oxygen ethers/esters (PubChemFP405), enamines (PubChemFP577), branched alkanes (PubChemFP697), guanidine motifs (PubChemFP438), and amino alcohols/ethers (PubChemFP615). These functional groups enhance ligand-receptor interactions through a combination of hydrogen bonding, electrostatic, and hydrophobic interactions, which provide insights into SARs and help in lead optimization.

Molecular docking studies further support our QSAR findings, showing stable interactions between selected inhibitors and the AβA4 target through hydrogen bonds and hydrophobic contacts. These results suggest that the identified lead compounds not only possess favorable predicted bioactivity but also form energetically stable complexes with their target.

While the current study provides strong computational evidence for promising AβA4 inhibitors, further experimental validation is necessary to confirm biological activity. Additionally, molecular dynamics simulations could be employed in future work to assess the stability and conformational flexibility of ligand-bound complexes, which could help in understanding the mechanistic basis of inhibition and supporting the advancement of these compounds toward clinical viability.

Conclusion

In conclusion, we utilized 881 PubChem fingerprint descriptors for developing QSAR models, and we evaluated their comparative performances. We observed that Histogram Gradient Regressor demonstrated strong performance for the constructed models, indicating that they could capture the feature space of AβA4 inhibitors. Utilizing the Gini index, a built-in feature importance estimator of the HGBR, we were able to identify key features essential for AβA4 inhibition: PubChemFP23, PubChemFP405, PubChemFP577, PubChemFP697, PubChemFP438, PubChemFP615, PubChemFP704, PubChemFP535, PubChemFP542, and PubChemFP596. The insights gained from this research are expected to serve as general guidelines for the development of novel AβA4 inhibitors.

Footnotes

Abbreviations

AβA4: Amyloid-beta A4; AD: Alzheimer’s disease; AI: Artificial intelligence; ALogP: Logarithm of octanol/water partition coefficient; API: Application programming interface; CDD: Computational drug discovery; HGBR: Histogram gradient boosting regressor; MDGI: Mean decrease of the Gini index; ML: Machine learning; MW: Molecular weight; nHBAcc: Number of hydrogen bond acceptors; nHBDon: Number of hydrogen bond donors; pIC₅₀: Negative logarithm of IC₅₀; QSAR: Quantitative structure-activity relationship; RMSE: Root mean square error.

Authors Contribution

SRK designed and performed the experiments. SRK and IMS analyzed the data. SRK wrote the original draft, and SRK and IMS contributed to the review and editing of the manuscript. SPD conceptualized, reviewed, and edited the manuscript and supervised the study.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical Approval

Ethical approval was obtained from the relevant ethics committee or Institutional Review Board (IRB).

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Informed Consent

The participant has provided informed consent for the submission of the article to the journal.

ORCID iD

Sugapriya Dhanasekaran

References

Budson

and O’Connor

MK.

What is Alzheimer’s disease?

In: Budson AE and O’Connor MK (eds) Six steps to managing Alzheimer’s disease and dementia . Oxford University Press; 2021, pp. 11–18.

Uddin

Alzheimer’s disease and you: can Alzheimers abduct consciousness?

J Neurol Disord 2017; 5(5): e123.

Bennet

, Jeyaraj

JPG

, Subha

, . Alzheimer’s disease recognition and detection using machine learning. 2024 5th international conference on smart electronics and communication (ICOSEC) . Trichy, India: IEEE; 2024, pp. 1395–1401.

Alzheimer’s Association. 2014 Alzheimer’s disease facts and figures. Alzheimer’s Dement 2014; 10(2): e47–e92.

Brookmeyer

, Johnson

, Ziegler-Graham

, . Forecasting the global burden of Alzheimer’s disease. Alzheimer’s Dement 2007; 3(3): 186–191.

Lastuka

, Bliss

, Breshock

, . Societal costs of dementia: 204 countries, 2000–2019. J Alzheimer’s Dis 2024; 101(1): 277–292.

Javaid

, Giebel

, Khan

, . Epidemiology of Alzheimer’s disease and other dementias: rising global burden and forecasted trends. F1000Research 2021; 10: 425.

Miller

Is pharma running out of brainy ideas?

Science 2010; 329(5991): 502–504.

Azargoonjahromi

The duality of amyloid-β: its role in normal and Alzheimer’s disease states. Mol Brain 2024; 17(1): 44.

10.

Sgourakis

, Yan

, McCallum

, . The Alzheimer’s peptides Aβ40 and 42 adopt distinct conformations in water: a combined MD/NMR study. J Mol Biol 2007; 368(5): 1448–1457. doi:10.1016/j.jmb.2007.02.093

11.

Lowe

, Duggan Evans

, Shcherbinin

, . Donanemab (LY3002813) dose-escalation study in Alzheimer’s disease. Alzheimers Dement (NY) 2021; 7(1): e12112. doi:10.1002/trc2.12112

12.

Charbuty

and Abdulazeez

Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2021; 2: 20–28. doi:10.38094/jastt20165

13.

Sheng

, Zhang

, Huang

, . Comparison of conventional mathematical model and machine learning model based on recent advances in mathematical models for predicting diabetic kidney disease. Digit Health 2024; 10: 20552076241238093. doi:10.1177/20552076241238093

14.

Maulud

and Abdulazeez

AM.

A review on linear regression comprehensive in machine learning. J Appl Sci Technol Trends 2020; 1: 140–147. doi:10.38094/jastt1457

15.

Shah

, Patel

, Sanghvi

, . A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augment Hum Res 2020; 5: 12. doi:10.1007/s41133-020-00032-0

16.

Raza

, Aslam

, Sher

, . Autonomic performance prediction framework for data warehouse queries using lazy learning approach. Appl Soft Comput 2020; 91: 106216. doi:10.1016/j.asoc.2020.106216

17.

Tang

, Kong

, Huang

, . Large language models can be lazy learners: analyze shortcuts in in-context learning. In: Findings of the association for computational linguistics: ACL 2023 . Stroudsburg, PA, USA: Association for Computational Linguistics; 2023, pp. 4645–4657. doi:10.18653/v1/2023.findings-acl.284

18.

Shoombuatong

, Prathipati

, Owasirikul

, . Towards the revival of interpretable QSAR models. In: Roy

(ed) Advances in QSAR modeling: applications in pharmaceutical, chemical, food, agricultural and environmental sciences . Cham: Springer International Publishing; 2017, pp. 3–55.

19.

García-Jacas

, Marrero-Ponce

, Acevedo-Martínez

, . QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations. J Comput Chem 2014; 35(18): 1395–1409. doi:10.1002/jcc.23640

20.

Valdés-Martiní

, Marrero-Ponce

, García-Jacas

, . QuBiLS-MIDAS: a parallel free-software for molecular descriptor computation based on two- and three-dimensional structures. J Cheminform 2017; 9: 35.

21.

Dong

, Cao

, Miao

, . ChemSAR: an online pipelining platform for molecular SAR modeling. J Cheminform 2015; 7: 60.

22.

Kuhn

Building predictive models in R using the caret package. J Stat Softw 2008; 28(5): 1–26.

23.

McKinney

pandas: a foundational Python library for data analysis and statistics. Python High Perform Sci Comput 2011. ; 14(9): 1–9.

24.

James

, Witten

, Hastie

, . An introduction to statistical learning: with applications in R . New York: Springer; 2013.

25.

Breiman

Random forests. Mach Learn 2001; 45(1): 5–32. doi:10.1023/A:1010933404324

26.

Pedregosa

, Varoquaux

, Gramfort

, . Scikit-learn: machine learning in Python. J Mach Learn Res 2011; 12: 2825–2830.

27.

Gupta

SP.

Role of fluorine in drug design and drug action. Lett Drug Des Discov 2019; 16(1). doi:10.2174/1570180816666190130154726

28.

Plachinski

and Yoon

TP.

Single-atom editing with light. Science 2024; 386(6717): 27. doi:10.1126/science.ads2595

29.

Farghaly

, Alosaimy

, Al-Qurashi

, . The most recent compilation of reactions of enaminone derivatives with various amine derivatives to generate biologically active compounds. Mini Rev Med Chem 2023; 24(8): 793–843. doi:10.2174/1389557523666230913164038

30.

Kim

, Semenya

and Castagnolo

Antimicrobial drugs bearing guanidine moieties: a review. Eur J Med Chem 2021; 216: 113293. doi:10.1016/j.ejmech.2021.113293

31.

Gomes

, Varela

, Pires

, . Synthetic and natural guanidine derivatives as antitumor and antimicrobial agents: a review. Bioorg Chem 2023; 138: 106600. doi:10.1016/j.bioorg.2023.106600

32.

Han

, Konno

, Sato

, . Tailor-made amino acids in the design of small-molecule blockbuster drugs. Eur J Med Chem 2021; 220: 113448. doi:10.1016/j.ejmech.2021.113448

33.

Karunanidhi

, Thomas

, van Weeghel

, . Heptanoic and medium branched-chain fatty acids as anaplerotic treatment for medium chain acyl-CoA dehydrogenase deficiency. Mol Genet Metab 2023; 140(3): 107689. doi: 10.1016/j.ymgme.2023.107689

Machine Learning-driven QSAR and Docking Pipeline for Identification of Amyloid Beta-A4 Inhibitors in Alzheimer’s Disease

Abstract

Background/Purpose

Materials and Methods

Results

Conclusion

Keywords

Introduction

Materials and Methods

Dataset

Workflow of Quantitative Structure-activity Relationship (QSAR) Modeling and Molecular Docking for Investigating an Inhibitory Activity.

Molecular Descriptors of Inhibitors

Feature Selection and Data Splitting

Multivariate Analysis

Validation of QSAR Models

Server Deployment

Results

Chemical Space of AβA4 Inhibitors

Chemical Space of Amyloid-beta A4 (AβA4) Inhibitors. Active and Inactive are Shown in Red and Blue, Respectively.

Boxplot of Amyloid-beta A4 (AβA4) Inhibitors Using Lipinski’s Rule of Five Descriptors. (A) Box Plots of ALogP, (B) Box Plots of Number of Hydrogen Bond Acceptors (NumHAcceptors), (C) Box Plots of Num Number of Hydrogen Bond Donors (NumHDonors), and (D) Box Plots of Molecular Weight.

QSAR Model for Predicting AβA4 Inhibitory Activity

Performance Summary of Training Set for Predicting Negative Logarithm of IC50 (pIC50).

Plot of Experimental Versus Predicted Negative Logarithm of IC50 (pIC50) Values for Models Constructed with 881 PubChem Fingerprint Descriptors.

Mechanistic Interpretation of Feature Importance

(A) Plot of Mechanistic Feature Importance as Exemplified by the Gini Index. (B) Plot Showing the Distribution of Active Amyloid-beta A4 (AβA4) Inhibitors (Gray Circles) and the Diversity Set (Red Circles) Selected for Molecular Docking.

List of Top PubChem Substructure Fingerprints and Their Corresponding Description.

Discussion

Conclusion

Footnotes

Abbreviations

Authors Contribution

Declaration of Conflicting Interests

Ethical Approval

Funding

Informed Consent

ORCID iD

References

Performance Summary of Training Set for Predicting Negative Logarithm of IC₅₀ (pIC₅₀).

Plot of Experimental Versus Predicted Negative Logarithm of IC₅₀ (pIC₅₀) Values for Models Constructed with 881 PubChem Fingerprint Descriptors.