Sage Journals: Discover world-class research

Abstract

With the rise in the use of Android smartphones, there has been a proportional surge in the proliferation of malicious applications (apps). As mobile phone users are at a heightened risk of data theft, detecting malware on Android devices has emerged as a pressing concern within the realm of cybersecurity. Conventional techniques, such as signature-based routines, are no longer sufficient to safeguard users from the continually evolving sophistication and swift behavioral modifications of novel varieties of Android malware. Hence, there has been a significant drive in recent times towards leveraging machine learning (ML) models and methodologies to identify and generalize malicious behavioral patterns of mobile apps for detecting malware. This paper proposes Deep learning (DL) based on new and highly reliable classifier, deep neural decision forest (DNDF) for detecting Android malware. Two datasets were used: Drebin and 2014 for comparison with previous studies, and TUANDROMD collected in 2021 for detecting the latest threats with advanced obfuscation and morphing techniques. We have also calculated the time-consuming and computational resources taken by our classifier. After conducting a thorough performance evaluation, our proposed approach attained impressive results on two datasets. The empirical findings reveal that the proposed DBN and DNDF models demonstrated exceptional performance, achieving an accuracy of 99%, a sensitivity of 1, and an AUC value of 0.98%. The metrics we obtained are comparable to those of state-of-the-art ML-based Android malware detection techniques and several commercial antivirus engines.

Keywords

Android malware detection deep learning auto-encoder deep belief neural network deep neural decision forest

1. Introduction

The year 2008 marked the initial release of the Android operating system (OS) as an open-source initiative in its present format. Since then, it has witnessed significant acclaim from users due to its ability to be tailored to individual preferences and its minimal hardware demands. In 2022, the number of active Android devices drew close to the staggering figure of 2,257.17 million, reaffirming its position as the most widely used operating system globally [1]. The growth of Android OS adoption is depicted in Fig. 1, which highlights the swift rate of expansion. The substantial expansion of Android’s user base can be attributed to its extensive integration into various technologies, such as mobile phones, the Internet of Things (IoT), Internet of Multimedia Things (IoMT), industrial IoT (IIoT), connected vehicles, and smart home appliances.

The substantial proliferation of the Android OS has resulted in significant security concerns. One particularly crucial challenge is the rise of malicious apps that contain or implant malware [2]. The Google Play Store is the primary source for Android users to access and download apps. Over the years, it has witnessed tremendous growth, with the number of apps available on the platform skyrocketing from a mere 16,000 in December 2009 to a staggering 2,257.17 million, by August 2022 [3]. The fast-paced expansion of the Google Play Store led to phases of inadequate security evaluations of the apps featured in the platform, thereby leading to numerous significant incidents where millions of users downloaded malware-infected apps [4, 5]. During these occurrences, the security measures on the Play Store may have faltered in detecting instances of malware or potentially unwanted programs. Alternatively, it is possible that these apps did not undergo the necessary screening process at all.

As per [4], the Google Play Store accounts for just 87% of the total Android apps downloaded. The remaining 13% of Android apps downloaded originate from various sources such as alternative markets, older apps backups, and direct downloads through browsers, among others. However, it is worth noting that these sources are typically considered unreliable. In numerous instances, attackers promote these non-Play Store downloads as a free edition of an otherwise paid app. Attackers also attempt to disguise their harmful apps as useful ones. In a more recent occurrence, attackers developed apps that could bypass the security protocols of the Google Play Store. While these apps may not contain malware upon installation, they are designed to secretly download and inject malware into Android devices later in time [6]. During the aforementioned occurrence, a banking trojan was spread through an app downloaded more than 300,000 times. According to a study cited in [7], only 35% of Android users carefully read all the permissions requested by an app before proceeding with its installation. According to the same study [7], only 77% of Android users have declined to install an app due to the permissions it requires, at least once. These statistics suggest a significant lack of user awareness regarding the potential risks of installing apps without thoroughly reviewing and comprehending the requested access permissions.

The security challenges above underscore the critical importance of developing a robust and dependable system for detecting malware. The malware detection solution must be capable of automatically learning and extracting intricate features from input data without requiring manual feature engineering. This can be particularly useful in the case of Android malware detection, where the malware may have sophisticated and constantly evolving behaviors that are difficult to capture using traditional handcrafted features.

To cope with the above challenges, this study presents a novel approach for android malware detection using four DL-based classifiers, including Convolutional neural networks (CNN), Autoencoder (AE), Deep Belief Neural network (DBN), and Deep neural decision forest (DNDF). Our research involved two different datasets Drebin 2014 [8] and TUANDROMD 2021 [9] to assess the effectiveness of our proposed approach against established benchmarks and to facilitate comparison with prior studies. Using an old dataset allowed us to compare our results with previous studies’ results. Using a new dataset provided an advantage in detecting the latest threats, as modern adversaries often employ advanced obfuscation and morphing techniques that render traditional methods ineffective. In addition to evaluating the performance metrics of the DL-based classifiers, we have also computed their respective training times and the required computational resources on both datasets. This analysis provides a comprehensive overview of the efficiency and scalability of the classifiers, which is crucial for practical deployment in real-world scenarios. To the best of our knowledge, there is a dearth of prior research that has conducted a comparative analysis of computational resources and time consumption among CNN, AE, DBN, and DNDF classifiers for the purpose of Android malware detection. Additionally, there is a lack of studies utilizing appropriate metrics to evaluate the performance of these classifiers on the given datasets. Our research contribution includes the following key points:

•
To develop four novel DL-based classifiers specifically designed for Android malware detection and classification, achieving promising results. This approach can serve as a foundation for further research in this domain, contributing to the advancement of the field.
•
To compare the four DL-based classifiers in terms of accuracy, precision, recall, MAE, MSE, RMSE, R2 score, AUC, and sensitivity, as well as the time rate and computational resources required. This analysis provides insights into each classifier’s strengths and weaknesses and helps identify the most efficient and effective solution for Android malware detection.
•
To evaluate the performance of the four classifiers on two different datasets, Drebin and TUANDROMD, shows the effectiveness of the classifiers on both old and new datasets and provides insights into the evolution of Android malware over time.

The paper is structured as follows: In Section 2, we provide an overview of the relevant literature. In Section 3, we outline the research design, including the methods employed for data collection and analysis. Section 4 presents the study’s findings. Section 5 is dedicated to analyzing the results, discussing their implications, and establishing connections to existing literature and research objectives. Finally, Section 6 concludes the study by summarizing the main findings, emphasizing the contributions of the study, and providing suggestions for further research.

Figure 1.
The number of internet users worldwide between 2012 and 2022, categorized by operating system, is presented in the following data [1].

2. Related work

2.1 Background

The following section provides a brief introduction to Android apps and presents a review of the current literature on detecting malware in such apps. Android apps are distributed in the form of an Android Application Package (APK) file, which is a compressed archive file capable of being installed on IoT devices running on the Android platform. The APK creator employs the Android SDK’s dx tool to generate the dex file to transform the compiled Java class files. This tool merges, rearranges, and optimizes various class files to minimize size and enhance execution speed. The dex file, created by converting the compiled Java class files using the dx tool of the Android SDK, serves as the centerpiece of the APK. The Dalvik virtual machine is a crucial component of the Android platform that allows the interpretation and execution of APK files. These APKs contain sections containing vital information about all the classes in the file. To analyze the APK, this paper proposes using the classes.dex file, which serves as a representation of the APK.

2.2 Conventional approaches to detecting malware on android

There are three main types of approaches for detecting Android malware: static ,dynamic and hybrid detection, which are considered to be Conventional approaches. Static detection appraoch, such as those based on signatures, permissions, bytecode, and hybrid static analysis, examine static APKs for potential malicious features without requiring app execution. To create accurate and reliable signatures, signature-based methods rely on identifying particular character sequences or semantic patterns in the source file [10]. Because of their efficiency, commercial anti-malware products frequently utilize signature-based methods. Permissions serve as the initial security checkpoint for identifying malware on the Android platform. When it comes to detecting possible malware, the permission-based approach primarily scrutinizes the sensitive or dubious permissions that are requested by the software developer and declared in the AndroidManifest.xml file. Examples of permission-based methods include Kirin, developed by Enck et al. [11], and PUMA, developed by Sanz et al [12]. Within the dex bytecode file, there is a wealth of semantic information regarding appl behaviors, including API calls and data flows. The bytecode-based method involves the disassembly of dex files to analyze APKs. Examples of such methods include DroidAPIMiner [13] and MamaDroid [14]. Hybrid static analysis methods, like Drebin [8] and FlowDroid [15], use multiple files to extract various types of features and enhance detection accuracy. In conclusion, while static analysis methods are quick and effective, they may be ineffective against malware that has undergone code obfuscation or encryption [16].

The use of dynamic detection methods involves running APKs in isolated environments such as sandboxes, simulators, or virtual machines, and monitoring their behavior to determine if they are malicious [17]. By tracking their actions, it is possible to identify and prevent potential security threats. The TaintDroid system is a dynamic analysis tool for Android that operates at the system-wide level and detects malware by monitoring the movement of sensitive data [18]. In contrast DroidScope [19], reconstructs semantic information at both the Linux OS and Java Dalvik levels to perform dynamic analysis of APKs. These methods enable the identification and prevention of potential security threats in Android systems. Dynamic detection methods employ diverse features, including system calls, network traffic, user interaction, and system components, with the aim of identifying potential security threats. This approach is described in detail in [20]. Dynamic detection methods are highly effective in detecting potential security threats in Android systems, even when code obfuscation and encryption techniques are used. Dynamic methods can identify malicious behaviors that may be missed by static detection. This makes dynamic detection an essential tool for ensuring the security of Android systems, as noted in [21]. Despite its advantages, dynamic detection has limitations that must be considered. One issue is the low code coverage, which means that not all branches of an app can be tested. Consequently, hidden code sections that may be triggered at certain times or under specific circumstances could be missed. Additionally, dynamic detection methods are time and resource intensive, which can make them less practical in certain situations.

The hybrid detection approach combines both static and dynamic detection methods to achieve comprehensive analysis of Android APKs. This approach involves analyzing the files of static APKs and monitoring the behavior of the APKs during execution. Previous studies, as demonstrated in [22, 23, 24], have shown the effectiveness of this approach. Hybrid detection approach offer a promising solution for combating code obfuscation and encryption while also improving code coverage. By combining both static and dynamic detection techniques, this approach can effectively identify potential security threats in Android systems. However, the considerable time and resource consumption required for hybrid detection may limit its practicality in some deployment scenarios.

2.3 Android malware detection using ML-based algorithms

Traditional methods for detecting malware in Android systems have relied heavily on building up libraries of signatures and the expertise of analysts, which can be challenging to scale up to keep pace with the rapid growth of Android malware. Fortunately, ML-based technologies offer a promising approach for efficient and automated Android malware detection, providing a new perspective for addressing this challenge [25]. ML-based methods for detecting Android malware typically involve four key steps: constructing a dataset, performing feature engineering, training a model, and evaluating the model. Feature engineering is a crucial step that involves extracting informative and robust features to accurately characterize APKs. Commonly used features for characterizing APKs include features extracted from the AndroidManifest.xml file and classes.dex features, and features based on dynamic behaviors. The selection of relevant features is essential for ensuring the validity and effectiveness of the model. The AndroidManifest.xml file is a critical source of information about an APK. It provides crucial details such as the hardware requirements, requested permissions, APK components, and filtered intents. These details are vital for understanding the APK’s functionality and ensuring that it can run properly on the target device. This file is a crucial reference point for feature extraction in ML-based approaches to Android malware detection. By analyzing the AndroidManifest.xml file, it is possible to identify important characteristics of an APK that may be relevant to its security and potential risk.

As described in [26], key features for detecting Android malware are extracted by parsing the AndroidManifest.xml files. The APK’s classes.dex file can be converted into a Smali format file, which includes Dalvik commands comprising opcodes and operands. This transformation enables the extraction of critical features that can be used for ML-based detection of Android malware. In [27], n-gram features are extracted from the opcode sequences to train malware classification models, highlighting the importance of leveraging advanced features extracted from classes.dex files for ML-based detection of Android malware. For example, features such as control flow diagrams [28] and API dependency graphs [29, 30] can be extracted from disassembled classes.dex files to improve the accuracy and effectiveness of the detection models. These features provide a comprehensive representation of an APK’s behavior and potential security risks, enabling more precise identification of potential threats. As reported in the literature [31, 32], dynamic behavioral features that capture critical information about Android malware can be obtained by executing apps in an isolated environment. The following traits are encompassed within these characteristics: file operations, network operations, encryption operations, service initiation, phone calls, and system calls. The amalgamation of static and dynamic features has the potential to improve the detection efficacy of Android malware. Static features can be extracted from the APK’s code structure and functionality, while dynamic features can be collected by monitoring the APK’s behavior in a sandboxed environment. Integrating both feature sets into ML-based models enables more accurate and comprehensive identification of security threats. The selection of models in Android malware detection can vary from traditional ML-based approaches, such as Support Vector Machines (SVM) [8] and Random Forests (RF) [14], to more advanced DL-based models, such as CNN [33], DBN [34] and Long Short-Term Memory (LSTM) [35].

2.4 Android malware detection using cutting edge algorithms

In recent years, the increasing popularity of mobile devices has led to a surge in the number of Android malware attacks. To combat this issue, researchers have explored the use of DL-based techniques to develop more effective malware detection methods. These studies have proposed various DL-based architectures, such as CNN, LSTM, and RNN to detect Android malware.

Qamar et al. [36] conducted a review of papers published between 2013 and 2019, focusing on mobile malware attacks, detection techniques, and potential solutions highlighted five studies that utilized DL-based algorithms for detecting mobile malware. Among these studies, four utilized the CNN algorithm, while one utilized the Artificial Neural Networks (ANN) algorithm. The review indicated that DL-based techniques offer superior performance compared to traditional ML-based algorithms, but their computational complexity is higher, leading to higher computational costs. In a study conducted by Slan and Samet [37], malware detection research papers were classified based on various techniques. These techniques encompass a range of approaches, including signature-based, behavior-based, heuristic-based, model checking-based, DL-based, cloud-based, mobile-based, and IoT-based methodologies. This categorization provides a useful framework for understanding the various approaches to detecting malware and highlights the diversity of methods used in the field. They reviewed four papers that used DL-based approaches, including DBN, ANN, and convolutional layers, and incorporated features such as API calls, system calls, and hybrid features. However, their primary focus was not to analyze the DL-based techniques for malware detection. Pan et al. [38] conducted an SLR on Android malware detection, categorizing static analysis into four types. Under the Neural Network (NN) category, DL-based models were classified by the authors. They determined that static analysis is a reliable method for identifying Android malware. While their primary focus was on Android, this article reviews DL-based approaches for detecting mobile malware more broadly. Feizollah et al. [9] reviewed 100 papers on feature selection techniques for Android malware detection, finding that only 8 utilized feature selection algorithms. They categorized features into static, dynamic, hybrid, and application metadata. Android permissions were the most popular static feature 36%, and system calls were the most dominant dynamic feature. Nearly half of the papers that used dynamic features used network traffic data.

3. Methodology and experiments

The primary objective of this section is to gain a deep understanding of the methodology employed in the development of precise and resilient malware detection systems. To achieve this, a comprehensive overview of these vital aspects is presented, ensuring a thorough examination of their intricacies and significance.

3.1 Datasets description

Android apps are created using Java programming language and run on a customized version of the Java Virtual Machine. These apps are packaged in a Java Archive (JAR) file with the extension APK. Android app consists of four fundamental components: activities, services, broadcast receivers, and content providers. These components play pivotal roles in the structure and functionality of the app, contributing to its overall behavior and user experience. These building blocks are defined and specified in the app’s manifest file, which serves as a roadmap for the app’s functionality. The interaction and communication between these components are facilitated by a mechanism called intents, which act as messages to request an action or transfer data between them. Intents can also be filtered based on specific criteria, allowing the components to selectively respond to certain types of requests.

This study utilized two different datasets, namely Drebin [8] and TUANDROMD [39], to develop and evaluate DL-based models for Android malware detection.

3.1.1 Drebin dataset

Created by researchers at the University of GÃ¶ttingen in Germany, consists of 5,560 malware apps and 9,476 benign apps. Each app is characterized by features such as permissions requested, API calls, and network traffic, among others. The Drebin dataset has been widely used in research on Android malware detection and evaluation, making it a well-known and commonly used dataset in the field.

3.1.2 TUANDROMD dataset

Created by Nguyen and Nguyen in 2021, contains 4,468 Android APK files, including 3,565 malware samples and 903 benign samples. Each sample is represented as an APK file and includes its corresponding metadata, such as the app name, version code, package name, and permission list. The dataset was collected in 2021 and is available for download from several online sources, such as Kaggle and Zenodo.

Table 1
The volume of datasets

Dataset	Total volume	Training	Validation	Testing
Drebin	15,036	12,028	1503	1505
TUANDROMD	4468	3574	447	448

These datasets were partitioned randomly into three parts, where 80% was designated for training, 10% for validation, and the remaining 10% for testing. The training portion was utilized to train models, while the validation set was employed to refine and choose the most optimal model prior to its application on the testing set. These datasets provides a diverse set of samples, including various types of malware such as adware, spyware, and trojans, as well as benign apps. It serves as a benchmark for evaluating the performance of DL-based models for detecting Android malware, allowing researchers to develop and test new algorithms. The volume of the datasets presented in Table 1.

3.2 Data-preprocessing

In this research, we employed various techniques to address the challenges of working with unbalanced datasets and ensure the quality and reliability of the Drebin and Tezpur University Android Malware Dataset (TUANDROMD). We discretized the data and applied class imbalance treatment techniques of undersampling and oversampling to balance the datasets. Specifically, we randomly undersampled the majority class and oversampled the minority class using the Synthetic Minority Over-sampling Technique (SMOTE) with the RandomUnderSampler and SMOTE algorithms from the imbalanced-learn library in Python. By implementing these techniques, we were able to improve the performance of our classifiers on these challenging datasets. The balanced datasets that resulted were employed to train and assess the classifiers, while the performance was evaluated using several metrics such as accuracy, precision, and recall. Furthermore, we removed variables with significantly poor quality from each dataset to preserve data integrity. Thirdly, we utilized data cleansing procedures to identify and rectify any inconsistencies and errors present in the datasets. One of the crucial steps we took was to remove duplicate samples using selected unique samples with a distinct combination of package name ( $P$ ) and SHA256 hash value ( $H$ ),as shown in the Eq. (1).

$\displaystyle\text{Unique}=\ s\mid s\in S,\forall s^{\prime}\in S,(s\neq s^{% \prime}\wedge P_{s}=P_{s^{\prime}}\wedge H_{s}=H_{s^{\prime}})\Rightarrow s^{% \prime}\notin\text{Unique}$ (1)

In addition to handling duplicate samples, we also addressed the issue of missing feature values by imputing the mean or median based on the distribution of the feature values. These measures were essential for improving the overall quality and reliability of the Drebin and TUANDROMD datasets, which is crucial for accurate training of DL-based models. By ensuring the quality and reliability of these datasets, we can develop more robust and accurate models that can be applied to real-world problems. It is essential to ensure data quality and reliability before conducting experiments, particularly when dealing with complex datasets such as Drebin and TUANDROMD datasets. Moreover, these measures can be applied to other similar datasets to ensure their quality and reliability, thus contributing to the development of more accurate and reliable models.

3.3 Mini-Max normalization method

To ensure that all features in the Drebin and TUANDROMD datasets had an equal contribution to the analysis, we utilized the Minimax normalization method. Minimax normalization is a technique for scaling features that transforms numerical data into a predetermined range, typically ranging from 0 to 1. The technique is applied to each feature in the dataset to ensure that no feature dominates the analysis due to large scales or ranges. This normalization technique helps prevent inaccurate results from being produced. To ensure consistency between datasets, we utilized the Minimax normalization technique to standardize the numerical data to a common scale. This approach includes subtracting the feature’s minimum value from each data point and then dividing the outcome by the feature’s range. The Minimax normalization is given in Eq. (2).

$\displaystyle X_{\textit{norm}}=\frac{X-X_{\min}}{X_{\max}-X_{\min}}$ (2)

The equation presented here involves four variables. The variable $X$ represents the original data point, while $X_{\min}$ and $X_{\max}$ represent the minimum and maximum values of the feature in the dataset, respectively. The variable $X_{\textit{norm}}$ represents the normalized value of $X$ by applying this method, we can ensure that all features in the Drebin and TUANDROMD datasets have an equal contribution to the analysis.

3.4 Evaluation metrics

The efficacy of the suggested techniques for identifying Android malware was assessed via statistical analysis and the computation of crucial metrics such as Pearson’s Correlation Coefficient ( $R$ ), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE). The effectiveness of the proposed model was evaluated using Eqs (3) to (12) as a means of assessment.

$\displaystyle\textit{MSE}=\frac{1}{n}\sum\limits_{i=1}^{n}\left(y_{i,\exp}-y_{% i,\text{pred}}\right)^{2}.$ (3)

$\displaystyle\textit{RMSE}=\sqrt{\sum_{i=1}^{n}\frac{\left(y_{i,\exp}-y_{i,% \text{pred}}\right)^{2}}{n}}$ (4)

$\displaystyle R\%=\frac{n\left(\sum_{i=1}^{n}y_{i,\exp}\times y_{i,\text{pred}% }\right)-\left(\sum_{i=1}^{n}y_{i,\exp}\right)\left(\sum_{i=1}^{n}y_{i,\text{% pred}}\right)}{\sqrt{\left[n\left(\sum_{i=1}^{n}y_{i,\exp}\right)^{2}-\left(% \sum_{i=1}^{n}y_{i,\exp}\right)^{2}\right]\left[n\left(\sum_{i=1}^{n}y_{i,% \text{pred}}\right)^{2}-\left(\sum_{i=1}^{n}y_{i,\text{pred}}\right)^{2}\right% ]}}\times 100$ (5)

$\displaystyle R^{2}\textit{bn}1-\frac{\sum_{i=1}^{n}\left(y_{i,\exp}-y_{i,% \text{pred}}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i,\exp}-y_{\textit{avg},\exp}% \right)^{2}}$ (6)

$\displaystyle\text{Accuracy}=\frac{\textit{TP}+\textit{TN}}{\textit{TP}+% \textit{FP}+\textit{FN}+\textit{TN}}\times 100\%$ (7) $\displaystyle\text{Specificity}=\frac{\textit{TN}}{\textit{TN}+\textit{FP}}% \times 100\%$ (8) $\displaystyle\text{Sensitivity}=\frac{\textit{TP}}{\textit{TP}+\textit{FN}}% \times 100\%$ (9)

$\displaystyle\text{Precision}=\frac{\textit{TP}}{\textit{TP}+\textit{FP}}% \times 100\%$ (10)

$\displaystyle\text{Fscore}=\frac{2*\text{preision}*\text{Sensitivity}}{\text{% preision}+\text{Sensitivity}}\times 100\%$ (11)

$\displaystyle T-\text{test}=\frac{x_{1}-x_{2}}{s_{p}\cdot\sqrt{\frac{1}{n_{1}}% +\frac{1}{n_{2}}}}$ (12)

The study provides equations for these metrics where $y_{i,\exp}$ represents the expected or actual value of the target variable, and $y_{i,\text{pred}}$ represents the predicted value. In the context of Android malware detection, $y_{i,\exp}$ refers to the actual class label of an Android app (malware or benign), while $y_{i,\text{pred}}$ refers to the class label predicted by the model. The difference between the actual and predicted values is used to calculate performance metrics, such as MSE, RMSE, R, and $R^{2}$ , which provide insights into the model’s accuracy in detecting Android malware. To be more precise, true positive (TP) signifies the number of positive sentiment samples that were correctly classified, while false positive (FP) denotes the number of samples that were mistakenly classified as having a positive sentiment. Likewise, true negative (TN) refers to the number of negative sentiment samples that were accurately classified, whereas false negative (FN) indicates the number of samples that were misclassified as having a negative sentiment.

3.5 Detection of malware using the proposed architecture

A combination of four DL-based algorithms, including CNNs, AE, DBN, and DNDF, resulted in a powerful ML-based model capable of detecting Android malware. The use of multiple algorithms allowed for a comprehensive analysis of the malware data, providing a more accurate and robust detection system. This approach highlights the potential of combining different DL-based techniques to improve the performance of ML-based models. The process began with tuning each DL-based algorithm,individually on the dataset. For CNN, we used it to extract features from an Android app’s raw bytecode and its static structure, such as the manifest file and resource files. The AE and DBN were used to perform unsupervised feature learning to capture patterns in the data indicative of malware. The DNDF was used to learn the relationship between the features and the presence of malware. Finally, the trained models were used to classify new, unseen apps as either benign or malicious based on the extracted features and learned dependencies. The proposed approach offers a comprehensive solution for detecting Android malware by leveraging the power of DL. The model is capable of extracting spatial as well as temporal features from the data, which enables it to identify patterns indicative of malware presence. This not only enhances the accuracy of the predictions but also helps in gaining a better understanding of the underlying factors contributing to malware development. The model was trained using two datasets, and to optimize its hyperparameters, the Adam optimizer and a validation dataset were utilized. The DL library Keras [40], written in Python, was utilized in our study.

CNN model’s architecture comprises of two 1D convolution layers with 32 filters and a kernel size of 4. The model also includes a global average pooling layer that performs spatial average pooling over the entire feature map. To avoid overfitting, a dropout layer with a rate of 0.2 is added after the second convolution layer. The fully connected layer with 256 hidden neurons follows the convolutional layers, and a ReLU activation function is applied. RMSprop optimizer is used with a learning rate of 0.001, and the model is trained for 100 epochs with a batch size of 32 and 100 steps per epoch. Additionally, a weight decay of 0.00005 is applied to the model, and the momentum is set to 0.9. Finally, the Adam optimizer is used to optimize the cross-entropy loss function.

AE architecture utilized in this research project includes an input layer with a shape determined by the number of features, followed by an encoding dimension of 32. The encoder layers consist of two dense layers with a ReLU activation function, and the decoder layers also include two dense layers with ReLU activation. The optimizer used for training is SGD, with a learning rate of 0.001. The model is trained for 100 epochs with a batch size of 32 and 100 steps per epoch. The loss function used is Binary Crossentropy, and the classification layer includes a dense layer with a sigmoid activation function. Overall, this AE model is optimized for efficient feature encoding and reconstruction while minimizing the loss function.

DBN architecture consists of an input layer with 64 neurons and two hidden layers, each with 64 neurons and ReLU activation. A dropout rate of 20% is applied to the hidden layers to prevent overfitting. The fully connected layer also has a ReLU activation function. The Adam optimizer is used with a learning rate of 0.001 and weight decay of 0.00005. The model is trained for 100 epochs with a batch size of 32 and 100 steps per epoch. The optimization function is Adam with a momentum of 0.9.

Table 2
Comparative analysis of experimental parameter using for, CNN AE, DBN and DNDF training

Parameters	CNN	AE	DBN	DNDF
Kernel size	4	–	–	–
Global average pooling	1D	–	–	–
Dropout	0.2	0.2	0.2	0.5
Fully connected layer	Dense	Dense	Dense	Dense
Activation function	Relu	Relu	Relu	Relu
Optimizer	RMSProp	SGD	Adam	–
Learning rate	0.001	0.001	0.001	0.001
Epochs	100	100	100	100
Batch size	32	32	32	32
Steps per epochs	100	100	100	100
Weight decay	0.00005	0.00005	0.00005	0.00005
Momentum	0.9	0.9	0.9	0.9
Input layer	–	Shape (features,)	64 neurons	–
Encoding dimension	–	32	–	–
Encoder layers	–	2 Dense layers, ReLU	–	–
Decoder layers	–	2 Dense layers, ReLU	–	–
Optimize function	Adam	SGD	Adam	Adam
Loss function	–	Binary crossentropy	–	–
Classification layer	–	Dense layer, sigmoid	–	–
Number of decision trees	–	–	–	10
Depth of decision trees	–	–	–	5
Number of neurons in decision tree layers	–	–	–	32
Output activation function	Sigmoid	Sigmoid	Sigmoid	Sigmoid

DNDF model consists of 10 decision trees, each with a depth of 5. The decision tree layers have 32 neurons each, with a dropout rate of 0.5 to prevent overfitting. The ReLU activation function is utilized in the fully connected layer, while the sigmoid activation function is applied in the output layer. The optimization of the model involves using the Adam optimizer, with a learning rate set to 0.001, a batch size of 32, and a total of 100 epochs, each consisting of 100 steps. The parameters of all the DL Based models are displayed in Table 2.

4. Implementation and results

Table 3
For the proposed model to achieve optimal performance, certain environmental prerequisites must be met

Hardware	Item	Configuration
	GPU	Nvidia GeForce RTX 4070 Ti
	RAM	16 GB
	Clock speed	3.50 GHz
	Processor	AMD ryzen 5 7600X
Software	Item	Version
	Matplotlib	3.4
	Keras	2.11
	TensorFlow	2.11
	Numpy version	1.24
	Python	3.7
	Operating system	Windows 11

The performance of DL-based models was assessed on standard Android malware datasets using the Keras and TensorFlow libraries of Python. The models were subjected to statistical analysis to evaluate their impact, and the resulting data were analyzed accordingly. Table 3 displays the platform that was employed for detecting malware in Android apps. The effectiveness of the malware detection algorithms was assessed using standardized datasets of mobile malware. The Drebin dataset comprises 15,036 Android apps that belong to 179 distinct malware families, whereas the TUNADROMD dataset contains 4468 apps from 25 different malware families, as indicated in Table 1. For this study, both datasets were split into three subsets, with 80% allocated for training, 10% for testing, and 10% for validation. A random function was employed to separate the training and testing data, which were used to train the models in the fitting phase. During the subsequent testing phase, the proposed models underwent evaluation for accuracy, precision, recall and AUC using new data. The primary objective of this evaluation was to verify and establish the efficacy of these models in effectively detecting Android malware.

4.1 Evaluating the effectiveness of the proposed algorithms through performance analysis

In this study, we introduce four DL-based algorithms, namely Convolutional Neural Network (CNN), Autoencoder (AE), Deep Belief Neural Network (DBN), and Deep Neural Decision Forests (DNDF), that are utilized for the purpose of detecting and categorizing malicious Android apps.

Table 4
The table depicts the performance assessment and resource consumption of all proposed classifiers, including, accuracy (ACC), precision (PRE), recall (REC) training time (TT) in seconds, memory usage (MMU) in MB

Proposed algorithms	Dataset	ACC	PRE	REC	TT	MMU
CNN	Drebin	0.986	0.980	0.983	1.000	633.97
AE	Drebin	0.987	0.981	0.983	1.001	732.016
DBN	Drebin	0.990	0.986	0.987	1.001	625.687
DNDF	Drebin	0.989	0.994	0.977	1.000	645.925
CNN	TUANDROMD	0.986	0.989	0.994	1.001	578.35
AE	TUANDROMD	0.986	0.985	0.997	1.001	575.809
DBN	TUANDROMD	0.982	0.988	0.988	1.000	680.660
DNDF	TUANDROMD	0.984	0.988	0.991	1.001	578.929

The CNN approach demonstrated its highly effective performance in detecting malicious apps on two datasets. The model’s performance is summarized in Table 4. After 100 iterations, the CNN model exhibited exceptional performance, achieving high training and validation accuracies. The model attained training and validation accuracies of 0.997 and 0.986 on the Drebin dataset. Similarly, on the TUANDROMD dataset, the model achieved training and validation accuracies of 0.996 and 0.991. Additionally, the model achieved low training and validation losses, with a training loss of 0.011 on the Drebin dataset and 0.014 on the TUANDROMD dataset, and a validation loss of 0.056 on the Drebin dataset and 0.018 on the TUANDROMD dataset. These results indicate that the model effectively learned the patterns in the training data and demonstrated the ability to generalize to unseen data with satisfactory accuracy. To avoid overfitting, it is essential to maintain a reasonable generalization gap between the learning and validation accuracies and losses. The CNN approach detected malware on both datasets, as indicated by the minimal number of FP and FN. This finding demonstrates the CNN approach’s ability to accurately classify instances of malicious attacks as positive and benign data as negative, with few instances of misclassification. Moreover, the CNN classifiers showed low memory usage, ranging from 578 to 634 MB, and fast execution times, ranging from 1.0 to 1.001 seconds. The outcomes demonstrate that CNN classifiers are highly adept at detecting Android malware and can be trained proficiently on varied datasets.

Figure 2.

Performance comparison of AE algorithm regarding accuracy, cross-entropy loss, and CM, using drebin and TUANDROMD. Panels (a) and (b) show the accuracy, panels (c) and (d) Cross-entropy loss (e) and (f) show the confusion matrix, respectively.

The performance of the AE algorithm in recognising malware was evaluated on two datasets, as depicted in Fig. 2a–f. The algorithm demonstrated exceptional performance on the TUANDROMD dataset, achieving high training and validation accuracies rates of 0.992 and 0.988, respectively, as plotted in Fig. 2a–b. Conversely, it achieved slightly lower accuracy rate on the Drebin dataset, with training and validation accuracy rate of 0.990 and 0.988, respectively. Additionally, the performance analysis revealed that the Drebin dataset exhibited a training loss of 0.018 and a validation loss of 0.043, indicating that the model effectively learned the patterns in the training data and generalized well to unseen data with reasonable accuracy. Similarly, the TUANDROMD dataset demonstrated a training loss of 0.026 and a validation loss of 0.033 as plotted in Fig. 2c–d, suggesting that the model learned the underlying patterns in the training data with high accuracy and successfully generalized to unseen data, as evidenced by the close proximity of the validation and training losses.

The performance of the AE approach was evaluated on both datasets and corresponding confusion matrices were generated, as plotted in Fig. 2e–f. Within the Drebin dataset, the approach successfully detected malware with minimal FP and FN instances. Most of the 540 apps in the dataset were labelled malicious, indicating that the AE classifier identified them as potential threats. Also, 944 instances were classified as TN and recognized as benign apps. In the TUANDROMD dataset, a significant portion of the normal data was classified as TN, accounting for 88 of the dataset’s apps. On the other hand, 352 apps were classified as TP, accurately identifying them as malicious attacks. This indicates that the AE approach effectively distinguished malicious and benign instances with fewer misclassifications. As a result, the AE algorithm exhibited better performance on the TUANDROMD dataset than the Drebin dataset.

Table 4 and Fig. 5a–b reveal high accuracy, precision, and recall scores for the AE classifiers on both datasets, ranging from 0.986 to 0.987 for accuracy, 0.981 to 0.985 for precision, and 0.983 to 0.997 for recall. Regarding resource consumption, the AE classifiers had a higher memory footprint than the CNN classifiers, ranging from 576 to 732 MB. However, the execution time on both datasets was similar to the CNN classifiers, ranging from 1.0 to 1.001 seconds. These findings suggest that AE classifiers effectively detect Android malware but may require more memory resources than CNN classifiers.

Figure 3.

Performance comparison of DBN algorithm regarding accuracy, cross-entropy loss, and CM. Panels (a) and (b) show the accuracy, panels (c) and (d) cross-entropy loss, (e) and (f) show the confusion matrix, respectively.

The proposed DBN approach successfully detected malicious Android apps within the Drebin and TUANDROMD datasets. The model’s performance was extensively evaluated, considering accuracy, cross-entropy loss, and the percentage of correctly classified samples for each epoch, as depicted in Fig. 3a–f. After 100 iterations, the DBN model achieved high training and validation accuracies on both datasets. On the Drebin dataset, the DBN model attained training and validation accuracies of 0.991 and 0.985, respectively. Similarly, on the TUANDROMD dataset, the model achieved training and validation accuracies of 0.992 and 0.984, as evident from Fig. 3a–b. The performance of the DBN model was assessed by analyzing the training and validation losses for both datasets, as shown in Fig. 3c–d. The results indicate that the model effectively learned the underlying patterns in the training data and demonstrated reasonable accuracy when generalizing to unseen data. Specifically, the DBN model achieved a training loss of 0.028 and a validation loss of 0.046 on the Drebin dataset. In contrast, the TUANDROMD dataset attained a training loss of 0.025 and a validation loss of 0.035.

The approach employed in detecting malware within both datasets proved highly effective, resulting in a low number of FP and FN. The confusion matrix of the DBN classifier on both datasets can be seen in Fig. 3e–f. Most of the 571 apps were classified as malicious within the Drebin dataset, while in the TUANDROMD dataset, 341 apps were correctly identified as TP and labelled as malicious. According to our research, the DBN method has demonstrated high accuracy in distinguishing between instances of malicious attacks, labeled as positive, and instances of benign data, labeled as negative, with minimal misclassification occurrences. This suggests that the DBN approach is highly effective in distinguishing between malicious and benign data. The approach successfully detected malware in both datasets, exhibiting low FP and FN. In the Drebin dataset, most of the 571 apps were classified as malicious, while in the TUANDROMD dataset, 341 apps were identified as TP and labelled as malicious. These findings of our study suggest that the DBN method has been proficient in precisely categorizing instances of malicious attacks as positive and instances of benign data as negative, with only a small number of misclassifications. Empirical evidence supports the effectiveness of the fused DBN model in detecting Android malware in both datasets. Figure 3c-d presents the performance metrics of the DBN classifiers on the Drebin and TUANDROMD datasets. The DBN classifiers achieved high accuracy, precision, and recall scores, ranging from 0.982 to 0.990 for accuracy, 0.986 to 0.988 for precision, and 0.987 to 0.988 for recall, as shown in Table 4. The memory usage of the DBN classifiers was moderate, ranging from 626 to 681 MB. Additionally, the execution time for both datasets was relatively fast, with a range of 1.0 to 1.001 seconds. These results suggest that DBN classifiers effectively detect Android malware and can be trained efficiently on diverse datasets.

The DNDF approach exhibits remarkable proficiency in detecting malicious apps within both datasets. A comprehensive evaluation was conducted, considering accuracy, cross-entropy loss, and the percentage of correctly classified samples for each epoch. After 100 iterations, the DNDF model achieved exceptional training and validation accuracies on the Drebin and TUANDROMD datasets. Specifically, it attained accuracies of 0.997 and 0.988 for the Drebin dataset and 0.996 and 0.997 for the TUANDROMD dataset, respectively. The performance of the DNDF model, highlighting its ability to detect malicious apps accurately, is depicted in Fig. 4a–b. The model’s effectiveness was evaluated by analyzing both datasets’ training and validation losses. The DNDF model demonstrated efficient learning of patterns in the training data, enabling it to generalize effectively to unseen data with reasonable accuracy. On the Drebin dataset, the model achieved a training loss of 0.011 and a validation loss of 0.047. Similarly, the TUANDROMD dataset attained a training loss of 0.012 and a validation loss of 0.008. Figure 4c–d depicts the model’s efficient learning and generalization abilities.

Figure 4.

Performance comparison of DNDF algorithm regarding accuracy, cross-entropy loss, and confusion matrices, panels (a) and (b) show the accuracy, panels (c) and (d) loss (e) and (f) show the confusion matrix.

Figure 5.

The plotted graph shows the performance evaluation of the proposed model based on precision, recall, sensitivity, MAE, MSE, RMSE, R2Score, and F1Score (a), (c), (e), and (g) on the Drebin, while (b), (d), (f) and (h), TUANDROMD dataset.

The DNDF approach was further evaluated by analyzing the resulting confusion matrices on both datasets. The approach demonstrated an impressive ability to detect malware in both datasets, with a minimal number of FP and FN, as depicted in Fig. 4e–f. Within the Drebin dataset, most of the 557 apps were classified as malicious, while 930 apps were labelled benign. Besides, in the TUANDROMD dataset, 346 apps were correctly identified as TP and classified as malicious, while only 93 apps were categorized as benign. The DNDF approach successfully detected malicious apps within the Drebin and TUANDROMD datasets, achieving high training and validation accuracies. Table 4 and Fig. 5e–f confirms the strong performance of DNDF classifiers on both datasets, with high accuracy, precision, and recall scores. Regarding resource consumption, DNDF classifiers exhibited a relatively low memory footprint, ranging from 578 to 645 MB. The execution time for both datasets was also relatively fast, ranging from 1.0 to 1.001 seconds. These results suggest that DNDF classifiers effectively detect Android malware and can be efficiently trained on various datasets. Figure 6provides a comprehensive overview of the resource consumption for all four classifiers, including both time and memory. Additionally, Fig. 5g–h illustrates the comparison of ROC curves for all proposed classifiers on both datasets. Based on the ROC curve values, it is evident that all four classifiers CNN, AE, DBN, and DNDF achieved high performance on both datasets. On the Drebin dataset, all four classifiers achieved a high ROC score of 0.99, indicating a very high TP rate while keeping the FP rate low. However, on the TUANDROMD dataset, the CNN classifier achieved an ROC score of 0.99, AE achieved 0.98, and DNDF also achieved 0.99. The DBN classifier achieved a slightly lower ROC score of 0.96 on the TUANDROMD dataset.

4.2 Sensitivity analysis comparison for various proposed DCNN-based architectures

In the Android mobile malware analysis field, sensitivity analysis is a statistical approach that measures the impact of uncertainties in input data variables on the accuracy of malware detection models. This analysis helps researchers and analysts identify the most significant features that affect the detection model’s performance and optimize the model’s parameters and configuration. One way to evaluate how well a model performs is by measuring how each input variable affects the model’s output. Statistical measures such as MAE, MSE, RMSE, sensitivity, Area Under the Curve (AUC), and $R^{2}$ can be used to quantify the prediction error between the predicted values and the target class. By using these metrics, we can assess the accuracy of the model’s predictions and determine how well it performs overall. This process can improve the reliability and effectiveness of the malware detection process and facilitate the development of more robust and accurate malware detection systems. Table 5 presents the results of the prediction error obtained by applying DL-based algorithms.

The performance of a CNN classifier was evaluated on two datasets: Drebin and TUANDROMD. For Drebin, the classifier achieved an MAE of 0.0141, MSE of 0.0110, RMSE of 0.1050, and an $R^{2}$ score of 0.0110. It also had an AUC of 0.99 and a sensitivity of 0.9979. In contrast, For TUANDROMD, the classifier performed slightly better, with an MAE of 0.0133, MSE of 0.0085, RMSE of 0.0926, $R^{2}$ of 0.0460, and an AUC of 0.99. The sensitivity was perfect at 1.0, indicating that the classifier correctly identified all positive samples. Consequently, the CNN classifier performed well in both datasets, achieving high AUC values and sensitivity.

Table 5
Comparison of MAE, MSE, RMSE, $R^{2}$ score, AUC, and sensitivity scores for various models

Model	Dataset	MAE	MSE	RMSE	$R^{2}$ -Score	AUC	Sensitivity
CNN	Drebin	0.014	0.011	0.105	0.011	0.99	0.997
AE	Drebin	0.021	0.010	0.102	0.044	0.99	0.998
DBN	Drebin	0.017	0.008	0.091	0.035	0.99	0.998
DNDF	Drebin	0.012	0.009	0.099	0.020	0.98	1.0
CNN	TUANDROMD	0.013	0.008	0.092	0.046	0.99	1.0
AE	TUANDROMD	0.021	0.011	0.109	0.097	0.98	1.0
DBN	TUANDROMD	0.023	0.013	0.115	0.069	0.96	1.0
DNDF	TUANDROMD	0.018	0.013	0.117	0.046	0.99	1.0

Figure 6.

Comparison of memory consumption and time performance across four android malware classifiers.

The AE performed well on both datasets, with slightly better results on the Drebin dataset. On the Drebin dataset, the model achieved an MAE of 0.0213, an MSE of 0.0104, and an RMSE of 0.1024, while on the TUANDROMD dataset, it achieved an MAE of 0.0218, an MSE of 0.0119, and an RMSE of 0.1093. The $R^{2}$ was 0.0445 for the Drebin dataset and 0.0972 for the TUANDROMD dataset, indicating that the model could explain some of the variances in the data. The AUC was 0.99 for Drebin and 0.98 for TUANDROMD, suggesting good classification performance and the sensitivity was high in both cases, with a value of 0.9989 for Drebin and 1.0 for TUANDROMD. However, the AE’s AUC and sensitivity scores were lower than the CNN classifier’s, indicating that it may not be as effective in classifying the samples. Despite this, the AE’s performance was still reasonably good on both datasets.

The DBN classifier underwent an assessment of Drebin and TUANDROMD datasets. On Drebin, the model obtained an MAE of 0.0171, MSE of 0.0083, and RMSE of 0.0914. Conversely, on TUANDROMD, the model obtained an MAE of 0.0234, MSE of 0.0133, and RMSE of 0.1155. The $R^{2}$ 0.0350 on Drebin and 0.0698 on TUANDROMD, indicating that the model could explain more data variance than Drebin. On Drebin, the model achieved an AUC of 0.99, indicating high classification performance, and a sensitivity score of 0.9989, indicating that it identified the most positive samples. On TUANDROMD, the model achieved a slightly lower AUC of 0.96, demonstrating high classification performance, and a sensitivity score of 1.0, indicating correctly identifying all positive samples. The DBN classifier performed well on both datasets, with slightly better performance on the drebin dataset. However, the AUC and sensitivity scores were lower than those achieved by the CNN classifier, suggesting that the DBN classifier may not be as effective in classifying the samples.

Table 6

The efficacy of the proposed security system was assessed, with the drebin and other dataset serving as the basis of comparison to existing security systems

Author, Year, Reference	Datasets	Proposed approach	AUC
Jannat et al., 2019, [41]	MalGenome, Kaggle, Androguard	Random forest tree	0.93
Xu et al., 2018, [35]	Google play, VirusShare, MassVet	LSTM	0.97
McLaughlin et al., 2017, [42]	Genome, IntelSecurity, MacAfee, Google play	Deep CNN	0.87
Millar et al., 2022, [43]	Drebin dataset.	CNN	0.91
Qaisar et al., 2021, [44]	Drebin dataset.	CBR, SVM, DT	0.95
Salehi et al., 2019, [45]	Drebin dataset.	Random forest tree	0.96
Koli et al., 218, [46]	Drebin dataset.	DT	0.97
Kabakus et al, 2019, [47]	Drebin dataset.	RF with 1000 decision trees	0.98
Lou et al., 2019, [48]	Drebin dataset.	SVM	0.93
Onwuzurike et al. 2019, [49]	Drebin dataset.	Random forest tree	0.94
Zhang et al. 2019, [50]	Drebin dataset.	Random forest tree	0.96
Meng et al., 2016, [51]	Drebin dataset.	Random forest tree	0.97
Vu et al., 2021, [52]	Drebin dataset.	CNN	0.98
Alkahtani et al., 2022, [53]	Drebin dataset.	CNN-LSTM	0.97
Proposed model	Drebin dataset.	DNDF	0.98
Proposed model	TUANDROMD dataset	DNDF	0.99

The efficacy of the DNDF classifier was evaluated on two datasets, Drebin and TUANDROMD. On Drebin, the purposed model attained an MAE of 0.0128, an MSE of 0.0098, and an RMSE of 0.0990, with an AUC of 0.98 and a sensitivity score of 1.0. The $R^{2}$ 0.0204 suggests that the model accounted for a small amount of the variance in the data. In comparison, On TUANDROMD, the DNDF achieved an MAE of 0.0189, an MSE of 0.0138, and an RMSE of 0.1177, with an AUC of 0.99 and a sensitivity score of 1.0. The $R^{2}$ was 0.0467, indicating that the model could better explain the variance than Drebin. The high AUC and sensitivity scores demonstrate excellent classification performance, making it a promising tool for malware classification. Consequently, the DNDF classifier performed well on both datasets, with slightly better results on TUANDROMD.

5. Discusssion

The study yielded impressive results, showing exceptional performance metrics. The AUC value was a perfect 1.00, indicating a strong ability to differentiate between positive and negative classes, as depicted in Table 5. The accuracy rates achieved were 0.989 and 0.984, accompanied by perfect sensitivity as depicted in Table 4. Furthermore, the training time required was only 1 second, demonstrating efficiency. A comparison with other approaches, as shown in Table 7 revealed that the proposed system outperformed most state-of-the-art methods, despite being a DL model. Although the model described in Li et at. [55] achieved higher accuracy, it employed a multi-layer perceptron (MLP) with an excessively large number of features 93,324. As a consequence, this approach consumed significant computational resources and had a training time of approximately 5 seconds.

In contrast, our proposed model achieved slightly lower accuracy but with the significant advantage of requiring fewer computational resources, resulting in a speedy training of just 1 second. Typically, DL-based algorithms have a higher number of parameters that need to be optimized during training. They often demand more training data compared to traditional ML-based algorithms for achieving good performance. However, our proposed system has demonstrated exceptional performance measures, surpassing state-of-the-art approaches. The results indicate that the proposed classifiers are computationally efficient and have the potential to be deployed in mobile environments. All proposed classifiers showed inference times of less than 1.5 seconds, making them practical for real-time usage on mobile devices. Furthermore, the memory consumption and CPU/GPU utilization remained stable throughout the experiments, indicating efficient utilization of available resources by the classifiers.

Table 7
Comparison of classifier performance in terms of accuracy, time, and computational resources

Reference	Dataset	year	Features	Classifier	ACC	Training time	Memory usage (MB)
[54]	Malgenome, Drebin, McAfee	2018	100–350	–	0.984	–	–
[55]	Drebin and AMD	2019	93,324	MLP	0.997	$\leqslant$ 5s	–
[56]	AMD	2019	27	MLP	0.961	–	–
[57]	Drebin	2018	93,324	RF	0.931	–	–
[2]	Drebin-215	2022	35	RF	0.986	0.7631 $\mu$ s
Proposed	Drebin,	2023	456	DNDF	0.989	1.000 s	645.925
Proposed	TUANDROMD	2023	456	DNDF	0.984	1.000 s	578.929

Table 6 summarizes the performance comparison of various classifiers on different datasets for Android malware detection. Our proposed system, using DNDF classifiers, outperforms existing state-of-the-art approaches in terms of AUC. Additionally, our classifiers exhibit reasonable training times and memory usage. Overall, our system provides an effective and efficient approach for Android malware detection.

6. Conclusion and future work

In this study, a security system was developed and designed, incorporating the use of CNN, AE, DBN, and DNDF algorithms. The research yielded highly promising results, and the following conclusions can be drawn based on these findings. To evaluate the effectiveness of our proposed system and enable comparison with previous studies, we employed the Drebin dataset and the TUANDROMD dataset. The Drebin dataset allowed us to benchmark against earlier research. In contrast, the TUANDROMD dataset provided an advantage in detecting modern threats that employ advanced obfuscation and morphing techniques, rendering traditional methods ineffective. The DNDF model exhibited the best performance on both the Drebin and TUANDROMD datasets, achieving the lowest values for MAE, MSE, and RMSE and the highest $R^{2}$ Score, AUC, and Sensitivity values. It achieved an accuracy of 0.989 on the Drebin dataset and 0.984 on the TUANDROMD dataset while operating within limited computational resources and time. The DBN model demonstrated high accuracy, with a score of 0.990 on the Drebin dataset and 0.982 on the TUANDROMD dataset. The CNN model also showed commendable performance, achieving an accuracy of 0.986 on both datasets. The AE model had slightly lower performance in terms of MAE, MSE, RMSE, and $R^{2}$ Score compared to the other models. However, its AUC and Sensitivity scores were generally comparable to or better than the other models, especially on the TUANDROMD dataset. During the validation phase, the DBN and DNDF models displayed commendable performance on both datasets, indicated by low validation loss and high accuracy, precision, recall, and AUC scores. The CNN model also demonstrated satisfactory performance on the TUANDROMD dataset, with high scores in accuracy, precision, recall, and AUC. The experimental findings revealed that the CNN, DBN, and DNDF classifiers are efficient DL-based algorithms for detecting malware. Among these classifiers, DNDF exhibited the most effective performance, considering various metrics including sensitivity. Additionally, our proposed system demonstrated not only exceptional performance in testing outcomes when compared to existing state-of-the-art approaches, but also produced efficient results in terms of training time and computational resources. Furthermore, the study results were subjected to a comparison analysis with recent research findings, validating the strength and efficacy of our conclusions.

In our future research endeavours, we will prioritize the development of a dependable cloud-based function for updating ML and DL-based models. We also aim to optimize the classifier by reducing its memory and processing requirements. Moreover, our focus will be on enhancing the classifier’s accuracy by integrating additional datasets and classifiers into our approach.

Footnotes

Conflict of interest

The authors declare that they have no conflict of interest.

Funding

The work presented in this paper has been supported by Beijing Natural Science Foundation (No. IS23054). The author would also like to thank Prince Sultan University, Riyadh, Saudi Arabia, and Lasbela university of Agriculture water and marine sciences for their support.

References

Statista. Worldwide Internet-connected Operating System Population from 2014 to 2020; 2020. Accessed: 2023 May 7. Available from: https//www.statista.com/statistics/543185/worldwide-internet-connected-operating-system-population/.

Alani

Awad

. Paired: An explainable lightweight android malware detection system. IEEE Access. 2022; 10: 73214-73228.

Statista. Google Play Store: Number of Apps 2021; 2021. Accessed: 2021 Dec 14. https//www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store.

Kotzias

Caballero

Bilge

. How did that get in my phone? unwanted app distribution on android devices. In: 2021 IEEE symposium on security and privacy (SP). IEEE. 2021; 53-69.

Nazir

Zhu

Wajahat

Ullah

, et al. Advancing IoT security: A systematic review of machine learning approaches for the detection of IoT botnets. Journal of King Saud University-Computer and Information Sciences. 2023; 101820.

Pernet

. Android malware infected more than 300,000 devices with banking trojans. TechRepublic; 2021. [Online]; accessed on 2021 Dec 14. Available from: https//www.techrepublic.com/article/android-malware-infected-more-than-300000-devices-with-banking-trojans.

Alani

. Android users privacy awareness survey. International Journal of Interactive Mobile Technologies. 2017; 11(3).

Arp

Spreitzenbarth

Hubner

Gascon

Rieck

Siemens

. Drebin: Effective and explainable detection of android malware in your pocket. Ndss. 2014; 14: 23-26.

Feizollah

Anuar

Salleh

Wahab

AWA

. A review on feature selection in mobile malware detection. Digital Investigation. 2015; 13: 22-37.

10.

Feng

Anand

Dillig

Aiken

. Apposcopy: Semantics-based detection of android malware through static analysis. Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. 2014; 576-587.

11.

Enck

Ongtang

McDaniel

. On lightweight mobile phone application certification. Proceedings of the 16th ACM conference on Computer and communications security. 2009; 235-245.

12.

Sanz

Santos

Laorden

Ugarte-Pedrero

Bringas

Marañón

GÁ

. Puma: Permission usage to detect malware in android. CISIS/ICEUTE/SOCO Special Sessions. 2012; 189: 289-298.

13.

Aafer

Yin

. Droidapiminer: Mining api-level features for robust malware detection in android. In: Security and Privacy in Communication Networks: 9th International ICST Conference, SecureComm 2013, Sydney, NSW, Australia, September 25–28, 2013, Revised Selected Papers 9. Springer. 2013; 86-103.

14.

Mariconti

Onwuzurike

Andriotis

De Cristofaro

Ross

Stringhini

. Mamadroid: Detecting android malware by building markov chains of behavioral models. arXiv preprint arXiv:161204433. 2016.

15.

Arzt

Rasthofer

Fritz

Bodden

Bartel

Klein

, et al. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. Acm Sigplan Notices. 2014; 49(6): 259-269.

16.

Adjeroh

Iyengar

. A survey on malware detection using data mining techniques. ACM Computing Surveys (CSUR). 2017; 50(3): 1-40.

17.

Wajahat

Imran

Latif

Nazir

Bilal

. A novel approach of unprivileged keylogger detection. In: 2019 2nd international conference on computing, mathematics and engineering technologies (iCoMET). IEEE. 2019; 1-6.

18.

Enck

Gilbert

Han

Tendulkar

Chun

Cox

, et al. Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems (TOCS). 2014; 32(2): 1-29.

19.

Yan

Yin

. Droidscope: seamlessly reconstructing the os and dalvik semantic views for dynamic android malware analysis. USENIX Security Symposium. 2012; 569-584.

20.

Qiu

Zhang

Luo

Pan

Nepal

Xiang

. A survey of android malware detection with deep neural models. ACM Computing Surveys (CSUR). 2020; 53(6): 1-36.

21.

Tam

Feizollah

Anuar

Salleh

Cavallaro

. The evolution of android malware and android analysis techniques. ACM Computing Surveys (CSUR). 2017; 49(4): 1-41.

22.

Mahmood

Mirzaei

Malek

. Evodroid: Segmented evolutionary testing of android apps. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 2014; 599-609.

23.

Vidas

Tan

Nahata

Tan

Christin

Tague

. A5: Automated analysis of adversarial android applications. Proceedings of the 4th ACM Workshop on Security and Privacy in Smartphones & Mobile Devices. 2014; 39-50.

24.

An adaptive semi-supervised deep learning-based framework for the detection of Android malware. Journal of Intelligent & Fuzzy Systems. 2023; 45(3): 5141-5157.

25.

Qureshi

Zhu

Zardari

Mahmood

, et al. SDN-enabled Deep Learning based Detection Mechanism (DDM) to tackle DDoS attacks in IoTs. Journal of Intelligent & Fuzzy Systems. 2023; 44(6): 10675-10687.

26.

Jamrozik

Zeller

. Droidmate: a robust and extensible test generator for android. Proceedings of the International Conference on Mobile Software Engineering and Systems. 2016; 293-294.

27.

Kang

Yerima

McLaughlin

Sezer

. N-opcode analysis for android malware classification and categorization. In: 2016 International conference on cyber security and protection of digital services (cyber security). IEEE. 2016; 1-7.

28.

Atici

Sagiroglu

Dogru

. Android malware analysis approach based on control flow graphs and machine learning algorithms. In: 2016 4th International Symposium on Digital Forensic and Security (ISDFS). IEEE. 2016; 26-31.

29.

Zhang

Duan

Yin

Zhao

. Semantics-aware android malware classification using weighted contextual api dependency graphs. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 2014; 1105-1116.

30.

Qureshi

Tunio

Zhu

Akhtar

Ullah

, et al. A hybrid DL-based detection mechanism for cyber threats in secure networks. IEEE Access. 2021; 9: 73938-73947.

31.

Rastogi

Chen

Enck

. Appsplayground: automatic security analysis of smartphone applications. Proceedings of the Third ACM Conference on Data and Application Security and Privacy. 2013; 209-220.

32.

Rasthofer

Arzt

Miltenberger

Bodden

. Harvesting runtime values in android applications that feature anti-analysis techniques. NDSS. 2016.

33.

Zhang

Chan

Jaitly

. Very deep convolutional networks for end-to-end speech recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2017; 4845-4849.

34.

Wang

Cai

Cheng

. DroidDeepLearner: Identifying android malware using deep learning. In: 2016 IEEE 37th sarnoff symposium. IEEE. 2016; 160-165.

35.

Deng

Chen

. Deeprefiner: Multi-layer android malware detection system applying deep neural networks. In: 2018 IEEE european symposium on security and privacy (EuroS&P). IEEE. 2018; 473-487.

36.

Qamar

Karim

Chang

. Mobile malware attacks: Review, taxonomy & future directions. Future Generation Computer Systems. 2019; 97: 887-909.

37.

Aslan

ÖA

Samet

. A comprehensive review on malware detection approaches. IEEE Access. 2020; 8: 6249-6271.

38.

Pan

Fang

Fan

. A systematic literature review of android malware detection using static analysis. IEEE Access. 2020; 8: 116363-116379.

39.

Borah

Bhattacharyya

Kalita

. Malware dataset generation and evaluation. In: 2020 IEEE 4th conference on information & communication technology (CICT). IEEE. 2020; 1-6.

40.

Chollet

, et al. Keras. GitHub; 2015. Available from: https://github.com/fchollet/keras.

41.

Jannat

Hasnayeen

Shuhan

MKB

Ferdous

. Analysis and detection of malware in Android applications using machine learning. In: 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE). IEEE. 2019; 1-7.

42.

Millar

McLaughlin

del Rincon

Miller

. Multi-view deep learning for zero-day Android malware detection. Journal of Information Security and Applications. 2021; 58: 102718.

43.

Millar

McLaughlin

Rincon

JMD

Miller

. Android malware detection using deep learning. Artificial Intelligence and Cybersecurity: Theory and Applications. Springer. 2022; 209-246.

44.

Qaisar

. Multimodal information fusion for android malware detection using lazy learning. Multimedia Tools and Applications. 2022; 1-15.

45.

Salehi

Amini

Crispo

. Detecting malicious applications using system services request behavior. Proceedings of the 16th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. 2019; 200-209.

46.

Koli

. RanDroid: Android malware detection using random machine learning classifiers. In: 2018 technologies for smart-city energy security and power (ICSESP). IEEE. 2018; 1-6.

47.

Kabakus

. What static analysis can utmost offer for Android malware detection. Information Technology and Control. 2019; 48(2): 235-249.

48.

Lou

Cheng

Huang

Jiang

. TFDroid: Android malware detection by topics and sensitive data flows using machine learning techniques. In: 2019 IEEE 2Nd international conference on information and computer technologies (ICICT). IEEE. 2019; 30-36.

49.

Onwuzurike

Mariconti

Andriotis

Cristofaro

Ross

Stringhini

. Mamadroid: Detecting android malware by building markov chains of behavioral models (extended version). ACM Transactions on Privacy and Security (TOPS). 2019; 22(2): 1-34.

50.

Zhang

Luo

Zhang

Pan

. An efficient Android malware detection system based on method-level behavioral semantic analysis. IEEE Access. 2019; 7: 69246-69256.

51.

Meng

Xue

Liu

Zhang

Narayanan

. Semantic modelling of android malware for effective malware comprehension, detection, and classification. Proceedings of the 25th International Symposium on Software Testing and Analysis. 2016; 306-317.

52.

Jung

. AdMat: A CNN-on-matrix approach to Android malware detection and classification. IEEE Access. 2021; 9: 39680-39694.

53.

Alkahtani

Aldhyani

. Artificial intelligence algorithms for malware detection in android-operated mobile devices. Sensors. 2022; 22(6): 2268.

54.

Yerima

Sezer

. Droidfusion: A novel multilevel classifier fusion approach for android malware detection. IEEE transactions on cybernetics. 2018; 49(2): 453-466.

55.

Mills

Niu

Zhu

Zhang

Kinawi

. Android malware detection based on factorization machine. IEEE Access. 2019; 7: 184008-184019.

56.

Yildiz

Doğru

. Permission-based android malware detection system using feature selection with genetic algorithm. International Journal of Software Engineering and Knowledge Engineering. 2019; 29(02): 245-262.

57.

Chen

Mao

Yang

Zhu

, et al. Tinydroid: a lightweight and efficient model for android malware detection and classification. Mobile information systems. 2018; 2018.

An effective deep learning scheme for android malware detection leveraging performance metrics and computational resources

Abstract

Keywords

1. Introduction

2.1 Background

2.2 Conventional approaches to detecting malware on android

2.3 Android malware detection using ML-based algorithms

2.4 Android malware detection using cutting edge algorithms

3. Methodology and experiments

3.1 Datasets description

3.1.1 Drebin dataset

3.1.2 TUANDROMD dataset

Table 1 The volume of datasets

Table 2 Comparative analysis of experimental parameter using for, CNN AE, DBN and DNDF training

Table 3 For the proposed model to achieve optimal performance, certain environmental prerequisites must be met

Table 4 The table depicts the performance assessment and resource consumption of all proposed classifiers, including, accuracy (ACC), precision (PRE), recall (REC) training time (TT) in seconds, memory usage (MMU) in MB

Table 5 Comparison of MAE, MSE, RMSE, R 2 score, AUC, and sensitivity scores for various models

Table 7 Comparison of classifier performance in terms of accuracy, time, and computational resources

Footnotes

Conflict of interest

Funding

References

Table 1
The volume of datasets

Table 2
Comparative analysis of experimental parameter using for, CNN AE, DBN and DNDF training

Table 3
For the proposed model to achieve optimal performance, certain environmental prerequisites must be met

Table 4
The table depicts the performance assessment and resource consumption of all proposed classifiers, including, accuracy (ACC), precision (PRE), recall (REC) training time (TT) in seconds, memory usage (MMU) in MB

Table 5
Comparison of MAE, MSE, RMSE, $R^{2}$ score, AUC, and sensitivity scores for various models

Table 7
Comparison of classifier performance in terms of accuracy, time, and computational resources