Abstract
In the era of ubiquitous sensors and smart devices, detecting malware is becoming an endless battle between ever-evolving malware and antivirus programs that need to process ever-increasing security related data. For malware detection, various approaches have been proposed. Among them, dynamic analysis is known to be effective in terms of providing behavioral information. As malware authors increasingly use obfuscation techniques, it becomes more important to monitor how malware behaves for its detection. In this paper, we propose a novel approach for dynamic analysis of malware. We adopt DNA sequence alignment algorithms and extract common API call sequence patterns of malicious function from malware in different categories. We find that certain malicious functions are commonly included in malware even in different categories. From checking the existence of certain functions or API call sequence patterns matched, we can even detect new unknown malware. The result of our experiment shows high enough F-measure and accuracy. API call sequence can be extracted from most of the modern devices; therefore, we believe that our method can detect the malware for all types of the ubiquitous devices.
1. Introduction
Nowadays, power-saving techniques and enhanced computing power allow us to use sensors as multifunctional devices. They can perform complex tasks and are connected to the Internet all the time. They are now playing a major role in sensing and generating big data in IoT (Internet of Things) era. Besides, the increased use of the mobile smart devices also allows us powerful ubiquitous computing. Ubiquitous sensors and smart devices are becoming critical source of data. However, as their operating system and software are almost same to the traditional Windows based system or UNIX based system, traditional malware and exploit can work on these small smart sensors and devices, the use of which is exponentially increased. Therefore, it becomes critical to manage ever-evolving malware and related security risks in the era of ubiquitous sensors and smart devices.
Various approaches have been proposed for malware detection [1–3]. Detection techniques proposed earlier were based on static analysis. Static analysis examines the binary code, analyzes all possible execution paths, and identifies malicious code without execution [4]. However, analyzing binary code turns out to be difficult nowadays. As obfuscation techniques become more sophisticated, static analysis can be bypassed by various obfuscation techniques, such as polymorphism, encryption, or packing [4]. In addition, as static analysis relies on a prebuilt signature database, it cannot easily detect new unknown malware until the signature is updated [4, 5]. Besides, some execution paths can be only explored after execution [4, 5]. To overcome these limitations of static analysis and complement it, dynamic analysis has been proposed and is widely used to achieve more effective malware detection.
Techniques based on dynamic analysis execute malware and trace its behaviors. Two major approaches in dynamic analysis are control flow analysis and API call analysis. Both approaches detect malware based on analysis of similarity between the behavior of the new and the known ones. However, malware authors try to circumvent those techniques through inserting meaningless codes or shuffling the sequence of programs.
Many of currently available API call analysis techniques fail to detect malware of such circumvention. Some techniques focus on extracting APIs that are frequently observed in malware in each class [6, 7]. They monitor APIs that are called and calculate the frequency and total number of events that certain API function called. Even though they quickly reveal the characteristics of malware in the same class, they fail to show the sequence of malware behavior and can be easily evaded by malware authors’ inserting and executing dummy and redundant API calls.
Others extract API call sequence for each class and develop static signature based on it [8–11]. They are better from the semantic view because they monitor the sequence of calls and the flow of programs. However, simply creating signatures from the extracting frequently found that call sequence for malware in each class does not allow them to detect malware in polymorphic or unknown form. It can be also evaded by malware authors’ evading tricks such as inserting redundant API calls. This incurs the need for new approaches in API call sequence analysis.
Recent few studies focus on the fact that unless the main purpose or functions of the malware are not changed, the critical low-level system call sequence does not change. Therefore, instead of extracting API call sequence for malware in each class, they propose to focus on API call sequence for certain functions of malware [12–14]. However, such approach has not been empirically well studied comparing to the API call sequence analysis techniques proposed previously. In this study, with a large set of data, we empirically study whether such approach generates superior results comparing with the previous ones.
In this study, we adopt sequence alignment algorithm which is known to perform well in extracting the similar subsequences from the different sequences. Sequence alignment algorithm will make us less confused by meaningless codes inserted in malware in its detection. Sequence alignment algorithms have been applied in various areas such as natural language processing and biometrics and have proven their excellence [15]. In this paper, we propose a new approach in API call sequence analysis with introducing sequence alignment algorithm. The rest of the paper is organized as follows. In Section 2, we review the related literature. In Section 3, we present our methodology and experiment. In Section 4, we conclude our research and suggest future research direction.
2. Literature Review
2.1. Malware Analysis
Malware analysis has made its advances as shown in Figure 1. Signature based detection was proposed in its early stage. In this stage, automatic generation of malware's signatures as much as possible was assumed to be important and this increased pattern matching speed [16, 17].

Chronicles of advances in malware analysis.
However, the signature based detection method shows following weaknesses. It requires continuous updates of signature and high maintenance cost. In addition, such method could be easily evaded by malware in polymorphic form [5]. To overcome the weakness, it embraces code normalization to capture the canonized original maliciousness [18]. Through this, capturing malicious codes that vary by polymorphic techniques applied becomes possible. However, it is still weak in detecting obfuscated malware. Besides, some execution paths could be only explored after execution [4, 5].
Malware analysis technique kept its advance due to certain needs; hence, dynamic analysis was proposed. Dynamic analysis methods are known to perform well for obfuscated malware [3]. Dynamic analysis executes malware, monitors how it behaves, and detects unknown malware that shows similar behavior to the known ones [3]. Two major dynamic analysis methods that are well known are control flow analysis and API call analysis [19, 20].
API call information shows how malware behaves. API call information can be extracted by both static and dynamic approaches. With static approach [6, 10, 11, 21], API list can be extracted from PE format of the executable files. With dynamic approach [7–9, 22–24], the APIs that are called can be observed by running the executable files (usually run on virtual machine).
There are two major ways to analyze the API call information gathered through the static approach. The first one applies simple statistical analysis, for example, counting the frequency of the called APIs, which can be used as a feature for classifying malware [6]. The second approach applies data mining or machine learning techniques to the collected API call information [21].
On the other hand, API call sequence information collected through the dynamic approach can be used for creating behavioral patterns. The information gathered through the dynamic approach can also be processed using simple statistics such as frequency counting [7] and data mining or machine learning [8, 22, 24].
Besides, researchers find other ways to process API call sequence information. Some previous studies applied API call graph [13]. There are many variations of call graph analysis. In [25], researchers adopted social network analysis methods to find meaningful features for call graph analysis. In [9], researchers calculate the similarity between API call sequences based on cosine similarity function and extended Jaccard measure. Recent works [19, 20, 26, 27] use additional information such as control flow information and API argument information to increase the accuracy in the mining process.
In Table 1, we summarize the previous empirical studies on API call analysis and their overall performances. We also compare them with our approach. In this study, we adopted dynamic method to extract the API call sequences. To get patterns that are rigor, we applied DNA sequence alignment algorithms (MSA and LCS). By using both extracted API call sequence patterns and critical API call sequences, we can detect the unknown malware or variants with high accuracy.
Comparison of previous researches.
2.2. DNA Sequence Alignment
Sequence alignment algorithms are most widely used in the area of biometrics to calculate the similarity between two or more DNA sequences or to find certain DNA subsequences from full DNA. Sequence alignment algorithms fall into two categories, which are global and local alignment [15]. Global alignment aligns entire sequence. It is useful when we attempt to find the sequence of most similar length and strings. The well-known global alignment algorithm is the Needleman-Wunsch algorithm [28]. Local alignment finds the highly similar subsequences between given sequences [15]. The Smith-Waterman algorithm is the well-known local alignment algorithm [29].
Pairwise sequence alignment methods find global alignments or the locally similar subsequences of two sequences. Multiple sequence alignment algorithm extends pairwise alignment to handle multiple sequences at the same time [15]. Since we need to handle multiple API call sequences, we applied multiple sequence alignment algorithm in this study.
3. Methodology
3.1. Environment Setup for Dynamic Analysis
For our experiment, we set up a virtual environment to run malicious programs. To trace API call sequence in runtime, we hook the program. In this study, we used the Detours hooking library [30] supported by Microsoft to trace API call sequences from Windows executable programs.
The Detours hooking process proceeds as shown in Figure 2 [30]. Unlike the original process without Detours, the hooking process goes through the Detour function and the Trampoline function. Before the target function starts, the Detour function leaves the log of the target function's name. After the Detour function ends, the Trampoline function that saves the start address of the target function begins. After the target function finishes, the Detours function also can check the end of target function. This allows us to trace API call sequence.

Hooking process of Detours.
For experiment, we set up a virtual machine environment to run and monitor any suspicious executable programs’ behavior. VirtualBox [31] was used to execute malware and observe its activity. We adopted 32-bit Windows XP Service Pack 3 as the virtual machine's operating system because most malware easily runs on a Windows XP environment. In addition, we set the maximum monitoring period as two minutes for the default value to trace each API call sequence. This monitoring period is configurable.
For DNA sequence alignment, we used the ClustalX [32]. ClustalX is widely used freeware in genome sequence analysis, such as DNA, RNA, or protein sequences. The program supports multiple sequence alignment (MSA) [33] and also provides visualization.
Figure 3 shows the result of sequence alignment by ClustalX.

Visualization of API call sequences of Trojan.PSW.Win32.Tepfer. (a) Visualization of API call sequence before applying multiple sequence alignment. (b) Visualization of API call sequences after applying multiple sequence alignment.
Figure 3(a) presents the API call sequences found in some malware samples in the same class and Figure 3(b) shows the result after applying MSA. This shows that, for those given malware samples, malware in the same family shares much common call subsequences. In Figure 3(b), each horizontal line represents an API call sequence of each malware, and the vertical line shows the common API call subsequence among malware in the same class. The API subsequence is colored according to the related functionality.
3.2. Dataset Preparation
To create a dataset, we chose 23,080 malware samples randomly from the malware dataset of the Malicia-project [34] and VirusTotal [35].
We share our dataset used in this paper online. Table 2 shows simple statistics about our dataset. The whole dataset (including malware, benign software, and their call sequence list) is accessible from the URL (http://ocslab.hksecurity.net/apimds-dataset).
Dataset description.
3.3. Limitation of Using Antivirus Program's Labeling Information
As we mentioned earlier in Sections 1 and 2, many previous studies on API call analysis classified malware based on the antivirus program vendors’ labeling. However, such labeling can lead to an error as shown in Figure 4.

An example of MSA result of Trojan-FakeAV.Win32.Security Shield.
Figure 4 shows that malware in the same label group can vary in terms of the API call sequence. Malware listed in Figure 4 is labeled as Trojan-FakeAV.Win32.Security Shield. For ease of explanation, we divide malware into groups (a), (b), and (c). While malware in groups (a) and (b) shows a similar API call sequence pattern within the group, we can only observe partial similarity between groups (a) and (b). Malware in group (c) shows a large difference in the API call sequence within the group. Group (c) also differs significantly from groups (a) and (b) in the API call sequence pattern.
As described above, malware in the same class can have quite different call sequences. On the other hand, malware in different classes can have common call sequences. For example, in our dataset, the three malware classes, “HEUR:Trojan. Win32.Generic,” “Trojan.Win32. FakeAV.lhiz,” and “Packed.Win32. Katusha.o,” show high similarity among them.
3.4. Categorization of APIs
In our dataset, 2,727 kinds of API were found; we categorized them into 26 groups according to MSDN library [36]. Table 3 shows an example of API classification. For example, CallNextHook API and SetWindowsHook API are categorized as class A, the hooking functions. Likewise, DeviceIoControl API and DvdLauncher API are categorized as class Z, the device control. If there exists an API call sequence such as [CallNextHook, DeleteFile, DeviceIoControl], the categorized API call sequence becomes [A, B, Z].
Description, example, and the number of APIs by API class.
3.5. Extraction of Longest Common Subsequence
To extract the common API call sequence pattern among malware, the longest common subsequences (LCSs) [37] were used. The formula is shown in (1). In the formula,
The LCS of malware can contain the same API call sequences that exist in benign programs. This occurs when malware uses generic APIs commonly used in any programs (e.g., API calls used in GUI). To reduce this kind of false positive error, we excluded these typical LCSs from our signature database.
3.6. Critical API Call Sequences of Known Malicious Activities
To improve the usefulness of API call sequences as features for detecting malware more accurately, we employed an additional feature. We assumed that there exists a specific API call sequence pattern that matches specific malicious activity. If such an API call sequence pattern exists, we can detect unknown attacks by checking whether a critical API call sequence pattern exists. To verify the existence of a critical API call sequence pattern of malware, we created profiles of the API call sequences of known attacks, which can be further used for unknown attack detection.
Examples of critical API call sequence patterns found in our sample malware are shown in Table 4. It should be noted that round brackets represent the OR relation. For example, in the case of IAT hooking, one possible critical API call sequence pattern is [LoadLibrary, strcmp, VirtualProtect]. In this case, strcmp can be replaced with strncmp and it becomes another possible critical API call sequence pattern [LoadLibrary, strncmp, VirtualProtect].
Malicious activities and their critical API call sequence pattern.
The following descriptions explain how certain malicious activities are carried out using the critical API call sequence pattern found. DLL injection: various methods for injecting Dynamic Link Library (DLL) into a target process exist, the most popular of which is to use CreateRemoteThread API, introduced by Jeffery Ritcher in the 1990s [38]. First, the method gets a handle of the target process by using OpenProcess. Then, it allocates some space in the target process’ memory using VirtualAllocateEx and writes the DLL's name, including full path, to the allocated memory by using WriteProcessMemory. Finally, it makes the target process reload the DLL using CreateRemoteThread. IAT hooking: the Import Address Table (IAT) contains the API's start address. By modifying the address, a different API can be called although the legitimate API is called. The process is as follows. First, it loads the target library using LoadLibrary. After finding the DLL from the IAT by comparing related APIs (e.g., strcmp), it modifies an attribute of the memory to be writable by using VirtualProtect and changes the address of the DLL. Antidebugging: there are many methods of detecting a debugging process using the API. For example, if the return value of the IsDebuggerPresent API is 1, this means that the program is being debugged. Antidebugging itself may not be a malicious activity, but it is often observed in malware. Screen capture: backdoor often captures the screen and saves it as an image file. The process is as follows. First, it gets a handle of the window using an API, such as GetDC. Then, it creates a compatible space for saving the image using CreateCompatibleDC and CreateCompatibleBitmap. After choosing the image's pointer using SelectObject, it copies the image to the memory space using BitBlt. Finally, it writes the captured image as a file using WriteFile.
We checked whether such critical API call sequence patterns and related malicious activities found in malware distinguish malware from benign programs. As shown in Table 5, we can observe that DLL injection, IAT hooking, and screen capture activities and their related API call sequences are found only in malware. However, antidebugging activity was found both in malware and benign programs. This activity was detected in malware much more frequently. This result shows the existence of unique behaviors of malware and their related critical API call sequence. We believe that this behavior-based malware classification can be effective in the dynamic analysis of API call sequences.
Scanning result of malicious behavior.
3.7. Overall Process
To summarize, our overall process is described in Figure 5. In the creating signature process, the API call sequence of malware is extracted and stored in the signature database. MSA is applied to extract the LCS. The preanalyzed critical API call sequence is also stored in the signature database.

Overall process of our methodology.
In the analysis process, the extracted API call sequence of a newly inserted program is compared with the API signatures in the database. If any LCS is matched with the signatures, the newly inserted program is classified as malware. In addition, when the newly inserted program has a matched critical API call sequence, the detected functionalities are reported as well.
3.8. Prototype Implementation
We implemented our malware detection system, API-based malware detection system (APIMDS) based on API call sequence analysis. As can be seen in Figure 6, when a new program needs to be traced, the hooking process monitors and tracks the program's API call sequences. After extracting the API call sequences from the program, the system compares them with our API call sequence database of APIMDS. If matched, APIMDS alerts the security administrator.

Overview of the proposed malware detection system, APIMDS.
3.9. Accuracy Test
We used 70% of the malware and benign programs to train the process and tested the accuracy using the remaining 30%. The test results are shown in Table 6. There is no false positive because the benign LCS sequences were excluded, as explained in Section 3.5.
Confusion matrix of classification result.
FPR (False Positive Ratio) = FP/(FP + TN) = 0.
FNR (False Negative Ratio) = FN/(FN + TP) = 0.0011.
Recall = TP/(TP + FN) = 0.998.
Precision = TP/(TP + FP) = 1.
To summarize, the precision is 1, and recall is 0.998. The
3.10. Limitations and Future Works
DNA sequence alignment algorithms consume much resources and time. MSA is known as NP-Complete, and computing LCS is known as NP-hard problem. Therefore, the computing cost of MSA and LCS is not negligible. However, most malicious programs’ size is relatively smaller than benign programs. In our experiment, computing MSA of 1,000 sequences took around 30 seconds under Intel i5 core and 8 GB memory PC environment.
In addition, to overcome such complexity issue, we adopted blacklist and whitelist based filtering, which excludes well-known benign or malicious programs. In order to reduce the computing cost, we recommend updating well-known benign programs or malicious programs list continuously from trustable sites. For example, this software list can be retrieved from the National Software Research Library [39].
Recent malware can hide their activities by using logic bomb or checking the defender's presence (e.g., checking the debugger process). Furthermore, they tend to adopt the obfuscation techniques. Therefore, detecting malware's hiding functionalities becomes challenging. Although several works try to detect logic bomb [40, 41] by using the symbolic-execution method, this method is not cheap. Our system can detect the antidebugging functionality that includes the API calls such as Sleep, GetLocalTime, and IsDebuggerPresent. However, there are other patterns to hide malware's activities. We need to enhance our system to better detect antidebugging functionality in the future.
The hooking process used in this study can only trace user-level APIs, and, therefore, the API call sequences cannot be logged if malware uses kernel-level APIs. In future work, we will consider malware using kernel-level APIs. We expect that this will improve the accuracy of the proposed system.
In the future, with the proliferation of mobile platform and the introduction of IoT (Internet of Things), we need to adjust our system for the use in those eras.
Our prototype is not well elaborated for practitioners. Hence, further study or elaboration is needed in the future.
4. Conclusion
In this paper, we proposed a novel method of API call sequence analysis. We found that antivirus vendors’ labeling of malware could be less accurate to be applied in the dynamic analysis of API call sequences. Therefore, instead of extracting API call sequence for each class, we extract API call sequence patterns from malware in different categories, focusing on common malware functions. Then, we developed a signature database and the proposed APIMDS for detecting malware based on the signature. Experimentally, our system showed promising detection results with high accuracy and extremely low error rate.
Many malware detection systems rely on the signature of a malware's static information, such as file size, process, and its artifacts. Therefore, they fail to detect new unknown malware until the signature has been updated. In contrast to the signature of the malware's static information, our signature database of dynamic information of critical API call sequence patterns allows us to generalize malware behavior and facilitates the effective detection of new unknown malware. Our method can be applied to both of the traditional PCs and new smart devices if the devices can extract and send the extracted API call sequence information to the analysis module outside. Because API call sequence is well abstracted behavioral data that can be extracted from most of the device, therefore, our method can detect the attack by analyzing the big data that are extracted from the ubiquitous devices such as sensors, smart devices, PCs, and servers.
We believe that our proposed system can be applied in a new type of cyber security intelligence system that is required to investigate ever-evolving malware.
Footnotes
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2014-H0301-14-1004) supervised by the NIPA (National IT Industry Promotion Agency). This work was also supported by the ICT R&D Program of MSIP/IITP [14-912-06-002, The Development of Script-Based Cyber Attack Protection Technology].
