Abstract
According to the previous management of early warning and risk control methods, the efficiency of management prediction is low, the effect is not good, and the disadvantages are very obvious. This paper mainly studies the C4.5 algorithm, Apriori algorithm and K-means algorithm. On the basis of association rules, the data from the above three algorithms are fused. On the fusion results of the processed data, it builds and optimizes the early warning model. The fusion data used in this model can be regarded as the basic data and the association rules are used for data mining. The experimental results show that data fusion can solve the problems of management early warning and risk control. This method is applied to enterprises Management has reference value.
Introduction
Almost all the enterprises that fall into the crisis of operation and management take financial crisis as a sign [1]. The emergence of financial crisis has a gradual and deteriorating process, which will eventually be reflected through financial indicators. Therefore, as an important part of business management [2, 3], financial management naturally requires the establishment of corresponding financial early warning system [4]. It is of great research value and practical significance to build an effective early warning model of financial crisis, to get early warning signals of serious deterioration of the financial situation of listed companies as soon as possible, and to meet the increasingly urgent needs of stakeholders [5, 6]. In addition, it is of great practical significance to correctly predict the financial risks of enterprises for the protection of the interests of investors and creditors, the prevention of financial crisis by operators, and the supervision of the quality of listed companies and the risk of securities market by government departments [7]. This paper proposes a data fusion mining management early warning and risk control method [8–10]. This method can comprehensively analyze the hidden internal relations between the results of operation and management and various financial data, take effective measures to promote the reform of operation and management, and improve the effectiveness and quality of management decisions.
C4.5 algorithm
In 1986, J. Ross Quinlan published a paper entitled “induction of decision trees” in
Attribute metrics
Given a sample set
Where
The information Gain (
Among them,
The principle that ID3 algorithm chooses attribute a as the test attribute is to maximize the information Gain(
There is a drawback in ID3 algorithm that information gain is used as the test attribute selection method: because information gain measurement tends to have more attributes, but the attributes with more values are not necessarily the best attributes.
In order to solve this problem, Quinlan proposes C4.5 algorithm, which modifies the classification evaluation function and uses the information gain rate to replace the information gain as the classification evaluation function. It is not only the successor of ID3 algorithm, but also the basis of many decision tree algorithms. In the decision tree algorithm applied to single machine, C4.5 algorithm not only has high classification accuracy, but also has the fastest speed.
Information gain ratio punishes multi valued attributes by adding an item called split information, which is used to measure the breadth and uniformity of attribute split data.
Where,
The information gain rate is equal to the ratio of information gain to segmentation information.
Based on ID3, C4.5 algorithm integrates the processing of continuous attribute and attribute value vacancy, and has a more mature method for tree pruning. The main idea of C4.5 algorithm is: suppose
The specific pseudo code algorithm description C4.5 form-tree (
C4.5 algorithm form-tree (T, T_attribute list)
C4.5 algorithm form-tree (T, T_attribute list)
C4.5 algorithm is a greedy algorithm, which uses the top-down, divide and conquer recursive way to construct a decision tree. In addition to the improvement of classification evaluation function, the following two aspects are also improved:
On the one hand, ID3 algorithm can deal with discrete values, while C4.5 can deal with attributes of continuous values. For attribute
On the other hand, C4.5 algorithm splits the training sample with missing attribute value according to all possible values of the missing attribute, divides the instance into multiple instances and belongs to different categories. In the process of execution, probability method is adopted and different weights are assigned. The weight value is the occurrence probability of a possible value in the classification. In this way, the number of samples passed down the path may not be an integer but have a score, but it does not affect the calculation process of the algorithm.
Basic concepts of association rules
Association rule mining is widely used in transaction database.
Suppose that the set
Definition 1: Association Rule.
Suppose A and B are non-empty sets composed of some items, It is A ⊆ I, A≠ ∅, B ⊆ I, B≠ ∅ and A∩ B = ∅, then the expression of the form A ⇒ B is called association rule, which means that the appearance of item subset A will lead to the appearance of item subset B. It is called association, A is the antecedent or precondition of association rule, and B is the consequent or result of association rule.
Definition 2: Support
Let A be a non-empty set composed of some items, that is, A ⊆ I and A≠ Ø. The support degree of rule A ⇒ B in transaction database D is the ratio of the number of transactions containing A and B in the transaction set to the number of all transactions, which is recorded as Supp (A ⇒ B), that is,
Physical significance of support degree: from the statistical significance, the support degree of project A, Supp (A), indicates the probability of project A appearing in transaction database T.
Definition 3: Confidence
The confidence level of rule A ⇒ B in the transaction set refers to the ratio of the number of transactions containing both A and B to the number of transactions containing A, which is recorded as Conf (A ⇒ B), that is,
Physical significance of confidence: for association rule A ⇒ B, its confidence indicates how likely it is to include both A and B in a transaction;From the statistical point of view, confidence is also a conditional probability, that is, in the case of A, the probability of B.
Definition 4: Strong Association Rule
The minimum support and confidence of a given association rule are MinSupp and MinConf. For association rule A ⇒ B, if Supp (A ⇒ B)≥MinSupp and Conf (A ⇒ B)≥MinConf, then association rule A ⇒ B is called strong association rule.
In statistical sense, the minimum support degree represents the lowest importance of association rules, and the minimum confidence degree represents the lowest reliability of association rules. Therefore, strong association rules are important and reliable association rules with expected value. Association rules that do not meet the above two conditions are also called weak association rules.
When the support of data item set A is greater than MinSupp, A is called frequent data item set, which is called frequency set for short.
Let A and B be the item set in data set D.
Given a transaction set D, the problem of mining association rules is to generate association rules with support and confidence greater than the minimum support (MinSupp) and minimum confidence (MinConf) given by users. When the support and confidence of the rules are greater than MinSupp and MinConf respectively, we think the rules are effective.
The mining process of association rules mainly includes two stages: the first stage is to find out all frequent project groups from the data set, and the second stage is to generate association rules from these frequent project groups.
In the first stage of association rule mining, all frequent project groups must be found out from the original data set. Frequent means that the frequency of a project group must reach a certain level compared with all records. The frequency of a project group is also called the support degree. Taking a 2-item-set including A and B as an example, we can get the support degree of a project group containing A, B from equation (5). If the support degree is greater than or equal to the minimum support threshold, then A, B is called the frequent project group. A k-item-set meeting the minimum support is called frequent k-item-set, which is generally expressed as Large k or Frequent k. The algorithm generates large k + 1 from Largek’s project group until no longer frequent project group can be found.
The second stage of association rule mining is to produce association rules. To generate association rules from frequent project groups is to use frequent k-project groups in the previous step to generate rules. Under the threshold of minimum confidence, if the confidence degree of a rule meets the minimum confidence, this rule is called association rule. For example, the confidence of rule A ⇒ B generated by frequent k-item group A, B can be obtained by formula (2.6). If the confidence is greater than or equal to the minimum confidence, then A ⇒ B is the association rule.
Apriori algorithm
Agrawal first proposed the problem of mining association rules among item sets in customer transaction database in 1993. The core method is Apriori algorithm, which is based on frequency set theory. Apriori algorithm is one of the most influential algorithms for mining frequent item-sets of association rules. It uses known frequent item-sets to derive other frequent item-sets. It is a width first algorithm.
Apriori algorithm divides the mining of association rules into two sub problems. First, mining all frequent items whose support degree is not less than the minimum support degree MinSupp from the transaction database D; second, generating association rules whose confidence degree is not less than the minimum confidence degree MinConf by using the mined frequent items. The algorithm is shown in Table 2.
Apriori algorithm
Apriori algorithm
Input data: transaction database D; minimum support threshold MinSupp.
Output result: frequent item set L in D.
Step 2 finds out the set L1 of frequent 1-term sets. In step 3–11, LK - 1 is used to generate candidate set CK to find LK. Apriori_ Gen does two actions: connect and prune, that is, to generate candidate item set CK from frequent item set LK - 1 connection. The specific process is described in Table 3.
Apriori_ Gen algorithm
According to Apriori property, all subsets of frequent item-sets must be frequent. This algorithm uses layer by layer search technology. Given k-item-sets, we only need to check whether their k-1 subsets are frequent. The test description of non-frequent subsets is shown as Table 4.
Test algorithm of frequent subsets
The key to the high efficiency of the algorithm is to generate smaller candidate project sets, that is to say, the candidate project sets that are not likely to become frequent project sets are not generated and calculated as much as possible. It takes advantage of the basic property that any subset of a frequent item-set must also be a frequent item-set. This property is inherited by most of the current association rule algorithms.
The K-means algorithm proposed by J.B.Mac Queen in 1967 is a classical clustering algorithm, which is widely used in scientific research and industrial applications. K-means algorithm is an indirect clustering method based on similarity measurement between samples, which belongs to unsupervised learning method. The task is to divide the data set into k disjoint point sets, so that the points in each set are as homogeneous as possible. That is to say, given the set N data points
The basic idea of the algorithm is: given a database containing n data objects and the number of clusters to be generated k, randomly select k objects as the initial k cluster centers, then calculate the distance between the remaining samples and each cluster center, classify the samples to the nearest cluster center, and calculate the average value of the adjusted new clusters If there is no change in the centers of the two adjacent clusters, it means that the adjustment of samples is over and the clustering average error criterion function E has converged. The function E is as follows:
The E is the sum of the squared errors of all objects in the database, p is the point in the space, representing the given data object, and mi is the average value of cluster cI (both p and mi are multidimensional). The algorithm is shown in Table 5.
K-means algorithm flow
This algorithm has good scalability. The disadvantage of K-means clustering algorithm is that it scans the database many times. In addition, it can only find the spherical class, but not the arbitrary shape class. In addition, the selection of initial centroid has a great influence on clustering results, and the algorithm is very sensitive to noise.
Selection of financial indicators
It is a gradual process for an enterprise to fall into financial difficulties. The gradual deterioration of its production and operation will usually be reflected in the financial statements of the enterprise quickly, showing some abnormal financial index data. There are many factors that affect the financial status of enterprises, but the data of some indicators are difficult to obtain, which requires a lot of human and material resources, so those financial ratios with high acquisition costs are not considered. According to the principle of operability, combined with the indicators provided in the financial report, this paper selects 29 financial indicators that comprehensively reflect the profitability, solvency, operating capacity and cash flow to build the financial early warning model, including 4 aspects of the company size and growth capacity that are not covered in the general paper but which we think have a greater impact on the financial risk prediction Financial indicators (log (total assets), log (net assets * total shareholders’ equity), growth rate of total assets, growth rate of operating revenue). The selected indicators are shown in Table 6.
List of financial crisis early warning indicators
List of financial crisis early warning indicators
We use seven classification methods provided by data mining software Weka, such as Bayesian network, decision tree, rule-based classification, nearest neighbor classification, multi-layer perceptron, BP neural network, logical regression, to establish various early warning models and analyze them. A large number of data analysis is carried out from two aspects. Firstly, all financial indicators are used for risk modeling and analysis by using seven classification methods, then indicators are selected by using data mining methods, and then risk modeling and analysis are carried out by using selected indicators.
The modeling process is based on the original data set without attribute selection. For each classification algorithm, two models are established respectively: the 2010–2015 data set and the 2010–2016 data set are used as training sets; the 2016 related data and the 2017 related data are used as test sets. Table 7 shows the test results of different classification methods on two sets of data.
Prediction accuracy of different classification algorithms for two different data sets before attribute selection
Prediction accuracy of different classification algorithms for two different data sets before attribute selection
The experimental results show that the performance of nearest neighbor classification, multi-layer perceptron, BP neural network and logistic regression is basically the same, while the performance of Bayesian network, decision tree and rule-based classification is not significantly different, the overall performance is significantly lower than the first four methods, but the recognition accuracy of ST (about 60%) is significantly higher than the first four methods.
From the experimental results, it can be seen that using the data from 2010 to 2015 as the training set for modeling to predict the data from 2016 to 2017, the prediction accuracy of most methods is lower than using the data from 2010 to 2016 to predict the prediction accuracy of 2017. It can be understood that the law of stock market in 2016 and 2017 is obviously different from that in 2010–2015. Therefore, the prediction accuracy of the model is not ideal.
Based on the data from 2010 to 2015, by using the three attribute selection methods of BestFirest, Green Stepwise and Linear Forward Selection in weka, we get 9 reserved attributes, that is earnings per share (diluted operating profit), debt asset ratio (total assets), log (net assets * total shareholders’ equity), total asset growth rate, cash liability ratio, operating profit, total owners’ equity (including minority shareholders’ equity), net assets, etc.
For the selected 9 indicators, two models are established for each classification algorithm: 2010–2015 data set and 2010–2016 data set as training set; 2016 related data and 2017 related data as test set. The test results are given in Table 8.
Prediction accuracy of different classification algorithms for two different data sets after attribute selection
It can be seen from Table 7 and Table 8 that after attribute selection, the prediction accuracy of various models does not change much, most of them are slightly improved, but the amount of data is reduced by nearly 2 / 3 compared with that before attribute selection, so the establishment time of the model is greatly shortened. When multi-layer perceptron algorithm is adopted, the establishment time of the model is shortened to 16.5% before attribute selection, The modeling time of other classification methods is also reduced to varying degrees, ranging from 24.74% to 57.57% before attribute selection. At the same time, the representation of the model after attribute selection is more concise, and the time to detect new data is correspondingly shortened, which shows that the model after attribute selection has better applicability. In addition, it is noted that after attribute selection, the four innovation indicators proposed in this paper retain three (log (total assets), log (net assets * total shareholders’ equity), and the growth rate of total assets), which shows that the concept of innovation indicators added in this paper is correct.
Based on the analysis of the financial data of non-financial listed companies from 2010 to 2017, four new indicators, including Log (total assets), Log (net assets * total shareholders’ equity), growth rate of total assets, and growth rate of operating revenue, are introduced. A total of 29 indicators are used for risk analysis. Seven different classification methods are used to model financial risk. The results show that the performance of the four methods is basically the same, and the risk early warning model can be built with nine representative indicators, which can better achieve the risk prediction. Generally speaking, the data set we deal with is balanced, and the machine learning method adopted is ideal for the classification performance of a few classes of unbalanced data. Therefore, the method based on data fusion mining proposed in this paper has certain reference value for enterprise management early warning and risk control.
Footnotes
Acknowledgments
This paper is supported by Social Science Foundation in Shaanxi Province(NO.2019S025) and the Soft Science Project in Shaanxi Province (NO. 2019KRM047).
