Sage Journals: Discover world-class research

Abstract

The study proposes a way of developing granular models based on optimized subsets of data with different sampling sizes, in which three generally used models, namely Support Vector Machine, K-Nearest Neighbor, and Long Short-Term Memory, are designed and transformed into granular version for achieving a good performance with sufficient functionality. First, a collection of subsets are determined using different sampling methods, which are subsequently applied to play as an essential prerequisite of the proposed models. Then, the principle of justifiable granularity is utilized to the design of interval information granules based on the subsets of data. The design process is associated with a well-defined optimization problem realized by achieving a sound compromise between two conflicting criteria: coverage and specificity. To evaluate the performance of the granular models, two aspects are considered: (i) sampling methods used in determining suitable subsets of data; (ii) different models applied to be transformed into granular models. A series of experimental studies are conducted to verify the feasibility of the proposed granular models.

Keywords

Data sampling granular model the principle of justifiable granularity

Introduction

When dealing with large amounts of data, it is necessary to pay attention to its accuracy and interpretability to improve the performance of system modeling. Granular computing^1,2 is one of the emerging information processing tools in computational intelligence. It is capable to solve complex problems by simulating human’s thinking behavior and opens a new and attractive direction for system modeling. Granular computing includes a series of concepts, methodology, algorithms, and applications for reasoning and computation realized on a basis of information granules, which are genuine generalizations of numeric data (experimental evidence).³ To be more specific, granular data modeling is commonly realized based on information granules rather than numeric data, and the outputs of granular models are also expressed in forms of information granules. Information granules provide more effective solutions to express data by considering both the completeness and accuracy, by which data analysis can be formalized at a higher level of abstraction. As a collection of elements carefully designed according to their characteristics and similarity, information granules can well reveal the inherent topological structure of the data and be used for the realization of system modeling.

To improve the efficiency of granular modeling, it is considered to use a subset of data to represent the characteristics of the entire dataset. Through sampling, we can effectively reduce the computational complexity and increase the processing speed. The challenge brought by data computation means that data sampling is indispensable, and it is critical to determine which sampling methods to choose.⁴ Due to the overall characteristics represented by the subsets of data, the performance of data analysis can be ensured.

It is well recognized that data modeling and processing are playing a significant role in traditional computational intelligence. However, a commonly encountered situation is that most of the data modeling is developed based on numeric data. It is of great interesting and challenging to implement the designing process in the framework of granular computing. In the era of big data analysis, it is not a good idea to deal with individual samples, and the understanding and cognition of the main characteristic of the entire data set is far more important than each individual sample. It is worth noting that granular computing plays as a fundamentally mechanism in data analysis and modeling. Numeric data often fails to meet the requirements of data completeness and accuracy when expressing uncertain information. Information granules provide more effective solutions to such uncertain problems. Information granule is a carefully designed and abstracted collection of elements based on the characteristic and similarity of the data. It can express the meaning of the data completely and accurately. In essence, granular computing solves complex problems by dividing complex problems into a series of sub-problems which are easier to manage.

The contribution of this paper is to develop a framework of granular data modeling and analysis based on a subset of data. There are several vital steps. First, different sampling methods are used to obtain a series of subsets of data. Second, based on these subsets of data, granular models with good performances can be obtained. A granular model constructed with a subset of data is anticipated to not only guarantee its functionality but also improve its efficiency. The significance of this study is to obtain an accurate and reliable granular model by which the computational complexity can be effectively reduced. In the development of granular models, a two-objective optimization problem is encountered: the coverage criterion is implemented by requiring that the available experimental evidence be included in the obtained information granules as many as possible, while the specificity implies an information granule should be as specific as possible. In this case, it is anticipated to achieve a sound compromise between the two conflicting criteria of coverage and specificity. In the designing process, information granules are constructed and formed as the core part in representing data, which demonstrate the constructed model with good interpretability. To be more specific, the outputs of the granular model can be presented by information granules, thus to compensate the loss of features in the sampling process.

The remainder of this paper is organized as follows. In Section II, we provide a brief review of the related works. Granular models are designed and evaluated in Section III. Experimental studies concerning a synthetic dataset and a collection of publicly available machine learning datasets are reported in Section IV. Conclusions and future research are offered in Section V.

Related works

Processing large-scale data sets consumes a large time consumption and brings huge challenges to system modeling. Therefore, data sampling is considered to effectively solve this problem. He et al.⁵ use MapReduce to sample uncertain data, and propose a block-based sampling method for large data sets, assigning the entire data set to a distributed system.⁶ It is well known that data sampling is a method of extracting a part of sample units that are sufficient representative of all samples to be studied. Through sampling, one can estimate and infer all samples’ characteristics from the analysis and research results of the sampled units. However, if the sample size is too small, wrong conclusions may be drawn. Singh and Masuku⁷ discuss some traditional methods in detail for determining a suitable sample size. Albattah⁸ studies the role of data sampling in big data analysis. He believes that even if we can process the entire dataset, there is no necessary to do in this way. When data sampling is conducted in machine learning, the most appropriate samples are that which can be used to obtain the best performance. It indicates that an increase of sample size will not benefit to improve the accuracy of data analysis.⁹ It is convincing that sampling methods can help improve the efficiency of essential data analysis and become a necessary step for big data preprocessing in future research.

It is well known that data sampling can definitely reduce the computational complexity of data modeling, however, it will result in a loss of features to some extent. As a novel platform of representing and describing data, granular computing emerges to deal with complex problems.

The concept of granular computing was firstly proposed by Zadeh¹⁰ in 1979. He believes that there are information granules in many knowledge fields, but the manifestation of knowledge is different in different fields. Lin¹¹ regard the information granule as a descriptive name to summarize available information in the form of granules. In subsequent research,¹² he introduces granular computing into data mining and machine learning. Currently, many researchers focus on applying granular computing to data mining. For example, Chen et al.¹³ propose a new ML-KNN algorithm under the framework of granular computing, the results of which show that the model is superior to most traditional algorithms. In pattern classification based on granular computing, Panda and Tripathy¹⁴ and Mehmood et al.¹⁵ combine fuzzy logic, neural network, and Support Vector Machine to develop an efficient granular information processing paradigm. The concept of the granular neural network was firstly proposed in 2000,¹⁶ and it is further formalized and discussed in Pedrycz and Vinovich.¹⁷ In recent years, granular neural networks attract attentions of many researchers.^16,18,19 The granular neural network is formed on the basis of a given (numeric) neural network, and it is implemented in a way that can be used to process information granules. In order to find patterns and rules in big data analysis, many methods are used in data mining, and data sampling plays an irreplaceable role. This paper proposes a granular model based on the obtained optimal subset. It can effectively reduce unnecessary processing time while ensuring the performance of granular models.

Design and evaluation of granular models

In this paper, we consider using representative subsets to improve the efficiency of system modeling while ensuring its sufficient performance.

Figure 1 presents a framework of the proposed granular model in this paper.

Figure 1.

Development framework of a granular model.

Selection of data subsets

Given a dataset composed of n-dimensional input and single-dimensional output, D ={( x ₁, y₁), ( x ₂, y₂), …, ( x _N, y_N)}, x _i∈ R _n, y_i∈R, i = 1, 2, …, N. By applying different sampling method, L% (L∈[1, 100]) of the original data are selected in each generation, which means the number of each subsequent sample size is increased by L%. For instance, a series of subsets X ₁, X ₂, …, X _M are obtained, where the proportion of the original data of each subset is recorded as P, p = L%*j, j = 1, 2, …, M, M = 100/L. The subset is represented as X_j = {(x₁, y₁), (x₂, y₂), …, (x_d, y_d)}, and the number of instances in X_j can be expressed as d = N*P.

First of all, a brief introduction to the three sampling methods used in this paper is shown in Table 1. It should be noted that when the stratified sampling method is used in a data set with no obvious stratification, it could be simply stratified according to the data distribution.

Table 1.

A brief introduction to sampling methods.

Sampling method	Description	Scope of application
Uniform sampling	Extracting a certain number of samples and each tuple is selected with equal probability.	There are fewer numbers in the population.
Stratified sampling	Tuples are divided into homogeneous groups and sample from each group.	There are more numbers in the population.
Systematic sampling	According to a certain sampling distance, samples are drawn from the population.	The total is made up of several distinct parts.

Based on a series of subsets X ₁, X ₂, …, X _M, different models are constructed and marked as G₁, G₂, …, G_M. In this study, three different traditional regression models, namely Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Long Short-Term Memory (LSTM), are applied to the experiments. The outputs of each corresponding model are obtained as Z ₁, Z ₂, …, Z _M, respectively, which are represented as a collection of one-dimensional vectors, Z_j= {e₁, e₂, …, e_n}, e_k∈R, k = 1,2, …, n, where n is the number of instances contained for the outputs of the model. Consequently, interval information granules are constructed based on the outputs of each model.

Design of interval information granules with the principle of justifiable granularity

The principle of justifiable granularity is one of the leading paradigms offering a solid guideline of forming information granules.²⁰ The essence of the paradigm is to form a single meaningful information granule based on available experimental evidence (data), and it is required that such a construct adhere to two intuitively compelling requirements: coverage and specificity.²¹

Here we assume that an information granule Ω=[a, b] represents an interval with an upper bound b and a lower bound a.

(a) Coverage (cov): The numeric experimental data accumulated within the information granules should be as high as possible, which means that this information granule is well justified (supported) by available experimental evidence.³ Let us consider the elements of data located to the right side of the numeric representative m (it could be the mean or median value of the experimental evidence), coverage is determined as the cardinality of the elements contained within the range [m, b],

cov = \frac{1}{N} card {y_{k} | y_{k} \in [m, b]}

(1)

(b) Specificity (sp): Specificity measures the degree of accuracy of information granules in expressing experimental evidence.²² It is expressed as follows:

sp = \exp (- α \frac{| b - m |}{| y_{max} - m |})

(2)

where y_max is the maximal value of Z _j, j = 1, 2, …, M.

The value of α has a certain impact on the calibration of the specificity criterion in the construction of information granules. It is worth noting that if α = 0, sp=1, and the specificity criteria is completely excluded (ignored). In this case, the upper bound is obtained as b = y_max. The value of α emphasizes the influence of the specificity criterion.²³

It is noted that the expression of both the coverage and specificity are not unique. However, the increase of coverage is corresponded to a decrease of specificity.^24,25 Since the two requirements are in conflict, we are interested in achieving a good balance between coverage and specificity.²⁶ Therefore, we can resort to a two-objective optimization problem and look at the resulting pareto front.²⁷ One can consider the aggregation of the two requirements, which can be succinctly completed as a product of coverage and specificity,

V (b) = cov * sp

(3)

By maximizing V(b), the optimal upper bound b can be obtained, that is, b_opt = argmax_bV(b). A similar process is carried out for the lower bound a.

In summary, the entire design process of information granules with the use of the principle of justifiable granularity is summarized in Algorithm 1.

Algorithm 1.

Construction of Interval Information Granules.

Input:
experimental data, Z _j = {e ₁, e ₂, …, e_n }, e_n ∈ R ;
α =2.0;
l = 100;
Output:
upper bound b _opt;
Initialization:

$Δ y = \frac{y_{max} - m}{l}$

;
m = mean (Z_j );
y_max = max (Z_j );
1: for h = 0; h < l; h ++do
2: b = m+h*Δy;
3:

$cov = \frac{1}{N} card {y_{k} | y_{k} \in [m, b]}$

;
4:

$sp = \exp (- α | m - b |)$

;
5: V(b) = cov*sp;
6: b _opt = argmax_bV(b);
7: end for

Evaluation of granular models

To evaluation the performance of the designed granular model, the general idea of the coverage and specificity are augmented to consider both the completeness and accuracy of the outputs of the model. Here, the coverage is defined as the degree of original data D contained within the interval information granules σ_j= [ $y_{j}^{+}$ , $y_{j}^{+}$ ], where $y_{j}^{+}$ and $y_{j}^{+}$ denote the lower and upper bounds of σ_j, j = 1,2, …, M,

Cov (σ_{j}) = \frac{1}{N} \sum_{i = 1}^{N} incl {y_{i}, [y_{j}^{-}, y_{j}^{+}]}

(4)

where

i n c l (y_{i}, [y_{j}^{-}, y_{j}^{+}]) = {\begin{matrix} 1, & if y_{i} \in [y_{j}^{-}, y_{j}^{+}] \\ 0, & o t h e r w i s e \end{matrix}

(5)

The specificity is quantified by the length of the interval, which is calculated as the average specificity of all intervals²⁸ as follows:

Sp (σ_{j}) = \max (0, 1 - \frac{| y_{j}^{-}, y_{j}^{+} |}{| y_{max} - y_{min} |})

(6)

where y_max and y_min represent the maximum and minimum values of the output space, respectively.

The overall performance index can be expressed as a product of the coverage and specificity criteria:

Q (σ_{j}) = Cov (σ_{j}) * Sp (σ_{j})

(7)

By evaluating the performance of the objective function based on different sample size of subsets, an optimal subset can be obtained. The evaluation process is summarized as shown in Algorithm 2:

Algorithm 2.

Evaluation of the Granular Model.

Input:
experimental data, ( x _i, y_i), i = 1, 2, . . ., N;
L% (L∈ [1, 100]);
Output:
value of objective function Q(σ_j);
Initialization:
M = 100/L;
1: forj = 1; j < M; j + +do
2: X _j= {( x ₁, y₁), ( x ₂, y₂), …, ( x _d, y_d)}, d = N*L%*j;
3: X _j → G_j;
4: G_j → Z _j;
5: Z _j →
$σ_{j} = [y_{j}^{-}, y_{j}^{+}]$
;
6:

$Cov (σ_{j}) = \frac{1}{N} \sum_{i = 1}^{N} incl {y_{i}, [y_{j}^{-}, y_{j}^{+}]}$

;
7:

$Sp (σ_{j}) = \max (0, 1 - \frac{| y_{j}^{-}, y_{j}^{+} |}{| y_{max} - y_{min} |})$

;
8:

$Q (σ_{j}) = Cov (σ_{j}) * Spec (σ_{j})$

;
9: end for

Experimental studies

In this section, we present a series of experiments completed based on a synthetic dataset and several publicly available datasets coming from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) to illustrate the performance of the proposed models. The experiments are carried out based on a Windows10 operating system, and the processor used in the experiment is AMD Ryzen 7 4800U with Radeon Graphics 1.80 GHz. Meanwhile, Python 3.6 version is used as the programing environment.

Synthetic data

The randomly generated 2-dimensional input and single-dimensional output data set are experimented. The function is described as follows:

Y = 0.5 * \sin (x_{1}) + 0.5 * \cos (x_{2}) + 3

(8)

where x₁ is in [0,50] and x₂ is in [−10,10]. In the experiments, a dataset containing 3000 pairs of input and output data is used (following uniform distribution over the input space). The dataset is divided into a 90%–10% proportion of training and testing data sets.

Experimental regression models and parameters setting

In this study, three regression models, namely KNN, SVM,and LSTM are selected as the numeric models. Both KNN and SVM are classical supervised learning algorithms. KNN considers each sample while SVM aims to find a function that achieves the separability of samples in the designing process. Moreover, KNN cannot handle high-dimensional data sets, while SVM is good at processing high-dimensional data. The naive KNN will not learn feature weights on its own, while the essence of SVM is to find weights. LSTM is a classical deep learning recurrent neural network model that has emerged in recent years, and it has excellent capabilities in the field of time series processing.²⁹

In the experiments, we use grid search method and cross-validation method to avoid overfitting, which is an exhaustive search method for specifying parameter values.^30,31

Firstly, use the grid search method to traverse each pair of hyperparameters in the search grid. Secondly, use the cross-validation method to evaluate each pair of hyperparameters to obtain the evaluation score index. The evaluation indicators of each pair of hyperparameters are compared to obtain the optimal hyperparameter pair, which is selected for model training. Grid search is used to search for parameters, that is, within a specified range of the parameters, it can be applied to realize the searching process according to the following steps: adjust the parameters in sequence according to the step length; use the adjusted parameters to train the learner; find the parameter with the highest accuracy on the verification set from all the parameters. K-fold cross-validation divides all data sets into k parts; takes one of them as the testing set each time; uses the remaining k−1 parts as the training set to train the model; calculates the score of the model on the testing set; and record the average score as the final score.

(a) KNN: For two n-dimensional vectors x and y , the Euclidean distance is defined as:

D (x, y) = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

(9)

For a given testing sample, find K closest samples in the training set based on the Euclidean distance, and then make predictions of these K neighbors.

Through the verification and analysis of the grid search method, the prediction effect is the best when the value of k in KNN is set as 10.

(b) LSTM: There are three gates in LSTM, namely the input gate, forgetting gate, and output gate. When a piece of information enters the LSTM network, it can be judged whether it is useful or not according to the pre-determined rules. I, F, O are the degree parameters of the three gates, and g is a conventional RNN operation of the input. It can be seen from Equation (10) that there are two outputs of LSTM, namely cell state c′ and hidden form h′. c′ is the product of the input and forgotten gates, that is, the contents of the current cell itself. h′ is obtained through the output gate, and it is also the content that is transfered to the next unit.

\begin{matrix} i = σ (w_{ii} x + b_{ii} + w_{hi} h + b_{hi}) \\ f = σ (w_{if} x + b_{if} + w_{hf} h + b_{hf}) \\ g = \tanh (w_{io} x + b_{ig} + w_{hg} h + b_{hg}) \\ o = σ (w_{io} x + b_{ig} + w_{hg} h + b_{hg}) \\ c^{'} = f * c + i * g \\ h^{'} = o * \tanh (c^{'}) \end{matrix}

(10)

The LSTM uses a multi-layer LSTM structure. The input layer LSTM structure of the prediction model uses 32 units; the hidden layer LSTM structure uses 16 units; the output layer uses a fully connected layer; the number of nodes in the output layer corresponds to the length of the predicted data; and the activation function uses the sigmoid function. Grid search and cross-validation are used to determine the time window parameters. Through the verification and analysis process of the grid search method, the best performance can be obtained when the batch size is 10 and the epochs is 100.

\min \frac{1}{2} | | w | |^{2} + C \sum_{i + 1}^{m} l_{ε} (f (x_{i}) - y_{i})

(11)

where C is regularization constant, and l_ϵ is the function of the insensitive loss shown in the following formula:

l_{ε} (Z) = {\begin{matrix} 0, & i f | z | > ε \\ | z | - ε, & otherwise \end{matrix}

(12)

With the value of γ increasing, the regression effect of the testing set gradually decreases, and the generalization errors is easily overfit. By applying the grid search method to obtain the optimal parameters, the RBF function is used in the kernel function and the best accuracy can be obtained when γ = 0.01.

Analysis of experimental results

The predicted outputs obtained with the three regression models compared with the original output are shown in Figure 2. Here, the neural network model is chosen as an example for analysis. Based on its numerical output, we use the principle of justifiable granularity to guide the construction of information granules. For the construction of information granules, we consider the determination of upper bound b of the interval. The plots of the performance index V(b) regarded as a function of b as well as the curve of coverage and specificity are as shown in Figure 3.

Figure 2.

Curves of original data set (true value) and the output of the model (predict value): (a) KNN model, (b) SVM model, and (c) LSTM model.

Figure 3.

Plots of V (b) treated as a function of “b” as well as coverage and specificity with selected value of α = 2.0.

The coverage curves obtained with the three regression models are included in Figure 4.

Figure 4.

Plots of the coverage cov: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

Obviously, with the sampling ratio increases, the coverage value of the constructed granular outputs increases as well. It is found that the coverage of the granular KNN model is the best. Meanwhile, the coverage under uniform sampling increases fastest.

Figure 5 displays the plots of the performance index Q(σ_j). There is a visible “knee” point with which the optimal sampling scale can be determined. For the three granular models, the optimal sampling rate is positioned in the range of about 20%–40%. It is noticeable that the optimal performance index is obtained with the granular KNN model, and the uniform sampling can help to achieve the optimal value faster.

Figure 5.

Plots of the performance index Q: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

Time consumption of sampling

The running time based on different subsets under each sampling method and the results are presented in Table 2.

Table 2.

Overall running time of different sampling methods.

	Uniform sampling			Stratified sampling			Systematic sampling
	SVM	LSTM	KNN	SVM	LSTM	KNN	SVM	LSTM	KNN
10%	4.78 s	39.63 s	9.21 s	6.24 s	42.76 s	10.22 s	3.26 s	28.63 s	8.76 s
20%	7.99 s	85.74 s	16.15 s	7.47 s	101.14 s	18.47 s	5.47 s	68.74 s	13.45 s
30%	9.89 s	124.88 s	22.91 s	10.12 s	126.18 s	26.32 s	8.16 s	112.88 s	19.59 s
40%	14.11 s	150.31 s	32.85 s	18.37 s	151.91 s	34.56 s	11.57 s	132.31 s	26.22 s
50%	17.45 s	194.11 s	36.84 s	21.69 s	200.76 s	39.11 s	15.65 s	166.11 s	26.83 s
60%	20.56 s	235.24 s	46.04 s	25.83 s	263.64 s	48.39 s	16.47 s	176.24 s	38.12 s
70%	24.78 s	275.44 s	50.38 s	29.76 s	284.14 s	55.16 s	17.14 s	198.44 s	36.58 s
80%	27.89 s	307.46 s	57.84 s	31.27 s	321.96 s	58.53 s	21.25 s	232.46 s	39.64 s
90%	31.14 s	349.41 s	64.48 s	36.46 s	368.23 s	67.89 s	24.11 s	254.41 s	39.18 s
100%	35.45 s	371.59 s	74.33 s	40.91 s	399.78 s	80.12 s	28.12 s	276.59 s	42.34 s

As shown in Table 2, it can be concluded that the running time of uniform sampling and stratified sampling is relatively long, and the time consumption of systematic sampling is the lowest.

Analysis of experimental results

For comparative analysis, we show the numerical outputs of the neural network model in Figure 6. Here 10% of the original data set is selected as the testing sets for experiments.

Figure 6.

Numerical model comparing of granular model.

The accuracy of the information granule as well as the numeric outputs in predicting the original data are expressed as follows, respectively.

Gpre = \frac{1}{n} card {y_{k} \in [a, b]} \times 100 %

(13)

Npre = \frac{1}{n} card {e_{k} = y_{k}} \times 100 %

(14)

It is calculated that 86.6% in the original data is included in the interval. However, only 68.7% of the predicted values obtained by the numeric model are the same as the data in the testing set, which indicates that in this case, the granular model performs better than the numeric model.

Experiments based on UCI machine learning

In this part, we conduct the experimental analysis based on four UCI data sets. The detailed information of each data set is shown in Table 3.

Table 3.

Features of UCI data sets.

Data set	Number of attributes	Number of instances	Description
Boston Housing	13	506	The data set contains information about home prices in Boston, Massachusetts, US, collected by the US Census Bureau. The data set is small, with only 506 cases.
Iranian Churn	13	3150	The data set was randomly collected from the Iranian telecommunications company database over a 12-month period. Among them, service type and status are more obvious hierarchical characteristics.
Clickstream Data for Online Shopping	14	165,474	The data set contains information about the clickstream of an online store that provides clothing for pregnant women. The data comes from 5 months in 2008, and the data set is relatively large.
Concrete Compressive Strength	9	1030	Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. The data set is small, with only 1030 cases.

Boston housing data set

We report the curves of coverage in Figure 7. It is shown that increasing sampling sizes will result in increasing values of coverage, and the best coverage can be obtained with the granular SVM model and uniform sampling can help reach the maximum value fastest.

Figure 7.

Plots of coverage: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

The curves of the objective function are shown in Figure 8. The optimal value is obtained when the sampling ratio is about 20% with the granular KNN model. For the granular SVM model, the optimal performance index is obtained when the sampling ratio is between 60% and 80%. In granular LSTM model, the optimal target value is obtained when the sampling ratio is about 40%–60%. The optimal value of the objective function is obtained when using the granular SVM model. Moreover, the fastest trend of the objective function is obtained under uniform sampling.

Figure 8.

Plots of the objective function: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

Therefore, for the Boston Housing data set, the best performance is obtained with the granular SVM model when using uniform sampling.

Iranian sprint data set

Coverage curve of the Iranian Sprint data set is shown in Figure 9. Some vital information is obtained by observing the coverage curve. For example, as the sampling ratio increases, the coverage increases gradually. The coverage rises fastest when using stratified sampling. The best coverage value is obtained with the granular LSTM model. The fastest increase in coverage is obtained when stratified sampling is used.

Figure 9.

Plots of coverage: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

The curve of the objective function is shown in Figure 10. It is easily to find that for each granular model, the best performance is obtained when sampling 40%, 60%, 50%, 60%,15%, and 20%. A meaningful conclusion obtained from the experimental results is that the best value of objective function can be obtained with the granular LSTM model. Also, the fastest rise of the objective function is determined when using systematic sampling.

Figure 10.

Plots of the objective function: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

For the Iran Sprint data set, stratified sampling and granular LSTM model can be considered to obtain a granular model with excellent performance.

Clickstream data for online shopping data set

The coverage curve of the Clickstream Data for Online Shopping data set is shown in Figure 11. As the sample size increases, the coverage also increases as a trend is shown on the coverage curve. The best coverage is obtained with the granular SVM model. In addition, when the used subset is obtained through system sampling, the fastest coverage rate can be obtained.

Figure 11.

Plots of coverage: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

Figure 12 displays the target function curve of the Clickstream Data for Online Shopping data set. We find that the value of “knee” in the granular KNN model is between 40% and 60% of the sampling, and the best performance is obtained at this time. The knee values in the granular SVM model and the granular LSTM model are between 50%, 60%, 15%, and 20%, respectively. Meanwhile, the best objective function value is obtained under the granular SVM model. The best performance increase is first obtained with systematic sampling.

Figure 12.

Plots of the objective function: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

Therefore, for Clickstream Data for Online Shopping data sets, a granular model with excellent performance can be obtained through systematic sampling and granular SVM model.

Concrete compressive strength data set

The coverage curve of the Concrete Compressive Strength data set is shown in Figure 13. As the sampling size increases, the trend of gradually increasing coverage can be observed in the Figure. The best coverage is obtained based on the granular LSTM model. Furthermore, the maximum coverage is first obtained in uniform sampling.

Figure 13.

Plots of coverage: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

The curve of the objective function is shown in Figure 14. The value of the optimal objective function of the granular KNN model is obtained when sampling 40%–60%. The size of the data subset required by the granular LSTM model is similar to that of the granular KNN model, while the granular SVM model requires the subset. The minimum sampling is about 40%–50%. Among them, the best performance is obtained in the granular LSTM model. The best objective function value is first obtained under uniform sampling.

Figure 14.

Plots of the objective function: (a) granular KNN model, (b) granular SVM model, and (c) granular LSTM model.

Considering the above experimental results, under the Concrete Compressive Strength data set, a granular model with excellent performance can be obtained through uniform sampling and LSTM granular model.

Conclusions

In this paper, granular models are proposed and analyzed by selecting subsets with different sample sizes. Several regression models, including KNN, SVM, LSTM, are utilized in the development of different granular models. The optimization process is carried out by considering two conflicting criteria: coverage and specificity. A synthetic dataset as well as four publicly available datasets from the UCI machine learning repository are used for experimental studies. First, subsets with different sample sizes are obtained with several sampling methods. Second, a series of numeric models are built based on each subset. By using of the principle of justifiable granularity, interval information granules are constructed with the outputs of the numeric models. Experiments show that for different data sets, the optimal sampling scale required under different sampling methods is not the same, and the granular model that can obtain the best performance is also different. Therefore, we can get the best granular performance by using the best model and best sampling method of each data set. At the same time, the required time consumption is effectively reduced.

In further studies, the proposed granular models are anticipated to be applied to the analysis of time series on basis of information granules from different levels of abstraction. We are also interested in the segmentation of time series, which is decomposed according to the similarity of the time series, and information granules are subsequently constructed to ensure the model with high efficiency. To be more specific, the proposed granular models can be employed to predict the spread of malicious software³² and COVID-19.^33-35

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the Shaanxi Natural Science Project FundKey Technology of Top-level Design for Intelligent Coal Mine Construction under Grant 2019JLM-11 and in part by the Deanship of Scientific Research at King Saud University under Grant RG-1439-009.

ORCID iDs

Dan Wang

Zhenhua Yu

Abdelatty E Abdelgawad

References

Yao

A ten-year review of granular computing. In: Proceedings of IEEE international conference on granular computing (GrC 2007), Fremont, CA, USA, 2–4 November 2007, pp. 734–739. New York, NY: IEEE.

Yao

Vasilakos

Pedrycz

Granular computing: perspectives and challenges. IEEE Trans Cybern 2013; 43(6): 1977–1989.

Pedrycz

Wang

Designing fuzzy sets with the use of the parametric principle of justifiable granularity. IEEE Trans Fuzzy Syst 2016; 24(2): 489–496.

Jun

Shouyong

Chong

, et al. A spintronic memristor crossbar array for fuzzy control with application in the water valves control system. Meas Control 2019; 52(5–6): 418–431.

Wang

Zhuang

, et al. Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst 2015; 258(258): 117–133.

Fang

Zhang

Wang

, et al. A survey of big data research. IEEE Netw 2015; 29(5): 6–9.

Singh

Masuku

Sampling techniques and determination of sample size in applied statistics research: an overview. Int J Commerce Manage 2014; 2(11): 1–22.

Albattah

The role of sampling in big data analysis. In: Proceedings of the international conference on big data and advanced wireless technologies, 2016, pp.1–5. Blagoevgrad: ACM.

Majid

Ali

Iqbal

, et al. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Computer Methods Programs Biomed 2014; 113(3): 792–808.

10.

Zadeh

LA.

Fuzzy sets and information granularity. In: Advances in Fuzzy Sets Theory and Applications, 1979.

11.

Lin

Granular computing: From rough sets and neighborhood systems to information granulation and computing in words. In: Proceedings of European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany, 1997, pp.1602–1606.

12.

Lin

TY.

Granular computing I: the concept of granulation and its formal model. Int J Granular Comput 2009; 1(1): 21–42.

13.

Chen

Tao

Gao

K-nearest neighbor multi-label learning algorithm based on granular computing. Comput Eng 2012; 38(22): 167–168.

14.

Panda

Abraham

Tripathy

BK.

Soft granular computing based classification using hybrid fuzzy-KNN-SVM. Intell Decis Technol 2016; 10(2): 115–128.

15.

Mehmood

Haider

Farooq

, et al. Droplet movement control using fuzzy logic controller and image processing along with pin valves inside microfluidic device. Meas Control 2019; 52(9–10): 1517–1531.

16.

Zhang

Fraser

Gagliano

, et al. Granular neural networks for numerical-linguistic data fusion and knowledge discovery. IEEE Trans Neural Netw 2000; 11(3): 658–667.

17.

Pedrycz

Vukovich

Granular neural networks. Neural Comput 2001; 36: 205–224.

18.

Amani

York

Chrystyn

, et al.; Do

Determination of factors controlling the particle size in nanoemulsions using artificial neural networks. Eur J Pharm Sci 2008; 35(1–2): 42–51.

19.

Song

Wang

A study of granular computing in the agenda of growth of artificial neural networks. Granul Comput 2016; 1(4): 247–257.

20.

Pedrycz

Al-Hmouz

Morfeq

, et al. The design of free structure granular mappings: the use of the principle of justifiable granularity. IEEE Trans Cybern 2013; 43(6): 2105–2113.

21.

Callejas

Cerrada

, et al. Group decision making based on a framework of granular computing for multi-criteria and linguistic contexts. IEEE Access 2019; 7: 54670–54681.

22.

Wang

Pedrycz

Design of granular interval-valued information granules with the use of the principle of justifiable granularity and their applications to system modeling of higher type. Soft Comput 2016; 20(6): 2119–2134.

23.

Pedrycz

Homenda

Building the fundamentals of granular computing: a principle of justifiable granularity. Appl Soft Comput 2013; 13(10): 4209–4218.

24.

Yeom

Lee

Kwak

KC.

Performance index of incremental granular model with information granule of linguistic intervals and its application. Appl Sci 2020; 10(17): 5929–5947.

25.

Yao

A triarchic theory of granular computing. Granul Comput 2016; 1(2): 145–157.

26.

Wang

A long short-term memory neural network approach for the hardware-in-the-loop simulation of diesel generator sets. Meas Control 2020; 53(1–2): 229–238.

27.

Armitage

AF.

Neural networks in measurement and control. Meas Control 1995; 28(7): 208–215.

28.

Zhu

Pedrycz

Development and analysis of neural networks realized in the presence of granular data. IEEE Trans Neural Netw Learn Syst 2020; 31: 3606–3619.

29.

Peter

Machine learning in action. helter Island, NY: Manning Publcations, 2012.

30.

Song

Pedrycz

Granular neural networks: concepts and development schemes. IEEE Trans Neural Netw Learn Syst 2013; 24(4): 542–553.

31.

Yao

Fang

Xiao

, et al. An intelligent fault diagnosis method for lithium battery systems based on grid search support vector machine. Energy 2021; 214(3): 118866.

32.

Gao

Wang

, et al. SEI2RS malware propagation model considering two infection rates in cyber-physical systems. Physica A: Stat Mech Appl 2022; 57(127207). DOI: 10.1016/j.physa.2022.127207.

33.

Sohail

Arif

, et al. Modeling the crossover behavior of the bacterial infection with the COVID-19 epidemics. Results Phys 2022; 39(105774). DOI: 10.1016/j.rinp.2022.105774.

34.

Sohail

Arif

, et al. Piecewise differentiation of the fractional order CAR-T cells-SARS-2 virus model. Results Phys 2022; 33(105046). DOI: 10.1016/j.rinp.2021.105046.

35.

Arif

Fahmy

, et al. Self organizing maps for the parametric analysis of COVID-19 SEIRS delayed model. Chaos Solitons Fractals 2021; 150. DOI:10.1016/j.chaos.2021.111202.

Granular data modeling and analysis based on optimized subsets of data

Abstract

Keywords

Introduction

Related works

Design and evaluation of granular models

Selection of data subsets

Design of interval information granules with the principle of justifiable granularity

Evaluation of granular models

Experimental studies

Synthetic data

Experimental regression models and parameters setting

Analysis of experimental results

Time consumption of sampling

Analysis of experimental results

Experiments based on UCI machine learning

Boston housing data set

Iranian sprint data set

Clickstream data for online shopping data set

Concrete compressive strength data set

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References