A generative adversarial network–based method for generating negative financial samples

Abstract

In financial anti-fraud field, negative samples are small and sparse with serious sample imbalanced problem. Generating negative samples consistent with original data to naturally solve imbalanced problem is a serious problem. This article proposes a new method to solve this problem. We introduce a new generation model, combined Generative Adversarial Network with Long Short-Term Memory network for one-dimensional negative financial samples. The characteristic association between transaction sequences can be learned by long short-term memory layer, and the generator covers real data distribution by the adversarial discriminator with time-sequence. Mapping data distribution to feature space is a common evaluation method of synthetic data; however, relationships between data attributes have been ignored in online transactions. We define a comprehensive evaluation method to evaluate the validity of generated samples from data distribution and attribute characteristics. Experimental results on real bank B2B transaction data show that the proposed model has higher overall ratings, which is 10% higher than traditional generation models. Finally, well-trained model is used to generate negative samples and form new dataset. The classification results on new datasets show that precision and recall are all higher than baseline models. Our work has a certain practical value and provides a new idea to solve imbalanced problem in whatever fields.

Keywords

Generative adversarial network long short-term memory network negative financial samples evaluation method

Introduction

With the rapid development of financial science and technology, online transactions increase greatly. The online fraudulent transactions are growing up when some new fraud methods have arisen. More and more methods and algorithms for anti-fraud detection have been proposed. Devi et al.¹ introduce a cost-sensitive weighted random forest algorithm to detect credit card fraud. And some scholars also propose deep learning method to detect abnormal transactions. CNN² is introduced into online transaction fraud detection and has reached good results. However, in this scenario, class imbalance occurs when normal transactions, called majority class, contain larger samples than abnormal transactions, called minority class. Learning from this dataset can be very difficult, especially the transactions are big data.

When an imbalanced problem occurs in the training data, learning algorithms will tend to the majority class and misclassify the minority class. This is because negative samples are given small weights when training. Furthermore, some evaluation metrics, such as accuracy and precision, will get high overall scores that mislead the analyst with good performance. However, the model performs poorly on testing data and gives lower recall and F1-score for negative samples. This negative effect makes it very difficult to accurately predicate one of these classes. Thus, we must be sure that our model doesn’t overfit or underfit on one of these classes. Therefore, the methods solving imbalanced problem aim to balance the positive and negative samples.

Methods to deal with class imbalanced problem can be divided into three types: data-level methods, algorithm-level methods, and hybrid techniques. Data-level methods adjust training dataset distribution to reduce the level of class imbalance, which can be mainly divided into under-sampling and over-sampling. Algorithm-level methods attempt to change weights in learning or decision process to increase the importance of minority class. The representative method is cost-sensitive method. Finally, hybrid techniques combine both methods strategically. Data sampling can reduce the impact of category noise, and cost-sensitive learning reduces the model bias for majority class.

The emergence of deep generation models provides inspiration for us to solve sample imbalances. Deep generation models mainly represented are generative adversarial network (GAN),³ variational autoencoder (VAE),⁴ and other models. Compared with VAE and other generation models, GAN is very flexible during training. It does not require too many mathematical assumptions and various approximate inferences to capture data distribution. Therefore, the difficulty of model training is greatly reduced, and it is very successful in generating complex data, including handwritten digits, faces, and CIFAR images. And GAN uses the adversarial loss that has higher resolution than VAE models at generated images.

In this article, we explore the possibilities of applying GAN to handle online transactions imbalance problem by generating negative samples. To deal with the time series characteristics of transactions, we add the long short-term memory (LSTM)⁵ layer to GAN, named New_GAN, to generate negative samples of transactions. While using GAN to generate close to the original data distribution, the LSTM network avoids the impact of data time series characteristics on the generated data. At the same time, a sample consistency evaluation model is established for the generated one-dimensional transaction data. The samples generation process will be the generator to capture the actual data distribution and create samples for negative group. The main contributions of this article are as follows:

A new generation model for online transactions samples has been proposed, combined GANs with LSTM networks. The adversarial discriminator guides the generator to produce realistic data with time series by playing a min-max game;

In order to avoid generating data uncontrollable and unrealistic, we update the objective function. In the generator optimization goal, we add feature penalty from real data and generated data during the model training;

We describe an evaluation metric to decide model performance comprehensively. The Kernel MMD (maximum mean discrepancy)⁶ and the correlation coefficient are considered from both data distribution and sample attributes relevance respectively.

The rest of the article is organized as follows. Related work of GAN theory and its application are discussed in the “Related work” section. In the “Model architecture” section, the model structure of negative sample data generation is proposed, and the data evaluation method is explained in the “Evaluation” section. After that, the performance of our proposed model is shown via extensive experiments in the “Experiment and evaluation” section. Finally, the “Conclusion” section concludes the article.

Related work

Considering that our article focus is the modification on using GAN to synthesizing the artificial data. In related work, we will provide a brief review of GAN’s theory, GAN’s application, and GAN’s evaluation metrics.

GAN³ is proposed by Goodfellow in 2014, which has shown remarkable success as a framework for producing realistic-looking data. It consists of a generator network and a discriminator network. Two networks compete against each other, adjust parameters dynamically, and finally generate “realistic” data samples. There are many improvements of GANs. Wasserstein GAN (WGAN)⁷ is proposed to improve model performance from the objective function. WGAN completely solves the problem of unstable training and ensures the diversity of generated samples. Alec Radford and others introduce CNN into GAN and make better original GAN structure and training process. It achieves the combination of unsupervised learning and supervised learning, named DCGAN.⁸ Condition GAN (CGAN)⁹ can solve the GAN’s controllable and unrealistic problem as generator $G$ and discriminator $D$ are conditioned on some extra control information.

The use of synthetic samples by GAN has gained many results in some works. In e-commerce, Kumar et al.¹⁰ propose a GAN for orders made in e-commerce websites. Once trained, the generator could generate any number of plausible orders, effectively assisting the manager to understand the relationship between goods and customers. Y Tu et al.¹¹ use GAN to semi-supervised learning to create a more data-efficient classifier. GAN uses unlabeled data effectively to reduce overfitting in deep learning. Lou et al.¹² introduce supervised signals into WGAN network for one-dimensional data augmentation. The input data are a potential sample obtained from autoencoder to generate electronic device data. Esteban et al.¹³ combine recurrent neural network (RNN) model with GAN for time series data in medical treatment, and novel evaluation methods are presented in paper.

In edge computing, R Gu and Zhang¹⁴ propose GANSlicing, a dynamic service-oriented software-defined mobile network slicing scheme. They use GAN to timely and flexibly allocate resource to improve quality of users’ experience. Liu¹⁵ design an efficient GAN computing system on ReRAM neuromorphic engine. It can train framework online and an optimized backward computation to execute the training process. The system is performing well compared with traditional GPU accelerator. M Mardani et al.¹⁶ puts forward a novel compressed sensing framework that uses GAN to train a low-dimensional manifold of diagnostic-quality images. GAN is widely used in various fields, which helps us solve each problem.

How to evaluate the result of GAN generated is a common problem. Generative moment matching networks (GMMN)¹⁷ suggest directly minimizing MMD distance to measure the pros and cons of generated images. Inception score¹⁸ is the most common way to evaluate GAN. It reflects the diversity of generated samples, but cannot measure how well the generator approximates the real distribution. Fréchet inception distance (FID)¹⁹ is recently proposed comparing with the distributions of inception embeddings. Both inception score and FID rely on Inception Net with ImageNet and training. Sliced Wasserstein distance (SWD)²⁰ is used to evaluate high-resolution GANs. K Shmelkov et al.²¹ propose a new dimension to this problem with GAN-train and GAN-test performance-based measures. GAN-train is the accuracy of a classifier trained on generation samples and tested on real images. GAN-test is the accuracy of a classifier trained on original dataset, but tested on generation set. It is suitable for a conditional GAN.

This article proposes a new GAN model based on LSTM network for financial negative samples. The negative samples are used to generate data close to original distribution, and the time series attributes are preserved. At the same time, for generated data, a comprehensive evaluation method is introduced. And we perform our generated data on a classifier model. Experiments show that the model we proposed effectively improves the result of network transaction fraud detection.

Model architecture

For dataset $X = (x_{1}, x_{2}, \dots, x_{n})$ that needs to be generated, $n$ means dataset $n$ samples. For each transaction data $x_{i}$ , it is $p$ -dimensional attributes. Our research problem can be definite as the following

{\begin{matrix} X \approx X' \\ X = (x_{1}, x_{2}, \dots, x_{n}) \\ X' = (x_{1}', x_{2}', \dots, x_{n}') \\ x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{np}), x_{i} \in X \\ x_{i}' = (x_{ip}', x_{ip}', \dots, x_{ip}'), x_{i}' \in X' \end{matrix}

(1)

The task is that we should generate dataset $X'$ , where the data distribution is equal to $X_{1}$ . About generated data $X'$ , it should contain $n$ data, each of which is also $p$ -dimensional attributes. In order to solve this problem, we propose the following model.

The basic theory

Regular GAN

The idea of GAN is very concise and clear, which includes two adversarial models: a generative model $G$ that captures the data distribution and a discriminative model $D$ that estimates the probability of a sample from the training data rather than $G$ . The goal of $G$ is to generate a realistic object, such as a person’s photo, and $D$ is to maximize the difference. Both $G$ and $D$ could be a non-linear mapping function, such as a multilayer perceptron (MLP). Figure 1 illustrates the structure of basic GAN. $G$ inputs random noise $z$ obeying Gaussian distribution to generate a sample similar to real training data of each training. $D$ takes synthetic data labeled 1 and raw data labeled 0 as input to distinguish them.

Figure 1.

The structure of GAN.

To learn the generator distribution $p_{g}$ over data $x$ , $G$ maps a priori on input noise variables $p_{z} (z)$ into data space as $G (z; θ_{g})$ , in which $G$ can be represented by MLPs with parameters $θ_{g}$ . The discriminator, $D (x; θ_{d})$ , outputs single scalars, representing probability $D (x)$ that $x$ comes from the data rather than $p_{g}$ . Then train $D$ to maximize the probability of assigning the correct label to both training examples and sample from $G$ and trains $G$ to minimize $\log$ . They are following the two-player min-max game with value function $V (G, D)$

\begin{matrix} \min_{G} max_{D} V (D, G) = E_{x ~ p_{data} (x)} [\log D (x)] \\ + E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))] \end{matrix}

(2)

In process of training, we train $G$ and $D$ by an iterative, numerical approach. One part of network is fixed, the other part’s weight is updated. We alternate between $k$ steps of optimizing $D$ and one step of optimizing $G$ . In early learning, $G$ is poor and $D$ can reject samples with high confidence as they are clearly different from the training data. $\log (1 - D (G (z)))$ saturates; thus we can train $G$ to maximize $\log (G (z))$ to provide stronger gradients in early learning. Each network strives to optimize its network until two networks reaching a dynamic equilibrium, named Nash equilibrium. The final stage is that $G$ generates realistic data and $D$ couldn’t recognize between real sample and generated sample.

The LSTM layer

LSTM is a kind of RNN.⁵ By deliberately designing, LSTM can remember long-term information. There are three main phases in LSTM:

The stage of forgetting by “forgetting the door.” The gate reads $h_{t - 1}$ and $x_{t}$ , and outputs a value between 0 and 1 through the $f_{t}$ function, which 0 means completely discarded and 1 means completely reserved

f_{t} = σ (w_{f} * [h_{t - 1}, x_{t}] + b_{f})

(3)

Selecting memory. After receiving the information from the last neuron, the LSTM cell determines what information is put into the new neural cell. The $sigmoid$ layer determines what information needs to be updated, and the $\tanh$ layer creates a new output vector

i_{t} = σ (w_{i} * [h_{t - 1}, x_{t}] + b_{i})

(4)

\tilde{C_{t}} = \tanh (w_{c} * [h_{t - 1}, x_{t}] + b_{c})

(5)

Then update the neuron state, multiply the old state with $f_{t}$ , discard what needs to be discarded, and then add $i_{t} * \tilde{C_{t}}$ , to get a new candidate value

C_{t} = f_{t} * C_{t - 1} + i_{t} * {\tilde{C}}_{t}

(6)

Output phase. It determines which will be treated as the current state output. Through the $sigmoid$ layer determines which part will be output, and the $\tanh$ layer chooses the final output of the part we need to output

o_{t} = σ (w_{o} * [h_{t - 1}, x_{t}] + b_{o})

(7)

h_{t} = o_{t} * \tanh (C_{t})

(8)

New generation model

Model structure

It is found that financial negative samples have characteristics of time series, the occurrence of fraud transactions depends on time, and abnormal transactions can be found through time series analysis. Although regular GAN can learn transaction characteristics and trading patterns, it cannot deal with time series characteristics of trading. In order to make new generation model capable to understand characteristics of the transaction sequence as much as possible, this article combines RNN with regular GAN. Although RNN can learn the characteristic association between transaction sequences, the learning ability of the intrinsic features of a single transaction is similar to traditional shallow neural network, and it cannot achieve expected goal. LSTM network with longer memory intervals satisfies our needs. Thus, we combine LSTM network with GAN to generate data to satisfy our demand. The network structure is shown in Figure 2.

Figure 2.

New generation model structure.

In proposed network structure, both generator and discriminator are based on LSTM network and MLP. The basic principle of generation model is consistent with the principle of the original GAN model, that is, the min-max game between $G$ and $D$ . In generator, the sample sequence generated by the previous and timestep will affect the next generation. The generator is the LSTM layer and fully connected layer, which tries to generate correct samples based on previous sequences of transaction. The discriminator also consists of LSTM and MLP. It plays a min-max game with generator, which gets previous information either the one generated by the generator as a negative sample or obtained from the correct sample dataset. In this way, the discriminator uses the current timestep samples as prior information to better distinguish whether the generator produces realistic samples for the timestep in the sequence. Figures 3 and 4 are detailed network structures.

Figure 3.

The detailed structure of generator.

Figure 4.

The detailed structure of discriminator.

Generation model

Figure 3 indicates that LSTM layer connects with the input layer and MLP layer. The generator first samples m noises $(z^{1}, z^{2}, \dots, z^{m})$ from prior distribution $p_{g} (z)$ . Through LSTM layer, the hidden state $h_{t}$ of timestep $t$ is got by maps input noise to fix dimension vectors and previous hidden state $h_{t - 1}^{G}$

h_{t}^{G} = g (h_{t - 1}^{G}, z^{m})

(9)

where $g$ is update function implemented by LSTM cells and $h_{t}^{G}$ is decided by previous generated sequence and noise distribution $z$ .

Moreover, after MLP layer, it maps hidden states $h_{t}$ into the output token distribution and generate sequence data $x_{t}^{,}$

x_{t}^{,} = f (z)

(10)

z = W * h_{t}^{G} + c

where $f$ represents a series of forward propagation operations, $c$ is the bias vector, and $W$ is the weight matrix.

Discrimination model

In Figure 4, LSTM cells model the input features, maps them to hidden state, and finally distinguishes input data labeled 0 and 1 through neural networks.

For the input samples, we represent an input sequence $(x_{1}, x_{2}, \dots, x_{t})$ , no matter it produced by generator distribution or sampled from real data distribution. As the sample representation $h_{t}^{D}$ can be encoded with LSTM cells

h_{t}^{D} (x_{t}) = LST M_{D} (x_{t}, h_{t - 1}^{D})

(11)

$h_{t}^{D} (x_{t})$ is the latest information of the transaction data to capture useful local features, which is mined through LSTM layer. Finally, fully connected layers with activation function in last layer are used to output the probability that the input is real

p_{x_{t}} = σ (c + W * h_{t}^{D} (x_{t}))

(12)

where $σ$ is the activation function, and $c$ and $W$ have the same meaning as the generator.

The optimization target is to minimize the objective functions between truth label and predicted probability. We adopt the adversarial training of generators and discriminators, and use the $adam$ optimizer to approximate maximum or minimum objective function.

The proposed algorithm is presented in Algorithm 1.

Algorithm 1: New_GAN 1 Algorithm
Input: The number of training iteration $k$ , the batch size $m$ .
1: for number of iterations do
2: for $k$ steps do:
3: Sample $m$ noise $(z^{1}, z^{2}, \dots, z^{m})$ from $p_{g} (z)$ ;
4: Sample $m$ examples $(x^{(1)}, x^{(2)}, \dots, x^{(m)})$ from $p_{data} (x)$ ;
5: Update the discriminator by ascending its gradient: $min_{D} \nabla_{θ} (E_{x ~ p_{data}} [logD (x)] + E_{x ~ p_{g}} [logD (G (x))])$
6: end for
7: Sample $m$ noise $(z^{1}, z^{2}, \dots, z^{m})$ from $p_{g} (z)$ ;
8: Sample $m$ examples $(x^{(1)}, x^{(2)}, \dots, x^{(m)})$ from $p_{data} (x)$ ;
9: Update the generator by ascending its stochastic gradient: $min_{G} \nabla_{θ} (- E_{x ~ p_{g}} [logD (G (x))])$
10: End for

Update objective function

A persisting challenge in the training of GANs is model collapse. For example, we train GAN model with MNIST dataset. The trained GAN can only generate one of 10 numbers; or in experiment of the face image, only one style of image is generated. Arjovsky et al.⁷ points out that the divergences which GANs typically minimize are not continuous and differentiable everywhere. While updating generator’s parameters, it leads to training difficulty. WGAN improves traditional GAN by using earth mover’s distance or Wasserstein distance. The definition of Wasserstein distance is

W (p_{data}, p_{G}) = inf_{γ \in Π (p_{data}, p_{G})} E_{(x, y) \in γ} [‖ x - y ‖]

(13)

where $p_{data}$ is real data distribution, $p_{G}$ is generated distribution and tries to approximate $p_{data}$ , and $γ \in \overset{(}{Π} p_{data}, p_{G})$ is the set of all joint distribution. $W$ is a continuous and different function. According to Kantorovich–Rubinstein duality²²

W (p_{data}, p_{G}) = sup_{{‖ f ‖}_{L} \leq 1} E_{x ~ p_{data}} [f (x)] - E_{x ~ p_{G}} [f (x)]

(14)

The supremum is 1-Lipschitz function $f : γ \to R$ . The final update equation is

\nabla_{θ} W (p_{data}, p_{G}) = \nabla_{θ} (E_{x ~ p_{data}} [f (x)] - E_{x ~ p_{z}} [f (x)])

(15)

We use LSTM to capture useful local features and generate samples. The optimization target can be described as $G (LSTM (z)) \to X$ , and the generator can minimize difference between $G (LSTM (z))$ and $X$ . The Wasserstein distance is used to measure the closeness of two data distributions as statistical criteria. The generator $G$ adopts the feature matching penalty²³ from real data samples to ensure the generated samples are constrained

L_{X, p_{z}} = ‖ X - G (LSTM (z)) ‖^{2}

(16)

After adding the feature matching penalty into generator, the optimized object of generator is to minimize

min_{G} \nabla_{θ} (- E_{x ~ p_{z}} [D (G (LSTM (z)))] + L_{X, p_{z}})

(17)

As for discriminator, it should get high confidence and $(G (LSTM (z)))$ are negative samples with low probabilities. Similarity, the optimized object can be expressed as

min_{D} \nabla_{θ} (E_{x ~ p_{data}} [D (x)] - E_{x ~ p_{z}} [D (G (LSTM (z)))])

(18)

The proposed algorithm is presented in Algorithm 2.

Algorithm 2: New_GAN 2 Algorithm
Input: The number of training iteration $k$ , the batch size $m$ .
1: for number of iterations do
2: for $k$ steps do:
3: Sample $m$ noise $(z^{1}, z^{2}, \dots, z^{m})$ from $p_{g} (z)$ ;
4: Sample $m$ examples $(x^{(1)}, x^{(2)}, \dots, x^{(m)})$ from $p_{data} (x)$ ;
5: Update the discriminator by ascending its gradient: $min_{D} \nabla_{θ} (E_{x ~ p_{data}} [G (x)] - E_{x ~ p_{z}} [D (G (LSTM (z)))])$
6: end for
7: Sample $m$ noise $(z^{1}, z^{2}, \dots, z^{m})$ from $p_{g} (z)$ ;
8: Sample $m$ examples $(x^{(1)}, x^{(2)}, \dots, x^{(m)})$ from $p_{data} (x)$ ;
9: Update the generator by ascending its stochastic gradient: $min_{G} \nabla_{θ} (- E_{x ~ p_{z}} [D (G (LSTM (z)))] + L_{X, p_{z}})$
10: Update the $L$ by ascending its stochastic gradient: $L_{X, p_{z}} = ‖ X - G (LSTM (z)) ‖^{2}$
11: End for

Evaluation

In the “New Generation Model” section, the target optimization function of the proposed GAN model is log-likelihood function. However, log-likelihood estimation is difficult to process and measure whether our model well-trained or not. At present, the evaluation methods for GAN are example-based, which extract features from generated samples and real samples and then performs distance measurement in feature space.²⁴ The simple evaluation from the perspective of data distribution is not comprehensive, which ignores attribute characteristics of one-dimensional data. This article combines data distributions and data correlations to evaluate generated data for evaluating our generated samples comprehensively. Finally, we perform our generation samples with raw dataset on a binary classifier. The evaluation indexes indicate the quality of generation data.

Data distribution

After training, assume that a successful GAN network can learn the true sample data distribution. The data distribution mapped onto feature space should also be same. We use Kernel MMD⁶ method to calculate data feature distribution.

MMD is maximum mean discrepancy. Based on the samples of the two distributions $P_{1}$ and $P_{2}$ , by finding continuous function $f$ in the sample space, the mean value of each distribution on $f$ is obtained. The mean discrepancy corresponding to $f$ can be obtained between two distributions. Finding $f$ makes this mean discrepancy have a maximum, so we can get the MMD. If the value is small enough, two distributions are considered to be same.

Suppose that the original dataset is $X = (x_{1}, x_{2}, \dots, x_{n})$ , its data distribution is $P_{r}$ . For generated data, its set is $Y = (y_{1}, y_{2}, \dots, y_{n})$ , and data distribution is $P_{g}$ . The $x_{i}$ and $x_{j}$ are random samples from the $P_{r}$ distribution. The $y_{i}$ and $y_{j}$ are random samples from the $P_{g}$ distribution, respectively. There is a regenerated Hilbert space H, and the mapping function $\emptyset$ satisfies a mapping from the original space to the Hilbert space: $\emptyset (.) : X - > H$ . The maximum mean difference can be expressed as

MMD (P_{r}, P_{g}) = ‖ \frac{1}{n} \sum_{i = 1}^{n} \emptyset (x_{i}) - \frac{1}{m} \sum_{j = 1}^{m} \emptyset (y_{j}) ‖_{H}^{2}

(19)

MMD launched as

\begin{matrix} MMD (P_{r}, P_{g}) = \frac{1}{n^{2}} \sum_{i, i'}^{n} \emptyset (x_{i}) \emptyset (x_{j}) \\ - \frac{2}{nm} \sum_{i, j}^{n} \emptyset (x_{i}) \emptyset (y_{j}) + \frac{1}{m^{2}} \sum_{j, j'}^{n} \emptyset (y_{j}) \emptyset (y_{j}) \end{matrix}

(20)

Expanding formula, the form of $\emptyset (x_{i}) \emptyset (x_{j})$ appears. Contact kernel function $k (*)$ in the SVM (support vector machine), skip calculation part of $\emptyset ()$ , and directly request $k (x_{i}, x_{j})$ . At this time, MMD becomes Kernel MMD, which is expressed as

\begin{matrix} MM D^{2} (P_{r}, P_{g}) = \frac{1}{n^{2}} \sum_{i \neq j}^{n} k (x_{i}, x_{j}) \\ - \frac{2}{nm} \sum_{i, j}^{n, m} k (x_{i}, y_{j}) + \frac{1}{m^{2}} \sum_{i \neq j}^{m} k (y_{i}, y_{j}) \end{matrix}

(21)

Since Gaussian kernel can be mapped to infinite dimensional space, and one-dimensional transaction data often has 10 to dozens of data attributes, the Gaussian kernel function $k (u, v) = e^{\frac{- {‖ u - v ‖}^{2}}{δ}}$ is selected as the kernel function, and the formula is converted into

\begin{matrix} MM D^{2} (P_{r}, P_{g}) = \frac{1}{n^{2}} \sum_{i \neq j}^{n} e^{\frac{{‖ - x_{i} - x_{j} ‖}^{2}}{δ}} \\ - \frac{2}{nm} \sum_{ij}^{n, m} e^{\frac{- {‖ x_{i} - y_{j} ‖}^{2}}{δ}} + \frac{1}{m^{2}} \sum_{i \neq j}^{m} e^{\frac{- {‖ y_{i} - y_{j} ‖}^{2}}{δ}} \end{matrix}

(22)

which means the smaller of the $MM D^{2} (P_{r}, P_{g})$ value, the closer that the generated data distribution $P_{g}$ is to original distribution $P_{r}$ .

Therefore, objective function of distribution measure can be written as $f_{1}$ , where $f_{1}$ should be smaller as much as possible

f_{1} = MM D^{2} (P_{r}, P_{g})

(23)

Feature correlations

Different from pictures, voice, and text data, there are strong correlations between different attributes in one-dimensional transactions. For example, when trading time is happened on 10:00 am, fraudulent transactions frequently occur in East China. It means that trading time and trading location are related. Therefore, we evaluate whether generated data attributes have same correlation with original data attributes.

Each transaction data can be expressed as $X = {(X_{1}, X_{2}, \dots, X_{p})}^{T}$ with $p$ -dimensional attributes, so the dataset is represented as

X = [\begin{matrix} x_{11} & \dots & x_{1 p} \\ ⋮ & \dots & ⋮ \\ x_{n 1} & \dots & x_{np} \end{matrix}] = [\begin{matrix} x_{1}^{T} \\ \dots \\ x_{n}^{T} \end{matrix}]

(24)

Mean $\bar{x_{j}} = \frac{1}{n} \sum_{i = 1}^{n} x_{ij}$ is responding to attribute $j$ . The mean vector is $\bar{x_{j}} = (\bar{x_{1}}, \bar{x_{2}}, \dots, \bar{x_{p}})^{T}$ , and the covariance between the attribute $j$ and the attribute $k$ is $S$

S = [\begin{matrix} s_{11} & \dots & s_{1 p} \\ ⋮ & \dots & ⋮ \\ s_{p 1} & \dots & s_{pp} \end{matrix}]

(25)

The correlation coefficient of attributes $j$ and $k$ can be expressed as $r_{jk} = \frac{s_{jk}}{\sqrt{s_{jj}} \sqrt{s_{kk}}}$ , and the correlation coefficient matrix $R_{r}$ of original data is attached

R_{r} = [\begin{matrix} 1 & \dots & r_{1 p} \\ ⋮ & \dots & ⋮ \\ r_{p 1} & \dots & 1 \end{matrix}]

(26)

Similarly, for generated sample data $Y = (Y_{1}, Y_{2}, \dots, Y_{p})^{T}$ , there is still a $p$ -dimensional property, and correlation coefficient matrix $R_{g}$ of generated data is calculated

R_{g} = [\begin{matrix} 1 & \dots & r_{1 p}^{,} \\ ⋮ & \dots & ⋮ \\ r_{p 1}^{,} & \dots & 1 \end{matrix}]

(27)

If the attributes $j$ and $k$ in original data have strong correlations, the correlation coefficient $r_{jk}^{,}$ of the generated data is as close as possible to correlation coefficient $r_{jk}$ of original data. Define $Dif f_{R}$ is defined as the distance of two correlations, then

Dif f_{R} = \sum_{i = 1}^{p} \frac{| R_{g} - R_{r} |}{2}

(28)

As correlation coefficient matrix is a symmetric matrix. Obviously, the smaller the $Dif f_{R}$ is, the stronger the correlation and closer the distance between original data and generated data. Therefore, objective function of correlation metrics can be expressed as $f_{2}$

f_{2} = Dif f_{R}

(29)

Comprehensive evaluation

According to two evaluation methods mentioned above, final evaluation function $F$ can be given by the following formula. The goal of verification is to ensure that $F$ is as low as possible. Therefore, we use the idea of multi-objective optimization to transfer target optimization, and the evaluation function is expressed as

{\begin{matrix} F = w_{1} f_{1}^{,} + w_{2} f_{2}^{,} \\ f_{1}^{,} = \frac{f_{1} - \min f_{1}}{\max f_{1} - \min f_{1}} \\ f_{2}^{,} = \frac{f_{2} - \min f_{2}}{\max f_{2} - \min f_{2}} \\ f_{1} = MM D^{2} (P_{r}, P_{g}) \\ f_{2} = \sum_{i = 1}^{p} \frac{| R_{g} - R_{r} |}{2} \\ s . t . P_{r} - > New_GAN - > P_{g} \end{matrix}

(30)

Among them, $w_{1}$ and $w_{2}$ are weights of two evaluation targets, and the sum of the weights of the two should be 1. When evaluating one-dimensional data, if more significance is put on the distribution verification of data samples, and the correlation between variables is not high, $w_{1}$ can take a larger weight. Otherwise, if there is strong correlation between variables, $w_{1}$ and $w_{2}$ can share the same weight, each taking 0.5. At the same time, when multi-evaluation is converted into single evaluation, each evaluation index needs to be normalized, and each item is normalized to [0, 1]. As dataset changes, the data distribution and data characteristics will change accordingly. The weights of $w_{1}$ and $w_{2}$ depend on the selection of the dataset and the current actual situation.

When using the comprehensive evaluation indicator, multiple generation models compare the experimental calculation results. The smaller the indicator of the model, the better the generation effect and closer to the original distribution.

Experiment and evaluation

Experiment setup

Dataset description

The experiment data are from real online transaction data of a major domestic bank. Raw dataset contains 3 months of bank B2C transaction records, about 3-million transaction data, of which about 100,000 negative transactions. It is found that data characteristics such as transaction time and transaction amount can reflect user’s transaction behavior characteristics. We select eight-dimensional feature used as the sample data to input. As the data cannot be public, the selected data dimension is not able to describe in detail. To ensure data consistency and availability, the data are processed routinely, including data cleaning, data conversion, and data reduction.

Parameter settings

For discriminators and generators, we use LSTM cells and multi-layer fully connected networks. The LSTM layer is two layers, which is more conducive to remembering information; and there are two layers of fully connected layers. Since original data to be generated are eight-dimensional data, output layer of generator is eight nodes, and the same as discriminator’s input. The final layer of the discriminator doesn’t contain any non-linear activation function. We use backpropagation algorithms to track learning conditions and adjust network parameters. Following the proposed method, we train the generation model for 100 epochs, saving one version of the dataset every 10 epochs and records MMD value every epoch.

Model training

Training with MMD

As we can reach data distribution, model records MMD’s value after each epoch during the model training, as shown in Figure 5.

Figure 5.

MMD value with every epoch.

Figure 5 shows that as training epoch increases, MMD value between generated samples and real samples remain smoothly, below 0.002.

We can see generated sample distribution is completely similar to real sample distribution. The generator could generate “real” data. It is encouraging to observe that the likelihood of the generated samples improves with training.

Training result

After training, we record the visual results of each training. Figures 6 –8 are two-dimensional visualization pictures of v1 and attribute v2 in the data attribute after training, where the y-axis is v2 and the x-axis is v1. Figures 9 –11 are two-dimensional visualization pictures of v3 and attribute v1 in the data attribute after training, where the y-axis is v3 and the x-axis is v1. The pictures of the training are shown in the figure below.

Figure 6.

V2 epoch = 10.

Figure 7.

V2 epoch = 50.

Figure 8.

V2 epoch = 100.

Figure 9.

V3 epoch = 10.

Figure 10.

V3 epoch = 50.

Figure 11.

V3 epoch = 100.

The left side of the picture is the distribution of real input data, and the right side is the case of generating data. It can be seen from the figures that as the number of training increases, the generated data are getting closer to real data distribution. It can be explained that our model is well trained. The generator captured raw data distribution.

Model verification

Comprehensive evaluation

The weights are divided into two principles; each is set to 0.5. For trained model, different numbers of data are generated, and the score of comprehensive evaluation method is calculated. The baseline model is regular GAN³ and VAE.⁴ We train the proposed model and baseline model. After model well trained, we generate different test subsets and calculate each dataset’s comprehensive score. The comprehensive evaluation score is shown in Table 1.

Table 1.

Comprehensive evaluation score.

Model	D_s1	D_s2	D_s3	D_s4	D_s5
GAN	0.2901	0.3023	0.2489	0.1889	0.1345
VAE	0.3021	0.3219	0.2809	0.2077	0.1103
New_GAN1	0.1319	0.1208	0.0901	0.1902	0.1409
New_GAN2	0.1317	0.1206	0.0900	0.1702	0.1109

GAN: generative adversarial network; VAE: variational autoencoder.

Bold numbers indicate the optimal experimental results for each dataset.

It shows that calculated values of our two model under comprehensive evaluation are all below 0.2. Generated data are close to original data distribution. In D_s3, the value is below 0.1 and the quality of the generated data is excellent. Then, we introduces Wasserstein distance and penalty into New_GAN2. It can be seen that the score is lower than New_GAN1 in D_s4 and D_s5. Compared with New_GAN1, the score of D_s4 has decreased by 10% and D_s5 has dropped about 20%. The New_GAN2 can completely approximate real data, and the generation data are more controllable and realistic.

To further validate the model classification results, we use a classification model to verify and the experimental results are described in next section.

GAN classify

We perform our model on classification model. The generated data are added to raw dataset to constitute a newly dataset. The quality of the generated data is tested by a binary classifier on multiple synthetic datasets. The basic evaluation index is accuracy, precision, and recall. The confusion matrix is shown in Table 2.

Table 2.

The confusion matrix.

	real0	real1
pred0	TP	FP
pred1	FN	TN

Accuracy is the ratio of correctly classified by the classifier to total numbers of samples $Accuracy (ACC) = (TP + TN) / N$ .

Precision is the ratio of positive samples that are classified correctly to number of samples that classifier determines to be positive samples: $Precision = TP / (TP + FP)$ .

Recall is the ratio of positive samples with correct classifications to true positive samples: $Recall = TP / (TP + FN)$ .

In the GAN generated data1, VAE generated data2, New_GAN1 generated data3, and New_GAN2 generated data4, the model classification effect is compared. The results of this experiment are presented in Figures 12 –14, which compares the performance achieved by the classifier.

Figure 12.

The classify performances of models on Dataset1.

Figure 13.

The classify performances of models on Dataset2.

Figure 14.

The classify performances of models on Dataset3.

As the figure room is not sufficient. Thus, we just list the data label in the max and min dataset. It can be seen from figures that generating data improve the original model classification effect. Among them, our generation models, compared with baseline models: GAN and VAE, have the advantage of enhancing data in generating negative transaction data. Our model has achieved best detections on all three test subsets. In terms of accuracy, classification results of the datasets generated by New_GAN1 and New_GAN2 models are above 95%, the accuracy is above 70%, and the recall rate is about 83%. Compared with VAE and original GAN, the classification results of the two models proposed in three datasets are improved about 5%, the accuracy is improved by about 8%, and the recall rate is up to 10%. The accuracy and recall of New_GAN model increased by average of 5%. However, with New_GAN1, the results of New_GAN2 on three datasets of evaluation metrics are not significantly improved. It has a certain relationship with the upper limit of the classification model itself.

Compared with other two baseline models, the generative model with LSTM networks proposed in this article is good at enhancing data classification results. The model we proposed in this article is superior to the existing GAN model and the VAE model in the effect of generation real samples. And the generation model New_GAN2 based on Wasserstein distance and penalty is better than New_GAN1.

Conclusion

Through our work, it can be found that the combination of GAN and LSTM network has achieved excellent performance. GAN network is suitable to quickly capture the characteristics of data distribution. LSTM can generate data with time series, which retains data distribution of generated data. Our work provides a novel method to generate “realistic” data in financial transaction data to naturally solve imbalanced problem. For the generated one-dimensional sample data, we evaluate the quality of the generated data from three aspects: data distribution, data attributes, and classification effects.

Although our generation model achieves better improvements in results, we have encountered a lot of difficulties while model training. From the model perspective, generators and discriminators need to be carefully balanced. And the generator is easy to fall into the local optimum, resulting in a single sample and insufficient diversity. In the future, we will improve the model from the perspective of loss function, gradient penalty, and so on. And our GAN can generate data similar to the real data, and cannot generate the potential fraud that has not yet occurred under the data latent space. The diversity of generated samples is insufficient, and it is difficult to cope well with potential cases that have not occurred. From this perspective, we will better solve the negative sample problem in financial transactions.

Footnotes

Handling Editor: Liran Ma

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Natural Science Foundation of Shanghai (No. 19ZR1401900), Shanghai Science and Technology Innovation Action Plan Project (No. 19511101802), and National Natural Science Foundation of China (No. 61472004, 61602109).

ORCID iD

Zhaohui Zhang

References

Devi

Biswas

Purkayastha

. A cost-sensitive weighted random forest technique for credit card fraud detection. In 2019 10th international conference on computing, communication and networking technologies (ICCCNT), 2019. IEEE.

Zhang

Zhou

Zhang

, et al. A model based on convolutional neural network for online transaction fraud detection. Secur Commun Netw 2018; 2018: 5680264.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial nets. In: Ghahramani

Welling

Cortes

, et al. (eds) Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc., 2014, pp.2672–2680.

Kingma

Welling

Auto-encoding variational Bayes, 2013, https://arxiv.org/pdf/1312.6114.pdf

Hochreiter

Schmidhuber

Long short-term memory. Neural Comput 1997; 9(8): 1735–1780.

Gretton

Borgwardt

Rasch

, et al. A kernel method for the two-sample-problem. In: Schölkopf

Platt

Hoffman

(eds) Advances in neural information processing systems. Cambridge: MIT Press, 2007, pp.513–520.

Arjovsky

Chintala

Bottou

Wasserstein GAN, 2017, https://arxiv.org/pdf/1701.07875.pdf

Radford

Metz

Chintala

Unsupervised representation learning with deep convolutional generative adversarial networks, 2015, https://arxiv.org/pdf/1511.06434.pdf

Mirza

Osindero

Conditional generative adversarial nets, 2014, https://arxiv.org/pdf/1411.1784.pdf

10.

Kumar

Biswas

Sanyal

eCommerceGAN: a generative adversarial network for e-commerce, 2018, https://arxiv.org/pdf/1801.03244.pdf

11.

Lin

Wang

, et al. Semi-supervised learning with generative adversarial networks on digital signal modulation classification. Comput Mater Con 2018; 55(2): 243–254.

12.

Lou

One-dimensional data augmentation using a Wasserstein generative adversarial network with supervised signal. In: Proceedings of the 2018 Chinese control and decision conference (CCDC), Shenyang, China, 9–11 June 2018, pp.1896–1901. New York: IEEE.

13.

Esteban

Hyland

Rätsch

Real-valued (medical) time series generation with recurrent conditional GANs, 2017, https://arxiv.org/pdf/1706.02633.pdf

14.

Zhang

. GANSlicing: a GAN-based software defined mobile network slicing scheme for IoT applications. In: Proceedings of the ICC 2019 – 2019 IEEE international conference on communications (ICC), Shanghai, China, 20–24 May 2019, pp.1–7. New York: IEEE.

15.

Liu

. A neuromorphic GAN system for intelligent computing on edge. In: Proceedings of the 4th ACM/IEEE symposium on edge computing, Washington, DC, 7–9 November 2019, pp.342–343. New York: ACM.

16.

Mardani

Gong

Cheng

, et al. Deep generative adversarial networks for compressed sensing automates MRI, 2017, https://arxiv.org/pdf/1706.00051.pdf

17.

Swersky

Zemel

Generative moment matching networks. In: Proceedings of the international conference on machine learning, Lille, 6–11 July 2015, pp.1718-1727. PMLR, http://proceedings.mlr.press/v37/li15.html

18.

Salimans

Goodfellow

Zaremba

, et al. Improved techniques for training GANs. In: Lee

Sugiyama

Luxburg

, et al. (eds) Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc., 2016, pp.2234–2242.

19.

Heusel

Ramsauer

Unterthiner

, et al. GANS trained by a two time-scale update rule converge to a local Nash equilibrium. In: Guyon

Luxburg

Bengio

(eds) Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc., 2017, pp.6626–6637.

20.

Karras

Aila

Laine

, et al. Progressive growing of GANS for improved quality, stability, and variation, 2017, https://arxiv.org/pdf/1710.10196.pdf

21.

Shmelkov

Schmid

Alahari

How good is my GAN? In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp.213–229. Berlin: Springer.

22.

Rachev

ST.

Duality theorems for Kantorovich-Rubinstein and Wasserstein functionals. Warszawa: Instytut Matematyczny Polskiej Akademi Nauk, 1990.

23.

Gulrajani

Ahmed

Arjovsky

, et al. Improved training of Wasserstein GANs. In: Guyon

Luxburg

Bengio

(eds) Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc., 2017, pp.5767–5777.

24.

Theis

Oord

AVD

Bethge

A note on the evaluation of generative models, 2015, https://arxiv.org/pdf/1511.01844.pdf