Sage Journals: Discover world-class research

Abstract

Extracting more information from feature interactions is essential to improve click-through rate (CTR) prediction accuracy. Although deep learning technology can help capture high-order feature interactions, the combination of features lacks interpretability. In this paper, we propose a multi-semantic feature interaction learning network (MeFiNet), which utilizes convolution operations to map feature interactions to multi-semantic spaces to improve their expressive ability and uses an improved Squeeze & Excitation method based on SENet to learn the importance of these interactions in different semantic spaces. The Squeeze operation helps to obtain the global importance distribution of semantic spaces, and the Excitation operation helps to dynamically re-assign the weights of semantic features so that both semantic diversity and feature diversity are considered in the model. The generated multi-semantic feature interactions are concatenated with the original feature embeddings and input into a deep learning network. Experiments on three public datasets demonstrate the effectiveness of the proposed model. Compared with state-of-the-art methods, the model achieves excellent performance ( $+$ 0.18% in AUC and $-$ 0.34% in LogLoss VS DeepFM; $+$ 0.19% in AUC and $-$ 0.33% in LogLoss VS FiBiNet).

Keywords

Click-through rate prediction multi-semantic feature interaction convolution squeeze-excitation network

1. Introduction

The click-through rate (CTR) prediction problem is currently receiving significant attention from academia and industry in the fields of computational advertising and recommender systems [1, 2, 3, 4, 5, 6]. The CTR prediction estimates whether a user will click on a given advertisement. It is a typical application of binary classification problems. In actual scenarios, a slight increase in the accuracy of CTR prediction can bring considerable benefits to related businesses such as advertising ranking [1], advertising bidding [7], and search engine [8].

The accuracy of CTR prediction depends not only on the model structure and algorithm but also on the input data [2, 3, 8, 9]. Recent research shows that it is crucial to consider the interactions between features [6, 10, 11, 12, 13]. Therefore, designing novel and efficient feature interaction methods has become essential for model improvement [6, 12, 13, 14, 15, 16, 17, 18]. Traditional CTR prediction methods such as Logistic Regression (LR) [19] and Factorization Machine (FM) [10] are good at processing original features or low-order interactive features [4, 10, 19, 20], and deep learning methods such as DeepCrossing [11] and DeepFM [21] are better at processing high-order interaction features [5, 11, 21, 22, 23]. At present, the widely used attention mechanism helps models learn the importance of different features, thereby further improving the performance of models [24, 25, 26, 27, 28, 29].

These existing models usually map all features to a shared space, which leads to only a shallow representation of the features. In actual scenarios, a feature can express multiple meanings, and the interaction between different features often contains high-level semantic concepts such as sentiment and intention. The above phenomena can be regarded as a feature-level semantic gap [30, 31], which will be weakened in this paper by learning the multi-semantic relations during feature interactions. Figure 1 shows an example of multi-semantic feature interaction, where the user is an athlete (occupation: athletes), the item is a pair of shoes (type: shoes), and the banner is on the home page (banner-position: home page). Interaction A (occupation, type) and Interaction B (banner-position, type) are two types of interaction. The left part of Fig. 1 shows the traditional methods, i.e., Interaction A and Interaction B are embedded in the same semantic space (Space: x0y). The right side of Fig. 1 shows the method used in this paper, i.e., Interaction A and Interaction B are embedded in two semantic spaces (Space: X0Y) and (Space: X0Z). (Space: X0Y) reflects the preferences of different professional users for items, i.e., the influence of user preferences on advertising clicks, so Interaction A is more important than B. Different banner positions represent the investment of the advertiser, so (Space: X0Z) learns the impact of advertisers on advertising clicks, which is more important for Interaction B.

Figure 1.

An example of multi-semantic feature interaction.

If features are embedded in a unique space, all interaction relationships will only be restricted to the same semantic space. Furthermore, if we embed features into different spaces before feature interaction, the number of embedding parameters will be doubled or more. Under normal circumstances, the embedding parameters have accounted for most of the total model parameters. Therefore, in this paper, we choose to achieve semantic diversity after feature embedding. The way we adopt differs from explicitly combining feature embedding vectors, i.e., automatically and emphatically selecting meaningful interaction relationships.

Based on the above core ideas, we aim to learn the impact of various semantic spaces to attenuate the semantic gap in feature interaction. We will utilize convolution operations to achieve feature interaction in various semantic spaces. Although this method adds some convolution kernel parameters to the network, it avoids a large scale of feature embedding parameters. Semantic diversity expands the interaction space from one dimension to multiple dimensions, so it is necessary to use the attention mechanism to identify important semantics and features. We use and make adjustments on Squeeze-and-Excitation Networks (SENet) to obtain global semantic information and dynamically implement weight distribution, achieving semantic diversity and feature diversity. Finally, in order not to miss any useful information, the original features and multi-semantic interactive features are input into a deep neural network (DNN) to capture high-order feature interactions effectively. To summarize, the major contributions of this paper are listed as follows:

(1)

Inspired by the new perspective of semantic diversity in CTR prediction, we propose a multi-semantic convolution-based feature interaction learning network (MeFiNet), which learns the interaction of the same pair of features under different semantics to weaken the feature-level semantic gap, thus helping to mine richer and more diverse value information in feature interactions.

(2)

For further performance gains, we propose an improved Squeeze & Excitation method based on SENet to learn the importance of different feature interactions in different semantic spaces. The method enhances the representation of the model in terms of semantics and features.

(3)

The proposed model is evaluated on three benchmark datasets. It consistently outperforms state-of-the-art deep models on MovieLens-1M, Criteo, and Avazu datasets. Further experiments also show that the improved Squeeze & Excitation attention method, multi-semantic interaction module, and high-order interaction module all help improve the performance of the model.

The rest of this paper is organized as follows. We begin by reviewing some work relevant to our proposed model in Section 2. Then in Section 3, we introduce our model presented in this paper. Experiment results and analysis are shown in Section 4. Finally, we conclude the paper in Section 5.

2. Related work

2.1 CTR prediction models

The LR model is the earliest and most classic CTR prediction model [19]. It has a simple structure and is easy to implement but unsuitable for processing nonlinear characteristics. The FM [10] model uses the idea of latent vectors proposed in the matrix factorization model to expand each feature into a k-dimensional latent vector for feature interaction, which makes it perform better than the generalized linear model and other advanced models at the time. Therefore, many FM-based variant models, such as FFM [4] and iFFM [20], are proposed. With the successful application of deep learning in computer vision and natural language processing, many CTR models based on deep learning have been proposed in recent years to model high-order feature interactions. FNN [22] uses the FM model to pre-train the embedding vectors and input them into a multi-layer perceptron (MLP) system. DeepCrossing [11] uses the residual network structure instead of the commonly used MLP to combine features automatically. Wide&Deep (WDL) [23] is the first deep model that uses a parallel structure to train the wide and deep parts jointly to increase memory and generalization capabilities. However, the wide part’s original features and feature combinations still need to be manually designed. DeepFM [21] is different from WDL. It uses FM as a wide part to learn low-order interactions and applies end-to-end model training. However, relying on DNNs to build complex models and learn high-order interaction features is insufficient. It is recommended to implement an explicit design of learning interactive features in a deep architecture [9]. Experts also believe that the data determines the upper limit of machine learning, and the algorithm will only approach this upper limit as much as possible.

2.2 Feature engineering and representation learning

There are usually two ways to process data: feature engineering and representation learning. The former refers to the manual data processing by experienced experts to obtain satisfactory or interactive features. Commonly used methods include feature scaling, decomposition, and aggregation [32]. The latter refers to the automatic learning of useful features through models to obtain better feature representation [1, 8], such as the embedding technology and the feature combination, usually used in CTR models based on deep learning. For instance, PNN [14] models the interaction relationship of features through vectors’ inner product (IPNN) or outer product (OPNN). NFM [12] introduces the Bi-Interaction pooling operation to learn 2-order interactions, significantly facilitating the modeling of higher-order and non-linear feature interactions in the deep fully connected layers. Deep Cross Network (DCN) [6] uses the cross network to learn bounded-degree feature combinations explicitly. It generates high-degree interaction at each layer while retaining the interaction information of the previous layer. DCN V2 [13] expands the weight parameters from the original vector to the matrix and leverages low-rank techniques to improve the ability to model cross features. In addition to these interactive methods, there are also some mature models that implement feature representation learning, such as convolutional neural networks [15, 16, 17], graph neural networks [18], and knowledge graphs [33].

Features imply different semantics in different scenarios. The concept of semantics appeared in natural language processing in the early days and is currently being used more and more in other research fields such as image annotation and recommender systems. IA-MSL [34] uses images annotated in two or more related semantic spaces to train the model, which overcomes the independence of mandatory conditions caused by a single space and the problem of ignoring spatial correlation. CCPM [15] introduces the convolutional neural network (CNN) to model important semantic features. CCPM, FGCNN [16], and MeFiNet are all based on CNN for feature interactions. The difference is that CCPM simply repeats the CNN operations, FGCNN uses a fully connected layer to reorganize the vectors to gain new features based on the CNN output, while MeFiNet extends the CNN channels to multi-semantic spaces and focuses on the importance of semantic spaces and features. TFNet [35] introduces multiple tensors to implement multi-semantic spaces, which brings more additional parameters than CNN in MeFiNet.

2.3 Attention mechanism

The attention mechanism can help models distinguish the importance of features and is widely used in machine translation [36, 37], speech recognition [38, 39], and other fields. AFM [24] is the first CTR prediction model to try the attention mechanism. It calculates the attention score of interaction features. The different attention mechanisms have also been used in several state-of-the-art CTR prediction models. DIN [25] uses an attention network to capture user interests from historical behaviors. DHAN [26] exploits a hierarchical attention network to gain different importance in different dimensions. AutoInt [27] leverages a multi-head self-attention network to learn feature interactions explicitly. SENet [28] has achieved remarkable results in computer vision and is gradually paid attention to by researchers in CTR [29]. It obtains the global spatial information through the Squeeze operation and then uses the Excitation operation to capture the interdependencies between channels. Both FiBiNet [29] and MeFiNet learn the importance through SENet. However, FiBiNet only leverages SENet to learn the importance of features, while our proposed MeFiNet improves SENet to focus on the importance of each semantic and feature.

3. Proposed model

Figure 2.

The architecture of our proposed CTR prediction model MeFiNet.

As illustrated in Fig. 2, the proposed model MeFiNet consists of three parts: 1) Feature embedding module (FEM), 2) Feature interaction module (FIM), and 3) Feature fusion module (FFM).

3.1 Feature embedding module

The inputs of the CTR prediction model usually contain user information, product information, and context information. They are mostly sparse features and encoded as one-hot vectors. However, these vectors tend to be of huge dimension and highly sparse. Through embedding technologies, they are transformed from a high-dimensional sparse classification space to a low-dimensional dense continuous space. The output of the embedding layer can be expressed as:

$\displaystyle E=[e_{1};e_{2};\ldots;e_{m}],$ (1)

where $E\in R^{m\times d}$ is the set of sparse features, $m$ denotes the number of sparse features, $e_{i}\in R^{d}$ is the embedding vector of the $i^{\text{th}}$ feature, and $d$ is the embedding dimension of $e_{i}$ . In addition, we use the sum or mean operation to handle sparse multi-value features. We classify most of the dense features as sparse features, and a small number of features are normalized with Max-Min.

To enrich embedding vectors and avoid inconsistency in the gradient direction when updating parameters [16, 40, 41], we introduce another embedding matrix $E^{\prime}$ to implement the feature interaction module (FIM). Table 1 describes the notions to be used in the paper.

Table 1

Notation information

Notation	Explanation
$e_{i}$	The embedding vector of the $i^{\text{th}}$ feature
$m, d$	Number of sparse features, embedding dimension
$E,E^{\prime}$	The embedding matrix for FFM and FIM
$X_{e^{\prime}},U,F$	Re-input, convolution kernel set, output of MFI
$u_{c},f_{c}$	Kernel, interactive features in the $c^{\text{th}}$ semantic space
$h,h^{\prime}$	Number of features in $X_{e^{\prime}}$ and $f_{c}$
$c$	Number of semantic spaces
$d^{\prime}$	New embedding dimension of interactive features
$a_{(c,i)}$	Statistic of the $c^{\text{th}}$ space of the $i^{\text{th}}$ interactive feature
$\hat{W_{1}},\hat{W_{2}}$	Weight for the dimensionality reduction and restoration
$b_{i}$	Weights of all semantic spaces of the $i^{th}$ interactive feature
$A,B,S,X_{s}$	Outputs of Squeeze, Excitation, Re-Weight and SE
$X_{r}$	Features composed of $E$ and dense features
$\mathbb{W}^{(l)},\mathbbm{{b}}^{(l)}$	Weight and bias of the $l^{\text{th}}$ layer in FFM
$\mathbbm{a}^{(l-1)},\mathbbm{a}^{(l)}$	Input and output of the $l^{\text{th}}$ layer in FFM

3.2 Feature interaction module

The feature interaction module (FIM) consists of two sub-modules, multi-semantic feature interaction (MFI) and Squeeze & Excitation based on SENet (SE).

3.2.1 Multi-semantic feature interaction

For most models, no matter which interaction method (e.g., inner product, outer product, or Hadamard product) is used, they only consider those interactions in the same semantic space and ignore the diversity of different semantic spaces. In this section, we gradually implement the construction process of multi-semantic spaces.

The concept of ‘channel’ in CNN is used to realize the diversity of semantic space. It introduces specific hidden layers to enhance feature extraction capabilities and reduce network parameters. We extract local features through small convolution kernels. One convolution kernel is used to extract one feature pattern, and multiple convolution kernels can extract multiple different feature patterns. However, multi-order interactions based on convolution operations bring more complexity to the model. Besides, convolution operations extract only neighbor patterns, which lose most interaction information. After careful consideration, we only use 2-order feature interactions and take some special measures, i.e., before inputting the data into the convolutional layer, we re-plan them by sorting and copying the embedding vector $E^{\prime}$ . Finally, the new input of the convolutional layer is $X_{e^{\prime}}=[e^{\prime}_{1},e^{\prime}_{2},e^{\prime}_{1},e^{\prime}_{3},% \ldots,e^{\prime}_{m-1},e^{\prime}_{m}]$ , where $X_{e^{\prime}}\in R^{h\times d}$ and $h$ is the new total number of features (for convenience, we set $h=m(m-1)$ ).

Figure 3.

The process of multi-semantic feature interaction.

First, we expand the dimension of $X_{e^{\prime}}$ to a 3D tensor and then use the convolution kernel $u$ ( $R^{\mathbbm{h}\times\mathbbm{w}\times 1}$ , $\mathbbm{h}=2$ , $1\leqslant\mathbbm{w}\leqslant d$ ) sliding window for feature interaction. In this process, the order and span of the interaction are determined by the height $\mathbbm{h}$ and width $\mathbbm{w}$ of the convolution kernel, and the number of the convolution kernel $c$ determines the number of semantic spaces for feature interaction. Compared with the tensor operation, our interaction parameters are reduced by $d/2\sim d^{2}/2$ . As shown in Fig. 3, the whole process can be described as the input $X_{e^{\prime}}$ is mapped to the multi-semantic interactive features $F$ through the convolution operation, where $X_{e^{\prime}}\in R^{h\times d\times 1},F\in R^{h^{\prime}\times d^{\prime}% \times c}$ , and $h=2h^{\prime}$ . We assume $U=[u_{1},u_{2},\ldots,u_{c}]$ represents the set of kernel parameters of filter, then the output $F=[f_{1},f_{2},\ldots,f_{c}]$ is expressed as:

$\displaystyle f_{c}=\textit{Conv}(u_{c},X_{e^{\prime}}),$ (2)

where $\textit{Conv}()$ means the convolution operation, and $u_{c}$ is the $c^{\text{th}}$ convolution kernel. $f_{c}=[f_{c_{1}},f_{c_{2}},\linebreak\ldots,f_{c_{h^{\prime}}}]$ represents the set of interactive features in the $c^{\text{th}}$ semantic space, and $h^{\prime}$ represents the new number of features in each semantic space. $f_{c_{i}}\in R^{d^{\prime}}(1\leqslant i\leqslant h^{\prime})$ is the $i^{\text{th}}$ interactive feature in the $c^{\text{th}}$ semantic space and $d^{\prime}$ is the new embedding dimension of the feature.

3.2.2 Squeeze & Excitation based on SENet

Research on the attention mechanism proves that different characteristics have different importance to the target variable. Similarly, different features in different semantic spaces are essential to the target variable in a multi-semantic scenario. Specifically, due to the diversity of semantic spaces, different semantic spaces have different influences on features. In addition, it is not sufficient to assign the same weight to different interactive features even in the same space. As shown in Fig. 1, interactions A and B emphasize user and advertiser semantic space. Therefore, this paper considers that different interactive features in semantic spaces affect targets, involving semantic diversity and feature diversity.

In the field of computer vision, SENet [28] learns the importance of different channels and uses global information to recalibrate features. Inspired by SENet, we transfer channel attention to semantic spatial attention. However, SENet focuses on semantic diversity and ignores different features in the same space. To overcome its limitation, we propose an improved Squeeze & Excitation method based on SENet, which extends the original channel attention to multiple semantic attentions with different features.

Squeeze

In multi-semantic interaction, each of the learned filters operates with a local receptive field, and consequently, every semantic space $f_{c}$ is independent of others. To explore the relevance between semantic spaces, we use Squeeze to calculate the global semantic information (semantic descriptor) of all interactive features. The semantic descriptor represents the global distribution of features in the semantic space, achieved by averaging or summing all the information in each semantic space of features. This process is expressed as:

$\displaystyle a_{(c,i)}=F_{sq}(f_{c_{i}})=\frac{1}{d^{\prime}}\sum\nolimits_{j% =1}^{d^{\prime}}{f_{c_{i}}}^{j},$ (3)

where $a_{(c,i)}$ is a scalar value and represents the semantic-wise statistics of the $i^{\text{th}}$ feature in the $c^{\text{th}}$ semantic space. So the statistic vector $a_{i}$ represents the semantic descriptor of the $i^{\text{th}}$ interactive feature aggregating all semantic spaces, where $a_{i}=[a_{(1,i)},a_{(2,i)},\ldots,a_{(c,i)}]$ . And the global semantic descriptor is represented by $A=[a_{1},a_{2},\ldots,a_{h^{\prime}}]\in R^{h^{\prime}\times c}$ .

Figure 4.

The improved Squeeze & Excitation method.

Excitation

This step is used to learn the weight of each semantic based on the semantic descriptor. We generally use two fully connected (FC) layers as a simple self-gating mechanism to achieve Excitation for the one-dimensional global descriptor. In this paper, the multi-semantic global descriptor with different features obtained by Squeeze aggregation information is two-dimensional, as shown in Fig. 4. To reduce the computational complexity and achieve the same effect, we leverage two matrices to learn the weights. The first learned weight matrix $\hat{W_{1}}$ plays the role of dimensionality reduction with reduction ratio $r$ , which is a super parameter. The second weight matrix $\hat{W_{2}}$ is to restore the original dimensionality, and then we use the Tanh function to compress the range of weight values (0, 1). The dimensionality reduction-and restoration operation contributes to the distribution of the attention weight, increases the weight of the important semantic space, and reduces the weight of the non-essential semantic space. This process is expressed as:

$\displaystyle b_{i}=F_{ex}(a_{i},\hat{W})=\sigma_{1}(\hat{W_{2}}\times(\hat{W_% {1}}\times a_{i})),$ (4)

where $b_{i}\in R^{c}$ is the weight vector of the $i^{\text{th}}$ feature in all semantic spaces, $\sigma_{1}$ is Tanh function, $\hat{W_{1}}\in R^{\frac{c}{r}\times c}$ and $\hat{W_{2}}\in R^{c\times\frac{c}{r}}$ . And the attention scores of multi-semantic interaction features is represented as $B=[b_{1},b_{2},\ldots,b_{h^{\prime}}]\in R^{h^{\prime}\times c}$ , which are obtained through the improved SE module.

Re-Weighting and Aggregation

First, we perform semantic-wise multiplication between the convolution interaction features F and the attention scores B. The output of the Re-Weight is expressed as:

$\displaystyle S=F_{rw}(F,B)=F\ast B,$ (5)

where $S\in R^{h^{\prime}\times d^{\prime}\times c}$ represents multi-semantic features with weights.

Secondly, different semantic spatial information of $S$ is aggregated:

$\displaystyle X_{s}=F_{\textit{aggr}}(S)=[x_{s_{1}},x_{s_{2}},\ldots,x_{s_{h^{% \prime}}}],$ (6)

where $X_{s}\in R^{d^{\prime}\times h^{\prime}}$ is the final set of interactive features and $x_{s_{i}}\in R^{d^{\prime}}$ is the $i^{\text{th}}$ interactive feature. There are many aggregation methods, such as sum, max, and mean. We use the sum of the tensor as the final result. The significance of this operation is that the aggregated semantic information is used to enhance the feature representation and help dig out more potential information. Moreover, it also satisfies the input of the deep network.

3.3 Feature fusion module

Since the low-order feature interaction still misses some helpful information. To improve the expressive ability of the model, we add multiple FC layers to generate high-order feature interactions. The processing procedure is described as follows.

First, we combine the multi-semantic interactive feature $X_{s}$ and the 1-order features $X_{r}$ , where $X_{r}$ is composed of the embedding vectors $E$ and the dense features. Both $X_{s}$ and $X_{r}$ are fused and used as the input of the first layer of the deep network, which is expressed as:

$\displaystyle\mathbbm{a}^{(0)}=F_{\textit{con}}(X_{r},X_{s}),$ (7)

Then, we get $\mathbbm{a}^{(1)}$ when $\mathbbm{a}^{(0)}$ is input to the deep neural network. Similarly, we can get the output $\mathbbm{a}^{(l)}$ of the $l^{\text{th}}$ layer network.

$\displaystyle\mathbbm{a}^{(1)}=\sigma(\mathbb{W}^{(1)}\mathbbm{a}^{(0)}+% \mathbbm{b}^{(1)}),$ (8) $\displaystyle\mathbbm{a}^{(l)}=\sigma(\mathbb{W}^{(l)}\mathbbm{a}^{(l-1)}+% \mathbbm{b}^{(l)}),$ (9)

where $l$ is the layer depth and $\sigma$ is the activation function. $\mathbbm{a}^{(l-1)},\mathbb{W}^{(l)},\mathbbm{b}^{(l)}$ , and $\mathbbm{a}^{(l)}$ represent the input, weights, bias, and output of the $l^{\text{th}}$ layer, respectively.

Finally, since CTR prediction is a binary classification problem, we select the Sigmoid function as the activation function.

$\displaystyle\hat{\textrm{y}}=\textit{Sigmoid}(\mathbb{W}^{(l+1)}\mathbbm{a}^{% (l)}+\mathbbm{b}^{(l+1)}),$ (10)

We choose cross entropy loss as the objective function.

$\displaystyle\textit{loss}=-\frac{1}{N}\sum\nolimits_{i=1}^{N}(\textrm{y}_{i}% \log(\hat{\textrm{y}}_{i})+(1-\textrm{y}_{i})\log(1-\hat{\textrm{y}}_{i})),$ (11)

where $\textrm{y}_{i}$ and $\hat{\textrm{y}}_{i}$ are the ground truth and the predicted rate of the $i^{\text{th}}$ instance, and $N$ is the total size of samples. The ultimate goal of the model is to minimize the objective function and obtain the best efficiency through continuous learning.

3.4 Complexity analysis

Feature interaction module (FIM), the key module in our proposed MeFiNet, has the highest complexity among all modules and differs from other models. In the following, we will analyze the complexity of this module from space and time perspectives.

Space complexity

FIM includes two sub-modules. There are $\mathbbm{h}\mathbbm{w}c$ parameters for multi-semantic feature interaction, where $\mathbbm{h}$ and $\mathbbm{w}$ are the height and width of the convolution kernel, respectively, and $ch^{\prime}+2c^{2}/r$ parameters for SE. Thus, the number of parameters of FIM is $\mathbbm{h}\mathbbm{w}c+ch^{\prime}+{2c^{2}}/r$ . Therefore, the space complexity of FIM is $O(\mathbbm{h}\mathbbm{w}c+ch^{\prime})$ .

Time complexity

The size of the output feature map of each convolution kernel is $h^{\prime}\times d^{\prime}$ , so the time complexity of MFI is $O(h^{\prime}d^{\prime}\mathbbm{h}\mathbbm{w}c)$ . For SE, the time complexity is $O(h^{\prime}c^{2}/r)$ , where $r$ is a super parameter. Then the total time complexity of FIM is $O(h^{\prime}c(d^{\prime}\mathbbm{h}\mathbbm{w}+c))$ .

4. Experiments

In this section, we perform extensive experiments on three benchmark datasets to evaluate our proposed MeFiNet. We aim to answer the following research questions:

RQ1: Can our proposed model perform better than other competitive models? RQ2: Are the key components in MeFiNet (i.e., SE, FIM, FFM) helpful in improving CTR prediction results? RQ3: How do hyper parameters such as the number of semantic spaces, the dimension of feature embedding, and the width of convolution kernel affect the performance of MeFiNet?

4.1 Experimental settings

Datasets

We conduct experiments on three real-world datasets, namely MovieLens-1M,1

¹
https://grouplens.org/datasets/movielens/.
Criteo,2 ²
https://www.kaggle.com/c/criteo-display-ad-challenge.
and Avazu.3 ³
https://www.kaggle.com/c/avazu-ctr-prediction.
1) MovieLens-1M is a widely used dataset about user ratings in recommender systems. We binarize the score data, taking data with scores greater than 3 as positive samples and data with a score of less than 3 as negative samples. To eliminate ambiguity, we delete all the data with a score of 3 in the dataset. 2) Criteo is a dataset used for CTR prediction in the Kaggle competition. It is a standard benchmark dataset for CTR prediction, including 45 million click records, 26 categorical feature fields, and 13 numeric feature fields. 3) Avazu is also a dataset used for CTR prediction in the Kaggle competition. It contains 40 million click records and 23 features. The field ‘id’ is an index without repeating, so we removed it from the experiment.

Both Criteo and Avazu datasets are enormous. Limited by the experimental conditions, we selected about a quarter as the experimental data. In the experiment, each dataset is divided into a training set (64%), a verification set (16%), and a test set (20%). Table 2 summarizes the statistics of the three datasets.

Table 2
Statistics of datasets

Dateset #Samples #Features #Sparse #Positive ratio

MovieLens-1M 739,012 9 7 0.78

Criteo 10,000,000 39 26 0.22

Avazu 10,000,000 22 21 0.18

Evaluation Metrics

We evaluate model performance using two classic CTR criteria: the area under the ROC curve (AUC) and Logloss. AUC is used as an evaluation standard to avoid the influence of thresholds by converting predicted probabilities into categories. It is suitable for the imbalance of positive and negative samples. The value of AUC usually spans from 0.5 to 1, and a higher AUC indicates better model performance. Logloss (Cross Entropy) reflects the average deviation of the prediction results. It pays more attention to the sorting ability of the algorithm. Generally, lower Logloss indicates better performance. In the CTR prediction task, it is considered very meaningful to increase the AUC by 1 or decrease the Logloss by 1.

Baselines

We compare our model with the following methods.

•
LR [19]. As the most classic linear regression model, it is easy to implement and effective.
•
FM [10]. It is the first model that uses factorization techniques to learn 2-order interactive features.
•
WDL [23]. It consists of two parts. The Deep part takes the sparse features passing through the embedding layer as input to the neural network and the Wide part takes the manual features as the input of LR.
•
DeepFM [21]. DeepFM uses FM instead of LR in the Wide part and uses a cascade of embedded vectors as the input of the DNN in the Deep part. Both parts share the same input.
•
PNN [14]. It defines two feature interaction methods: the inner and outer products. We implement the inner product interaction method, IPNN.
•
CCPM [15]. It pioneers the use of CNN network learning feature interaction in CTR. The feature matrix finally obtained after multiple pooling operations is used as the input of MLP.
•
FGCNN [16]. The CNN is first used to generate local patterns, and then an FC layer is introduced to recombine them to generate new features. A deep classifier adopts the structure of IPNN to learn interactions from the augmented feature space.
•
FiBiNet [29]. The SENet mechanism is used to learn the weights of features dynamically, and three types of Bilinear-Interaction layers are used to learn feature interactions.

Implementation Details

The experimental hardware platform is Intelö Core i5-4200H CPU @ 2.80 GHz 2.79 GHz, 8 GB memory, 1 TB hard disk, 64-bit operating system, and x64-based processor. The experiment runs on the Tensorflow1.15, and the programming language is python3.7.

Table 3
The overall performance of different models on three datasets

Models MovieLens-1M Criteo Avazu

AUC Logloss AUC Logloss AUC Logloss

LR 0.8659 0.3578 0.7831 0.4312 0.7499 0.4086

FM 0.8866 0.3298 0.7888 0.4272 0.7582 0.4039

CCPM 0.8828 0.3354 0.7920 0.4246 0.7646 0.4003

IPNN 0.8911 0.3246 0.8092 0.4104 0.7718 0.3962

WDL 0.8913 0.3241 0.8106 0.4092 0.7710 0.3969

DeepFM 0.8915 0.3234 0.8108 0.4089 0.7712 0.3968

FGCNN 0.8916 0.3240 0.7930 0.4239 0.7713 0.3965

FiBiNet 0.8921 0.3227 0.8096 0.4102 0.7716 0.3963

MeFiNet 0.8934 0.3214 0.8116 0.4084 0.7729 0.3957

4.2 Performance comparison (RQ1)

Dateset	#Samples	#Features	#Sparse	#Positive ratio
MovieLens-1M	739,012	9	7	0.78
Criteo	10,000,000	39	26	0.22
Avazu	10,000,000	22	21	0.18

Models	MovieLens-1M	Criteo	Avazu
LR	0.8659	0.3578	0.7831	0.4312	0.7499	0.4086
FM	0.8866	0.3298	0.7888	0.4272	0.7582	0.4039
CCPM	0.8828	0.3354	0.7920	0.4246	0.7646	0.4003
IPNN	0.8911	0.3246	0.8092	0.4104	0.7718	0.3962
WDL	0.8913	0.3241	0.8106	0.4092	0.7710	0.3969
DeepFM	0.8915	0.3234	0.8108	0.4089	0.7712	0.3968
FGCNN	0.8916	0.3240	0.7930	0.4239	0.7713	0.3965
FiBiNet	0.8921	0.3227	0.8096	0.4102	0.7716	0.3963
MeFiNet	0.8934	0.3214	0.8116	0.4084	0.7729	0.3957

We try different CTR models for comparative experiments on MovieLens-1M, Criteo and Avazu, using AUC and Logloss as evaluation indicators. The performance is shown in Table 3, where the underlined numbers represent the best performance among all benchmark models, and the bold numbers represent the best in all models. From the table, we can get the following observations.

(1)
LR performs the worst. It only learns the original features and does not consider the correlation between the features. The performance of FM is not as good as the models based on deep learning. It just adds the learning of 2-order feature interactions but does not consider high-order feature interactions.
(2)
Among all deep learning-based models, CCPM only calculates local feature interactions and ignores global information, so the performance is the worst. IPNN uses inner products to interact with embedding features, which performs best on the Avazu dataset. Both WDL and DeepFM use a parallel Structure. DeepFM uses the FM model for low-level interactions and outperforms WDL. It performs best on the Criteo dataset. FGCNN uses CNN and a fully connected recombination layer to build new features, and it performs well on the MovieLens-1M and Avazu datasets. FiBiNet distinguishes the importance of features based on SENet in the embedding layer and uses the bilinear structure for feature combination, which performs best on MovieLens-1M.
(3)
The proposed MeFiNet model performs best. On the MovieLens-1M dataset, the AUC increased by 0.15%, and Logloss decreased by 0.4%; On the Criteo dataset, AUC increased by 0.1%, and Logloss decreased by 0.12%; On the Avazu dataset, AUC increased by 0.14%, Logloss decreased by 0.13%. In actual scenarios, a slight increase in the AUC of offline models may bring additional millions of dollars to online applications yearly. Therefore, the performance advantage of MeFiNet is of great significance.
(4)
By comparing the AUC and Logloss indicators of MeFiNet on the three datasets, we have found that the performances on the MovieLens-1M and Avazu datasets are better than that of the Criteo dataset. After analysis, we conclude that MeFiNet can highlight its advantages when the amount of data or the number of features is small.

4.3 Effects of key components (RQ2)

The following is an experiment to study the impact of critical components in MeFiNet on performance. Each variant is realized by removing relevant components.

•
w/o SE. We remove the SE sub-module (attention mechanism) in this variant and retain the MFI sub-module.
•
w/o FIM. We remove the FIM module and only the original feature embedding vectors are directly input into the DNN of the FFM module.
•
w/o FFM. We remove the FFM module and replace it with the binary classification function, i.e., removing the high-order feature interactions and keeping only the shallow structure of the model.

The experimental results are shown in Table 4, from which we can get the following observations.

Table 4
The performance comparison of different components in MeFiNet

Models MovieLens-1M Criteo Avazu

AUC Logloss AUC Logloss AUC Logloss

MeFiNet 0.8934 0.3214 0.8116 0.4084 0.7729 0.3957

w/o SE 0.8921 0.3228 0.8108 0.4089 0.7709 0.3970

w/o FIM 0.8914 0.3241 0.8107 0.4090 0.7711 0.3967

w/o FFM 0.8870 0.3302 0.7977 0.4207 0.7658 0.3996

(1)
The FFM component has the most significant impact on the results because the high-order interaction of features brings much information to the model. (a) Compared with MeFiNet, w/o FFM gets a lower AUC and higher Logloss (specifically, $-$ 0.72% in AUC and $+$ 2.74% in Logloss on dataset MovieLens-1M; $-$ 1.71% in AUC and $+$ 3.01% in Logloss on dataset Criteo; $-$ 0.92% in AUC and $+$ 0.99% in Logloss on dataset Avazu). It shows that the deep neural network dramatically influences the performance of CTR prediction. (b) Compared with the traditional shallow models LR and FM in Table 3, w/o FFM performs better, which reflects that the proposed FIM module brings advantages to the model.
(2)
The FIM component is also very crucial for the model. After removing FIM, the entire model is equivalent to a single DNN where the original features are directly input into the MLP after embedding. The result of w/o FIM is inferior to MeFiNet on the three datasets ( $-$ 0.22% in AUC and $+$ 0.84% in Logloss on dataset MovieLens-1M; $-$ 0.11% in AUC and $+$ 0.15% in Logloss on dataset Criteo; $-$ 0.23% in AUC and $+$ 0.25% in Logloss on dataset Avazu). Adding an explicit feature interaction module to the deep learning framework is beneficial.
(3)
When the features are greatly increased, adding an attention module can effectively improve the model’s performance. Compared with MeFiNet, w/o SE decreases the performance by 0.15% AUC and 0.44% LogLoss on MovieLens-1M, by 0.1% AUC and 0.12% LogLoss on Criteo. The SE module significantly impacts the dataset Avazu, in which AUC decreased by 0.26% and Logloss increased by 0.33%. In conclusion, increasing the attention mechanism SE helps selectively emphasize important features, suppress less important ones, and improve the model’s performance.

In addition, w/o SE performs better than w/o FIM. Firstly, the FIM component consists of MFI and SE. Secondly, By only deleting the SE component and retaining the MFI component, i.e., adding multi-semantic feature interaction on w/o FIM, the model can obtain better results than w/o FIM on the MovieLens-1M and Criteo datasets. It also proves that combining the deep learning-based model with feature interaction is very effective in enhancing the ability of feature representation learning.
4.4 Influence of hyper parameters (RQ3)

Models	MovieLens-1M	Criteo	Avazu
MeFiNet	0.8934	0.3214	0.8116	0.4084	0.7729	0.3957
w/o SE	0.8921	0.3228	0.8108	0.4089	0.7709	0.3970
w/o FIM	0.8914	0.3241	0.8107	0.4090	0.7711	0.3967
w/o FFM	0.8870	0.3302	0.7977	0.4207	0.7658	0.3996

Several critical parameters in MeFiNet impact the model’s performance, such as the semantic space, the embedding dimension, and the convolution kernel width. To study the impact of these hyper-parameters, we investigate how the MeFiNet model works by changing one hyper-parameter while fixing the others on three datasets in this subsection.

4.4.1 Impact of semantic spaces

Different feature interactions exist in semantic spaces, such as user and advertiser spaces. These spaces are implemented based on convolutional layers in this paper. The multi-semantic spaces help obtain richer semantic interaction features but also increase the parameters that need to be optimized. We conducted experiments on three datasets to study the impact of semantic spaces. The experimental results are shown in Fig. 5, where Fig. 5a–c is about AUC, and Fig. 5d–f is about Logloss.

Figure 5.

Impact of semantic spaces.

(1)

For the MovieLens-1M dataset, the model performs best when the number of semantic spaces is 3, i.e., it obtains the largest AUC and the smallest Logloss.

(2)

For the Criteo dataset, as the number of semantic spaces increases, the model performance has a growing trend, and then the growth slows down. When the number of semantic spaces is 5, the model obtains the best AUC and Logloss.

(3)

For the Avazu dataset, when the number of semantic spaces is 6, the values of AUC and Logloss are significantly improved, and the model performance is the best.

From the overall change of the two indicators on the three datasets, the model’s performance is first improved and then decreased when the number of semantic spaces increases. Because increasing the number of semantic spaces too much will create more parameters and lead to a more complex network structure, the model will likely perform poorly due to overfitting.

4.4.2 Impact of embedding dimensions

Generally, a larger feature embedding dimension will contain more information. However, if we blindly increase the embedding dimension, it may be less conducive to the presentation of information. We conduct relevant experiments on three datasets to explore the impact of different embedding dimensions of feature vectors on the model’s performance. For the MovieLens-1M dataset, the dimensions are from 4 to 64. For the Criteo and Avazu datasets, the dimensions are from 2 to 32. The experimental results are shown in Fig. 6, from which we can observe:

(1)
With the increase of embedding dimension, AUC increases, and Logloss decreases. When a specific dimension is reached, the performance decreases due to too large parameters and complex convergence.
(2)
For the MovieLens-1M dataset, when the dimension is 8, AUC reaches the maximum, and Logloss decreases the smallest. For the Criteo dataset, the model performs best when the dimension is 8. For the Avazu dataset, the model performs best when the dimension is set to 4.
(3)
We can achieve good performance by appropriately defining the dimension of embedded features. For the three datasets, it is more appropriate to set embedding dimensions as 4–8.

Figure 6.
Impact of embedding dimensions.

Figure 7.
Impact of convolution kernel widths.

4.4.3 Impact of convolution kernel widths

The width of the convolution kernel determines the perception range of the 2-order interaction. When the width increases, the feature interaction will span a larger embedding dimension, containing more information. The dimension of the feature obtained after the interaction will decrease in return. However, at this time, the parameter amount of the convolution kernel will become larger, and the update will become more complicated. To better observe the impact of changes in the width of the convolution kernel, we uniformly set the embedding dimension of the interactive feature to 8. The width varies from 1 to 8. The result is shown in Fig. 7.

(1)
In the MovieLens-1M dataset, the performance is the best when the width is 2. With the kernel width increasing, the model’s performance becomes poorer. For the Criteo dataset, when the width changes between 3 and 4, the model performs better than others. For the dataset Avazu, the model performance is much better than other values when the width is kept at 1–4, especially 2. It can be seen that properly increasing the width of convolution interaction helps improve the model’s performance.
(2)
The performance of small window interaction is better than that of a large window on the whole. When the width is small, the convolution kernel has fewer interaction parameters, and the convolution computational complexity is not high. At the same time, the interactive features maintain an appropriate dimension, which will not cause too many parameters to converge or too little information to be obtained because the width is too large.

5. Conclusions

To take advantage of the rich and diverse semantic relationships in feature interactions, we propose to use multiple semantic spaces to learn feature interactions and establish a new model MeFiNet for CTR prediction. We use convolution operations to construct multiple semantic spaces, which helps to learn the latent semantic expression of features and avoids the unity and randomness of a single embedding space. To identify the importance of semantic features, we propose an improved Squeeze & Excitation method based on SENet. It uses global information to dynamically re-adjust the weight distribution. Experimental results on three public datasets show the superior performance of MeFiNet. Further experiments have verified that the deep learning framework is even more potent with effective feature combinations as input. Learning multiple semantic feature interactions improves CTR prediction accuracy, making the model more complex. In the future, we will study lightweight methods to obtain more semantic information from feature interactions while making the model less complicated.

Footnotes

Acknowledgments

This work is partly supported by the Shanghai Science and Technology Innovation Action Plan Project (No. 22511100700).

Conflict of interest

No conflict of interest exists in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my coauthors that the work described is original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed. And this article does not contain any studies with human participants performed by any of the authors.

References

Ouyang

Zhang

Ren

Liu

, Representation learning-assisted click-through rate prediction, in: Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 4561–4567.

Pan

Jin

Liu

Shi

Atallah

Herbrich

Bowers

et al., Practical lessons from predicting clicks on ads at facebook, in: Proceedings of the 8th International Workshop on Data Mining for Online Advertising (ADKDD), 2014, pp. 1–9.

Yan

Chen

Zhang

, JointCTR: A joint CTR prediction framework combining feature interaction and sequential behavior learning, Applied Intelligence 52(4) (2022), 4701–4714.

Juan

Zhuang

Chin

W.-S.

Lin

C.-J.

, Field-aware factorization machines for CTR prediction, in: Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), 2016, pp. 43–50.

Yan

Chen

Wan

Wang

, Modeling low-and high-order feature interactions with FM and self-attention network, Applied Intelligence 51(6) (2021), 3189–3201.

Wang

, Deep & cross network for ad click predictions, in: Proceedings of the AdKDD, 2017, pp. 1–7.

Zhang

Yuan

Wang

, Optimal real-time bidding for display advertising, in: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2014, pp. 1077–1086.

Yang

Deng

Tan

Tao

Zhang

Qin

Ding

, Learning compositional, visual and relational representations for CTR prediction in sponsored search, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), 2019, pp. 2851–2859.

Zhang

Qin

Guo

Tang

, Deep Learning for Click-Through Rate Estimation, in: Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), 2021, pp. 4695–4703.

10.

Rendle

, Factorization machines with libfm, ACM Transactions on Intelligent Systems and Technology (TIST) 3(3) (2012), 57.

11.

Shan

Hoens

T.R.

Jiao

Wang

Mao

, Deep crossing: Web-scale modeling without manually crafted combinatorial features, in: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2016, pp. 255–262.

12.

Chua

T.-S.

, Neural factorization machines for sparse predictive analytics, in: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2017, pp. 355–364.

13.

Wang

Shivanna

Cheng

Jain

Lin

Hong

Chi

, DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems, in: Proceedings of the World Wide Web Conference (WWW), 2021, pp. 1785–1797.

14.

Cai

Ren

Zhang

Wen

Wang

, Product-based neural networks for user response prediction, in: Proceedings of the 16th IEEE International Conference on Data Mining (ICDM), 2016, pp. 1149–1154.

15.

Liu

Wang

, A convolutional click prediction model, in: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM), 2015, pp. 1743–1746.

16.

Liu

Tang

Chen

Guo

Zhang

, Feature generation by convolutional neural network for click-through rate prediction, in: Proceedings of the World Wide Web Conference (WWW), 2019, pp. 1119–1129.

17.

Liu

Wang

Tan

, Contextual operation for recommender systems, IEEE Transactions on Knowledge and Data Engineering (TKDE) 28(8) (2016), 2000–2012.

18.

Cui

Zhang

Wang

, Fi-gnn: Modeling feature interactions via graph neural networks for ctr prediction, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), 2019, pp. 539–548.

19.

Richardson

Dominowska

Ragno

, Predicting clicks: estimating the click-through rate for new ads, in: Proceedings of the 16th International Conference on World Wide Web (WWW), 2007, pp. 521–530.

20.

Yan

Zhang

Zhao

Huang

, An intelligent field-aware factorization machine model, in: Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), 2017, pp. 309–323.

21.

Guo

Tang

, DeepFM: a factorization-machine based neural network for CTR prediction, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 1725–1731.

22.

Zhang

Wang

, Deep learning over multi-field categorical data, in: Proceedings of the European Conference on Information Retrieval (ECIR), 2016, pp. 45–57.

23.

Cheng

H.-T.

Koc

Harmsen

Shaked

Chandra

Aradhye

Anderson

Corrado

Chai

Ispir

et al., Wide & deep learning for recommender systems, in: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (RecSys), 2016, pp. 7–10.

24.

Xiao

Zhang

Chua

T.-S.

, Attentional factorization machines: Learning the weight of feature interactions via attention networks, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 3119–3312.

25.

Zhou

Zhu

Song

Fan

Zhu

Yan

Jin

Gai

, Deep interest network for click-through rate prediction, in: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD), 2018, pp. 1059–1068.

26.

Tan

Lang

Guo

, Deep interest with hierarchical attention network for click-through rate prediction, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2020, pp. 1905–1908.

27.

Song

Shi

Xiao

Duan

Zhang

Tang

, Autoint: Automatic feature interaction learning via self-attentive neural networks, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), 2019, pp. 1161–1170.

28.

Shen

Sun

, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.

29.

Huang

Zhang

, FiBiNET: Combining feature importance and bilinear feature interaction for click-through rate prediction, in: Proceedings of the 13th ACM Conference on Recommender Systems (RecSys), 2019, pp. 169–177.

30.

Pang

Shen

Shao

, Towards bridging semantic gap to improve semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4230–4239.

31.

, Deep learning for matching in search and recommendation, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 1365–1368.

32.

Zheng

Casari

, Feature engineering for machine learning: principles and techniques for data scientists, O’Reilly Media, Inc., 2018.

33.

Guo

Zhuang

Qin

Zhu

Xie

Xiong

, A survey on knowledge graph-based recommender systems, EEE Transactions on Knowledge and Data Engineering (TKDE) 34(8) (2020), 3549–3568.

34.

Chen

Yuan

Yan

Tang

Rui

Chua

T.-S.

, Towards multi-semantic image annotation with graph regularized exclusive group lasso, in: Proceedings of the 19th ACM International Conference on Multimedia (MM), 2011, pp. 263–272.

35.

Liu

Wang

Tan

Shao

Huang

, TFNet: Multi-semantic feature interaction for CTR prediction, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2020, pp. 1885–1888.

36.

Bahdanau

Cho

Bengio

, Neural machine translation by jointly learning to align and translate, in: Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015, pp. 1–15.

37.

Luong

M.-T.

Pham

Manning

C.D.

, Effective approaches to attention-based neural machine translation, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015, pp. 1412–1421.

38.

Chorowski

Bahdanau

Serdyuk

Cho

Bengio

, Attention-based models for speech recognition, in: Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015, pp. 577–585.

39.

Bahdanau

Chorowski

Serdyuk

Brakel

Bengio

, End-to-end attention-based large vocabulary speech recognition, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4945–4949.

40.

Guo

Tang

Guo

Han

Yang

Zhang

, Order-aware embedding neural network for CTR prediction, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2019, pp. 1121–1124.

41.

Yang

Shen

Zhao

, Operation-aware neural networks for user response prediction, Neural Networks 121(1) (2020), 161–168.

Models	MovieLens-1M		Criteo		Avazu
	AUC	Logloss	AUC	Logloss	AUC	Logloss
LR	0.8659	0.3578	0.7831	0.4312	0.7499	0.4086
FM	0.8866	0.3298	0.7888	0.4272	0.7582	0.4039
CCPM	0.8828	0.3354	0.7920	0.4246	0.7646	0.4003
IPNN	0.8911	0.3246	0.8092	0.4104	0.7718	0.3962
WDL	0.8913	0.3241	0.8106	0.4092	0.7710	0.3969
DeepFM	0.8915	0.3234	0.8108	0.4089	0.7712	0.3968
FGCNN	0.8916	0.3240	0.7930	0.4239	0.7713	0.3965
FiBiNet	0.8921	0.3227	0.8096	0.4102	0.7716	0.3963
MeFiNet	0.8934	0.3214	0.8116	0.4084	0.7729	0.3957

MeFiNet: Modeling multi-semantic convolution-based feature interactions for CTR prediction

Abstract

Keywords

1. Introduction

2.1 CTR prediction models

2.2 Feature engineering and representation learning

2.3 Attention mechanism

3. Proposed model

3.2.1 Multi-semantic feature interaction

Squeeze

Excitation

Re-Weighting and Aggregation

Space complexity

Time complexity

4. Experiments

4.1 Experimental settings

Datasets

Evaluation Metrics

Baselines

Implementation Details

4.4.1 Impact of semantic spaces

Footnotes

Acknowledgments

Conflict of interest

References