Sage Journals: Discover world-class research

Abstract

Automatic image aesthetic evaluation is an attractive and challenging visual task. Recently, methods based on convolutional neural networks have achieved remarkable performance. However, semantic information, an intuitive prerequisite for evaluating image aesthetics, has not received enough attention regarding its importance in previous methods. How to efficiently extract semantic information and make better use of it to assist the aesthetic evaluation task remains unsolved. In this article, we propose to utilize the self-supervised model Auto-Encoder to extract semantic information in the form of multi-task learning. Then, a fusing module is prepended at the bottleneck layer to explicitly combine semantic information with aesthetic information in a pre-activated manner. Specifically, we implement a customized pooling operation to pool the semantic features extracted by Auto-Encoder and apply a weak constraint between the pooled semantic features and aesthetic information to realize the combination. The following regressor can complete aesthetic evaluation based on the semantic–aesthetic combined features. In addition, to enable our model to adapt to arbitrary aspect ratios of images, another pooling strategy called spatial pyramid pooling is adopted to obtain the image features of a fixed length. Our method achieves competitive performance on the public image aesthetic evaluation benchmark. Especially on the most commonly used metric Spearman rank-order correlation coefficient, the proposed model achieved the best performance compared with some state-of-the-art methods. Extensive ablation studies and visualization experiments were conducted to demonstrate the effectiveness of our method.

Keywords

Auto-Encoder Image Aesthetic Assessment Semantic Information

Introduction

Background

Image aesthetic assessment (IAA), which aims to quantify the beauty of a given image, is a meaningful vision task, as it has a large number of practical applications in our modern society. For example, with the explosive growth of Internet users, posting photos on social media platforms has already become a common demand in our daily life. Some applications with a well-designed IAA model can help users to choose the most appealing images, to sort similar images in the user’s photo libraries, or to optimize parameters for editing images.

In contrast to conventional visual tasks such as image classification,^1,2 object detection,^3,4 and semantic segmentation,^5,6 it is nontrivial to objectively assign an aesthetic score to a given image. As illustrated in Figure 1, for the image classification task, its ground-truth label is represented as a one-hot vector. It is a binary vector, in which only one entry corresponding to the actual category is equal to 1 while all other entries are 0s. For the IAA task, its ground truth label, called aesthetic label in this article, takes the form of a distributed representation, which reflects a score distribution over the votes obtained from numerous experts. Such a kind of aesthetic label has been adopted in mainstream IAA databases like AVA.⁷ In this article, we aim to devise a learning-based model that can automatically predict the distributed aesthetic score of a given image.

Figure 1.

Comparison between the tasks of image classification and image aesthetic evaluation.

Researchers have paid great efforts to the IAA task. Previous methods can be roughly divided into two categories: knowledge-driven and data-driven.

Methods of the first category^8–12 focus on designing a series of hand-crafted features following professional photography skills such as the rule of thirds, color harmony, and shallow depth of field. These manually designed features are fed into some universal classifier like support vector machine,¹³ random forest,¹⁴ or multilayer perceptron¹⁵ for aesthetic classification. Although these works have achieved passable results, they still suffer from some limitations. First, implementing these methods requires professional knowledge of photography and aesthetic. Second, it is difficult even for experienced experts, due to the subjectivity and diversity of aesthetic attributes, to design a generic hand-crafted aesthetic feature suitable for all photos. Consequently, it is arduous and time-consuming to apply the knowledge-driven methods to the IAA task.

To address the above limitations, the data-driven methods^16–23 resort to various deep neural networks (DNNs) for learning a hierarchy of aesthetic features geared toward the IAA task in an end-to-end manner. The data-driven methods, which are free from the help of labor-intensive feature engineering, have achieved higher accuracy in the IAA task, compared with the knowledge-driven methods.

Lu et al.¹⁹ initiated the success of the DNN-based IAA model. They constructed a double-column CNN (Convolutional Neural Network) model to adaptively capture layout information and fine-grained details, respectively, from global and local views of an input image.

In their subsequent work,²⁰ a deep multi-patch aggregation network was developed to extract and aggregate aesthetic features at the patch level. Jin et al.¹⁷ aimed to predict the distributed aesthetic score for a given image. A tailor-made loss called cumulative Jensen–Shannon (JS) divergence was used to drive the training. Recently, Talebi and Milanfar²² stated that the traditional cross-entropy loss ignores the relationship between the buckets of a distributed aesthetic score. They solved this problem by replacing the cross-entropy loss with an Earth Mover Distance (EMD) loss.

Related Works

Semantic information is significant for the IAA task. This is because understanding the content of an image, namely the semantic information, is an intuitively reasonable prerequisite for evaluating the aesthetic. In this article, we propose to utilize a classic self-supervised method, called Auto-Encoder, to extract the semantic information in the form of multi-task learning.

To gear toward the IAA task, we design a fusing module, which aims to combine the extracted semantic information with image aesthetics, so as to get a better aesthetic evaluation performance. After end-to-end training, the encoder will provide an aesthetic–semantic combined feature that paves the way for evaluating the aesthetic.

In this section, we briefly review some related works from the following two aspects: (a) self-supervised learning (SSL) for IAA; (b) utilization of semantic information in IAA.

Self-Supervised Learning for IAA

Some studies attempt to incorporate SSL methods to reduce the score bias caused by the subjectivity of aesthetics. Ching et al.¹⁶ presented an SSL-based IAA model, in which image inpainting serves as a pretext task. To some extent, inpainting an image will force the IAA model to understand the aesthetic, so as to provide a better initial state for fine-tuning. Following the SSL perspective, Sheng et al.²¹ attempted to extract a set of aesthetic-aware representations from images. Specifically, they designed two pretext tasks, which are trained to identify the types and strengths of editing operations applied to images.

These SSL methods^16,21 focused on designing an aesthetic-aware pretext task without using any manual annotations. Reconstruction, as a self-supervised task to extract semantic information, is trained in a multi-task manner together with aesthetic assessment in this article. This semantic information will be combined with aesthetic information of images to evaluate the image aesthetics. Auto-Encoder, which is a generative SSL architecture,²⁴ can leverage input data itself as supervisory signal. It is used in our model to extract the image semantic information without any additional labels.

Semantic Information in IAA

As a necessary prerequisite of image aesthetics assessment, the importance of semantic information in IAA is indisputable. Kao et al.¹⁸ aimed to discover effective aesthetic representations with the aid of the semantic information. To this end, a multi-task network was built to accomplish the semantic recognition and the IAA task simultaneously. They further introduced a correlation item between these two tasks for learning the inter-task relationship. Zhang et al.²³ proposed a double-subnet network, in which one subnet attends aesthetic-relevant regions by encoding the holistic information, while the other one extracts fine-grained features from these attended regions. Then, a gated information fusion module adaptively combined the extracted fine-grained features at global and local levels.

Unfortunately, the extracted semantic information in literature^18,23 is only used for guiding independent tasks like recognition or attention allocation. In other words, the semantic information, in these previous works,^18,23 only plays an implicit role in predicting the aesthetic scores.

In contrast to the existing works,^18,23 in this article, we propose to explicitly combine the semantic information with aesthetic information in a pre-activating manner. A semantic-aesthetic fusing module is designed to inject aesthetic information semantic information for their combinations. Doing so helps the model to focus on aesthetic-related semantic region, as demonstrated in the “Visualization” section. Hence, our method makes better use of semantic information to assist the task of IAA by explicitly fusing semantic and aesthetic information. The difference in the manner of using semantic information between previous methods and ours is shown in Figure 2.

Figure 2.

Difference in the manner of using semantic information between previous IAA methods and ours. (a) Previous methods. (b) Ours.

Contributions

Semantic information is important in IAA tasks as it is an intuitive prerequisite for evaluating the aesthetic. However, according to our investigation above, some previous works^16,17,21,22 ignored this important information, while the other ones^18,23 either need additional labels or are not explicit enough when utilizing semantic information. Therefore, there still exists room to improve these previous methods.

In this article, we propose a new aesthetic evaluation model that extracts and utilizes semantic information efficiently. First, a generative SSL model, that is, Auto-Encoder, is utilized to extract the semantic information without using any additional manual labels. Auto-Encoder is a classic generative self-supervised model which can use input data itself as a supervisory signal. In addition, a benefit is the ability of Auto-Encoder to maintain the complete semantic information of the image the proposed model intuitively improves the performance cross the data sets. Second, to use the semantic information more efficiently, we propose a fusing module to explicitly combine the semantic information with image aesthetics. This module is prepended at the bottleneck layer to inject the image aesthetics into the extracted semantic information. With the fusing module, there will be a relationship established between the distributed aesthetic score and pre-activated features. Thus, the combination mentioned above can be realized to assist the aesthetic evaluation. However, a conventional multi-way regressor may cause the confusion of these relationships. To address this issue, we implement a split multi-way regressor to replace the conventional one. In our experiments, the superior performance of the proposed model proves the effectiveness of our design.

Our contributions can be summarized as follows:

We propose to utilize a self-supervised model called Auto-Encoder to extract the semantic information of the images in the form of multi-task learning. This process does not require any additional manual labels.

A fusing module is designed to explicitly combine semantic information with aesthetic information in a pre-activated manner. In this module, a weak constraint is applied between the semantic features and aesthetic information to realize the combination.

The entire network is kept free of fully connected layers except for the prediction module and uses the spatial pyramid pooling layer to fix the feature dimensions before the final prediction. Therefore, our model can adapt to the image input with arbitrary aspect ratio.

Our Proposal

In this section, we first provide an overview of the proposed IAA model. Then we introduce three specific modules (i.e. auto-encoder module, semantic-aesthetic fusing module and prediction module) in detail. The loss functions that guide the process of training will also be introduced in each subsection.

Problem Definition

Let $S = {I_{n}, y_{n}}_{n = 1}^{N}$ denote training set consisting of $N$ training samples. Without loss of generality, we take the $n th$ image (and its label) as an example and drop the subscript $n$ . Suppose that an image $I \in R^{H \times W \times C}$ has the size $H \times W \times C$ , where $H$ , $W$ , and $C$ are the height, width, and number of channels, respectively. Its corresponding distributed aesthetic label is denoted by $y \in R^{M}$ , where $M$ is the dimension of the label $y$ (e.g. $M$ is 10 in AVA data set).⁷ The $m th$ element in $y = [y_{1}, y_{2}, \dots, y_{M}]$ represents the probability of the image $I$ being rated as $m$ , and $\sum_{m = 1}^{M} y_{m} = 1$ . Given an aesthetic evaluation training set $S$ , our goal is to train a model $f : R^{H \times W \times C} \to R^{M}$ , which receives an input image $I$ and predicts a distributed aesthetic score $\hat{y} \in R^{M}$ .

Overview

The main insight of our method is to incorporate SSL method into the aesthetic evaluating network to extract semantic information and combine it together with aesthetic information. The model can better understand image aesthetics with the help of semantic information. Except for this, the aspect ratio of image has to be strictly maintained, because its changes can exert a great impact on the aesthetics of an image. Therefore, an IAA model needs to adapt to different aspect ratios.

The overall architecture of the proposed model is illustrated in Figure 3. This model is composed of three modules: (1) the auto-encoder, which is a generative SSL module,²⁴ aims to extract semantic information from the images without using any additional manual annotations; (2) the Semantic-Aesthetic Fusing (SAF) module is designed to inject aesthetic information into semantic features in a pre-activated manner at the bottleneck layer of our model, so as to explicitly realize the combination between the semantic and the aesthetic information; and (3) the prediction module, consisting of a Spatial Pyramid Pooling (SPP) layer²⁵ and a multi-way regressor, is responsible for outputting the distributed aesthetic score.

Figure 3.

Overall architecture.

Our regressor network is composed of the fully connected layers. This is because the fully connected layer can realize a comprehensive feature integration over the spatial dimension. Thus, the features can be more accurately projected into fixed-dimensional logits.

However, the fully connected layer cannot directly adapt to input of arbitrary size. Therefore, the SPP layer is applied to pool the features by using multiple adaptive pooling layers of different sizes and concatenate their outputs to fix the feature dimension. In addition, in the whole model, the fully connected layer only exists in the regressor network. Thus, the proposed model can directly evaluate the aesthetic score of a given image regardless of its aspect ratio.

Auto-Encoder Module

Auto-encoder is a generative SSL model²⁴ composed of an encoder $E$ and a decoder $D$ . It can leverage input data itself as a supervisory signal for training. Specifically, the encoder $E$ learns to extract the semantic features from the input image, and meanwhile the decoder endeavors to reconstruct the input image from the extracted semantic features. In this procedure, the extracted features must maintain enough semantic information of the input image, which is the prerequisite for the decoder to fulfill the reconstruction task. This critical semantic information will be combined with aesthetics of images in the following module.

Hereinafter, we introduce the workflow of Auto-encoder in detail. The encoder $E$ extracts features from an input image $I$ . This procedure can be represented by $F = E (I)$ , where $F \in R^{h \times w \times c}$ denotes the extracted features. The number of channels of extracted features is denoted as $c$ , and $h \times w$ is the spatial resolution. Any CNN-based backbones for image classification tasks such as AlexNet,² VGG,²⁶ ResNet,¹ GoogLeNet,²⁷ and DenseNet²⁸ can be used here as the encoder. We take the feature extraction part of the backbone as encoder, which does not contain fully connected layers.

To facilitate pre-activation operations in the subsequent SAF module, features $F$ are separated into $M$ groups ${G_{1}, G_{2}, \dots, G_{M}}$ , each of which contains $P$ feature maps, for example, the $m th$ group $G_{m} = {G_{m}^{1}, G_{m}^{2}, \dots, G_{m}^{P}}$ , and thus $P = c / M$ . The procedure of splitting operation is schematically illustrated in Figure 4.

Figure 4.

The procedure of splitting features into groups.

The decoder $D$ takes the extracted features $F$ as input and learns to reconstruct the original input image. This process can be expressed as $\hat{I} \in D (F)$ , where $\hat{I} \in R^{H \times W \times C}$ denotes the reconstructed images and $H, W, C$ are the height, width, and number of channels of $\hat{I}$ . The decoder consists of six layers of transposed convolution, each of which is followed by normalization and activation function. Its specific architecture is shown in Table 1. Note that to handle the issue of aspect ratio, the fully connected layer is excluded from the decoder.

Table 1.

Architecture of decoder.

	Layer 1	Layer 2	Layer 3	Layer 4	Layer 5	Layer 6
Kernel size	3	3	3	3	3	3
Input channel	c	48	32	16	12	8
Output channel	48	32	16	12	8	3
Stride	2	2	2	2	2	1
Padding	1	1	1	1	1	1
Output padding	1	1	1	1	1	0
Normalization	BN	BN	BN	BN	BN	None
Activation function	ReLU	ReLU	ReLU	ReLU	ReLU	Tanh

Since features F is separated into M groups with P feature maps in each group, there are c feature maps (i.e. channels) in total. BN: batch normalization.

The auto-encoder module uses the input image itself as a supervisory signal. The reconstruction loss between $\hat{I}$ and $I$ can be formulated by the mean square error as:

L_{r e c t} = (\hat{I}, I) = \frac{1}{H \times W \times C} | \hat{I} - I |^{2}

(1)

By minimizing the reconstruction loss, the auto-encoder module can effectively extract the semantic information preserved in $F$ .

Semantic-Aesthetic Fusing Module

To explicitly combine the semantic information and the aesthetic information, we design a SAF module at the bottleneck layer of the proposed model. The goal of SAF module is to inject the aesthetic information into $F$ in a pre-activated manner prior to the final aesthetic prediction. To this end, we design a new operation called group average pooling (GAP), and apply it to each feature group $G_{m}$ , respectively. By doing so, an $M$ -dimensional feature vector $a$ is produced, which plays a key role in the following aesthetic pre-activation module.

Each component of $a = (a_{1}, a_{2}, \dots, a_{M})$ represents the pre-activation degree of the corresponding feature group. The process of the GAP is shown in Figure 5.

Figure 5.

Group Average Pooling (GAP). Feature $F$ is separated into $M$ groups $G_{1}, G_{2}$ ,…, $G_{M}$ , each of which, denoted by $G_{m}$ , contains $P$ feature maps $G_{m}^{1}, G_{m}^{1}$ ,…, $G_{m}^{P}$ . GAP is conducted on each $G_{m}$ to obtain $a_{m}$ . For each $G_{m}$ , GAP first performs average pooling on each feature map $G_{m}^{P}$ and then averages all results from $P$ feature maps belonging to $G_{m}$ .

After obtaining $a$ with the semantic information, it is critical to consider how to combine it with the aesthetic information. In our model, a weak constraint is added to achieve this combination. Specifically, we develop an activation loss between a and the aesthetic label $y$ . L1 distance is chosen to implement this weak constraint here:

L_{a c t} = \sum_{m = 1}^{M} | y_{m} - a_{m} |

(2)

where $a_{m}$ and $y_{m}$ denote the $m th$ component in the distributed aesthetic label $y$ and aesthetic pre-activation embedding $a$ , respectively.

The main idea of our SAF module is to combine the semantic information with the aesthetic information using a weak constraint. Therefore, we do not intend to apply over-strict constraints on the similarity between distributions in this phase. L1 distance calculates the sum of the absolute value of the difference between the target value and the output value. The constraint of L1 distance is for each independent scalar value, regardless of the information of distribution. Thus, it is considered as a relatively weak constraint for distribution compared with Kullback–Leibler (KL) divergence, JS divergence, and so on. To sum up, L1 distance is more suitable for our activation loss. In the “Comparative Study” section, an experiment is conducted to test different specific forms of activation loss, and the result validates the correctness of this choice.

Alternatively, the fully connected layers can be used in the SAF module to obtain $a$ . However, in our model, GAP behaves more suitably for following two reasons. First, different from the prediction module, the SAF module is used for pre-activation, rather than making the final predictions of the image aesthetics. In other words, we need to establish the corresponding relationship between the feature group and the aesthetic score $y_{m}$ , but not to build a comprehensive projection from the feature $F$ to the final predictions. Therefore, introducing fully connected layers at the bottleneck layer does not conform to our weak constraint design concept of pre-activation. Moreover, several fully connected layers will bring a large number of parameters, which increases the risk of overfitting. Second, as mentioned before, it is an important ability for the IAA model to process images with arbitrary aspect ratios. In contrast to fully connected layers, the GAP operation can guarantee this more conveniently.

In addition, in our model, the value of $P$ is also a critical hyperparameter. The total dimensions of the feature $F$ at the bottleneck layer can be calculated as $D_{F} = (M \times P) \times H \times W$ , where $M$ is determined by the human annotations of specific data set, and $H$ and $W$ are determined by the size of the input image. Thus, $P$ is the only hyperparameter that can be adjusted to control the dimensions of the features $F$ in the bottleneck layer. To cooperate better with the pre-activation concept, the feature dimensions of the bottleneck layer should not be too high, which will lead to conflict with our concept of weak constraint and may increase the risk of overfitting. However, if the dimensions are too low, the presentation capabilities of the model may not be sufficient.

Therefore, the model needs an appropriate $P$ value. Relevant comparative experiments are conducted in the “Comparative Study” section.

Prediction Module

The prediction module is composed of an SPP (Spatial Pyramid Pooling) layer²⁵ and a separated multi-way regressor. The prediction module receives the extracted features $F$ as input and predicts a distributed aesthetic score, denoted by $\hat{y} = ({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{M})$ .

A fully connected layer is used in the regressor network to project features to $M$ -dimensional logits. Compared with convolutions, fully connected layers can realize the comprehensive interaction in the spatial dimension, so it can make more accurate predictions based on the extracted features. However, the characteristic that the fully connected layer can only receive fixed-dimensional features is in conflict with the input of arbitrary size. To solve this conflict, the SPP layer is applied here.

The specific process of SPP is shown in Figure 6. The SPP layer pools the features by using multiple sublayers. Each sublayer is composed of a different number of bins, and the pooling operation is performed in each bin (i.e. one value for each bin). The design of the combination of multiple sublayers with different sizes helps the model maintain multi-scale image information, since the number of bins in each sublayer is fixed regardless of the size of input features. The pooled features will be fixed to the same size. Then, these features can be handled by the regressor network.

Figure 6.

Spatial Pyramid Pooling (SPP). The SPP layer in this figure has three sublayers (purple, green, and blue), which consist of 1, 4, and 16 bins, respectively. No matter what the spatial size of the input feature map of this pooling layer is, its output is always a vector of length 21.

The regressor predicts the distributed aesthetic score $\hat{y}$ with these embeddings. In particular, to eliminate the confusion of aesthetic information between each group, a separated multi-way regressor is implemented to realize the final prediction. Specifically, $M$ independent regressors are used to process the feature embedding pooled from each group, instead of using a conventional joint regressor to process the entire features $F$ , because image aesthetics has been injected into each group through pre-activation in SAF module. As a result, in each group, semantic information has been combined with the corresponding aesthetic information. Thus, a relationship has been established between feature group $G_{m}$ and aesthetic score $y_{m}$ . If the feature embeddings are processed with a conventional joint regressor, it may cause the aesthetic information of different groups to interfere with each other and destroy the corresponding relationship mentioned above. If so, the prepended aesthetic pre-activation will in turn produce a negative effect on the final aesthetic prediction. In addition, due to the reduction in the number of connections, the number of parameters of the regressor will also be greatly reduced. Our ablation study in the “Ablation Study” section confirmed this conjecture. Figure 7 illustrates the difference between our separated multi-way regressor and conventional joint multi-way regressor.

Figure 7.

Comparison between conventional and separated multi-way regressor.

Similar to other studies,^16,29,22 EMD (Earth Mover’s Distance) is chosen as the loss function between the predicted distributed aesthetic score $\hat{y}$ and the label $y$ . EMD loss calculates the minimum cost of moving mass of one distribution to another. Unlike the traditional cross-entropy loss, EMD can measure the distance between two distributions taking the order among classes into account, which is more suitable for IAA task. EMD loss can be expressed as:

L_{a e s} (y, \hat{y}) = (\frac{1}{N} \sum_{k = 1}^{N} | C D F_{y} (k) - C D F_{\hat{y}} (k) |^{r})^{\frac{1}{r}}

(3)

where $C D F_{y} (k)$ denotes the cumulative distribution function as $\sum_{m = 1}^{M} y_{m}$ , where $y_{m}$ is the distributed aesthetic score.

So far, all three loss functions of our method have been formulated. As mentioned before, our model is trained in the form of multi-task learning. The weighted sum of these three losses produces our total loss function $L_{t o l}$ of our method:

L_{t o l} = α_{1} L_{r e c t} + α_{2} L_{a c t} + α_{3} L_{a e s}

(4)

where $α_{1}, α_{2}, α_{3}$ denote the weights for each loss, respectively.

Experiments

In this section, we first give an introduction to the data sets used in throughout our experiments and make some necessary explanations of the details in the experiments. Then, we exhibit our quantitative results, ablation results, and visualization results, respectively.

Data Sets

Our experiments are conducted on a mainstream public data set called AVA.⁷ The AVA data set has 250,000 images approximately, and each image is voted for aesthetic quality. The number of voters for each image ranges from 78 to 549, with an average of around 210. The aesthetic quality of each image in the AVA data set is denoted as a distributed aesthetic score range from 1 to 10. The distribution of the aesthetic score is normalized to the interval of [0, 1] to represent the probability of each score. One original image of the AVA data set is shown in Figure 8. On the left is an original image in the data set, and on the right is the corresponding distributed aesthetic score of this image (i.e. $I$ on the left and $y$ on the right as we introduced in the “Problem Definition” section). Moreover, an image quality assessment data set TID2013³⁰ is used for cross data set evaluation. Images in TID2013 with MOS (mean opinion score) labels are used for cross data set evaluation in our experiments.

Figure 8.

One original image in the AVA data set⁷ and its corresponding distributed aesthetic score. Vertical axis at left is the counts of raters, and at right is the probability of each rating after normalizing.

Implementation Details

Image Preprocessing

The original image from data set is resized to 256 on the short edge. Note that the aspect ratio of the original image during this process is not changed. The aspect ratio of image is significant in the IAA task, as the aesthetics of the image will be deteriorated when its aspect ratio is changed. Similar to other studies,^{20,22,29,31,32} random horizontal flip is used for data augmentation, because it will not change the aesthetics of the image.

Backbone

Any CNN-based backbones for image classification tasks, such as AlexNet,² VGG,²⁶ ResNet,¹ GoogLeNet,²⁷ and DenseNet,²⁸ can serve as the encoder. VGG16 is chosen for fair comparison with other aesthetic assessment architectures.^22,33

Hyperparameters

The Stochastic Gradient Descent (SGD) optimizer with momentum is used in our experiments, in which momentum is set to 0.9. The learning rate is set to $5 e - 5$ . Our model is trained for 20 epochs. The weights of each loss $α_{1}$ , $α_{2}$ , $α_{3}$ are set to 1, 0.1, and 1, respectively.

Software and Hardware Settings

All experiments presented below were conducted with PyTorch 1.7.0 and CUDA v11.2 on a server equipped with Intel® Xeon(R) W-2150B CPU @ 3.00GHz × 20 and GeForce RTX 2080 Ti.

Metrics

The following evaluation metrics were selected for testing. The average aesthetic score $μ$ is an important metric to measure the aesthetics of a given image $I$ . With the distributed aesthetic score $y$ , the average score $μ$ can be calculated as:

μ = \sum_{m = 1}^{M} m \times y_{m} .

(5)

Whether the predicted average score is close to the ground-truth is an important metric to evaluate the performance of the IAA model. The standard deviation of distributed aesthetic scores can reflect the consistency of people’s aesthetic opinions on an image. It can be calculated as:

α = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2}}

(6)

Similar to other studies,^16,22,29,31 the correlations between predicted average aesthetic score and ground-truth are regarded as the most significant metrics for the IAA task. We calculate the Pearson Linear Correlation Coefficient (PLCC) as:

P L C C = \frac{\sum_{n = 1}^{N} (μ_{n} - \bar{μ}) ({\hat{μ}}_{n} - \bar{\hat{μ}})}{\sqrt{\sum_{n = 1}^{N} {(μ_{n} - \bar{μ})}^{2}} \sqrt{\sum_{n = 1}^{N} {({\hat{μ}}_{n} - \bar{\hat{μ}})}^{2}}}

(7)

where ${\hat{μ}}_{n}$ and $μ_{n}$ are predicted and ground-truth average score of a given image $I_{n}$ , while $\bar{\hat{μ}}$ and $\bar{μ}$ are means of the predicted and ground-truth average scores. Another correlation-based metric is called Spearman Rank-order Correlation Coefficient (SRCC), which can be calculated as:

S R C C = 1 - \frac{6 \sum_{n = 1}^{N} {(v_{n} - p_{n})}^{2}}{N (N^{2} - 1)}

(8)

where $v_{n}$ is the rank of the ground-truth score $μ_{n}$ in the ground-truth scores, and $p_{n}$ is the rank of ${\hat{μ}}_{n}$ in the predicted score. SRCC is a nonparametric index to measure the dependence of two variables. When calculating the correlation coefficient of standard deviation, replace $μ$ with $σ$ .

Besides, accuracy of binary classification is another common metric used in some early IAA works.^7,20,33 According to the calculated average score $μ_{n}$ , we can set a threshold $t$ to divide aesthetic quality of the input image into high or low (denoted as $C_{h i g h}$ and $C_{l o w}$ , respectively):

C = {\begin{matrix} C_{h i g h} μ > t \\ C_{l o w} else \end{matrix}

(9)

However, note that there is no explicit boundary of an image is beautiful or not in our real world. Setting an explicit threshold to directly divide images into two classes is debatable. But for comparison with previous IAA models,^{16,20,22,29,31,33} binary classification accuracy is also calculated in our experiments. Consistent with the previous works,^{16,20,22,29,31,33} $t$ is set to 5 for fair comparisons.

Comparison with Baselines

Table 2 shows IAA performance comparisons between the proposed model and the previous methods^{16,20,22,29,31,33} on the metrics mentioned above. Our model achieves competitive performance with the help of efficient use of semantic information. Specifically, the proposed model outperforms all the other baselines on SRCC (mean), SRCC (SD) and PLCC (SD) and achieves second place on PLCC (mean). Our model gets relatively inferior result on accuracy of binary classification. Adaptive fractional dilated convolution (AFDC), in which the problem of the image aspect ratio is also properly solved, achieved the best performance in PLCC (mean) and accuracy of binary classification. As mentioned before, there is no explicit boundary to classify images as beautiful or ugly. It is controversial to take the accuracy of binary classification as a key metric.

Table 2.

Comparisons with baselines.

Method	SRCC (mean)	PLCC (mean)	SRCC (SD)	PLCC (SD)	Acc
DMA-Net(AlexNet)²⁰	–	–	–	–	75.41%
MNA-CNN(VGG)³³	–	–	–	–	77.10%
SSL-D-FPP(AlexNet)¹⁶	0.4940	0.5200	0.1600	0.1710	80.43%
NIMA(VGG)²²	0.5920	0.6100	0.2020	0.2050	80.60%
NIMA(Inception-V2)²²	0.6120	0.6360	0.2180	0.2330	81.51%
Ranking+reg(MobileNet-V2)²⁹	0.5420	0.5541	–	–	75.82%
Ranking+cls(MobileNet-V2)²⁹	0.5409	0.5535	–	–	75.50%
AFDC(ResNet)³¹	0.6489	0.6711	–	–	83.24%
Ours (VGG)	0.6585	0.6693	0.2649	0.2681	79.05%

The best and the second best results are highlighted in bold. “Mean” and “SD” indicate that the column is the correlation coefficient of the mean and standard deviation of the distributed aesthetic score. SRCC: Spearman rank-order correlation coefficient; PLCC: Pearson Linear Correlation Coefficient.

Ablation Study

Ablation studies were conducted to show the specific effect of each module in the proposed model. Our ablation experiment shown in Table 3 demonstrates the effectiveness of our SAF module and the necessity of implementing the multi-way regressor in a separate manner. As shown in Table 3, the model trained without pre-activation (w/o Act) performs worst. This shows that if there is no pre-activation to explicitly combine the semantic features with the aesthetic information, the raw semantic information itself is of little help to the IAA task. This proves the effectiveness of our pre-activation design. In addition, without pre-activation, the design of the separated multi-way regressor will be meaningless; such a mismatched design may even further reduce the performance. See the third row of Table 3 for details. Instead, a joint regressor which can combine more information will get better results. As shown in the second row of Table 3, when the separated multi-way regressor is replaced by a conventional joint regressor (w/o Split), the corresponding model achieves a relatively high performance, but being slightly worse than our complete model. This is because the pre-activation can realize a better cooperation that eliminates the confusion of aesthetic information between each group.

Table 3.

Results of ablation studies.

Model	SRCC	PLCC
w/o Act	0.5444	0.5524
w/o Split	0.6509	0.6656
w/o Act & w/o Split	0.6279	0.6430
Complete model	0.6585	0.6693

SRCC: Spearman rank-order correlation coefficient; PLCC: Pearson Linear Correlation Coefficient;

Comparative Study

In this section, we conducted comparative experiments with respect to the SAF module of the proposed method.

The Dimensions of the Features in Bottleneck Layer

The first comparative experiment is about the feature dimensions of the bottleneck layer. As mentioned before, the features $F$ in the bottleneck layers are separated into $M$ groups to facilitate the pre-activation in SAF module. That is, the dimensions of the middle layer feature $F$ can be calculated as $(M \times P) \times H \times W$ . As discussed in the “Semantic-Aesthetic Fusing Module” section, $P$ is the only hyperparameter we can choose to control the dimensions of features $F$ . The value of $P$ not only complies with our weak constraint concept of SAF module, but also maintains sufficient ability to represent the bottleneck layer. The results are shown in Figure 9. On the whole, the performance of the model will decrease as the feature dimension increases. The reason is that the overly high feature dimension conflicts with our weak constraint concept in our SAF module and the influence of overfitting increases. But it can be observed that when $P = 1$ , the SRCC metric of the model is slightly worse than that when $P = 2$ . This demonstrates that too small a value of $P$ , which may lead to the insufficiency of the bottleneck layer, will also damage the performance of our model. Finally, for the comprehensive consideration of our design concept and model performance, the $P$ value is set to $1$ in our model.

Figure 9.

Results of comparative study for feature dimensions.

Specific Forms of Activation Losses

We conducted a comparative experiment to validate our choice of the specific form of activation loss. The results are shown in Table 4. It can be observed that the model performs worst when the activation function is JS divergence, and it is slightly better when the KL divergence is adopted. However, the model with these over-strict constraints on the distribution is much worse than L1 distance. The experiment results conform to our analysis in the “Our Proposal” section. It is not used to directly output the final prediction of distributed aesthetic score, but is used to explicitly combine the aesthetic information and semantic information of the image in a pre-activated manner. Therefore, compared with an over-strict distribution constraint function, a weak constraint like L1 distance is preferable here.

Table 4.

Comparative studies of the forms of activation losses.

Model	SRCC (mean)	PLCC (mean)
JS divergence	0.6089	0.6219
KL divergence	0.6205	0.6329
L1	0.6585	0.6693

SRCC: Spearman rank-order correlation coefficient; PLCC: Pearson Linear Correlation Coefficient; JS: Jensen–Shannon; KL: Kullback–Leibler.

Visualization

Visualization experiments were conducted to provide some intuitive understandings of the proposed model.

To better understand the working mechanism of pre-activation applied in the SAF module, we visualize the heatmap of $a_{n}$ using Grad-CAM.³⁴ The $M$ -dimensional distributed aesthetic score can be regarded as a $M$ -dimensional ordered classes. To adapt to the IAA task, we made some modifications to the original Grad-CAM.³⁴ Specifically, the original Grad-CAM calculates the heatmap of the highest probability in the output prediction and ignores other classes with lower probability. This is reasonable in an image recognition task. However, in the IAA task, things are different. In addition to the class with the highest probability in the output, other classes are also meaningful. They represent the probability of other scores. Therefore, we calculate the weighted sum of the heatmap of all classes, which can be expressed as:

H = \frac{1}{M} \sum_{m = 1}^{M} a_{m} \times H_{m}

(10)

where $a_{m}$ and $H_{m}$ are the $m th$ component of $a$ , and the heatmap of $m th$ rating, respectively. Thus, the final output of the activation map can more accurately reflect the spatial position of the critical areas of the aesthetic in the original image. As shown in Figure 10, in the second column, the $a$ generated by only Auto-Encoder is almost globally average, because the Auto-Encoder aims to reconstruct the original image supervised by $L_{2}$ distance at the pixel level. In this case, no other tasks need to be completed. That is, only semantic information is extracted by the Auto-Encoder, but aesthetic information is not combined. In the third column, the activation loss is removed from our complete model. Except for reconstructing the original image, this model needs to minimize EMD loss at the same time. Compared with the second column, the model focuses on certain regions related to image aesthetics. Without the activation loss, the combination of semantic information and aesthetic information cannot be realized. The last column shows the heatmap of $a$ of our complete model. The highlighted region focuses on the critical aesthetic-aware area in the image. As expected, our complete model successfully realizes the combination between semantic information and aesthetic information with the help of the SAF model.

Figure 10.

Grad-CAM visualization. (a) The original image in data set; (b), (c), and (d) the heatmaps added to the original image. The (b), (c) and (d) are: the vanilla Auto-Encoder, our model without activation loss, and complete model, respectively.

t-SNE Visualization

To further evaluate the effectiveness of our IAA model, we visualize the ground-truth and predicted distributed aesthetic score using t-SNE embedding.³⁵ Specifically, we selected $512$ samples from AVA data set.⁷ Their $10$ -dimensional distributed aesthetic score is reduced to $2$ -dimensional and is shown in an image. The yellow and blue points in the figure represent the labels and predictions of distributed aesthetic scores, respectively (i.e. $\hat{y}$ and $y$ ). The closer they are in the figure, the more accurate the model prediction is. As shown in Figure 11, the distributed aesthetic score produced by random initialized model is shown in (a). Obviously, the predicted and ground-truth of distributed aesthetic scores are completely separated in the random initialized model. This is because the model has learnt nothing now. In (b), the prediction is produced by the proposed model without pre-activation. $\hat{y}$ and $y$ start to mix up together, because the EMD loss $L_{e m d}$ constrains the aesthetic prediction to be close to the label. But they are not mixed well enough. In (c), the prediction is produced by our complete model. $\hat{y}$ and $y$ are much closer with each other compared with the first two models. This demonstrates that our pre-activation in SAF module indeed helps to predict the distributed aesthetic score.

Figure 11.

t-SNE Visualization: (a) is produced by random initialized model; (b) is produced by our model without pre-activation; (c) is produced by the complete model.

Cross Data Sets Validation

A single dataset is usually affected by the profession, personality, age, mood and etc. of raters. In other words, training on a certain dataset may mislead the trained model to fit the raters' personal factors. Thus, the ability of performing stably on different data sets of IAA model will be affected. Note that when the training data set is different from the testing data set, it is meaningless to use a specific value as the boundary of binary classification. As shown in Table 5, our method is significantly ahead of previous methods^29,22 in two metrics. This is due to the reconstruction effect of the Auto-Encoder, which forces the model not only to learn the aesthetic information, but also to maintain the complete semantic information of the original image. Therefore, when there may be deviations in the aesthetic labels on different data sets, the influence of above factors of the training set can be relieved.

Table 5.

Cross data sets evaluation.

Model	SRCC	PLCC
NIMA²²	0.4320	0.5140
ranking+reg²⁹	0.3971	0.4855
ranking+cls²⁹	0.4111	0.5009
ours	0.5194	0.5447

SRCC (mean) and PLCC (mean) shown in this table are tested in the TID2013 data set with the model trained on the AVA data set without any fine-tuning. SRCC: Spearman rank-order correlation coefficient; PLCC: Pearson Linear Correlation Coefficient.

Conclusion

In this article, we design a novel IAA model motivated by the utilizing the semantic information effectively by fusing it with the aesthetic information. Our model takes the self-supervised Auto-Encoder module as a branch to extract semantic information of images without using any additional manual labels. To use this semantic information efficiently, a fusing module is prepended at the bottleneck layer of our whole model. This fusing module is designed to inject image aesthetics into the semantic features using a weak constraint. To eliminate the confusion of aesthetic information caused by this fusing module, a separated multi-way regressor is implemented to replace the conventional one. In addition, we utilize the spatial pooling layer to enable our model adapt to arbitrary aspect ratio. Experimental results on the mainstream IAA data set demonstrate the effectiveness of the proposed method.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is partially supported by National Natural Science Foundation of China (62001099) and supported by the Fundamental Research Funds for the Central Universities of China (17D110408).

ORCID iD

Rong Huang

References

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, 27–30 June 2016, pp. 770–778. New York: IEEE.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Adv Neur In 2012; 25: 1097–1105.

Redmon

Divvala

Girshick

, et al. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, 27–30 June 2016, pp. 779–788. New York: IEEE.

Ren

Girshick

, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE T Pattern Anal 2016; 39(6): 1137–1149.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, 7–12 June 2015, pp. 3431–3440. New York: IEEE.

Ronneberger

Fischer

Brox

U-Net: Convolutional networks for biomedical image segmentation. In: Navab

Hornegger

Wells

, et al. (eds) Proceedings of the international conference on medical image computing and computer-assisted intervention (MICCA). Cham: Springer, 2015, pp. 234–241.

Murray

Marchesotti

Perronnin

. AVA: A large-scale database for aesthetic visual analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Providence, RI, 16–21 June 2012, pp. 2408–2415. New York: IEEE.

Datta

Joshi

, et al. Studying aesthetics in photographic images using a computational approach. In: Leonardis

Bischof

Pinz

(eds) Proceedings of European conference on computer vision (CVPR). Berlin: Springer, 2006, pp. 288–301.

Dhar

Ordonez

Berg

. High level describable attributes for predicting aesthetics and interestingness. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Colorado Springs, CO, 20–25 June 2011, pp. 1657–1664. New York: IEEE.

10.

Tang

Jing

. The design of high-level features for photo quality assessment. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), New York, 17–22 June 2006, pp. 419–426. New York: IEEE.

11.

Nishiyama

Okabe

Sato

, et al. Aesthetic quality classification of photographs based on color harmony. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Colorado Springs, CO, 20–25 June 2011, pp. 33–40. New York: IEEE.

12.

Sun

Yao

, et al. Photo assessment based on computational visual attention model. In: Proceedings of ACM international conference on multimedia (MM), Beijing, China, 19–24 October 2009, pp. 541–544. New York: ACM.

13.

Cortes

Vapnik

. Support-vector networks. Mach Learn 1995; 20(3): 273–297.

14.

Breiman

. Random forests. Mach Learn 2001; 45(1): 5–32.

15.

Gardner

Dorling

. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmos Environ 1998; 32(14–15): 2627–2636.

16.

Ching

See

Wong

L-K

. Learning image aesthetics by learning inpainting. In: IEEE international conference on image processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020, pp. 2246–2250. New York: IEEE.

17.

Jin

, et al. Predicting aesthetic score distribution through cumulative Jensen-Shannon divergence. In: Proceedings of the AAAI conference on artificial intelligence (AAAI), vol. 32, New Orleans, LA, 2–7 February 2018.

18.

Kao

Huang

. Deep aesthetic quality assessment with semantic information. IEEE T Image Process 2017; 26(3): 1482–1495.

19.

Lin

Jin

, et al. Rating image aesthetics using deep learning. IEEE T Multimedia 2015; 17(11): 2021–2034.

20.

Lin

Shen

, et al. Deep multi-patch aggregation network for image style, aesthetics, and quality estimation. In: Proceedings of the IEEE international conference on computer vision (ICCV), Santiago, 7–13 December 2015, pp. 990–998. New York: IEEE.

21.

Sheng

Dong

Chai

, et al. Revisiting image aesthetic assessment via self-supervised feature learning. In: Proceedings of the AAAI conference on artificial intelligence (AAAI), vol. 34, New York, 7–12 February 2020, pp. 5709–5716. Palo Alto, CA: AAAI Press.

22.

Talebi

Milanfar

. Nima: Neural image assessment. IEEE T Image Process 2018; 27(8): 3998–4011.

23.

Zhang

Gao

, et al. A gated peripheral-foveal convolutional neural network for unified image aesthetic prediction. IEEE T Multimedia 2019; 21(11): 2815–2826.

24.

Liu

Zhang

Hou

, et al. Self-supervised learning: Generative or contrastive. IEEE T Knowl Data En 2021

25.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE T Pattern Anal 2015; 37(9): 1904–1916.

26.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representations (ICLR) (ed Bengio

LeCun

), San Diego, CA, 7–9 May 2015.

27.

Szegedy

Liu

Jia

, et al. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, 7–12 June 2015, pp. 1–9. New York: IEEE.

28.

Huang

Liu

Van Der Maaten

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, 21–26 July 2017, pp. 4700–4708. New York: IEEE.

29.

Pfister

Kobs

Hotho

. Self-supervised multi-task pretraining improves image aesthetic assessment. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops (CVPRW), Nashville, TN, 19–25 June 2021, pp. 816–825. New York: IEEE.

30.

Ponomarenko

Jin

Ieremeiev

, et al. Image database TID2013: Peculiarities, results and perspectives. Signal Proces: Image 2015; 30: 57–77.

31.

Chen

Zhang

Zhou

, et al. Adaptive fractional dilated convolution network for image aesthetics assessment. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Seattle, WA, 13–19 June 2020, pp. 14114–14123. New York: IEEE.

32.

Liu

Chen

. A-lamp: Adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, 21–26 July 2017, pp. 4535–4544. New York: IEEE.

33.

Mai

Jin

Liu

. Composition preserving deep photo aesthetics assessment. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, 27–30 June 2016, pp. 497–506. New York: IEEE.

34.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision (ICCV), Venice, 22–29 October 2017, pp. 618–626. New York: IEEE.

35.

Van der Maaten

Hinton

. Visualizing data using t-SNE. J Mach Learn Res 2008; 9(11): 2579–2605.

Pre-Activating Semantic Information for Image Aesthetic Assessment

Abstract

Keywords

Introduction

Background

Related Works

Self-Supervised Learning for IAA

Semantic Information in IAA

Contributions

Our Proposal

Problem Definition

Overview

Auto-Encoder Module

Semantic-Aesthetic Fusing Module

Prediction Module

Experiments

Data Sets

Implementation Details

Image Preprocessing

Backbone

Hyperparameters

Software and Hardware Settings

Metrics

Comparison with Baselines

Ablation Study

Comparative Study

The Dimensions of the Features in Bottleneck Layer

Specific Forms of Activation Losses

Visualization

t-SNE Visualization

Cross Data Sets Validation

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References