Abstract
The current state of the art for image annotation and image retrieval tasks is obtained through deep neural network multimodal pipelines, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding (FNE) in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale discrete representation of images, which results in richer characterisations. Extensive testing is performed on three different datasets comparing the performance of the studied variants and the impact of the FNE on a levelled playground,
Keywords
Introduction
One of the main challenges of the semantic web is vagueness, the difficulty of representing imprecise concepts. An increasing trend in the community is to use vector representations of vague concepts. Vector representations allow for the evaluation of concepts’ similarity simply by computing a vector distance. Not less important is the possibility of obtaining these vector representations automatically. The use of automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web [7].
Deep learning methods are representation learning techniques which can be used to generate such vectors. The models obtained from these methods are composed of multiple processing layers that learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state of the art in speech recognition, visual object recognition, object detection and many other domains [23]. The use of deep learning vector embeddings to represent words has had a substantial impact in many natural language processing tasks [6]. Similarly, deep learning image embeddings have shown great generalisation capabilities, even between distant domains [13]. In this regard, we argue that the semantic web can significantly benefit from the use of deep learning-based embeddings.
In this paper we focus on multimodal pipelines, which tackle two problems in parallel. First, the problem of obtaining a semantically meaningful embedding of an image representing a scene. Second, the problem of obtaining a visually meaningful embedding of a sentence describing a scene. This is done through the construction of a joint embedding, representing both modalities: an image of a scene, and a caption describing it.
The joint embedding constructed can be used to correlate images with sentences easily. As an example, imagine an e-commerce website where sellers upload images of the product to sell. Some sellers may add a very accurate textual description too, while others’ descriptions may be incomplete, inaccurate or non-existent at all. On the other side, buyers search for the desired product writing a free-text description of it. The use of a multimodal embedding as proposed can help to link the textual information provided by the buyer with the product that best matches it, regardless of whether an accurate description for that individual product was provided or not. In general, this approach can be used to automatically include representations of uncaptioned images in a semantic web.
The proposed methodology can also have an impact on semantic web technologies in the disambiguation of vague semantics. Take for instance the concept
Information retrieval is a natural way to assess the quality of joint embedding methods [14]. Image annotation (also known as caption retrieval) is the task of automatically associating an input image with a describing text. The complementary task of associating an input text with a fitting image is known as image retrieval or image search.
State-of-the-art image annotation methods are currently based on deep neural network representations, where an image embedding (
The main goal of this paper is to explore the impact of using a Full-Network embedding (FNE) [13] to generate the image embedding required by multimodal pipelines, replacing the standard one-layer embedding. We do so by integrating the FNE into the multimodal embedding pipeline defined in
The generic pipeline defined by Kiros
We report the consequential improvements in our implementation, which increase the performance of the original method [21] as well. Finally, we exhaustively test the main variations on a levelled playground, obtaining insights on the real impact on performance of each of them. Indeed, properly assessing the sources of empirical gains is a key aspect in research that should be further encouraged [25]. Evaluation is done using three publicly available datasets: Flickr8K [30], Flickr30K [42] and MSCOCO [24].
To sum up, the contributions of this paper are:
Integration of the FNE into the generic pipeline defined by Kiros
Comparative study of the impact on performance of the main variants introduced by [37] and [11] under equality of the rest of hyper-parameters.
Exhaustive study of optimal hyper-parameter configuration for the previous methods.
Novel curriculum learning process to further increase Order++ and VSE++ [11] training stability and performance.
The rest of the paper is structured as follows. In Section 2, the main different approaches existing in the literature for the image/caption retrieval problem are reviewed. This review introduces the basic methodology by Kiros
Related work
This paper builds upon the methodology described by Kiros
Following the same pipeline [21], Vendrov
Also, using two different neural networks for image and text, and the ranking loss as methodology keystone, we find the Embedding Network (EN) presented in [41] and the Word2VisualVec (W2VV) model [9]. The first approach (EN) introduces a novel neighbourhood constraint in the form of additional loss penalties
A substantially different group of methods is based on the Canonical Correlation Analysis (CCA). A first successful approach in this direction is the use of Fisher Vectors (FVs) [22]. FVs are computed with respect to the parameters of a Gaussian Mixture Model (GMM) and an Hybrid Gaussian–Laplacian Mixture Model (HGLMM). For both, images and text, FVs are build using deep neural network features: a CNN for images features, and a word2vec [27] for text features. A more recent approach based on the same CCA methodology [10], introduces a novel bidirectional neural network architecture. This architecture is based on two channels which share weights: one channel maps images to sentences while the other goes in the opposite direction. Losses are applied in each projection and in a middle layer. The loss in the middle layer seeks to ensure the correlation between both representations at this point. Instead of using the CCA, a more efficient Euclidean loss is used. Since both methods rely on a CNN representation of the image, the introduction of the FNE in these pipelines should be straightforward.
Attention-based models is another family of competitive solutions for tackling multimodal tasks. Dual Attention Networks (DANs) [29] currently holds the best results on the Flickr30K dataset. On a general pipeline similar to [21], DANs introduce two additional small neural networks as attention mechanisms for images and captions. This allows DANs to estimate the similarity between images and sentences by focusing on their shared semantics. In a similar fashion, selective multimodal Long Short-Term Memory network (sm-LSTM) [15] includes a multimodal context-modulated attention scheme at each time-step. This mechanism can selectively attend to a pair of instances of image and sentence, by predicting pairwise instance-aware saliency maps for image and sentence. All attention-based methods rely on CNN representations of the images, as the previously described methods did. However, they differ in that the representations are obtained from the last convolutional layer. At this level, information on the features position is available allowing for the use of attention mechanisms. On the contrary, FNE obtains a compact representation of the whole image at the cost of losing the spatial information. Application of the FNE methodology to those techniques would require to modify significantly the FNE schema and is one of our main lines of future work.
Methods
The multimodal embedding pipeline of Kiros

Overview of the proposed multimodal embedding generation pipeline with the integrated Full-Network embedding. Elements colored in orange (dark grey) are components modified during the neural network training phase. During testing, only one of the inputs is provided.
The FNE [13] generates a vector representation of an input image by processing it through a pre-trained CNN, extracting the neural activations of all convolutional and fully-connected layers. After the initial feature extraction process, the FNE performs a dimensionality reduction step for convolutional activations, by applying a spatial average pooling on each convolutional filter. After the spatial pooling, every feature (from both convolutional and fully-connected layers) is standardized through the
The last step of the FNE is a feature discretization process. The previously standardized embedding is usually of large dimensionality (
Multimodal embedding
In our approach, we integrate the FNE with the multimodal embedding pipeline of Kiros
In simple terms, the pipeline training procedure consists of the optimisation of the pairwise ranking loss between the correct image-caption pair and a random pair. Assuming that a correct pair of elements should be closer in the multimodal space than a random pair, the loss
Where
The similarity metric proposed in [21] is the cosine similarity
Multimodal order embedding
Using the same general schema, Vendrov
Notice that since image and caption embeddings are normalised to have unit L2-norm, both lay on an hyper-sphere centred on its coordinate origin, thus a perfect order-embedding will not be achieved unless they are the same vector, which is extremely unlikely to happen.
Maximum error loss
A recent contribution to the field [11] proposes to compute the loss focusing only on the worst contrasting example (
Curriculum learning
Faghri
To fix that, we define a sort of curriculum learning approach to combine the benefits of the sum loss
We propose to train the model using the sum of errors loss
We performed preliminary experiments using this methodology to apply a learning rate reduction, which resulted in small performance gains for some algorithms. We kept these results out of the paper as we do not consider them to be conclusive enough, and to avoid shadowing more relevant contributions.
Experiments
In this section, we evaluate the impact of using the FNE in a multimodal pipeline for both image annotation and image retrieval tasks. We extend our previous work [39] introducing the FNE in different multimodal pipelines. To properly measure the relevance of the FNE, we compare the results obtained with those of the original multimodal pipelines (
We identify the different combinations of embedding and multimodal pipeline with a notation in the form of EMB-PIPE. EMB denotes the embedding being either FNE (for the full network embedding) or FC7 (for the baselines using the last CNN layer,
Datasets
In our experiments we use three different and publicly available datasets:
The
The
The
Experimental setup
We investigate the impact of the FNE on the methods proposed in [11,21,37], and on the curriculum learning methodology proposed in Section 3.5. The methods are named following the convention of [11]. Notice all losses are actually based on a Hinge Loss:
Sum of Hinge Loss (
Maximum of Hinge Loss (
Sum of Order Embedding Loss (
Maximum of Order Embedding Loss (
Pre-trained Hinge Loss (
Pre-trained Order Embedding Loss (
The details of the hyper-parameters used in the experiments for each method can be found in Table 1.
Implementation details
The devil is in the details. To facilitate the reproducibility and interpretability of our work, in this section we provide all the details regarding our implementation. The Theano [36] based implementation we used is available at [38].
Hyper-parameter configuration for the experiments
For MSCOCO Word embedding dimensionality is 2,000 and Learning rate is 0.00025.
First training-second training parameters.
Hyper-parameter configuration for the experiments
For MSCOCO Word embedding dimensionality is 2,000 and Learning rate is 0.00025.
First training-second training parameters.
During a training epoch, all images are presented with one caption chosen randomly from the five captions available. This approach differs from the usual of presenting all five captions per image each epoch [21,39]. If all five image-caption pairs are included in the dataset, it may be the case that more than one correct image-caption pairs can be included in the same random batch. Since the method uses all image-caption combinations in the batch as contrastive examples, a correct pair could be wrongly used as an incorrect pair during the loss computation, leading to noise during the training. By using only one correct caption, we remove this possibility. On the other hand it is now possible (although highly unlikely depending on the number of training epochs) that a correct caption is never used during training. In fact, the probability that a correct caption is never used during training is in the order of
The models are trained until a maximum number of epochs is reached, and the best performing model on the validation set is chosen. Notice that the result of this process is very similar to what could be obtained through an early stopping policy. In the case of baseline experiments, the maximum number of epochs is set to 200 for all our executions. In MH experiments on Flicker8k and Flicker30k, we raise the maximum number of epochs to 400 as we observed results kept improving after 200 epochs.
On all our experiments (for both the FC7 and the FNE variants) the batch size is of 128 image-caption pairs. Within the same batch, every possible alternative image-caption pair is used as contrasting example (
Caption processing
The caption sentences are word-tokenized using the Natural Language Toolkit (NLTK) for Python [3]. We did not remove punctuation marks as in [11,39], and in contrast to [37]. Also, unlike some previous works [21,39] we do not remove long sentences from the training split. We did not observe a significant impact on performance with this reduction of the text pre-processing. These observations are aligned with conclusions from [4], where simple tokenization works equally or better than more complex text preprocessing systems in general domain datasets. We hypothesise that the short nature of the texts combined with the availability of multiple text instances for each image helps the system to overcome sparsity issues.
The choice of the word embedding size and the number of GRUs has been analyzed to obtain a range of suitable parameters to test in the validation set. Previous contributions [11,21,37] set the word embedding dimensionality to 300. In our preliminary experiments, we tested word embedding dimensionalities of 300, 600, 1,024, 1,536, 2,048 and 3,072, finding that a higher dimensionality helps to obtain better results. We also found that very different dimensionalities between the word embedding and the multimodal embedding (
Similarly, we explored different multimodal embedding dimensionalities (
Image processing
For generating the image embedding we use the classical VGG16 CNN architecture [33] pretrained for ImageNet [31] as source model. This architecture is composed by 16 convolutional layers combined with pooling layers, followed by two fully connected layers and a final softmax output layer. Using only the activations of the last fully connected layer before the softmax (
To obtain a better representation of the image, the full network embedding resizes the image to 256 × 256 pixels and extracts 5 crops of 224 × 224 pixels (one from each corner and the center). Mirroring these 5 crops horizontally we obtain a total of 10 crops which are processed through the CNN independently. The activations collected from each of these 10 crops are averaged to obtain a single representation of the image before further processing. For the baseline we use the same process before L2-normalization. Although a similar process is common for data augmentation, notice that we are not actually doing data augmentation since the number of training samples does not increase.
Evaluation metrics
To evaluate the image annotation and image retrieval tasks we use the following metrics:
To obtain a comparable performance metric per model, we use the sum of the recalls on both tasks. This has been done before in [39] and in [11], the latter using only R@1 and R@10. We only use the score obtained on the validation set to select the best performing model for early stopping and hyper-parameter selection.
Results obtained for the Flickr8K dataset. R@K is Recall@K (high is good). Med r is Median rank (low is good). Best results for each FC7–FNE comparison are shown in underline . Best results for SotA and our experiments are shown in bold
Results from [39].
Trained for 400 epochs.
Results obtained for the Flickr8K dataset. R@K is Recall@K (high is good). Med
Results from [39].
Trained for 400 epochs.
Results obtained for the Flickr30K dataset. R@K is Recall@K (high is good). Med
Single model.
CNN fine-tuned.
Results from [39].
Trained for 400 epochs.
Results obtained for the MSCOCO dataset. R@K is Recall@K (high is good). Med
Single model.
Results provided on [19].
Extra training data from validation set.
CNN fine-tuned.
Results from [39].
Table 2 shows the results of the proposed full network embedding on the Flickr8K dataset, for both image annotation and image retrieval tasks. The top part of the table includes the current state-of-the-art (SotA) results as published. The second part summarises the results published by the original contributions this work is based on. Following parts contain the results produced by us for each of the models defined in Section 4.2. Each of these blocks comprises two pairs of results, the first pair corresponds to the results while using a configuration of hyper-parameters as close as possible to the original (
First, let us consider the effect of all modifications in the pipeline (detailed in Sections 3 and 4.3) compared to our previous work [39]. In the first block of experiments, we can compare the results from [39] (FC7-SH-bl and FNE-SH-bl) with the ones obtained in this work for the same model (FC7-SH and FNE-SH). Notice that in FC7-SH-bl and FNE-SH-bl hyper-parameters were already optimized for FNE. We can see a substantial improvement in results obtained using both the FC7 and the FNE image embeddings. With an average increase in recall of 4.75% on MSCOCO, these results validate the improvements made in the pipeline globally and the exhaustive hyper-parameter fine-tuning.
Results obtained in this work for the original pipeline from Kiros
These results highlight the difficulty to perform a consistent comparison between different multimodal approaches since different authors make different choices in the settings of their experiments (and sometimes fail to detail them thoroughly). Notably, important differences arise depending on the data used for training and testing, specially when experimenting with the MSCOCO dataset as we have seen in Section 4.1. Similarly, data augmentation techniques, a standard approach in most SotA methods, can give a boost to performance. In our experiments, we did our best to avoid such differences or to specify them entirely when they are unavoidable. In this context, the results we provide are as comparable as possible. It is essential to keep in mind all these considerations, when comparing the results we report with the ones from other publications.
Comparing the results of the family of methods based on [21] with the state of the art, we see that their relative performance increases with dataset size (larger datasets lead to more competitive performances of these methods). Since the methods tested are more data-driven (
Now, let us focus on the differences between a model and the same model using the FNE image embedding. This is the most significant contribution of this paper, as it incorporates the FNE on several multimodal embedding pipelines. We can see through the tables of results that every method on every dataset obtains better results when using the FNE embedding when compared to the FC7. Moreover, even with the original hyper-parameter configuration (sub-optimal for FNE) the FNE obtains better results on all tests. The only exception is FNE-MOE-bl where training problems occur with the original configuration (we analyze this issue in Section 6). Even in this case, results using an appropriate hyper-parameter selection are superior to those of the baseline (FC7-MOE-bl). Considering all the experiments on MSCOCO dataset (including baselines), the average increase in recall using the FNE embedding is 3.7%.
Considering the methods tested in our consistent experimental setup, we see that FNE-MH tends to obtain the best results on image annotation while FNE-POE is usually superior in image retrieval tasks. With these results, we can not consider one method preferable to the other except in the smallest Flicker8K dataset, where FNE-MH is superior. In any case, the performance differences between the best versions of each method remain lower than the impact of the FNE. For instance, in the experiments on MSCOCO, the recall gap between the best and the worst method (for each task separately) is, on average, 2.1%.
Finally, we observe that the proposed methodology of curriculum learning increases the already good performance of the original FC7-MOE [11] and the FNE-MOE 1.7% on average at MSCOCO. On the other hand, on methods based on the cosine similarity
Experiments on MOE training behaviour
When training models using the maximum order embedding (MOE and MOE-bl), we observed instability issues. For some configurations of hyper-parameters, the model does not start learning, even after extending the number of epochs significantly. To obtain some insights on that behaviour, we trained the same model five times with different random initialisations. The configurations tested are shown in Table 5. The combinations of learning rate, margin and absolute value are taken from the original works of [11,37].
Hyper-parameter configuration and results for the experiments on MOE training behaviour. Success indicates the number of times that experiment succeeded in starting training (i.e. , score > 10) over total repetitions
Hyper-parameter configuration and results for the experiments on MOE training behaviour. Success indicates the number of times that experiment succeeded in starting training (
The rest of the hyper-parameters are kept the same for all experiments. The dimensionality of the word embedding is 300, and the multimodal embedding has 1,024 dimensions. The maximum number of epochs is 200. We run all the tests on Flickr8K to minimise computational cost, although we observed this behaviour in Flickr30K and MSCOCO too.
To evaluate these experiments, we count the number of times the algorithm succeeded in starting training. We consider it does not train if validation and test scores are below 10 (regular scores are higher than 200). The results obtained are shown in Table 5.
Results, quite surprisingly, do not point to a single variable as the cause of the problem. For the FC7 embedding, it did not train when the absolute value was used, independently of the learning rate and margin. The experiment with the same configuration that worked well with FC7 does not train with FNE. On the other hand, the original configuration from [37] (but using max loss) successfully trained on FNE embedding, but this behaviour is not entirely robust since it failed once.
These experiments show that the instability of the training does not come from the choice of embedding, but instead on the hyper-parameter selection and parameter initialisation. While these experiments help to shed light on the problem, further work is required to completely understand the cause.
The proposed curriculum learning methodology (see Section 3.5) effectively solved this problem in all our experiments, as it initialises the network using the more robust sum loss. None of the experiments we did using the proposed curriculum learning methodology for different hyper-parameters configurations failed to start training.
For the multimodal pipeline of Kiros
The results of our comparative study of the different variants from [11,21,37] pointed up the need of properly assessing the sources of empirical gains. We consider it is a key aspect of research that should be further encouraged. We hope that our experimental study can help other researchers with design decisions from the text pre-processing to the loss choice including ranges of optimal dimensionalities and other hyper-parameters.
Another issue we tackled was the instability of MOE models. Depending on the random initialization of the weights, the same model may start learning or not. Our experiments showed that the combination of hyper-parameters also plays a role in these difficulties. However, further study is required to get a real insight into the mechanisms causing this problem. In any case, the proposed curriculum learning method of pre-training using a sum of losses effectively alleviates this problem while increasing performance.
When compared to the current state of the art, the results obtained from the studied variants using FNE are below the results reported through other methods. This difference is often the result of using a more substantial amount of training data. Indeed, results given in [11] indicate that models based on the pipeline of [21] can obtain state-of-the-art results when using enough data.
Finally, let us remark that the FNE is straight-forward compatible with most multimodal pipelines based on CNN embeddings. The constant improvement in the results observed here for the variants proposed by [11,21,37] suggest that other methods can also boost its performance incorporating the FNE. These results also encourage us to consider the modifications required to be able to introduce attention mechanisms (
Footnotes
Acknowledgements
This work is partially supported by the Joint Study Agreement no. W156463 under the IBM/BSC Deep Learning Center agreement, by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051), and by the Core Research for Evolutional Science and Technology (CREST) program of Japan Science and Technology Agency (JST).
