Sage Journals: Discover world-class research

Abstract

The image caption generation algorithm allows computer to understand the picture and generate sentences that comply with grammar rules and picture features. Under the Encoder-Decoder framework, the CNN (Convolutional Neural Networks) model is widely used as an encoder to extract image features and the RNN (Recurrent Neural Networks) model as a decoder to generate the description sentence to solve the problem of image caption generation. The most famous algorithm is the NIC, which used Inception-v3 as the encoder, and the LSTM (Long Short-term Memory) as the decoder. However, there are too many parameters in LSTM, and the quality of generated sentences is not high. In the field of visual features, deepening the network structure can improve the feature extraction ability, but the network will degenerate. Therefore, the NIC algorithm is improved. The Inception-ResNet-v2 network is used as the encoder, and the LSTMP network is introduced as the decoder. Taking BLUE-4, ROUGE, METEOR, and CIDEr as evaluation indicators, MSCOCO and Flickr30k are used as datasets to make comparative test between the NIC and the improved NIC. Experimental results show that the improved NIC algorithm outperforms the NIC algorithm in all four evaluation indicators.

Keywords

Image caption Inception-ResNet-v2 ResNet LSTMP

Get full access to this article

View all access options for this article.

References

Cho

Merrienboer

B.V.

Gulcehre

et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, Computer Science (2014).

Sutskever

Vinyals

and Le

Q.V.

, Sequence to sequence learning with neural networks, 4 (2014), 3104–3112.

Vinyals

Toshev

Bengio

et al., Show and tell: A neural image caption generator, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 3156–3164.

Vinyals

Toshev

Bengio

et al., Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4) (2016), 652–663.

Szegedy

Vanhoucke

Ioffe

et al., Rethinking the inception architecture for computer vision, Computer Science (2015), 2818–2826.

Farhadi

Hejrat

Sadeghi

M.A.

et al., Every picture tells a story: Generating sentences from images, in: European Conference on Computer Vision, Springer-Verlag, 2010, pp. 15–29.

Kulkarni

Premraj

Ordonez

et al., BabyTalk: Understanding and generating simple image descriptions, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2011, pp. 1601–1608.

Mitchel

Han

Dodge

et al., Midge: Generating image descriptions from computer vision detections, in: Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, pp. 747–756.

Aker

and Gaizauskas

, Generating image descriptions using dependency relational patterns, in: Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010, pp. 1250–1258.

10.

Kuznetsova

Ordonez

Berg

A.C.

et al., Collective generation of natural image descriptions, in: Meeting of the Association for Computational Linguistics: Long Papers, Association for Computational Linguistics, 2012, pp. 359–368.

11.

Hodosh

Young

and Hockenmaier

, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial Intelligence Research 47(1) (2013), 853–899.

12.

Gong

Wang

Hodosh

et al., Improving image-sentence embeddings using large weakly annotated photo collections, Computer Vision – ECCV 2014, Springer International Publishing, 2014, pp. 529–545.

13.

Ordonez

Kulkarni

and Berg

T.L.

, Im2Text: Describing images using 1 million captioned photographs, in: International Conference on Neural Information Processing Systems, Curran Associates Inc., 2011, pp. 1143–1151.

14.

Mao

Yang

et al., Deep captioning with multimodal recurrent neural networks (m-RNN), Eprint Arxiv (2014).

15.

Kiros

Show et al., attend and tell: Neural image caption generation with visual attention, Computer Science (2015), 2048–2057.

16.

Kiros

Salakhutdinov

and Zemel

R.S.

, Unifying visual-semantic embeddings with multimodal neural language models, Computer Science (2014).

17.

Donahue

Anne Hendricks

Guadarrama

et al., Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.

18.

Jia

Gavves

Fernando

et al., Guiding the long-short term memory model for image caption generation, in: IEEE International Conference on Computer Vision, IEEE Computer Society, 2015, pp. 2407–2415.

19.

Zhang

Ren

et al., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, (2015), 1026–1034.

20.

Krizhevsky

Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional neural networks, in: International Conference on Neural Information Processing Systems, Curran Associates Inc., 2012, pp. 1097–1105.

21.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, Computer Science (2014).

22.

Szegedy

Liu

Jia

et al., Going deeper with convolutions, CVPR (2015).

23.

Cho

Merrienboer

B.V.

Gulcehre

et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, Computer Science (2014).

24.

Gers

F.A.

and Schmidhuber

, Recurrent nets that time and count, in: IEEE-Inns-Enns International Joint Conference on Neural Networks, IEEE Computer Society, 2000, pp. 3189.

25.

Zhang

et al., Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2016, pp. 770–778.

26.

Szegedy

Ioffe

and Vanhoucke

et al., Inception-v4, Inception-ResNet and the impact of residual connections on learning, 2016.

27.

Sak

Senior

and Beaufays

, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, Computer Science (2014), 338–342.

28.

Sak

Senior

and Beaufays

, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition, arXiv preprint arXiv:14021128, (2014).

Generating research of image caption based on improved NIC algorithm

Abstract

Keywords

Get full access to this article

References