Two-person interaction recognition from bilateral silhouette of key poses

Abstract

This work proposes a key pose based intelligent system for recognition of human interactions from video streams. In addition to interaction recognition, the task is useful for some of other applications like content based video retrieval. The main idea is to use the shape of the bilateral silhouette between the persons and analyze it using shape context descriptor, which is one of the popular shape descriptors in object recognition and matching tasks. At first, a dictionary from random samples for the whole classes is collected and the bilateral silhouette image is extracted for all samples and classes to train the low level classifier named frame classifier. Then, the frames of test sequence are compared with these samples and labeled as one class using frame classifier. Finally, a high level classifier is used to categorize the interaction as a function of predefined labels of frame sequence. We call this classifier as the sequence classifier. Because of probable errors in foreground extraction, some faults may occur in frame classification. Moreover, each interaction sequence is composed of two types of frames, which contain related or unrelated information about interaction. To tackle the problem, a normalized histogram of the frame labels is used as the action descriptor, which is robust against misclassification of some frames. This histogram is applied to a sequence classifier like random decision forests (RDF), Probabilistic Neural Network (PNN) or Support Vector Machine (SVM) to perform interaction recognition. Experimental results on SBU and UT-interaction dataset emphasize the privileged performance of the proposed method.

Keywords

Human interaction recognition bilateral silhouette key pose high level classifier low level classifier

Get full access to this article

View all access options for this article.

References

Baysal,

M.C.

Kurt and

Duygulu, Recognizing human actions using key poses, in: 20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, August 23–26, 2010, pp. 1727–1730.

Belongie and

Malik, Matching with shape contexts, in: Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, Hilton Head Island, SC, 2000, pp. 20–26.

S.J.

Belongie,

Malik and

Puzicha, Shape matching and object recognition using shape contexts, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4) (2002), 509–522. doi:10.1109/34.993558.

B.E.

Boser,

Guyon and

Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, COLT 1992, Pittsburgh, PA, USA, July 27–29, 1992, pp. 144–152.

Breiman, Random forests, Machine Learning 45(1) (2001), 5–32. doi:10.1023/A:1010933404324.

Canny, A computational approach to edge detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 8(6) (1986), 679–698. doi:10.1109/TPAMI.1986.4767851.

A.A.

Chaaraoui,

Climent-Pérez and

Flórez-Revuelta, Silhouette-based human action recognition using sequences of key poses, Pattern Recognition Letters 34(15) (2013), 1799–1807. doi:10.1016/j.patrec.2013.01.021.

C.-C.

Chang and

C.-J.

Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2(3) (2011), Article ID 27.

L.-H.

Chen,

C.-W.

Su,

C.-F.

Weng and

H.-Y.M.

Liao, Action scene detection with support vector machines, Journal of Multimedia 4(4) (2009), 248–253. doi:10.4304/jmm.4.4.248-253.

10.

Cheng,

Liu,

Zhao and

Ye, Real world activity summary for senior home monitoring, in: Proceedings of the 2011 IEEE International Conference on Multimedia and Expo, ICME 2011, Barcelona, Catalonia, Spain, July 11–15, 2011, pp. 1–4.

11.

Datta,

Shah and

N.D.V.

Lobo, Person-on-person violence detection in video data, in: 16th International Conference on Pattern Recognition, ICPR 2002, Quebec, Canada, August 11–15, 2002, pp. 433–438.

12.

F.D.M.

de Souza,

G.C.

Chávez,

E.A.

do ValleJr. and

de A. Araújo, Violence detection in video using spatio-temporal features, in: SIBGRAPI 2010, Proceedings of the 23rd SIBGRAPI Conference on Graphics, Patterns and Images, Gramado, Brazil, August 30–September 3, 2010, pp. 224–230.

13.

Dedeoglu,

B.U.

Toreyin,

Gudukbay and

A.E.

Cetin, Surveillance using both video and audio, in: Multimodal Processing and Interaction,

P.M.

Maragos,

A.P.

Potam and

P.G.

Patrick, eds, Springer US, 2008, Chapter 10, pp. 1–13. doi:10.1007/978-0-387-76316-3_6.

14.

Dong,

Kong,

Liu,

Li and

Jia, Recognizing human interaction by multiple features, in: First Asian Conference on Pattern Recognition, ACPR 2011, Beijing, China, November 28–28, 2011, pp. 77–81. doi:10.1109/ACPR.2011.6166533.

15.

Ejaz,

T.B.

Tariq and

S.W.

Baik, Adaptive key frame extraction for video summarization using an aggregation mechanism, Journal of Visual Communication and Image Representation 23(7) (2012), 1031–1040. doi:10.1016/j.jvcir.2012.06.013.

16.

Gong,

Wang,

Jiang,

Huang and

Gao, Detecting violent scenes in movies by auditory and visual cues, in: Advances in Multimedia Information Processing – PCM 2008, 9th Pacific Rim Conference on Multimedia, Tainan, Taiwan, December 9–13, 2008, pp. 317–326.

17.

Jin,

Hu and

Wang, Human interaction recognition based on transformation of spatial semantics, IEEE Signal Processing Letters 19(3) (2012), 139–142. doi:10.1109/LSP.2012.2184091.

18.

Körtgen,

G.J.

Park,

Novotni and

Klein, 3D shape matching with 3D shape contexts, in: Proceedings of the 7th Central European Seminar on Computer Graphics, Budmerice, Slovakia, 2003.

19.

S.K.

Kuanar,

Panda and

A.S.

Chowdhury, Video key frame extraction through dynamic Delaunay clustering with a structural constraint, Journal of Visual Communication and Image Representation 24(7) (2013), 1212–1227. doi:10.1016/j.jvcir.2013.08.003.

20.

Lai and

Yi, Key frame extraction based on visual attention model, Journal of Visual Communication and Image Representation 23(1) (2012), 114–125. doi:10.1016/j.jvcir.2011.08.005.

21.

M.J.

Marín-Jiménez,

Muñoz-Salinas,

Yeguas-Bolivar and

Pérez de la Blanca, Human interaction categorization by using audio-visual cues, Machine Vision and Applications 25(1) (2014), 71–84. doi:10.1007/s00138-013-0521-1.

22.

M.J.

Marín-Jiménez,

Pérez de la Blanca and

Mendoza, Human action recognition from simple feature pooling, Pattern Analysis and Applications 17(1) (2014), 17–36. doi:10.1007/s10044-012-0292-8.

23.

M.J.

Marín-Jiménez,

Yeguas and

Pérez de la Blanca, Exploring STIP-based models for recognizing human interactions in TV videos, Pattern Recognition Letters 34(15) (2013), 1819–1828. doi:10.1016/j.patrec.2012.10.018.

24.

Mecocci and

Micheli, Real-time automatic detection of violent-acts by low-level colour visual cues, in: Proceedings of the International Conference on Image Processing, ICIP 2007, San Antonio, Texas, USA, September 16–19, 2007, pp. 345–348.

25.

Mukherjee,

S.K.

Biswas and

D.P.

Mukherjee, Recognizing interactions between human performers by ‘dominating pose doublet’, Machine Vision and Applications 25(4) (2014), 1033–1052. doi:10.1007/s00138-013-0589-7.

26.

Park and

J.K.

Aggarwal, Event semantics in two-person interactions, in: 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, August 23–26, 2004, pp. 227–230. doi:10.1109/ICPR.2004.1333745.

27.

Park and

J.K.

Aggarwal, Recognition of two-person interactions using a hierarchical Bayesian network, in: First ACM SIGMM International Workshop on Video Surveillance, Berkeley, California, 2010.

28.

Park and

M.M.

Trivedi, Multi-person interaction and activity analysis: A synergistic track- and body-level analysis framework, Machine and Vision Applications 18(3–4) (2007), 151–166. doi:10.1007/s00138-006-0055-x.

29.

Parzen, On estimation of a probability density function and mode, Annals of Mathematical Statistics 33 (1962), 1065–1076. doi:10.1214/aoms/1177704472.

30.

Patron-Perez,

Marszalek,

I.D.

Reid and

Zisserman, Structured learning of human interactions in TV shows, IEEE Transactions on Pattern Analysis and Machine Intelligence 34(12) (2012), 2441–2453. doi:10.1109/TPAMI.2012.24.

31.

E.A.

Perez,

V.F.

Mota,

L.M.

Maciel,

Sad and

M.B.

Vieira, Combining gradient histograms using orientation tensors for human action recognition, in: Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 11–15, 2012, pp. 3460–3463.

32.

Poppe, A survey on vision-based human action recognition, Image Vision Computing 28(6) (2010), 976–990. doi:10.1016/j.imavis.2009.11.014.

33.

Raptis and

Sigal, Poselet key-framing: A model for human activity recognition, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23–28, 2013, pp. 2650–2657. doi:10.1109/CVPR.2013.342.

34.

M.S.

Ryoo, Human activity prediction: Early recognition of ongoing activities from streaming videos, in: IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6–13, 2011, pp. 1036–1043.

35.

M.S.

Ryoo and

J.K.

Aggarwal, UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA), in: IEEE International Conference on Pattern Recognition Workshops, Istanbul, Turkey, August 23–26, Contest Reports, 2010, p. 4.

36.

Y.S.

Sefidgar,

Vahdat,

Se and

Mori, Discriminative key-component models for interaction detection and recognition, Computer Vision and Image Understanding 135 (2015), 16–30. doi:10.1016/j.cviu.2015.02.012.

37.

Shao and

Ji, Motion histogram analysis based key frame extraction for human action/activity representation, in: Sixth Canadian Conference on Computer and Robot Vision, CRV 2009, Kelowna, British Columbia, Canada, May 25–27, 2009, pp. 88–92. doi:10.1109/CRV.2009.36.

38.

D.F.

Specht, Probabilistic neural networks, Neural Networks 3(1) (1990), 109–118. doi:10.1016/0893-6080(90)90049-Q.

39.

Sunyoung,

Seongho,

Hyeran,

Haejin and

Sooyeong, Human interaction recognition in YouTube videos, in: 8th International Conference on Information, Communications and Signal Processing, ICICS 2011, December 13–16, 2011, pp. 1–5.

40.

B.T.

Truong and

Venkatesh, Video abstraction: A systematic review and classification, Transactions on Multimedia Computing, Communications and Applications 3(1) (2007), Article ID 3.

41.

Wu,

Zhu and

Shao, One shot learning gesture recognition from RGBD images, in: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, June 16–21, 2012, pp. 7–12. doi:10.1109/CVPRW.2012.6239179.

42.

Yuan,

Prinet and

Yuan, Middle-level representation for human activities recognition: The role of spatio-temporal relationships, in: Trends and Topics in Computer Vision – ECCV 2010 Workshops, Heraklion, Crete, Greece, September 10–11, Revised Selected Papers, Part I, 2010, pp. 168–180.

43.

Yun,

Honorio,

Chattopadhyay,

T.L.

Berg and

Samaras, Two-person interaction detection using body-pose features and multiple instance learning, in: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, June 16–21, 2012, pp. 28–35. doi:10.1109/CVPRW.2012.6239234.

44.

Zajdel,

J.D.

Krijnders,

T.C.

Andringa and

D.M.

Gavrila, CASSANDRA: Audio–video sensor fusion for aggression detection, in: Fourth IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2007, Queen Mary, University of London, London, United Kingdom, September 5–7, 2007, pp. 200–205.