Movie scene segmentation using object detection and set theory

Abstract

Movie data has a prominent role in the exponential growth of multimedia data over the Internet, and its analysis has become a hot topic with computer vision. The initial step towards movie analysis is scene segmentation. In this article, we investigated this problem through a novel intelligent Convolutional Neural Network (CNN) based three folded framework. The first fold segments the input movie into shots, the second fold detects objects in the segmented shots and the third fold performs object-based shots matching for detecting scene boundaries. Texture and shape features are fused for shots segmentation, and each shot is represented by a set of detected objects acquired from a light-weight CNN model. Finally, we apply set theory with the sliding window–based approach to integrate the same shots to decide scene boundaries. The experimental evaluation indicates that our proposed approach outran the existing movie scene segmentation approaches.

Keywords

Movie analysis multi-level decision making scene segmentation shot segmentation information fusion object detection set theory

Introduction

The exponential growth of multimedia data exposes various domains for the researchers around the globe in which the entertainment data, such as movies, have a prominent role. In such huge video data,^1,2 efficient organization^3,4 and browsing/retrieval⁵ of a specific movie are challenging problems. To tackle these problems, the basic step is to develop schemes to parse a movie by segmenting it into shots and scenes, which is based on its hierarchical structure that is shown in Figure 1. A continuous sequence of frames acquired by a single camera is called a shot, which is the basic building blocks of a movie. However, a single shot is insufficient to deliver any semantic meaning. On the other hand, a scene consists of one or more consecutive shots, which are semantically correlated and share the same physical settings or presents a continuous action performed by the actors. Thus, users are more likely to recall a scene rather than a shot, because it covers a whole event or sub-story in a movie.

Figure 1.

Hierarchical structure of a movie.

Scene segmentation plays a vital role in movies data analysis that have a large number of applications, such as summarization,^6,7 indexing and retrieval,^8,9 alignment of subtitles,¹⁰ and characters identification.^11,12 Due to the diverse nature of movies, the available literature for scene segmentation can be grouped into three major categories based on features extraction, which include (1) visual-based, (2) audio-based, and (3) hybrid scene segmentation methods. The first algorithm for video segmentation based on credible scene boundaries is presented by Kender and Yeo.¹³ They calculated short-term shot-to-shot coherence using a color histogram to measure the similarity between new and current shots. Similarly, Rui et al.¹⁴ calculated a color histogram of the first and last frame of a shot and then used activity measure clustering to get similar shots from the video. Finally, they merged the overlapping shots of the video to identify the scene boundaries. Rasheed and Shah¹⁵ pointed out the problem of over-scene segmentation in action movies by calculating the length of shots and motion contents to compare it with the adjacent shots. Finally, shots that have high motion are grouped into the same scene. In another work¹⁶, they proposed a graph partitioning–based scene segmentation scheme in which each node represented a shot and the edges between these nodes represented the temporal–visual coherence. A normalized cuts technique was then applied to produce a number-prescribed segmentation of a scene. However, this method was unable to segment the scenes based on the semantic meaning, which is one of the main challenges in the scene segmentation literature. A statistical approach based on Markov Chain Monte Carlo (MCMC) was proposed for scene segmentation by Zhai and Shah¹⁷ where scenes were generated using stochastic Monte Carlo sampling. They introduced diffusion to merge and split updates to determine the scene boundaries. Gu et al.¹⁸ proposed an energy minimization–based video scene segmentation scheme by formulating the procedure of segmentation as minimal energy, in which the global constraint was presented as content and local constraint as context energy. A text-based scene segmentation scheme was proposed by Cour et al.,¹⁰ they analyzed specified-scripts of the movies and segmented the movie into scenes by aligning the common dialogues and timestamps. Due to the extensive inconsistencies between the scripts and captions, the alignment of the text rate was restricted, which affected the overall accuracy of the scene segmentation technique. Weng et al.¹⁹ studied the social relationships of actors in movies and tried to segment movie scenes based on their appearance. They built a role social network of the actors and then analyzed the variance in context. This method showed good results in segmenting movie scenes. Badrinarayanan et al.²⁰ used a Shot Transition Graph (STG), where each mentioned node presented a shot and the edges among them were weighted based on their similarity. Finally, the STG was split into sub-graphs using normalized cuts to partition the graph to detect the scene boundaries. Baraldi et al.²¹ combined a local image descriptor and temporal clustering to divide the videos into coherent scenes. Kalith et al.²² performed video segmentation using moving object detection in each frame. A semantic video scene segmentation method was proposed by Ji et al.²³ for image captioning using a deep model. First, keyframes were selected from shots and then forwarded to a deep model to generate semantic text descriptions. The generated captions for each keyframe was then utilized to check the similarity between the shots. For three-dimensional (3D) point cloud videos, a generic segmentation approach is presented in Lin et al.²⁴ by exploiting geometry in RGBD, which is based on low-level features like connectivity and compactness. They represented the object’s rough estimation in one frame to exploit temporal coherence with hierarchical structures. This hierarchical structure was used in an efficient way to establish temporal correspondence on different scales of the object’s connectivity, temporally managing the splits, and merging of the objects helps to update the segmentation.

Shot-based keyframe selection is one of the main limitations found in existing literature because presenting a shot by one keyframe results in missing lots of information for lengthy shots. On the other hand, such schemes need a special module for keyframes selection. The second major issue in these schemes is low level feature–based shots similarity measurements, because the shots of a single scene are semantically related, and it is very challenging to represent semantic meaning through low-level features. To tackle these problems, we proposed a scheme to segment a movie into scenes based on the object’s appearance in each shot. First, the movie is segmented into shots using texture and color features, and then the objects are detected in each shot using a YOLOv3²⁵ Convolutional Neural Network (CNN) model, trained on Common Objects in Context (COCO) dataset²⁶ that has eighty classes of daily life objects. Finally, we proposed a novel set theory–based mechanism for shot matching to detect scene boundaries. The main contributions of this article are summarized as follows:

In movies, multiple cameras are used to capture a scene from different angles, and finally shots from each angle are merged to present a scene. We fused texture and color features to detect these shots in an efficient way.

Existing techniques rely on a single or two keyframes to represent a shot, which results in the loss of information. Also, these techniques need an extra module for keyframe extraction. In contrast, we represented a shot through a set of detected objects that appeared in the same shot. Our strategy covers maximum information of a shot by detecting objects in all the frames.

Shots similarity is a crucial step in scene segmentation techniques, which is usually done through feature matching, which needs extensive processing time. Thus, a novel set theory–based shots matching approach is proposed, which is computationally efficient compared to the features-based matching approaches.

The rest of the article is organized as follows: section “The proposed method” discusses the proposed methodology for scene detection in detail. Experimental results and discussion are presented in section “Experimental evaluation,” which is followed by the conclusion and future work in section “Conclusion.”

The proposed method

In this section, we present the proposed framework in detail. Our method is three folded: (1) shot segmentation, (2) shot-based object detection, and (3) shot matching using set theory. All these steps are discussed in subsequent sections in detail. The proposed system can effectively segment a full-length movie of any genre into scenes for further analysis. The overall framework of the proposed system is given in Figure 2.

Figure 2.

Our proposed framework for scene segmentation in movies.

Shot segmentation

The hierarchical structure of movies assists in parsing a movie into short-length clips, which makes high-level video segmentation and movie analysis more effective. Using this advantage, we first segmented the whole movie into shots. Low level features, such as a color histogram, Local Binary Patterns (LBPs), and a Histogram of Gradients (HoG), shows better performance in shot segmentation when it comes to frame-by-frame comparison. Inspired from the histogram based method,²⁷ we used HSV color space for a color histogram and LBP for texture features to compute the similarity between two consecutive frames. A single frame is represented by a feature vector FV of size 360 by concatenating a color histogram of 70 bins, which is achieved using color quantization in HSV color model and 290 texture features. For texture features, the first LBP patterns are extracted from a frame, in which we only selected 58 uniform patterns, and then we applied five 2 × 2 angular structures to analyze these patterns. Finally, a texture feature vector of size 5 × 58 = 290 is obtained. A detailed description of these features extraction is given in Sajjad et al.²⁷ The similarity S between two consecutive frames is calculated using equations (1) and (2)

S_{(i, j)} = \sum_{b \in bins}^{n} min (F V_{i} (b), F V_{j} (b)), where S_{(i, j)} \in [0, 1]

(1)

F V_{j} \in {\begin{matrix} same shot frames, if S_{(i, j)} > threshold \\ different shots' frames, otherwise \end{matrix}

(2)

where S_(i,j) is the similarity score between feature vectors FV_i and FV_j of the ith and jth frames, respectively. If the calculated similarity score between the ith and jth frames is greater than a predetermined threshold, then the jth frame is considered as related to the previous shot; otherwise, a new shot is considered to be initiated. We have set the threshold value equal to 0.9 after extensive experiments.

Shots based object detection

The recent integration of deep learning with high computational power enable machines to successfully solve object detection,²⁸ classification,²⁹ and several other problems related to images and videos in real time.³⁰ We applied deep learning in the domain of movie scene segmentation to detect various objects in a shot. For object detection, we used a YOLOv3 CNN²⁵ model with DarkNet backend framework, trained on the COCO dataset.²⁶ Due to its ability to recognize multiple objects in a single image by encountering multiple regions at once using only one neural network makes it suitable for our problem. The objects contained in the COCO dataset are categorized into 80 classes that can be favorably used in objects based shot segmentation in movies. The objects include pedestrians, vehicles, signboards, and many other indoor and outdoor objects that play a vital role in movie data. In YOLOv3, the object detection problem is treated as a regression problem, and it simultaneously makes prediction for multiple classes in one look. It also encodes the contextual information of the classes based on their appearances, which makes it applicable to our problem. The architecture of YOLOv3 uses successive convolutional filters of size 3×3 and 1×1 in the sequence of 53 convolutional layers. Each layer is followed by a non-linear activation function, which is called Leaky Rectified, and a max-pooling layer of size 2×2 with stride 2. YOLOv3 has significant benefits over previous versions in terms of time complexity and accuracy, and it can process 78 fps without batch processing. Figure 3 shows some sample frames from the testing movies with detected objects using YOLOv3.

Figure 3.

Sample frames from testing movies with the detected objects of various categories.

Most of the scene segmentation schemes select one keyframe or the first and last frames to represent the whole frames sequence of a shot.^31,32 Such schemes ignore lots of useful information when the shot is lengthy. In our scheme, we detect objects in all frames of the shots to make sure that none of the information is lost. A union of the detected objects in each shot is then taken to associate the set of objects with the concern shot using equations (3) and (4)

S [o b_{1}, o b_{2}, \dots, o b_{m}] = \sum_{i = 1}^{n} ⊙ [Sh (f_{i})]

(3)

S_{set} = ⋃_{j = i}^{m} S (o b_{j})

(4)

where ⊙ is the function to detect objects in the frames f₁, f₂, …, f_N of shot Sh; S is the set of all detected objects ob₁, ob₂, …, ob_m; and S_set is the set of objects that appeared in the shot after taking the union. The union of all detected objects of the shot is taken to avoid the repetition of same object in a single shot. Finally, we represent each shot as a set of objects based on the contents present in it with this mechanism.

Scene boundaries detection using set theory and sliding window approach

As discussed in the introduction, a scene is a group of consecutive shots that are semantically correlated and share similar coherence in time and space to present an entire event. Low-level features can give some useful information about the connection between shots, but they are unable to describe the semantic correlation between shots. Therefore, we efficiently utilize the contents of the shots to measure similarity using set theory in the proposed scheme. An example is given in Figure 4 to explain the concept of the proposed shots matching mechanism.

Figure 4.

Scene boundary detection using sliding window approach.

In the previous step, each shot is represented as a set of detected objects as shown in Figure 4. A window of four shots is taken to determine the similarity between the shots. According to Figure 4, let’s say the set of shots 2, 3, 4, and 5 are in the red window. Then, we assume that the set of shots 2, 3, and 4 belongs to the same scene, and to check the set of shot 5, we compare it with the rest of the sets in the window. The comparison is based on set theory, and the intersection of set 5 is taken with sets 2, 3, and 4, separately. If the number of elements of intersection is greater than or equal to any of the total elements of sets 2, 3, or 4, the shot is considered to be a part of the previous running scene and the window moves to the next shot, as highlighted with the green window. Now the intersection of set 6 with the rest of the sets in the green window gives an empty set, which means there are zero elements in common with these shots. Hence, a new scene is started, and the window will start from set 6 (blue window). The overall mechanism is given in Algorithm 1.

Algorithm 1. Scene boundary detection
Input: Sets of objects detected in shots S_set_(x), total number of elements in set T_(x)
1. Select first four sets in a window such as (S_set₍₁₎, S_set₍₂₎, S_set₍₃₎ and S_set₍₄₎)
2. For i = 4 to N, where N is the total no. of shots
If (S_set_{(i – 3)} ∩ S_set_(i) > T_{(i – 3)}) OR (S_set_{(i – 2)} ∩ S_set_(i) > T_{(i – 2)}) OR (S_set_{(i – 1)} ∩ S_set_(i) > T_{(i – 1)})
Then Scene_x ← Scene_x ∪ S_set_(i)
Increment window
Else
Scene_y ← S_set_(i)
Start window from S_set_(i)
End
Output: Scene boundaries

Experimental evaluation

In this section, we present the experimental details of the proposed movie scene segmentation scheme. The experiments were conducted on a desktop computer with an Intel^® Core^™ i7-3770 CPU that has 16 GB of RAM. The programming environment was Python 3.5 running on a Windows 10 x64 operating system. The dataset information is given in the subsequent section.

Dataset and ground truth

A total of five movies of different genres were used in our experiments to evaluate the scene segmentation performance. These movies included You’ve Got Mail, The Devil Wears Prada, Salt, Notting Hill, and The Lake House. Detailed information about the testing data is given in Table 1. Due to the variation found in the definition of a scene, there is no benchmark dataset with common ground truth in the literature for movie scene segmentation. In our experiments, we generated ground truth by selecting scene boundaries with the help of movie scripts. The last column of Table 1 shows the total number of scenes as the ground truth for all movies.

Table 1.

Detailed description of the testing data.

Movie ID	Movie title	Genre	Length (min)	No. of shots	No. of scenes
YGM	You’ve Got Mail	Comedy/Romance	119	1049	62
TDWP	The Devil Wears Prada	Comedy/Drama	109	1614	96
SA	Salt	Action/Adventure	100	2339	78
NH	Notting Hill	Comedy/Romance	124	1623	42
TLH	The Lake House	Fantasy/Drama	105	865	76

Scene change detection

The common metrics used by most of the researchers to evaluate scene change accuracy are Precision (P), Recall (R), and F-measure (F). We followed the criteria used in Chen et al.³³ to evaluate scene change accuracy using equations (5)–(7)

P = \frac{N_{c}}{N_{c} + N_{f}}

(5)

R = \frac{N_{c}}{N_{c} + N_{m}}

(6)

F = \frac{2 \times P \times R}{P + R}

(7)

where N_c is the number of correct detected scenes boundaries, N_m is the number of missed ones, and N_f is the number of false detected scene boundaries. Table 2 represents the overall performance of the proposed scheme. The results for movie NH are lower, because in this movie most of the shots are close-ups of the actors’ faces due to which in most of the shots only a single class is detected, such as person. This kind of error is briefly discussed in the next sub-section.

Table 2.

Overall results of the proposed scheme.

Movie ID	Total no.of scenes	Correctdetection	Missed detection	Falsedetection	Precision (%)	Recall (%)	F-measure (%)
YGM	62	56	6	8	87.50	90.32	88.88
TDWP	96	83	16	19	81.37	83.83	82.58
SA	78	72	11	10	87.80	86.74	87.27
NH	56	41	15	18	69.49	73.21	71.30
TLH	76	68	8	11	86.07	89.47	87.74
Total	368	320	56	66	82.44	84.71	83.55

Analysis for window size determination

To determine the window size for shots comparison, we repeated our experiments for different sized windows. We noticed during the experiments that our algorithm took more time as the window size increased. Figure 5 shows the average performance of all the movies varying the size of the window, such as from 2 to 7 shots, where abscissa presents the window size and the ordinate presents the average F-measure value of all the movies. It is proved from the experiments that the preferable window size is 4. The movie NH gave its best results on a window size of 5, but the average results of the overall movies were much better on size 4.

Figure 5.

F-measure values for different sizes of the sliding window.

Analysis of missed boundaries

The missing of scene boundaries in our approach is occurring due to two types of errors. The first one is appearance of the same objects in two adjacent scenes. For example, almost 95% of the shots in a movie must contain an actor, which were predicted as a person class in our proposed scheme. Hence, common object detection in all shots leads to miss-detection of scene boundaries. Examples of such scenes can be found in the movie NH, which are given in Figure 6(a). This type of error is difficult to recover using visual cues without causing many false detections. The second reason for miss-detection of scene boundaries is the gradual transition between two shots. Although, there are very few gradual transitions between shots in movies, and most are found between scenes changes. A robust shot transition detection scheme can recover such errors. Figure 6(b) shows a gradual transition between two scenes.

Figure 6.

Samples of missed detected scene boundaries: (a) common object detection problem in two different scenes, where the first two frames represent one scene and the rest represent the second; (b) gradual transition between shots of different scenes.

Conclusion

Movies data in computer vision domain is considered as one of the most challenging data due to its long length, diversity, high semantics, and camera motion. However, its hierarchical structure is helpful for its segmentation into shots and scenes for better understanding. In this article, we proposed an efficient three folded framework for scene segmentation in movies. First of all, the shots are segmented using fusion of low-level features. Next, a CNN model is employed for object detection in each shot, and the shots with similar objects are combined together. Finally, the set theory with sliding window approach is used to find similarity between shots and assign them to an appropriate scene. Our approach is not only an effective tool for scene boundaries detection in movies, but it can be applicable to other movie-related tasks, such as keyframes extraction for indexing and retrieval, video abstraction, skims selection, and trailer generation. In future research, we intend to apply a movie summarization algorithm on these segmented scenes for movies trailer generation.

Footnotes

Handling Editor: Fuyuan Xiao

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1D1A1B07043302).

ORCID iD

Ijaz Ul Haq

References

Xiang

Wang

Pickering

et al . Big video data for light-field-based 3D telemedicine. IEEE Network 2016; 30: 30–38.

Muhammad

Ahmad

et al . Efficient deep CNN-based fire detection and localization in video surveillance applications. IEEE Trans Syst Man Cyber Syst. Epub ahead of print 14 June 2018. DOI: 10.1109/TSMC.2018.2830099.

Gao

Xiang

. Rate-distortion optimized mode switching for error-resilient multi-view video plus depth based 3-D video coding. IEEE Trans Multimedia 2014; 16: 1797–1808.

Nasir

Muhammad

Mauri

et al . Fog computing enabled cost-effective distributed summarization of surveillance videos for smart cities. J Para Distrib Comput 2018; 126: 161–170.

Ahmad

Muhammad

Baik

. Medical image retrieval with compact binary codes generated in frequency domain using highly reactive convolutional features. J Med Syst 2018; 42: 24.

Tsai

C-M

Kang

L-W

Lin

C-W

et al . Scene-based movie summarization via role-community networks. IEEE Trans Circuit Syst Video Techn 2013; 23: 1927–1940.

Muhammad

Hussain

Baik

. Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recog Lett. Epub ahead of print 7 August 2018. DOI: 10.1016/j.patrec.2018.08.003.

Tao

Tan

Y-P

. Efficient clustering of face sequences with application to character-based movie browsing. In: International conference on image processing, San Diego, CA, 12–15 October 2008, pp.1708–1711. New York: IEEE.

Choi

De Neve

. Towards an automatic face indexing system for actor-based video services in an IPTV environment. IEEE Trans Consum Electron 2010; 56: 147–155.

10.

Cour

Jordan

Miltsakaki

et al . Movie/script: alignment and parsing of video and text transcription. In: European conference on computer vision, Marseille, 12–18 October 2008, pp.158–171. Berlin: Springer.

11.

Sang

. Robust face-name graph matching for movie character identification. IEEE Trans Multimedia 2012; 14: 586–596.

12.

Haq

Muhammad

Ullah

et al . DeepStar: detecting starring characters in movies. IEEE Access 2019; 7: 9265–9272.

13.

Kender

Yeo

B-L

. Video scene segmentation via continuous video coherence. In: Computer society conference on computer vision and pattern recognition, Santa Barbara, CA, 25 June 1998, pp.367–373. New York: IEEE.

14.

Rui

Huang

Mehrotra

. Constructing table-of-content for videos. Multimedia Syst 1999; 7: 359–368.

15.

Rasheed

Shah

. Scene detection in Hollywood movies and TV shows. In: Computer society conference on computer vision and pattern recognition, Madison, WI, 18–20 June 2003, pp.343–350. New York: IEEE.

16.

Rasheed

Shah

. Detection and representation of scenes in videos. IEEE Trans Multimed 2005; 7: 1097–1105.

17.

Zhai

Shah

. A general framework for temporal video scene segmentation. In: Proceedings of the 10th international conference on computer vision (ICCV), Beijing, China, 17–21 October 2005, pp.1111–1116. New York: IEEE.

18.

Mei

Hua

X-S

et al . EMS: energy minimization based video scene segmentation. In: International conference on multimedia and expo, Beijing, China, 2–5 July 2007, pp.520–523. New York: IEEE.

19.

Weng

C-Y

Chu

W-T

J-L

. RoleNet: movie analysis from the perspective of social networks. IEEE Trans Multimedia 2009; 11: 256–271.

20.

Badrinarayanan

Kendall

Cipolla

. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 2015; 39: 2481–2495.

21.

Baraldi

Grana

Cucchiara

. Scene segmentation using temporal clustering for accessing and re-using broadcast video. In: International conference on multimedia and expo (ICME), Turin, 29 June–3 July 2015, pp.1–6. New York: IEEE.

22.

Kalith

Mohanapriya

Mahesh

. Video scene segmentation: a novel method to determine objects, 2018, http://ijsrst.com/IJSRST184121

23.

Hooshyar

Kim

et al . A semantic-based video scene segmentation using a deep neural network. J Inform Sci. Epub ahead of print 19 December 2018. DOI: 10.1177/0165551518819964.

24.

Lin

Casas

Pardàs

. Temporally coherent 3D point cloud video segmentation in generic scenes. IEEE Trans Image Process 2018; 27: 3087–3099.

25.

Redmon

Farhadi

. Yolov3: an incremental improvement, 2018, https://arxiv.org/abs/1804.02767

26.

Lin

T-Y

Maire

Belongie

et al . Microsoft COCO: common objects in context. In: European conference on computer vision, Zurich, 6–12 September 2014, pp.740–755. Cham: Springer.

27.

Sajjad

Ullah

Ahmad

et al . Integrating salient colors with rotational invariant texture features for image representation in retrieval systems. Multimedia Tool Appl 2018; 77: 4769–4789.

28.

Ullah

Muhammad

Del Ser

et al . Activity recognition using temporal optical flow convolutional features and multi-layer LSTM. IEEE Trans Indus Electron. Epub ahead of print 22 November 2018. DOI: 10.1109/TIE.2018.2881943.

29.

Sajjad

Khan

Muhammad

et al . Multi-grade brain tumor classification using deep CNN with extensive data augmentation. J Comput Sci 2018; 30: 174–182.

30.

Ding

Lin

C-T

Cao

. Deep neuro-cognitive co-evolution for fuzzy attribute reduction by quantum leaping PSO with nearest-neighbor memeplexes. IEEE Trans Cybernet 2018; 49: 2744–2757.

31.

Zhu

Liu

. Video scene segmentation and semantic representation using a novel scheme. Multimedia Tool Appl 2009; 42: 183–205.

32.

Sakarya

Telatar

. Graph-based multilevel temporal video segmentation. Multimedia Syst 2008; 14: 277–290.

33.

Chen

L-H

Lai

Y-C

Liao

H-YM

. Movie scene segmentation using background information. Pattern Recog 2008; 41: 1056–1065.