Sage Journals: Discover world-class research

Abstract

Entertainment venues like theaters are key to social engagement, but the lack of assistive technologies for people with visual impairments limits their participation. Our research aims to enhance theater accessibility using computer vision techniques to detect visual information about objects and actor gestures and convey them to blind audiences through non-visual modalities. In this work, we focus on providing a novel dataset, TS-RGBD, containing theater scenes to guide the development of computer-vision-based systems for theater scene description. It includes RGB, depth, and skeleton sequences captured by Microsoft Kinect in a theatrical setting. The dataset features untrimmed theater scenes as well as trimmed sequences consisting of individual gestures for actor gesture recognition. We test state-of-the-art image captioning models on untrimmed scenes, revealing that dense captioning models generate redundant captions with fixed numbers, leading to imprecise descriptions and a lack of context. These challenges hinder visually impaired individuals from comprehending theater scenes descriptions effectively. Additionally, we assess the performance of skeleton-based graph convolution networks for human action recognition in a theater environment using trimmed skeleton sequences. The results highlight limitations in recognizing human actions in this setting. Based on these findings, we propose solutions for overcoming these challenges, paving the way for future improvements in making theater performances more accessible to individuals with visual impairments.

Keywords

RGB-D dataset scene description assistive system image captioning egocentric description human action recognition

1. Introduction

Accessibility research aims to create tools that assist individuals with disabilities, particularly those with visual impairments. Recently, technologies such as artificial intelligence and computer vision have been applied in various contexts to assist blind and visually impaired (BVI) individuals in everyday tasks, including navigation and wayfinding (Bouzit et al., 2004; Lavanya Narayani et al., 2021; Nazri et al., 2020), education (Kim et al., 2020; Shilkrot et al., 2015; Srmist & Student, 2018), face and object recognition (Francis et al., 2020; Ijaz et al., 2021), etc. However, entertainment venues, especially cinemas and theaters, remain largely inaccessible to BVI audiences due to their high reliance on visual cues, with few solutions available to make these experiences fully inclusive. Despite the growing use of computer vision in accessibility, little research has focused on its application within live performances. Our work addresses this gap by exploring how computer vision and deep learning models can detect visual cues in theater performances to be conveyed to the BVI audience members. Specifically, we work on two key objectives, although further details of these works are not within the scope of this article:

Theater scene description: Providing BVI audience members with a textual description of important regions in a theater scene by providing positions of every object or region present on the stage regarding them (left, right, front) using image captioning deep learning models.

Actors’ gestures recognition: Conveying actors’ gestures to the BVI audience via tactile stimuli, by recognizing the actions performed on stage using human action recognition deep learning models, and thus representing the recognized actions on a tactile device (Benhamida & Larabi, 2022).

Therefore, in this article, we present a novel dataset, TS-RGBD, created to contribute to the works defined above by evaluating the state-of-the-art deep learning models for scene description and action recognition in theater environments. The dataset consists of untrimmed videos of theater scenes and trimmed sequences of actors’ gestures, captured by depth sensor. The inclusion of depth data provides a rich spatial information, enhancing the accuracy of detection and recognition.

This paper is organized as follows: Section 2 reviews the existing approaches and some benchmarks for both image captioning and human action recognition tasks. Section 3 introduces the proposed theatre dataset: TS-RGBD, its structure, annotation process, and detailed information. Then, Section 4 is devoted to present our solution for egocentric captioning, followed by the experimental results of human action recognition models (skeleton-based) on the proposed dataset, detailed in Section 5.

2. Related Works

2.1. Image Captioning

Image captioning consists of describing the content of any given image using text. The automatically generated captions are expected to be grammatically correct, with logical order. Image captioning relies on deep learning models that are based either on retrieval (auto-encoders or features extraction, etc.), template (sentence generation after object detection and recognition), or end-to-end learning (Liu et al., 2019). Generated captions can be a single sentence or multiple sentences that constitute a paragraph.

There are various architectures for single sentence captioning models, from scene description graphs (Aditya et al., 2018; Shambharkar et al., 2021) to attention mechanisms (Tan et al., 2022; Wei et al., 2021; Xian et al., 2022; Zha et al., 2022; Zhang et al., 2021), transformers, and even CNN-LSTM and GANs networks (Che et al., 2020; Li et al., 2021; Su et al., 2019).

Solutions for paragraph captioning are based on end-to-end dense captioning models. They are based on single-sentence captioning to generate a set of sentences that will be combined to form a coherent paragraph (Liu et al., 2019). These solutions are built using encoder-decoder architectures and recurrent networks (Johnson et al., 2016; Li et al., 2020; Xu et al., 2021; Zha et al., 2022).

Kong et al. proposed in Kong et al. (2014) a solution for RGB-D image captioning, but it only focuses on enriching descriptions by positional relationships between objects, while training their model on a dataset that does not include theatre images.

Whether single sentence or paragraph, image captioning models achieved remarkable results regarding different metrics (BLEU, ROUGE, METEOR, CIDrE, etc.). However, they do not generate detailed captions when it comes to complex scenes. Single sentence models focus on moving objects ignoring background, and paragraph captioning models do not consider positional descriptions.

Giving blind and visually impaired people sentences that lack descriptions of static objects and background, or paragraphs that lack positional descriptions of said objects makes it difficult or even impossible for them to re-imagine and rebuild the scene in their minds.

We highlight the fact that most models are trained only on indoor or outdoor scenes, which leads to bad captioning when the images are extracted from theatre scenes.

Well-known computer vision datasets, even those of considerable acclaim, notably lack theatre images, let alone comprehensive RGB-D data specifically capturing theatre scenes. Table 1 gives a summary of available RGB datasets.

Table 1.
RGB Datasets.

Dataset Name Images No. Type

MS-COCO 200,000 Mainly outdoor

Flickr8K 8000 Outdoor + indoor

Visual Genome (VG) (Krishna et al., 2017) 101,174 Outdoor + indoor

Cityscapes 25,000 Urban streets

ADE20K 20,000 Outdoor + indoor

Dataset Name	Images No.	Type
MS-COCO	200,000	Mainly outdoor
Flickr8K	8000	Outdoor + indoor
Visual Genome (VG) (Krishna et al., 2017)	101,174	Outdoor + indoor
Cityscapes	25,000	Urban streets
ADE20K	20,000	Outdoor + indoor

As Table 2 for depth datasets.

Table 2.

RGB-D Datasets.

Dataset Name	Images No.	Type
SUN-RGB-D	10,335	Rooms
NYU-Depth v2	2000	Indoor
Scan Net	1513	Indoor scans

From both tables, we conclude that there are no available datasets with theatre plays in them.

2.2. Human Action Recognition

Human action recognition is a fundamental task in computer vision with numerous applications, ranging from surveillance and human-computer interaction to robotics and virtual reality. Due to its wide range of applications, many methods were proposed that succeeded at achieving considerable performance. The earliest methods were based on RGB sequences (Bregonzio et al., 2009; Laptev et al., 2008) but their performance is relatively low due to different factors such as illumination and clothing colors. After the release of the Microsoft Kinect sensor, many RGB-D human action benchmarks emerged (Rahmani et al., 2016; Shahroudy et al., 2016; Wang et al., 2012) presenting richer information by providing the depth modality resulting in more accurate action features. They mostly consist of three modalities: RGB, depth, and skeleton sequences.

Some of the well-known RGB-D human action benchmarks include:

UWA3D Activity Dataset (Rahmani et al., 2016) contains 30 activities performed at different speeds by 10 people of varying heights in congested settings. This dataset has high inter-class similarity and contains frequent self-occlusions.

MSR Daily Activity3D dataset (Wang et al., 2012) includes 16 daily activities in the living room. This dataset can be used to assess the modeling of human-object interactions as well as the robustness of proposed algorithms to pose changes.

MSR Action Pairs (Oreifej & Liu, 2013) provides 6 pairs of actions in which two actions in a pair involve the interactions with the same object in distinct ways. This dataset can be used to evaluate the algorithms’ ability to model the temporal structure of actions.

NTU-RGBD (Shahroudy et al., 2016) was first containing 56880 sequences of 60 action classes. Then, the extended version (Liu et al., 2020) was introduced with 57367 additional sequences and 60 other action classes making it the largest action benchmark so far.

As a result, other action recognition methods were developed based on the RGB-D datasets that surpass the earliest approaches. Some methods considered the use of depth maps only (Li et al., 2018; Zhang et al., 2018) which achieved better performance compared to RGB methods but they remain very sensitive to view-point variations.

Recently, the skeleton-based approach is widely investigated using skeleton sequences and it achieved considerable performance compared to the other approaches, especially after the rise of graph convolution networks (GCNs) (Li et al., 2019; Xu et al., 2021; Yan et al., 2018). GCNs are designed to extract features from graph-based data such as skeleton sequences that can be modeled as graphs by linking different body joints. However, all existing datasets were captured in outdoor or indoor environments (e.g., kitchens, rooms, offices), with none specifically addressing a theater setting. As a result, recognition model performance may decline when applied in a theater context. To address this gap, our proposed dataset provides a means to evaluate these models specifically in a theater environment.

In conclusion, in this work, we make the following contributions:

To the best of our knowledge, we are the first to collect and provide RGB-D sequences captured in a theatrical setting.

We provide image captions that contain the direction of each region, with captioning model retrained on our theatre scenes dataset.

We analyze the performance of skeleton-based human action recognition GCN models when deployed in a new environment, different then the source domain, using our proposed dataset, and discuss their performance.

3. TS-RGBD Dataset Description

In this section, we describe the data collection process, and dataset statistics in detail as well as annotation and cleaning methodologies.

3.1. Setup

In order to collect samples in a theatre environment, we sought cooperation with national theaters. Thus, we contacted the UK National Theater, but because of the terms of the actors’ contracts, it was not possible to use their visual content. Our local National Theater on the other hand was open for a partnership with the laboratory to achieve the task. However, the limited range of the Kinect sensor hindered us from accurately capturing the depth information of actors situated at a distance beyond four meters.

Finally, we opted to film various scenarios at the auditorium of the university (Figure 1) where the distances are convenient for the Kinect sensor.

Figure 1.

Scene capturing at the auditorium of the university.

Two Kinect v1 sensors were used and positioned at the same height in two different viewpoints (front view and side view) as shown in Figure 2. We also used more than 76 objects in total to vary the setups and the used/background objects. The use of two sensors at different positions and varying background setups results in the diversity of the collected samples.

Figure 2.

Illustration of the Kinects setup.

3.2. Subjects

We enlisted a team of 8 male students with different body shapes and heights, to interpret on stage the prepared scenarios. The students signed a legal document granting us permission to use and distribute their visual content among the scientific society.

3.3. Data Modalities

The Microsoft Kinect v1 provides three data modalities: RGB images, depth, and skeleton data. The resolution of each captured RGB and depth sequence is $640 \times 480$ , and each frame is saved in JPEG format. The sequences were captured at a rate of 25 frames per second.

The skeleton data, on the other hand, consists of three-dimensional positions of 20 body joints for each tracked human body, knowing that Kinect v1 can only detect and track at most two human bodies. Figure 3 illustrates the configuration of the 20 captured joints.

Figure 3.

Joints configuration provided by Kinect v1.

3.4. Data Classes

Our dataset consists of two categories of data: segmented theatre actions and untrimmed theatre scenes.

3.4.1. Segmented Theatre Actions

This category contains 36 action classes that are common in theatre scenes such as walking, sitting down, drinking, jumping, eating, and throwing…setc. We tried to collect similar actions to the existing benchmarks action classes, as the purpose of this part of the dataset is to evaluate the performance of existing trained models in a theater environment and not to train the models on new action classes.

Each viewpoint comprises around 230 sequences, with an average of 170 frames for each sequence. Each action was carried out by 3 males and was repeated at least 3 times with varying speeds.

TS-RGBD action classes with number of samples are listed in Table 3.

Table 3.
Number of Samples Per Action Class.

Action Number of Action Number of Action Number of

class samples class samples class samples

$1.$ Standing up 14 $2.$ Bowing 6 $3.$ Sitting down 12

$4.$ Drinking 14 $5.$ Eating 10 $6.$ Dropping something 9

$7.$ Picking up something 14 $8.$ Throwing something 16 $9.$ Clapping 6

$10.$ Reading 9 $11.$ Writing 11 $12.$ Tearing up paper 12

$13.$ Putting on a jacket 6 $14.$ Taking off a jacket 8 $15.$ Putting on shoes 6

$16.$ Taking off on shoes 12 $17.$ A person walking 7 $18.$ Hand waving 7

$19.$ Touch head 12 20- Phone call 12 $21.$ Jumping 20

$22.$ Kicking something or someone 22 $23.$ Checking time on a wristwatch 12 $24.$ Wipe face 10

$25.$ Salute 10 $26.$ Putting palms together 20 $27.$ Falling down 13

$28.$ Fan self 16 $29.$ Pushing a person 8 $30.$ Punch/slap a person 14

$31.$ Two persons hugging 4 $32.$ Giving something to someone 8 $33.$ Shaking hands 6

$34.$ High five 8 $35.$ Two persons walking towards each other 8 $36.$ Two persons walking apart from each other 8

Action	Number of	Action	Number of	Action	Number of
$1.$ Standing up	14	$2.$ Bowing	6	$3.$ Sitting down	12
$4.$ Drinking	14	$5.$ Eating	10	$6.$ Dropping something	9
$7.$ Picking up something	14	$8.$ Throwing something	16	$9.$ Clapping	6
$10.$ Reading	9	$11.$ Writing	11	$12.$ Tearing up paper	12
$13.$ Putting on a jacket	6	$14.$ Taking off a jacket	8	$15.$ Putting on shoes	6
$16.$ Taking off on shoes	12	$17.$ A person walking	7	$18.$ Hand waving	7
$19.$ Touch head	12	20- Phone call	12	$21.$ Jumping	20
$22.$ Kicking something or someone	22	$23.$ Checking time on a wristwatch	12	$24.$ Wipe face	10
$25.$ Salute	10	$26.$ Putting palms together	20	$27.$ Falling down	13
$28.$ Fan self	16	$29.$ Pushing a person	8	$30.$ Punch/slap a person	14
$31.$ Two persons hugging	4	$32.$ Giving something to someone	8	$33.$ Shaking hands	6
$34.$ High five	8	$35.$ Two persons walking towards each other	8	$36.$ Two persons walking apart from each other	8

3.4.2. Untrimmed Theatre Scenes

This category includes 38 written theatre scene scenarios. It contains, in total, 75 sequences for each viewpoint, with a mean of 1119 frames per sequence. The scenes are divided into three types:

Solo scenes involve a single person performing different actions (Figure 4). At least two individuals interpreted each solo scene to ensure data diversity.

Two-Person Scenes involve interactions between two individuals, such as “two persons walking towards each other”, “shaking hands”, “one person handing an object to another one”, and “hugging each other” as shown in figure 5.

Group Scenes involve three or more people engaged in an activity. Notably, skeleton data of this last type of scene is considered as a two-person interaction scene because, as mentioned before, Kinect v1 can only track skeleton joints of at most two persons. Figure 6 shows an example of such scenes.

Figure 4.

Example of an interpreted solo scenario.

Figure 5.

Example of an interpreted two-person scenario.

Figure 6.

Example of an interpreted group scenario.

Summarily, with 8 male actors (females were not available) we could gather 610 sequences with an average of 373 frames per sequence (25 frames per second), and a total of 123 149 frames.

Table 4 presents a summary:

Table 4.

Captured RGB-D Data from Written Scenarios.

Sequences tot.	Frames tot.	No. used objects
610	123 149	76

Figure 7 shows the number of sequences per type of scenario. There are more solo scenes since the Kinect v1 range is limited to $4$ meters and resolution ( $640 \times 480$ ) which makes it impossible to fit a group of people into such a small frame due to their height differences.

Figure 7.

Pie chart for the number of sequences by type of scenario.

3.5. Data Cleaning

For the image captioning task, we created an application to manually select frames with smooth depth maps, that mark a transition in the video to avoid redundancies. In addition, we had to go over all selected frames to keep only the ones with smooth corresponding depth maps. In the end, 1480 key-frames were kept.

3.6. Data Annotation

Many data annotation applications available today offer powerful functionality for annotating data, but they often come with a trade-off: either our data become publicly accessible, or these applications come at a cost and are not available for free.

Even so, we could find a multi-platform desktop application developed by Wada and LabelMe (2021) available to download and install from GitHub. The developer was inspired by the original “LabelMe” application that was created by MIT for manually annotating data for object detection/recognition and instance or semantic segmentation, with the possibility of drawing a box or a polygonal envelope around a region of interest and adding labels to these boxes and envelopes.

We could annotate 50 images so far, resulting in 504 regions, and 109 words as shown in Table 5.

Table 5.
Numbers of Annotated Data.

Regions No. Captions no. Tokens no.

504 504 109

Regions No.	Captions no.	Tokens no.
504	504	109

Figure 8 shows the interface of the “LabelMe” application as well as the process of polygonal annotations. For each region of interest, we create a polygonal boundary by enclosing it with relevant points (points in green Figure 8). Instead of assigning a single-word label, we provide a full descriptive sentence. All information, including the region ID, point coordinates, and descriptive label, is stored in a .json file. The selected regions of interest include every area in the image that is meaningful for describing its content.

Figure 8.

LabelMe interface.

4. Egocentric Captioning: Proposed Solution

In this article, we propose an approach to offer the blind and visually impaired detailed descriptions of the environment they are in, while giving them the opportunity to attend theatre plays. Those descriptions will be generated by the DenseCap module that outputs captions for both mobile and static objects and regions in a given scene. These generated captions are not enough for the users to re-imagine the scene, they will need to know where each object or region is situated regarding their own position (Egocentric Description). To give the users this information, we will need depth data alongside RGB image of the scenes, specifically theatre scenes.

An example of the expected description is shown in Figure 9.

Figure 9.

Example of egocentric scene description.

To do that, we had to retrain the DenseCap model on our dataset. Proposed in Johnson et al. (2016), it is a model based on fully convolutional localization networks (FCLNs) that outputs boxes surrounding detected regions, each box with its caption and confidence score. DenseCap was trained on the Visual Genome dataset (VG dataset), a collection of images from MS-COCO and Flicker datasets where each image is associated with a rich set of information: regions of interest are delimited with bounding boxes, and each box has multiple captions describing its content. We chose DenseCap because it does not focus only on salient objects and provides background descriptions as VG dataset annotations describe every region present in the image, be it an object, a person, an animal, or the surrounding environment.

After detecting regions and generating the corresponding captions, we applied the algorithm proposed in our precedent work to get the directions (Delloul & Larabi, 2022).

Since Depth information is not present for the VG dataset, we used AdaBins model to estimate depth maps for VG images.

4.1. Experiments and Results

We modified the DenseCap code provided in GitHub to be trained on custom data and we applied transfer learning by reusing the models’ weights provided by the authors to train it on our data for 10 more epochs.

Table 6 shows evaluation results after using DenseCap on our data before and after retraining, using evaluation metrics METEOR, BLEU, ROUGE, and CIDEr to measure the quality of generated phrases by comparing them to human-generated references.

Table 6.
Captions Evaluation.

Case METEOR BLEU ROUGE CIDEr

Before Re-Training 0.21 0.30 0.32 1.25

After Re-Training 0.52 0.6 0.63 5.52

Case	METEOR	BLEU	ROUGE	CIDEr
Before Re-Training	0.21	0.30	0.32	1.25
After Re-Training	0.52	0.6	0.63	5.52

The higher the better.

We then chose 20 random images from VG and our dataset to manually annotate the direction of each generated region.

Table 7 summarizes results, where accuracy is the ratio of correct directions to the total directions, expressed as a percentage. Qualitative results are shown in Figure 10.

Figure 10.

Multiple examples from TS-RGBD dataset.

Table 7.

Egocentric Description Evaluation on Our Images.

Dataset	Correct Directions	Incorrect Directions	Acc.
Ours	195	5	97.5%
VG	175	21	89%

4.2. Limitations

Captions are redundant due to the fact that DenseCap generates $k$ number of captions and $k$ was set to 10. Sometimes there are fewer regions than 10, and sometimes there are more, which cannot be possible to determine by a visually impaired person.

Egocentric description lacks precision for some regions.

Final description doesn’t mention that the image is about a theatre play.

5. Human Action Recognition: Experimental Evaluations with TS-RGBD

In this part, we evaluate the performance of three Graph Convolution Networks: ST-GCN (Yan et al., 2018), 2S-AGCN (Shi et al., 2019), and MS-G3D (Liu et al., 2020) for skeleton-based human action recognition, in a theater environment using the trimmed skeleton sequences from our TS-RGBD dataset (Figure 11).

Figure 11.

Examples of skeleton data sequences from TS-RGBD dataset.

These models fall under spatio-temporal GCN models that extract both spatial and temporal features from skeletal sequences, and they are known for their high recognition performance compared to the state-of-the-art models. They were trained on two challenging human action benchmarks: NTU-RGBD (Shahroudy et al., 2016) and Kinetics (Kay et al., 2017) and their attained accuracies are illustrated in Table 8.

Table 8.

Obtained Accuracies by ST-GCN, 2s-AGCN, and MS-G3D on NTU-RGBD and Kinetics.

	NTU-RGBD	Kinetics
ST-GCN (Yan et al., 2018)	81.5%	30.7%
2s-AGCN (Shi et al., 2019)	88.5%	36.1%
MS-G3D (Liu et al., 2020)	91.5%

We load the pre-trained weights of each model trained on NTU-RGBD as a source domain. The choice of NTU-RGBD benchmark as source domain is motivated by the fact that it’s the most similar one to TS-RGBD in the skeleton data structure, and they are both captured in an indoor environment.

The obtained accuracies of the pre-trained models on TS-RGBD are presented in Table 9.

Table 9.

Test Results of ST-GCN, 2s-AGCN, and MS-G3D With TS-RGBD.

	Accuracy
ST-GCN (Yan et al., 2018)	50.01%
2s-AGCN (Shi et al., 2019)	55.73%
MS-G3D (Liu et al., 2020)	60.96%

5.1. Discussion and Future Works

We observe that the performance of all models has decreased compared to the source domain. This demonstrates the need for a such new dataset captured in theater for the development of a theater action recognition system. However, training a model from scratch necessitates a large dataset which requires a lot of effort and time. Potential solution is to apply transfer learning on these models using our dataset with our provided sequences. The transfer learning technique has proved its effectiveness, particularly with CNNs (Convolution neural networks) but few works investigated this technique with GCNs, which presents a potential area of future research.

On the other hand, we also observe that MS-G3D model outperformed the other models, so we pursued more comprehensive data from its experiment in order to further discuss the struggles and challenges encountered with skeleton-based recognition. We analyzed its confusion matrix and extracted the most well-classified as well as the most misclassified action classes (Table 10 and Figure 12).

Figure 12.

Confusion matrix of MS-G3D with TS-RGBD. $y = 37$ represents actions of NTU-RGBD that are not included in our dataset.

Table 10.

Most Well-Classified and Misclassified Action Classes.

Top 10 Well-Classified Classes	Top 10 Misclassified Classes
Put on shoes	Drop
Put on jacket	Put palms together
Walk	High five with a person
Punch/slap	Falling down
Hug a person	Drink
Walking apart from each other	Clap
Walking towards each other	Write
Sit-down	Stand-up
Take off jacket	Check-time
Push a person	Read

Based on Table 10 and Figure 12, we distinguish that the model struggles in recognizing actions that require details about specific body parts, such as the hand shape, or about the involved object in case of human-object interaction. For instance, the action “write” necessitates additional information on the hand form and the used object, which are not included in the skeleton representation. As a result, it was frequently confused with the action “play with phone” due to the similarity of their skeleton motion trajectories. Same for action ’Drop’ which the model failed to recognize due to missing information about the dropped object and similarities in skeleton motion with other actions, making it difficult to differentiate them based solely on skeleton joint positions.

In conclusion, there are two major elements that have a large impact on the skeleton-based approach recognition performance. The first factor is the precision of the provided joints’ positions. The recognition performance can be low if the skeleton joints are not very well captured and cluttered. The second factor is the number of characteristics that can be extracted from the skeleton modality only. The tested models were unable to recognize some actions that require details about specific body parts’ characteristics such as hands or about the involved object in case of human-object interaction. As a possible solution, future works may consider combining skeleton modality with other modalities to encounter the lack of information problem, which may aid in differentiating between some confusing actions with similar skeleton motions.

Further works on RGB-D human action recognition field using the TS-RGBD dataset would boost the development of an assistive system that can accurately recognize human actions on stage. Furthermore, future works might explore the untrimmed scenes within the TS-RGBD dataset to develop systems that can detect the temporal boundaries of actions. Such advancements would enable real-time action recognition, facilitating a more dynamic and interactive experience for visually impaired and blind audience members.

6. Conclusion

We introduced the TS-RGBD dataset, a novel RGB-D dataset designed specifically for theater scene description. Captured in a theatrical setting using the Microsoft Kinect sensor, the dataset includes synchronized RGB, depth, and skeleton sequences. It contains two types of data: trimmed sequences for actor gesture recognition and untrimmed theater scenes, primarily for image captioning.

By incorporating depth data, the TS-RGBD dataset enhances the performance of image captioning and human action recognition models in live performances, particularly in theaters. The results of testing image captioning models and skeleton-based human action recognition models on the TS-RGBD dataset demonstrate its potential to expand the range of environment types where visually disabled individuals can navigate with the aid of computer vision technology. The combination of accurate human action recognition and textual description of theatre scenes can provide valuable assistance to visually impaired individuals in accessing entertainment places and enjoying theatrical experiences.

The availability of the TS-RGBD dataset opens the door for the development of more inclusive assistive technologies, expanding the accessibility of entertainment venues for BVI individuals and promoting their full integration into society.

Footnotes

ORCID iDs

Khadidja Delloul

Leyla Benhamida

Slimane Larabi

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Our TS-RGBD dataset is available on GitHub: GitHub Repository. The repository also provides explanations on how data are structured along side with python codes used to record data.

Notes

References

Aditya

Yang

Baral

Aloimonos

Fermüller

(2018). Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding, 173, 33–45. https://doi.org/10.1016/j.cviu.2017.12.004. https://www.sciencedirect.com/science/article/pii/S1077314217302291

Benhamida

Larabi

(2022). Human action recognition and coding based on skeleton data for visually impaired and blind people aid system. In 2022 First international conference on computer communications and intelligent systems (I3CIS) (pp. 49–54). IEEE.

Bouzit

Chaibi

De Laurentis

K. J.

Mavroidis

(2004). Tactile feedback navigation handle for the visually impaired. In Proceedings of the ASME 2004 international mechanical engineering congress and exposition, dynamic systems and control, Parts A and B, Anaheim, California, USA (pp. 1171–1177). https://doi.org/10.1115/IMECE2004-60450.

Bregonzio

Gong

Xiang

(2009). Recognising action as clouds of space-time interest points. In 2009 IEEE conference on computer vision and pattern recognition (pp. 1948–1955). https://doi.org/10.1109/CVPR.2009.5206779.

Che

Fan

Xiong

Zhao

(2020). Visual relationship embedding network for image paragraph generation. IEEE Transactions on Multimedia, 22(9), 2307–2320. https://doi.org/10.1109/TMM.2019.2954750

Delloul

Larabi

(2022). Egocentric scene description for the blind and visually impaired. In 2022 5th International symposium on informatics and its applications (ISIA) (pp. 1–6). https://doi.org/10.1109/ISIA55826.2022.9993531.

Francis

M. G. A.

Karthigaikumar

D. P.

Kumar

M. G. A.

(2020). Face recognition system for visually impaired people. Journal of Critical Reviews, 7(17), 2760–2764.

Ijaz

Iqbal

Raza

U. A.

Aftab

R. M.

(2021). A survey on currency identification system for blind and visually impaired. International Journal, 10(2). https://doi.org/10.30534/ijatcse/2021/541022021.

Johnson

Karpathy

Fei-Fei

(2016). DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

10.

Kay

Carreira

Simonyan

Zhang

Hillier

Vijayanarasimhan

Viola

Green

Back

Natsev

Suleyman

Zisserman

(2017). The Kinetics human action video dataset .

11.

Kim

Han

B.-K.

Pyo

Ryu

Kim

Kwon

D.-S.

(2020). Braille display for portable device using flip-latch structured electromagnetic actuator. IEEE Transactions on Haptics, 13(1), 59–65. https://doi.org/10.1109/TOH.2019.2963858

12.

Kong

Lin

Bansal

Urtasun

Fidler

(2014). What are you talking about? Text-to-image coreference. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

13.

Krishna

Zhu

Groth

Johnson

Hata

Kravitz

, et al (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

14.

Laptev

Marszalek

Schmid

Rozenfeld

(2008). Learning realistic human actions from movies. In 2008 IEEE conference on computer vision and pattern recognition (pp. 1–8). https://doi.org/10.1109/CVPR.2008.4587756.

15.

Lavanya Narayani

Sivapalanirajan

Keerthika

Ananthi

Arunarani

(2021). Design of smart cane with integrated camera module for visually impaired people. In 2021 International conference on artificial intelligence and smart systems (ICAIS), IEEE, Coimbatore, India (pp. 999–1004). https://doi.org/10.1109/ICAIS50930.2021.9395840.

16.

Chen

Zhang

Wang

Tian

(2019). Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

17.

Cook

Zhu

Gao

(2018). Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

18.

Liang

Shi

Feng

Wang

(2020). Dual-CNN: A convolutional language decoder for paragraph image captioning. Neurocomputing, 396, 92–101. https://doi.org/10.1016/j.neucom.2020.02.041

19.

Nie

Zhang

(2021). Image captioning with multi-level similarity-guided semantic matching. Visual Informatics, 5(4), 41–48. https://doi.org/10.1016/j.visinf.2021.11.003

20.

Liu

Shahroudy

Perez

Wang

Duan

L.-Y.

Kot

A. C.

(2020). NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873

21.

Liu

Wang

(2019). A survey on deep neural network-based image captioning. The Visual Computer, 35(3), 445–470.

22.

Liu

Zhang

Chen

Wang

Ouyang

(2020). Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

23.

Nazri

N. M. A.

Fauzi

S. S. M.

Gining

R. A. J.

Razak

T. R.

Jamaluddin

(2020). Smart cane for visually impaired with obstacle, water detection and GPS. International Journal of Computing and Digital Systems, 10, 2–8. https://doi.org/10.12785/ijcds/1001119

24.

Oreifej

Liu

(2013). HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

25.

Rahmani

Mahmood

Huynh

Mian

(2016). Histogram of oriented principal components for cross-view action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(12), 2430–2443. https://doi.org/10.1109/TPAMI.2016.2533389

26.

Shahroudy

Liu

T.-T.

Wang

(2016). NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

27.

Shambharkar

P. G.

Kumari

Yadav

Kumar

(2021). Generating caption for image using beam search and analyzation with unsupervised image captioning algorithm. In 2021 5th International conference on intelligent computing and control systems (ICICCS) (pp. 857–864). https://doi.org/10.1109/ICICCS51141.2021.9432245.

28.

Shi

Zhang

Cheng

(2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

29.

Shilkrot

Huber

Meng Ee

Maes

Nanayakkara

S. C.

(2015). FingerReader: A wearable device to explore printed text on the go. In Proceedings of the 33rd annual ACM conference on human factors in computing systems, Association for Computing Machinery, New York, NY, USA (pp. 2363–2372). ISBN 9781450331456. https://doi.org/10.1145/2702123.2702421.

30.

Srmist

Student

(2018). Text-to-speech device for visually impaired people. International Journal of Pure and Applied Mathematics, 119(15), 1061–1067.

31.

Tang

Han

Zhang

(2019). A neural image captioning model with caption-to-images semantic constructor. Neurocomputing, 367, 144–151. https://doi.org/10.1016/j.neucom.2019.08.012

32.

Tan

J. H.

Tan

Y. H.

Chan

C. S.

Chuah

J. H.

(2022). ACORT: A compact object relation transformer for parameter efficient image captioning. Neurocomputing, 482, 60–72. https://doi.org/10.1016/j.neucom.2022.01.081

33.

Wada

LabelMe

(2021). Image polygonal annotation with python, Zenodo. https://github.com/wkentaro/labelme.

34.

Wang

Liu

Yuan

(2012). Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE conference on computer vision and pattern recognition (pp. 1290–1297). https://doi.org/10.1109/CVPR.2012.6247813.

35.

Wei

Jia

Guo

Shi

(2021). Past is important: Improved image captioning by looking back in time. Signal Processing: Image Communication, 94, 116183. https://doi.org/10.1016/j.image.2021.116183

36.

Xian

Zhang

(2022). Dual global enhanced transformer for image captioning. Neural Networks, 148, 129–141. https://doi.org/10.1016/j.neunet.2022.01.011

37.

Zhu

Zhao

(2021). Multi-scale skeleton adaptive weighted GCN for skeleton-based human action recognition in IoT. Applied Soft Computing, 104, 107236. https://doi.org/10.1016/j.asoc.2021.107236

38.

Yang

Shen

Tian

(2021). Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning. Knowledge-Based Systems, 214, 106730. https://doi.org/10.1016/j.knosys.2020.106730

39.

Yan

Xiong

Lin

(2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1. https://doi.org/10.1609/aaai.v32i1.12328.

40.

Zha

Z.-J.

Liu

Zhang

(2022). Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 710–722. https://doi.org/10.1109/TPAMI.2019.2909864

41.

Zhang

Tian

Guo

Liu

(2018). DAAL: Deep activation-based attribute learning for action recognition in depth videos. Computer Vision and Image Understanding, 167, 37–49. https://doi.org/10.1016/j.cviu.2017.11.008

42.

Zhang

Wang

Chen

(2021). Exploring region relationships implicitly: Image captioning with visual relationship attention. Image and Vision Computing, 109, 104146. https://doi.org/10.1016/j.imavis.2021.104146

Action	Number of	Action	Number of	Action	Number of
class	samples	class	samples	class	samples
$1.$ Standing up	14	$2.$ Bowing	6	$3.$ Sitting down	12
$4.$ Drinking	14	$5.$ Eating	10	$6.$ Dropping something	9
$7.$ Picking up something	14	$8.$ Throwing something	16	$9.$ Clapping	6
$10.$ Reading	9	$11.$ Writing	11	$12.$ Tearing up paper	12
$13.$ Putting on a jacket	6	$14.$ Taking off a jacket	8	$15.$ Putting on shoes	6
$16.$ Taking off on shoes	12	$17.$ A person walking	7	$18.$ Hand waving	7
$19.$ Touch head	12	20- Phone call	12	$21.$ Jumping	20
$22.$ Kicking something or someone	22	$23.$ Checking time on a wristwatch	12	$24.$ Wipe face	10
$25.$ Salute	10	$26.$ Putting palms together	20	$27.$ Falling down	13
$28.$ Fan self	16	$29.$ Pushing a person	8	$30.$ Punch/slap a person	14
$31.$ Two persons hugging	4	$32.$ Giving something to someone	8	$33.$ Shaking hands	6
$34.$ High five	8	$35.$ Two persons walking towards each other	8	$36.$ Two persons walking apart from each other	8

Theater Scene Description for Human-Scene Interaction

Abstract

Keywords

1. Introduction

2. Related Works

2.1. Image Captioning

Table 1. RGB Datasets. Dataset Name Images No. Type MS-COCO 200,000 Mainly outdoor Flickr8K 8000 Outdoor + indoor Visual Genome (VG) (Krishna et al., 2017) 101,174 Outdoor + indoor Cityscapes 25,000 Urban streets ADE20K 20,000 Outdoor + indoor

3. TS-RGBD Dataset Description

3.1. Setup

3.3. Data Modalities

3.4.1. Segmented Theatre Actions

3.6. Data Annotation

Table 5. Numbers of Annotated Data. Regions No. Captions no. Tokens no. 504 504 109

Table 6. Captions Evaluation. Case METEOR BLEU ROUGE CIDEr Before Re-Training 0.21 0.30 0.32 1.25 After Re-Training 0.52 0.6 0.63 5.52

5. Human Action Recognition: Experimental Evaluations with TS-RGBD

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

Data Availability Statement

Notes

References

Table 1.
RGB Datasets.

Dataset Name Images No. Type

MS-COCO 200,000 Mainly outdoor

Flickr8K 8000 Outdoor + indoor

Visual Genome (VG) (Krishna et al., 2017) 101,174 Outdoor + indoor

Cityscapes 25,000 Urban streets

ADE20K 20,000 Outdoor + indoor

Table 5.
Numbers of Annotated Data.

Regions No. Captions no. Tokens no.

504 504 109

Table 6.
Captions Evaluation.

Case METEOR BLEU ROUGE CIDEr

Before Re-Training 0.21 0.30 0.32 1.25

After Re-Training 0.52 0.6 0.63 5.52