Abstract
In this paper we present the development of an interactive, content-aware and cost-effective digital signage system. Using a monocular camera installed within the frame of a digital signage display, we employ real-time computer vision algorithms to extract temporal, spatial and demographic features of the observers, which are further used for observer-specific broadcasting of digital signage content. The number of observers is obtained by the Viola and Jones face detection algorithm, whilst facial images are registered using multi-view Active Appearance Models. The distance of the observers from the system is estimated from the interpupillary distance of registered faces. Demographic features, including gender and age group, are determined using SVM classifiers to achieve individual observer-specific selection and adaption of the digital signage broadcasting content. The developed system was evaluated at the laboratory study level and in a field study performed for audience measurement research. Comparison of our monocular localization module with the Kinect stereo-system reveals a comparable level of accuracy. The facial characterization module is evaluated on the FERET database with 95% accuracy for gender classification and 92% for age group. Finally, the field study demonstrates the applicability of the developed system in real-life environments.
1. Introduction
Digital signage flat-panel displays are emerging as a new, efficient method for providing targeted information [1,2]. They are found in airports, hotels, universities, retail stores and various outdoor public spaces (Figure 1), all providing optimized information-and-appearance attractive multimedia content. Today, a large majority of the applications of digital signage are interfaces to public or internal information, advertising, brand building and influencing the customer's behaviour by enhancing the customer's experience [3].

Digital signage displays are permeating public spaces and replacing static signs.
Modern studies of digital signage are actively exploring mechanisms for engaging users with interactive content [4,5]. Various interaction modalities have been proposed, including speech, facial expression, gaze, touch and hand gestures [6]. Interactive digital signage is appearing in urban life and architecture [7] as well as ubiquitously in computing [1]. Ojala
Digital signage displays are advantageous compared to static signs because they can display varying multimedia content such as images, animations, video and audio. Content can be changed in real-time, which, in principle, allows for full context and audience adaptation [9]. However, the high potential of digital signage displays has not yet been fully exploited, as the displayed content is most often generic and uninteresting for observers, causing the
Such interactive audience adaptive digital signage systems have the potential to be applied to a large number of displays. Therefore, the hardware cost of a single display is important as well as the cost of software solutions, especially the use of algorithms without copyright limitations. These considerations were our main motivation for using a monocular camera in our system for the interaction of the system with the observers, as it is very frequently already built into the frame of flat digital displays. Alongside this, we applied reliable state-of-the-art computer vision solutions, centrally optimizing their advanced functions, real-time processing speed and ease of integration, as well as their license free implementation.
The outline of the paper is as follows: Section 2 presents the architecture of our interactive and audience adaptive digital signage system, Section 3 evaluates the performance of the system in a laboratory setting, Section 4 describes the use of the system in a real-world audience measurement study and Section 5 gives a discussion and conclusions.
2. Interactive and audience adaptive digital signage
To address the problems of display and interaction blindness, we designed our digital signage system to oversee and interact with the presence, activity and characteristics of the observers. A scheme of the proposed system is given in Figure 2.

Schematic of the developed interactive and audience-adaptive digital signage system. Media content is managed at the central content server and then dispatched via a local or global network to each broadcasting location. A camera enhanced display tracks the observers and their characteristics, and broadcasts adaptive content in real-time, possibly at multiple locations A and B. Arrows denote a circular information flow which notably differs from the one-way information flow in common digital signage systems.
We propose a novel and reliable spatial localization of observers using a single monocular camera and measuring interpupillary distance in human faces. Since interpupillary distance in human faces is fairly stable [12], it can serve as the measurement quantifier for determining the position of detected human faces in a 3D space, if the optical parameters of the camera are known.
Digital images captured by the camera are processed with the digital signage software in real-time, extracting temporal, spatial and demographic features of the observers. By comparing the determined features with the predefined content descriptors, the display software automatically selects and broadcasts content relevant to the specific detected observer, for example, this could be information that is targeted to young adult males in the age group 25–34 years.
The proposed system is implemented in C++, using the Qt application framework and OpenCV library [14]. A 24' Sony Vaio VPCL135FX/B computer display enhanced with a Logitech WebCam Pro 9000 camera is used as a broadcasting unit prototype.
Several computer vision methods are combined to achieve the optimal determination of the features of the observer. The advantages of this approach and a performance analysis are discussed separately in Section 3. The computer vision methods are designed into the system architecture, which includes a pipeline of three modules (see Figure 3). The first module detects the observers and determines their temporal (e.g. dwell time) and spatial parameters (e.g. distance from the digital signage system). The second module determines the observer's demographic information from the registered facial image, i.e. the gender and the age group of the observer. Finally, the combined features determined by Modules 1 and 2 are used by Module 3 to broadcast observer-profile-specific content on the digital signage display. Below, each module is separately presented and the privacy aspects of the system are explained.

Modules of the developed interactive and audience adaptive digital signage system. Modules 1 and 2 provide localization and demographic information about currently present observers, which is then used by Module 3 for content-aware broadcasting.
2.1 MODULE 1: Temporal and spatial localization of observers
Pre-processing of captured images includes background segmentation. We use Mixture-of-Gaussians based background modelling [13] to extract foreground regions and define the possible presence of observers.
The number of observers and their presence time are estimated by the face detection and tracking algorithms. The Viola & Jones face detection algorithm [14] is applied to the pre-processed foreground regions, to distinguish true observers facing the display from non-observer digital patterns. Using a frontal face detector, we can determine the location of the faces of the observers down to the size of 20×20pixels per face in real-time, regardless of their actual position and physical scale. The positively identified segmented regions – i.e. the observers – are tracked using a Fast Match Template algorithm, specially adapted for real-time video processing, supplied in OpenCV library [15]. The observer's upper body is used as a template image.
Facial images are registered using the multi-view Active Appearance Model (AAM) method. The AAM simultaneously models the intrinsic variation in shape and texture of a deformable visual object as a linear combination of the basis modes of the variation [16]. Although linear in both shape and appearance, overall, the AAMs are nonlinear parametric models in terms of pixel intensity. Fitting an AAM to an image consists of minimizing the error between the input image and the closest model instance; i.e. solving a nonlinear optimization problem [17]. More specifically, the face registration determines the position of 66 facial feature points, where for example, each eye is described with 6 feature points that form a convex polygon around the eye orbit. The centroid of this polygon is calculated in order to determine the centre of the observer's eye. We denote the centroid points of the left eye and the right eye as
and is inversely proportional to the distance between the face and the camera. Therefore, we can use the following distance estimation function
where x is the estimated
2.2 MODULE 2: Obtaining demographic features
The demographic features of age and gender are determined according to 7 age groups: 1–14, 15–24, 25–34, 35–44, 45–54, 55–64 and over 65 years, which are all either male or female. We use the AAM facial registration method of Module 1 to register a face and warp it to the normalized frontal form of 50×50px in size. The FERET database [18] is used to train Support Vector Machines (SVM) as classifiers of gender and age.
2.3 MODULE 3: Content-aware broadcasting
Upon setting up the digital signage broadcasting, each item of broadcasting item is described with a content descriptor
where α
where
Finally, to choose the actual content item to be broadcasted
For
2.4 Privacy aspects
The architecture of our interactive and audience adaptive digital signage system is designed according to the Privacy-by-design [20] principles, to ensure the fully appropriate handling of personal data. Image capturing and processing is performed by the display unit in real-time, therefore no visual records are stored or distributed over the network. Video images of actual observers are discarded immediately after processing, storing only the observers' audience
3. Laboratory evaluation of the system
The performance analysis of the proposed digital signage system was first conducted in a laboratory study. The modules were monitored separately to obtain clear information about their performance.
Five observers were asked to walk along a straight path, facing the camera of our IPD system and the Kinect (aligned and positioned one above the other). Markers were placed on the floor along the path at 10 cm intervals from 0.5m to 8m in distance. When the observer reached a marker, the estimated distance was recorded by both systems. Figure 4 shows the root mean square error (RMSE) of distances obtained by this experiment.

Root mean square error (RMSE) of the IPD based (red) and Kinect (blue) distance measurement systems increase with distance but stay in general agreement.
The comparison of the Kinect and the IPD estimator shows an 18.7cm and 19.3cm overall RMSE, respectively. Kinect gives stable results up to a distance of 4m with error increasing exponentially with distance in agreement with [21]. The relative standard error of our IPD estimator for a given distance range is 4.3%. On average, the IPD distance estimator gives results almost equivalent to the Kinect stereo system and at distances of over 4m even more stable results, which is a clear indication of the accurate performance of Module 1.
Comparison of evaluated machine learning methods for gender and age group classification within Module 2.
The highest classification accuracy was achieved using SVM for both gender and age group classification resulting in a classification accuracy of 95.2% and 91.7%, respectively. Following this accuracy study, the SVM method is used in our system as the main method for gender and age group classification.
4. Field application of the system
We used the developed system in a real-world environment for an audience measurement study [22] which is part of a larger marketing study. With the described system we measured the attention time that shoppers were giving to a digital signage display in a clothing store. Normally, such studies rely on qualitative assessments based on interviews or questionnaires [9-11].
The described system, however, provided full quantitative data on the number, age and gender of the visitors, as well as their attention time based on the observations of the digital signage display. The accuracy and volume of the data collected with our system is almost incomparable with traditional methods of market analysis. This creates the possibility of analyzing the collected data with machine learning methods. Besides gathering data in relation to the display blindness hypothesis [10] in general, the goal of the audience measurement study was also to determine what type of information (for example, static or dynamic, such as video) attracts higher attention.
The digital signage system was installed in a clothing boutique in Ljubljana, Slovenia. The shop sells higher priced sports fashion and apparel, which can affect demographic and behavioural characteristics. The floor plan of the store consisted of a main area situated between the entrance and the cashier's desk, with additional room in the back used for changing. To make the most out of the floor plan, the display was mounted at eye-level on a special shelf next to the cashier's desk, directly facing the entrance (see Figure 5a).

Real-life audience measurement study with the developed system. a) A typical image captured by the digital signage screen unit. b) Image after segmentation described in Section 2.1.
The audience measurement study was performed in 23 daily sessions, recording a total of 214 hours. The system detected 1294 people and determined their gender and age group as well as their attention time given to the display.
The audience measurement study reveals that 61% of all detected customers in the store were female and 39% were male. The age distribution of customers was as follows: 7% in the category 1–14 years, 10% in 15–24 years, 20% in 25–34 years, 25% in 35–44 years, 19% in 45–54 years, 12% in 55–64 years and 7% in the category 65+ years. Dwell time for each customer is the time they spend in the same room as the display. Attention time is the time that each customer spends looking at the digital signage display. The attention time quantifier reveals, that, on average, men pay attention to the digital signage display for 1.2s, whereas women only 0.4s. Age group comparison shows that attention time to digital signage is the highest (2.4s) in the age group for children (1–14 years) while the average attention time of all customers is 0.7s. Interestingly, the average attention time is lowest in the 35–44 years age group (0.42s). The content quantifier, dynamic or static, shows that broadcasting dynamic and not static digital signage content increased attention time by 43%.
We performed a correlation analysis of the audience measurement data acquired with the described digital signage system using Spearman's rank correlation coefficient

Distance map of audience measurement data using absolute Spearman's dissimilarity ρ
As expected, the distance map reveals large distances among gender, age group and content type since they are independent. Small distances between attention time and independent variables (gender, age group, content type) confirm statistical dependency and this validates trends observed in the audience measurement study.
The described algorithm-determined audience measurement parameters (gender, age group, dwell time and attention time) were, for the purpose of evaluating the precision in the marketing study, that required also some additional parameters describing customers that arrived in groups, evaluated also manually. The comparison of algorithm- and manually determined parameters from the described field study offers an exact means with which to determine the accuracy of the automatically determined parameters, i.e. the full performance of the digital signage system. Indeed, the comparison shows that the system performs with a high level of accuracy, giving gender classifiers 86.6% and age classifiers a classification accuracy of 77.1%.
5. Conclusion
In conclusion, we have described the design, implementation and use of an adaptive digital signage system. Based on functional requirements of an audience adaptive digital signage system which can perform in real time, we identified its components and overall architecture. Three main modules were designed and developed: (i) Module 1 for the spatial and temporal localization of the observers, (ii) Module 2 for the demographic features of the observers and (iii) Module 3 for content-aware broadcasting. The accuracy of the developed system was evaluated in a laboratory study achieving spatial accuracy in the tracking of observers with a relative standard error of 4.3%. With our monocular spatial localization module, we achieved results comparable to the distance estimates obtained by the stereo-based Kinect. In this way we demonstrated that it is possible to reliably determine the distances of people with a single monocular camera.
We assessed the performance of our system also in a real world environment in the context of a marketing study. The goal was to determine the attention time which visitors of a clothing store paid to the digital signage display. The digital signage system demonstrated accurate and superior quantitative collection of data regarding shoppers, otherwise typically obtained only qualitatively via marketing studies.
License-free computer vision and machine learning algorithms that operate in real-time on low-price hardware were selected. All of the algorithms and software components that we used are copyright free, which makes the proposed architecture even more suitable for practical implementation.
Clearly, more complex hardware enhancements of the standard display, e.g. infrared sensors or multiple cameras for stereo vision, could lead to even more accurate results, but an integral part of our design was also good price-to-performance and simple implementation. This choice is even more rational because displays with built-in monocular cameras are becoming widely accessible. The use of more complex algorithms could improve observer tracking and classification accuracy but would at the same time require more processing time or more processing power. The price-performance trade-off of future interactive signage systems will probably be determined only after more experience is gained with such systems.
We believe that the main contribution of this paper is an operational, cost-effective interactive signage system where the viewing statistics and interaction with the audience are achieved with efficient and real-time capable computer vision techniques. We hope that this will contribute to the future development and design of intelligent digital signage systems.
Footnotes
6. Acknowledgments
This work was supported by the Slovenian Research Agency, research program Computer Vision (P2–0214).
