Abstract
A large number of modern mobile devices, embedded devices and smart home devices are equipped with a voice control. Automatic recognition of the entire audio stream, however, is undesirable for the reasons of the resource consumption and privacy. Therefore, most of these devices use a voice activation system, whose task is to find the specified in advance word or phrase in the audio stream (for example,
Introduction
The voice activation task has been attracting both research and industry for decades. Since the task of formulating an algorithm to determine whether a code phrase has been uttered in an audio stream is difficult, it is not surprising that heuristic algorithms and machine learning methods have long been used for the voice activation problem.
The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition. We would like to highlight the following important moments: the beginning of the use of hidden Markov models back in 1989 (Rohlicek et al., 1989), the use of neural networks since 1990 (Morgan et al., 1990, 1991; Naylor et al., 1992), the use of pattern matching approaches, in particular, dynamic time wrapping (Zeppenfeld and Waibel, 1992) optimization of a loss functions specific to a voice activation (as opposed to the common metrics such as accuracy and similar; this enables the system to become more attractive in terms of user experience) (Chang and Lippmann, 1994; Szöke et al., 2010), attempts to get rid of a garbage model (Junkawitsch et al., 1997), building systems of voice activation for non-English languages such as Chinese (Zheng et al., 1999; Hao and Li, 2002), Japanese (Ida and Yamasaki, 1998), Persian (Shokri et al., 2011), construction of discriminative systems (Keshet et al., 2009; Tabibian et al., 2011, 2013), publications describing voice activation systems in mass products (Chen et al., 2014a; Gruenstein et al., 2017; Guo et al., 2018; Wu et al., 2018), as well as publishing open datasets to compare different approaches (Warden, 2018).
Voice activation systems can be applied in various areas: telephony (Shokri et al., 2013; Szöke et al., 2010), crime analysis (Kavya and Karjigi, 2014), the assistance systems in emergency situations (Zhu et al., 2013), automated management of airports (Tabibian, 2017) and, naturally, personal voice assistants, built-in mobile phones and home devices (Gruenstein et al., 2017).
The problem of voice activation is closely related to the problems of automatic speech recognition and spoken term detection. In ASR, the task is to find the most likely sequence of words spoken in the audio recording, whereas in voice activation we need to find only a predetermined set of words or to indicate that such a word/words was/were not spoken. Of course, being able to solve the problem of ASR can easily solve the problem of voice activation, but at the moment most of the speech recognition systems consume an unacceptably large amount of resources for voice activation.
Spoken term detection is a search for a given phrase (and this phrase may vary depending on the request) in a static set of audio data. In voice activation, the phrase is fixed, but the audio data is delivered in real time. Therefore, you can use offline methods in spoken term detection, such as bidirectional neural networks or audio pre-indexing.
Despite the differences in these problems, approaches and ideas often overlap. For example, audio data representation, decoding methods or architecture of acoustic models. Additional requirements may apply for voice activation systems. For example, responding only to a keyword that was addressed to the system, but not to the same keyword spoken in the conversation (wake-up-word detection) (Këpuska and Klein, 2009; Zhang et al., 2016); responding only to a keyword spoken by a registered user (Gruenstein et al., 2017; Manor and Greenberg, 2017; Kurniawati et al., 2012).
In this paper, we will focus primarily on voice activation systems that can be used in embedded systems, in particular, mobile phones. Such systems must satisfy the following properties:
high recall of finding the keyword (to build a voice interface, you need to be sure that you can start the voice interaction; with a low recall, the user will have to start the interaction in a different way),
a small number of false positives (since the voice activation system is always on, a large number of false positives is unacceptable: this causes a waste of device resources, distracts the user’s attention and potentially reduces security),
the ability to work entirely on a limited resource device (firstly, continuous forwarding of audio data to remote servers is impossible due to prohibitively high requirements for resources and communication coverage, and secondly, it is undesirable from the user privacy’s point of view),
consumption of a small amount of resources (due to the previous property, consuming a large amount of resources will lead to rapid battery depletion and slow operation of other processes),
noise resistance and variability of speech,
a small delay between the utterance of the keyword and system activation.
We will call systems that satisfy these properties
Previously, there were reviews of voice activation systems (Bohac, 2012; Rohlicek et al., 1993; Morgan and Scofield, 1991), but there is some outdated information (due to rapid development in the area). Also, as far as we know, our work is the first systematic literature review on the subject.
This work has the following structure. In Section 2, we describe the structure of a typical voice activation system, and will help to state the research questions which we aim to answer in this work. Next, in Sections 3, 4, 5, 6, and 7, we provide the answers to these questions. In Section 8 we describe approaches that are difficult to relate to the typical system described in 2. Finally, in Section 9, we summarize the study and describe possible areas for further work.
As described in Section 1, voice activation systems have come a long way. One way to study and compare approaches is to provide the model of a system and to compare the individual components of the model. Most voice activation systems (especially modern ones) consist of the following parts:
application of the
For example, Chen et al. (2014a) describe voice activation systems that apply an acoustic model specified by deep neural network to extracted Log Mel-filterbank (feature extraction) and decide whether the keyword was uttered by smoothing deep neural network outputs and comparing them with a threshold (decoding).
Of course, not all the possible voice activation systems are well described by the scheme. For instance, in pattern-matching approaches it is hard to separate acoustic model and feature extraction. Discriminative spotters would be another example. We will discuss these and other systems in more detail in Section 8. Nevertheless, even in these systems it is always possible to point out the feature representation of the audio or some kind of the acoustic model.
This systematic literature review aims to summarize information available in studies about voice activation systems for embedded devices by answering the following research questions: What acoustic features are used? What types of acoustic model are used? What acoustic units are used in acoustic modelling? What types of decoder are used? What metrics are used to evaluate systems’ quality?
Sound is a continuous physical phenomenon of mechanical vibration transmission in the form of an acoustic wave. However, most machine learning models do not accept continuous data as input. Thus, the extraction of features from the audio recording has two main goals:
representing audio in a way that would be suitable to machine learning methods,
the preservation of the largest possible amount of information needed to solve the problem (i.e. finding keywords) and the exclusion of the largest possible amount of information irrelevant to the task (“noise” such as background sounds or the variability of speech).
Most voice activation systems use an approach similar to speech recognition systems (Hinton et al., 2012).
The original recording is segmented in possibly overlapping
In each frame, a numerical vector that describes the behaviour of the sound at this time interval is computed (usually, this vector is computed using the discrete Fourier transform). Let’s say that this vector has dimension
The resulting numerical matrix of the size
Thus the audio data can be viewed as a 2D-image or a time series. The specially selected transformation used in the second step is responsible for extracting the most discriminative features for the voice activation task.
Of course, not all the systems go this way. For example, Kumatani et al. (2017) use raw waveform (without any selected transformations), and Lehtonen (2005) develops a specific digital signal processing pipeline.
Sometimes, feature quantization is used to increase the speed of operation, reduce consumption or for specific algorithms (Feng and Mazor, 1992).
The audio is segmented into short frames (popular choice is to have 25 ms segments with the overlap of 10 ms).
For each frame the periodogram estimate of the power spectrum is computed. This is similar to the way human cochlea processes the information (different nerves fire signals depending on the frequency of the audio). To get the estimate, first Discrete Fourier Transform of each frame is computed via:
Apply the Mel-filterbank to the power spectra summing the energy in each filter. The Mel scale relates perceived frequency. Human ear is more sensitive to small changes in low frequencies than in the higher spectra. In order to convert frequency f to Mel scale, the following formula is used:
The logarithm of filterbank energies is taken. This also relates to human perception: the loudness does not change linearly with the energy. Logarithm is good approximation and also it allows to perform channel normalization with simple subtraction (e.g. cepstral mean normalization).
Discrete cosine transform is applied. This is done to decorrelate the filterbank energies which were computed with overlapping filters.
Although the vast majority of articles use
Among the common techniques, one can use
For a detailed description of the mentioned features, you can refer to the relevant articles or reviews (Giannakopoulos, 2015). The visualization of some of the features for the phrase “Hello, world!” is shown in Fig. 1 and is computed using the framework for speech recognition

Feature visualization for audio file with “Hello, world!” pronunciation.
The acoustic features used in studied sources are presented in Table 7, in Appendix A. The number of times these features were used in the sources is presented in Table 1.
The umber of times acoustic features and transformations were used in studied sources.
The task of the acoustic model is to model acoustic properties of the selected acoustic unit. For example, an acoustic model can provide a probability distribution over the vectors of MFCC-features when a certain word is pronounced. Practically, the acoustic model is used to compute
Often it is more natural or easier to compute
The most common acoustic model for voice activation is built as follows. The set of HMM states is logically divided into two parts: a part that represents audio event of keyword pronunciation and a
Each state of the model represents an acoustic unit (see Section 5 for details), for example, a phoneme. Model “says” that at each frame (see Section 3) the acoustic environment is in one of the states of the HMM and generates a

Hidden Markov model example for Amazon’s keyword spotter (Guo et al., 2018).
It is necessary to be able to calculate
Note that it is the question of definitions what to consider an acoustic model in the HMM-GMM setup. You can either consider GMM (so the part which actually computes
Good acoustic model is the key for a high quality voice activation system. Therefore, it is not surprising that the calculations associated with the acoustic model usually take the biggest part of the voice activation system runtime. This is why in many studies this part is speeded up. For example, Fernández-Marqués et al. (2018) apply the binary arithmetic (instead of floating arithmetic) in the model, Sun et al. (2017), Szöke et al. (2010) represent the architecture of a neural network where each layer of the matrix multiplication of
Another way to build a speech recognition system is not to use HMM, but to calculate some (heuristically selected) value based on the outputs of the acoustic model. For a successful use of this approach, see Chen et al. (2014a).
The acoustic models used in studied sources are presented in Table 8, in Appendix A. The number of times these models were used in the sources is presented in Table 2.
The number of times a specific acoustic model was used in studied sources.
The choice of an elementary unit for acoustic modelling (acoustic unit) affects the resulting quality. A system developer is faced with the following tradeoff: the larger the unit is (e.g. a
If a smaller unit has been chosen, for example,
Sometimes the solution for this tradeoff is to choose
Also, one can choose not a whole phoneme as a unit, but a
We must note that the term senone does not have a strict definition. Some authors like Yu and Deng (2014) define senone as a tied (clustered) triphone state. Some, like authors of Janus Toolkit, call all acoustic units senones (Janus Toolkit Documentation, 2019).
The solution of this tradeoff depends on the size of the training data (at a small size it is much more difficult to build a whole word model than a phoneme model), the choice of the acoustic model, the key phrase, and the language. As far as we know, at the moment there is no algorithm or rules, under what conditions which acoustic unit to choose.
The acoustic units used in studied sources are presented in Table 6 in Appendix A. Number of times these units were used in the sources are presented in Table 3.
Number of times specific acoustic unit was used in studied sources.
Number of times specific acoustic unit was used in studied sources.
As a result of the acoustic model application to an audio stream we receive the values characterizing probability that at a certain moment this or that acoustic unit was pronounced. Voice activation system needs to make a decision whether the keyword was uttered in an audio stream or not according to the obtained one or more numeric series. To do this, different approaches of
In the simplest case, it is only necessary to compare the obtained number with the threshold value to make a decision. E.g. when the acoustic unit is the whole keyword the decision is made by comparing the computed probability with 0.5.
Smoothing is usually used to improve the recognition quality in the case of comparison with the threshold (Chen et al., 2014a; Lehtonen, 2005). The motivation for this technique is that the keyword is an acoustic event that has a certain duration in the time dimension. Thus, the actual keyword utterance should generate a high probability of multiple counts in a row. Thus, when applying the smoothing function to the time series, we avoid false positives caused by fluctuations of the acoustic model. Silaghi and Vargiya (2005) suggested an interesting variant of smoothing. In the case of acoustic units, the probabilities of each phoneme are normalized to the probability of the least probable phoneme.
In systems that use a comparison with a template utterance, Dynamic Time Warping (DTW) is often used. DTW is an analogue of the Levenshtein distance for numerical series. The motivation of this method is that the duration of the recorded pattern is likely to differ from the pronunciation in real conditions. Thus, we cannot compare two audio fragment directly, namely, one needs “to strech” or “to squeeze” certain intervals of the template over time. DTW distance is usually computed with dynamic programming. For a more detailed description and various modifications, please refer to Zehetner et al. (2014).
Decoding becomes more meaningful in the case of HMM. Indeed, in this formulation, we need to solve a typical problem for HMM: find the most probable sequence of hidden states (if this sequence corresponds to a keyphrase, then, in some approaches, it means activation) or find the total probability of passing through some sequences of states (for example, we can say that we do not care how many frames in a row the first phrase phoneme was pronounced, how many, the second and so on; only the order is important).
The Viterby algorithm uses dynamic programming to find the most likely sequence of hidden states in hidden Markov model given the observations. Naturally, this algorithm is widely used in works about HMM-voice activation systems. Many authors explore a variety of approaches and heuristics to speed up the algorithm, adapt it to find sequences satisfying some additional properties, and so on. For example, Liu et al. (2000) use various techniques of hypotheses pruning and rescoring probabilities using a bi-gram language model. In Zhu et al. (2013), the possibility of using the Viterbi algorithm on sliding windows of the audio stream is considered. Junkawitsch et al. (1997) consider a modification of the Viterbi algorithm that approximates finding the optimal sequence that has the highest probability normalized by the utterance length. Several additional modifications of the Viterbi algorithm are considered in Wilcox and Bush (1992).
In addition, Wilcox and Bush (1992) discuss how to use the forward–backward algorithm for quick estimation of probabilities needed in decoding.
We would also like to mention the standard technique of using HMM-derived probabalities and deriving decoding to comparing to the threshold. This approach is conventionally called
Some authors use completely different approaches to decode. For example, Manor and Greenberg (2017) describe an application of fuzzy logic to decoding.
The approaches to decode used in studied sources are presented in Table 9 in Appendix A. The numbers of times the specific approach was used in studied sources are presented in Table 4.
Number of times the specific approach to decoding was used in studied sources.
Number of times the specific approach to decoding was used in studied sources.
A large number of metrics can be used to compare different approaches of voice activation systems. These metrics can be grouped by the aspect of the system they measure:
classification quality,
operation speed,
amount of used RAM and CPU.
Metrics for speed measurement are standard and non-specific for voice activation systems. The most commonly used are real time factor (RTF) – total processing time of the audio stream divided by the length of the stream, latency (average delay of the response signal from pronouncing) and total processing time (this metric is less indicative than RTF).
For resource usage, it is the most popular to measure the amount of RAM used and CPU load (as a percentage of the compute core). To improve both parameters, different approaches to quantize the parameters of the acoustic model are often used (Fernández-Marqués et al., 2018).
But at the moment there are no standard metrics to measure the quality of classification. Moreover, similar metrics, unfortunately, are called differently in different sources. We think it would be profitable to have standartized set of metrics in that area.
The main problem is that the voice activation system must satisfy two opposite properties to work well: it must be sensitive enough to react to the keyword utterances, and it must be robust enough not to react to sound events similar to the keywords, but that are not actual keywords. Any system can be made arbitrarily sensitive, reacting to each event, and arbitrarily robust, not reacting to any events. The challenge is to choose the right balance between these two operating points. Therefore, one must either use at least two metrics (for example, precision and recall), or use one common metric (for example, f1-score) to measure the quality of a classification,. In the second case, an unsuccessful choice of metrics can lead to false conclusions, since there is no single correct balance between the importance of sensitivity and robustness.
The following metrics are often used to measure classification quality:
detection rate (precision) is the number of correctly recognized keywords relative to the total number of accepted keywords,
substitution rate is the number of mis-recognized keywrods relative to the total number of accepted keywords,
deletion rate (false reject rate, opposite to recall, miss rate) is the number of un-detected keywords relative to the total number of keywords,
rejection rate is the number of keywords which are rejected relative to the total number of keywords (false reject rate – FRR),
false alarm rate (FAR) is the number of false alarms (relative to the number of utterances without keyword; sometimes per keyword or per hour of speech),
accuracy (recognition rate) is the number of correctly classified utterances relative to the total number of utterances,
true positive rate (same as recall),
true negative rate (opposite to FAR).
As you can see, there is no accepted pair of metrics, moreover, often the same metrics are not called the same in different sources.
Figure of merit is one of the most used metrics in voice activation system research. FOM is the average of correct detections per k false positive activations per hour for each natural number k from 1 to 10. This metric was especially often used until the 2010s. Recently, such high rates of false positives per hour are unreasonably high, so FOM does not reflect the relevant modes of operation of the modern voice activation system. Other common metrics are equal error rate (the smallest value that can take both FAR and FRR at the same time), ROC-AUC (the area under the precision-recall curve).
Some papers suggest more complex ways of measuring classification quality. For example,
It is hard to compare results from different works not only because different metrics are used, but also because the choice of the dataset and the keyword deeply affects the results. If two works use false alarms per hour to describe their system quality, but one uses a dataset of speech recordings and the other uses a dataset from real user devices (where speech may take 3–6 hours for each 24 hour recording), then these works would have completely different metrics even with the same voice activation system.
We think it is safe to assume that industry research provides the best or close to the best voice activation systems today because of big amount of audio data and computation resources. Shan et al. (2018) reports system with
The metrics used in studied sources are presented in Table 10 in Appendix A. The numbers of times these metrics were used in studied sources are presented in Table 5.
Number of times the specific metric was used in studied sources.
Some approaches to the construction of voice activation systems are difficult to describe according to the classification proposed in Section 2.
First of all it worth to mention approaches of comparison with a template, for example using DTW. In such systems, the user first records one or several keywords pronunciations, and then the necessary sound fragments are compared with the recordings and the triggering is announced if the selected similarity measure exceeds some prespecified threshold. The advantages of this approach include the simplicity of both learning (memorization) and operation. In addition, in this approach, it is natural to use personalization: indeed, one can argue that recorded patterns reflect the specific features of the user pronunciation, which allow to distinguish it from other users if appropriate similarity metric is used. However, in practice this approach is not very robust. The quality of its operation depends on how well the similarity measure is chosen and what features are used. The task to eliminate all the noise and disimilarity in environments by appropriate choice of features and similarity measure has proven to be difficult. DTW is one way to calculate the measure of similarity of two time series, possibly of different length. Systems using such approaches are described in Morgan et al. (1991), Naylor et al. (1992), Zeppenfeld and Waibel (1992), Kosonocky and Mammone (1995), Kurniawati et al. (2012). Zehetner et al. (2014) discuss the different underlying metrics of the similarity to use in DTW framework. Szöke et al. (2015) discuss the possibility of using DTW even for the case where a keyword can be subjected to declensions, conjugations, or even word order permutations.
Another interesting approach is to model the appearance (or absence) of keywords with the help of point processes and, in particular, Poisson processes. In such systems, the parameters of two process families are evaluated: for each selected feature for sound with (1) and without a keyword (2). An interesting feature of such systems is the ability to select these parameters during operation, thereby adapting to the channel, user and usage scenarios. For more information on the proposed see Jansen and Niyogi (2009c), Jansen and Niyogi (2009b). Sadhu and Ghosh (2017) describe how to apply this approach in systems with limited resources using unsupervised online learning.
Finally we would like to mention the
Conclusion
In this research, we have made a systematic literature review of voice activation systems. We proposed the structure of a typical voice activation system and considered main approaches described in the literature for each of the modules of such a system.
Regarding the feature representation, most of the techniques are shared with automatic speech recognition. The majority of cited works use MFCC or Log Mel-filterbank features. In this area, we see the reduction of the inductive bias over the time: more and more recent papers like (Raziel and Hyun-Jin, 2018) or (Myer and Tomar, 2018) don not use DCT-step, probably because deep neural networks work reasonably well even with correlated features. We expect further simplification: using raw waveform or some unsupervised approach like contrastive predictive coding in Oord et al. (2018).
GMM, widely used in acoustic modelling, are replaced with different types of neural networks. We are not aware of any state-of-the-art solutions that do not use deep learning in the voice activation problem. One of the main questions in that area is how to apply neural networks having limited resources. Some possible answers are: to apply quantization, to use a special network topology like time-delayed neural network or to use a cascade of the models waking up the more powerful and consuming model only if the smaller model is activated.
At the moment, the most widely used systems use phonemes as acoustic units. Phonemes are stable enough to be reliably found in audio stream and flexible enough to be used for the majority (if not all) keywords.
We believe that voice activation research could greatly benefit from creating open datasets in order to compare different systems. Today it is complicated to compare different works because of different train and test data, different keywords, and sometimes different target metrics.
As a result of the literature review, we noticed that there are some questions to which there are no clear answers in the published sources. So we would like to focus on them and conduct research in these areas:
Acoustic units used in studied sources.
Acoustic features used in studied sources.
Acoustic models used in studied resources.
The approaches to decoding used in studied sources.
The metrics used in studied sources.
