Sage Journals: Discover world-class research

Abstract

Objective

This study aims to systematically review and synthesize the studies on the application of machine learning for classifying infant cry types, identifying pathological cries, and evaluating the accuracy of infant cry recognition.

Methods

This review followed the PRISMA guidelines and was registered in PROSPERO (CRD42024600969). The literature search was conducted on four data sources: PubMed, CINAHL, Embase, and IEEE Xplore. The included studies focused on machine learning-based classification of infants’ needs cries or pathological cries. These were published in English between January 1, 2014 and October 31, 2024. Study quality was assessed using the QUADAS-2 tool.

Results

Of 919 studies were identified, 17 were included in the final synthesis. Machine learning can classify infant cries into two main types: infant needs’ cries and pathological cries, with some studies addressing both. Needs-related cries comprised nine subtypes, while pathological cries included six subtypes. Classification accuracy varied by machine learning classifier and the features used, ranging from 44.5% to 99.82%. The highest accuracy for infant needs’ cries was hunger and pain cries at 99.82% using a Gaussian mixture model (GMM) classifier with constant-Q cepstral coefficients features. For pathological cries, the highest accuracy was for detecting deafness (99.42% to 99.82%), using a genetic selection of Fuzzy Model and a GMM classifier.

Conclusions

Machine learning shows strong potential for accurately classifying infant cries and detecting pathologies. Future research should prioritize developing diverse cry datasets to improve model generalizability, evaluating performance in real-world settings, and integrating cry analysis with physiological signals to enhance diagnostic accuracy.

Keywords

Machine learning infant cry pathological cry infant health classification systematic review

Introduction

Infant cries are one of the infant cues, which are signals that infants communicate and interact with their caregivers. These signals can convey the needs of infants and reflect pathological disorders.¹ Crying can be considered as the natural behavior of an infant and is identified based on the infant's needs, such as hunger, pain, discomfort, diaper, and sleepiness.^2,3 Moreover, an infant's cries can indicate health issues or illnesses, such as infections, respiratory distress syndrome (RDS), or neurological conditions.^1,4,5 Since infants are unable to communicate verbally, it is essential for caregivers to interpret the meaning behind their cries. This understanding is the key to providing appropriate responses to infants. However, infant cries encompass a wide range of meanings, which can make it challenging to interpret them accurately, especially for first-time mothers who lack experience in caring for babies and are more likely to misinterpret than experienced mothers, leading to inappropriate responses.⁶

The principles of the maternal sensitivity concept can explain the responsiveness to the infant's needs.⁷ This framework includes the dynamic process of perceiving, interpreting, and responding to the infant's signals based on previous caregiving experience and the quality of caregiver–infant interaction. A responsive mother can accurately perceive the infant's signals, interpret, and respond appropriately to enhance infant development and attachment. On the other hand, difficulties in interpreting or responding to infant cries may increase stress and risk of psychological or physical health problems for caregivers. Petzoldt et al.⁸ found that first-time mothers of excessive crying infants were more likely to develop anxiety disorders due to a lack of infant care experiences. Similarly, Oberlander and Rotem-Kohavi⁹ indicated that an inability to respond to infant cries can contribute to postpartum depression. In terms of physical health effects, Brand et al.¹⁰ indicated that caregivers who struggle with infant crying may experience sleep disturbances, depressive symptoms, and family strain.

The analysis of infant cries has demonstrated that pathological conditions can be evaluated through acoustic cry analysis. Valdes et al.¹ identified specific acoustic features that can distinguish normal cries from pathological cries based on the characteristics of the infant's voice. For instance, healthy cries are typically loud and exhibit an ascending–descending melody pattern with a frequency range of 400 to 650 Hz. In contrast, abnormal cries tend to have a shorter duration, monotonous melody, and a higher frequency than 650 Hz. Infant cries can serve as a valuable tool for identifying pathological conditions in medical diagnosis. For example, in diseases affecting the central nervous system, cries exhibit extremely high frequencies of 3000–4000 Hz. Conversely, in hypothyroidism, the cries have a lower fundamental frequency than normal cries, but the spectrogram is similar to healthy cries. While the differences between normal and pathological cries can be identified based on their acoustic characteristics, distinguishing them with accuracy remains challenging, particularly in medical diagnostics that require precise evaluation.

Traditionally, several studies on infant cries have assumed the existence of distinct cry types, such as hunger, pain, or discomfort cries, which can be acoustically differentiated and classified.^2,3,11,12 However, this typological perspective has been challenged by the graded cry hypothesis, which proposes that infant cries vary along a continuum of arousal or distress rather than representing discrete categories.^13,14 According to this view, acoustic differences primarily reflect the intensity of distress rather than specific needs. This debate is critical for cry classification studies: if cries are graded rather than categorical, then labeling them as fixed types may impose artificial boundaries. In medical or clinical practice, relying solely on distress levels can make it difficult to diagnose or detect pathological conditions, since pathological cries may share similar acoustic markers with highly distressed but otherwise healthy infants.¹⁵ Therefore, before evaluating how effectively algorithms can classify infant cries, it is essential to consider how the cries are labeled. The labeling process is crucial and should account for how each cry was identified. In clinical and research settings, the practical approach to cry labeling often relies on contextual cues, such as the action that successfully stops the cry or the stimulus preceding it. For example, Liang et al.¹⁶ labeled cries based on the action that stopped the crying (e.g. a hunger cry was identified when the infant stopped crying after being fed) or based on the event that caused the crying (e.g. a pain cry was labeled during invasive procedures). Similarly, Parga et al.¹⁵ ensured labeling reliability by having two medical staff independently identify each audio recording, with pain cries captured during painful stimuli. However, identifying distinct cry types remains challenging, particularly under the graded cry hypothesis, which suggests that cries vary along a continuum of arousal or distress rather than existing as discrete categories.¹⁴ This underscores the need to carefully consider how labels are defined and validated before they are used to train machine learning (ML) models. Additionally, ML approaches learn from acoustic features that vary along a continuum and transform these graded patterns into categorical outputs, bridging the gap between continuous vocal variation and relevant classifications.¹⁷

In the current era of advanced technology, ML has been utilized to help recognize and differentiate infant cries, providing a helpful solution to these problems. ML is the subset of artificial intelligence, which uses algorithms based on the knowledge gained from past data to forecast and make decisions for a specific domain.¹⁸ ML models can predict, categorize, and cluster through supervised learning that algorithms learn patterns from numerical values based on historical data.¹⁹ For example, in adult patients, Jian et al.²⁰ studied ML classification algorithms to predict and classify eight diabetes complications, reaching 97.8% accuracy. In the field of pediatrics, Tesfaye et al.²¹ developed ML to predict childhood anemia using sociodemographic, economic, and maternal and child variables. The accuracy of their predictive performance ranged from 60% to 66%.

In maternal and infant care, ML has been used to classify and recognize infant cries, enhancing caregivers’ understanding of an infant's signals. Although the human ability to understand an infant's cry should not be overlooked, a well-constructed computer system can provide more accurate solutions to audio classification. Mukhopadhyay et al.²² compared the accuracy performance in differentiating infant cry types between human and ML, with humans achieving 33.09% accuracy. In contrast, ML achieved 80.56% accuracy on the same dataset. Moreover, ML is also utilized in healthcare systems to differentiate between the cries of healthy and sick infants. For example, Zayed et al.²³ applied ML to classify healthy, sepsis, and respiratory distress cries. Similarly, Rosales-Pérez et al.²⁴ used ML to distinguish between pathological cries (e.g. asphyxia and deafness) and specific need cries (e.g. hunger and pain).

From previous literature reviews on maternal and infant care, several studies have explored the application of ML in predicting neonatal mortality,²⁵ neonatal outcomes in neonatal intensive care units,²⁶ preterm birth,²⁷ and pregnancy complications.²⁸ However, there is a lack of systematic reviews specifically focusing on the application of ML to classify infant cries based on pattern recognition. Therefore, the purpose of this study is to systematically review the studies on the application of ML in classifying infant cry types and identifying pathological cries, as well as evaluating the accuracy of classification. This review focuses on research studies published between January 2014 and October 2024 to provide the most recent coverage of evidence. The research questions of this study are: (1) How has ML been used to classify infant cry types and pathological cries? (2) How accurately can ML classifier recognize patterns in infant cries?

Methods

Protocol and registration

This systematic review was conducted following the methodological guidelines of systematic reviews and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA 2020) framework.²⁹ A narrative synthesis approach was employed to describe the findings. The study protocol has been registered with the PROSPERO database (registration number: CRD42024600969).

Eligibility criteria

Inclusion criteria: Studies were included if they focused on classifying infant cries based on needs or detecting pathological cries using ML algorithms. They must also employ observational designs, such as cohort studies, case-control studies, or cross-sectional studies, and infant's age no more than 2 years. The studies published in English between January 1^st, 2014 and October 31^st, 2024 were included.

Exclusion criteria: Studies were excluded if they did not specify the type of ML classifier, features, or did not report outcomes related to the crying type and performance data of the classifications (accuracy, sensitivity, or specificity rate). To ensure the rigorous and reliability of the review's findings, gray literature, conference proceedings, pilot studies protocols, case studies, dissertations, and editorials were excluded.

Search strategies

In this review, searches were conducted on October 31^st, 2024, with the support of a health science librarian across four electronic databases: PubMed, CINAHL, Embase, and IEEE Xplore (Institute of Electrical and Electronics Engineers). The core search strategy focused on the concepts of ML and infant crying, combined with terms related to classification and pattern recognition. To ensure comprehensive coverage, additional search terms included “machine learning” OR “deep learning” AND “infant cries” AND “classification” OR “pattern recognition” along with various commonly used keywords (e.g. “convolutional neural network,” “support vector machine,” “newborn cry,” and “baby cry”). The search terms were applied using database-specific search methodologies and incorporated Boolean operators, Medical Subject Headings (MeSH), and free-text terms tailored to each database—using MeSH terms or title/abstract searches for PubMed, CINAHL, and Embase, and IEEE Terms for IEEE Xplore. For example, the MeSH term “machine learning” is used in various databases as follows: in PubMed it appears as (“Machine learning"[mh]), in CINAHL as (MH “Machine learning+”), in Embase as (“machine learning”/exp), and in IEEE as (IEEE Terms: “Machine learning”). All databases were last searched on October 31^st, 2024. The full search strategy is provided in Table S1 of the Supplemental materials.

Selection process and data collection

All searched articles were imported to Rayyan (copyright © 2022), a web-based software application tool for screening studies for systematic reviews, and duplicate studies were removed. Two reviewers (SS and NK) independently screened the articles by title and abstract using a practical screening table developed from the inclusion and exclusion criteria. This table served to guide the reviewer through the screening process and minimize selection bias. Following this initial screening phase, the selected articles were reviewed in full text to assess eligibility and relevance. The articles that did not meet the eligibility criteria were excluded. Any disagreements between the two reviewers were resolved through discussion. If the conflict was not resolved through discussion, the reviewers will consult with a third opinion (SN). This systematic process ensured that only relevant studies were included in the further assessment.

Data extraction

Relevant data from the selected articles was extracted using a predeveloped data extraction form in Microsoft Excel. The main categories for data extraction include (1) General characteristics: this includes information such as the authors and the year of publication, the number of datasets, characteristics of the datasets, and sample size. (2) Performance: this category includes the ML algorithms used as classifiers. (3) Outcomes: this includes classifications of crying types and diagnostic performance metrics, specifically accuracy rate (%), sensitivity (%), and specificity (%). The characteristics of the included studies are presented in Table 2.

Quality assessment

Two reviewers (SS and NK) were assigned and independently assessed the quality of the included studies by using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool.³⁰ This tool is designed for assessing the quality of primary diagnostic accuracy studies in systematic reviews. The QUADAS-2 consists of four domains: (1) patient selection, (2) index test, (3) reference standard, and (4) flow and timing. Each domain is assessed for risk of bias, and the first three domains are also evaluated regarding applicability. There are 10 questions for assessing the risk of bias aspect and three questions for assessing the applicability. The results are reported as “low risk,” “high risk,” and “unclear risk.”

For the interpretation of overall quality, if all questions within a domain are answered “yes,” the study is judged as having a “low risk,” indicating that appropriate methods and safeguards against bias were clearly reported. A study is considered “high risk” if at least one question within the domain is answered “no,” reflecting evident methodological flaws. An “unclear risk” is assigned when the study provides insufficient information for a judgment.³⁰ The detailed results of the quality assessment are provided in the Supplemental materials.

Data analysis and synthesis

Data were extracted into an extraction form to tabulate and visually display the results of each study. This review plans to categorize the results into three parts: (1) infant cry type, (2) classifier type, and (3) the accuracy of classification. Moreover, a table was created to show the infant cry type and the ML classifiers used in each study, providing a clear visual representation of the different classifier types. Another table was created to provide the accuracy of classification for each classifier, helping to distinguish and synthesize evidence based on three main groups of ML models: supervised learning, unsupervised learning, and hybrid models.³¹ A narrative synthesis has been used to describe the results of a systematic review. Additionally, a pie chart was used to illustrate the proportion of physiological and pathological cry classifications using ML.

Results

Search overview results

Figure 1 illustrates the study selection process, which involved a total of 919 studies identified across four databases. A total of 37 duplicate records were removed using Rayyan. The remaining 882 studies to be reviewed based on their titles and abstracts. There were 840 studies excluded due to not meeting the eligibility criteria. This process resulted in 42 studies being retrieved for full-text review, and two studies^32,33 were excluded due to unavailability of full text. As a result, 40 studies were assessed for eligibility, and 23 studies were excluded for several reasons as described in the PRISMA diagram in Figure 1. There were 17 included studies in this systematic review.

Figure 1.

PRISMA flow diagram of the included studies.

Included study characteristics

The search yielded the final 17 included studies that were cross-sectional study. The publication years ranged from 2014 to 2024. For the infant cries dataset, 11 studies^{2–5,11,15,16,23,34–36} utilized self-recorded sounds, five studies^12,24,^37–39 relied on public databases, and one study⁴⁰ did not specify the source as shown in Table 2. In total, there were around 113,677 infant cries, ranging between 300 and 54,744 sounds. For the ML classifiers type, 13 studies^2–5^{,11,15,23,34,36,37,39–41} used supervised learning, two studies^35,38 used unsupervised learning, and the remaining two studies^12,24 used a hybrid approach, which is detailed in Table 4. Ten studies^{2,3,11,12,15,16,34,35,37,40} reported on specific cry types related to infants’ needs, four studies^4,5,23,36 focused on pathological cries, and three studies^24,38,39 addressed both categories.

The distribution of infant cry types is illustrated in Figure 2, which includes nine subtypes. The hunger cry represented the largest proportion, accounting for 29%, followed closely by the pain cry at 27%. Other notable types were the sleepy cry at 12%, the discomfort cry at 10%, the wet diaper cry at 7%, and the burp and fussy cry at 5%. The remaining cry types, holding and cold, constituted a smaller proportion of 3% and 2%, respectively. In Figure 3, the distribution of pathological cry types was presented, divided into six subtypes. The four most prevalent were the sepsis cry and respiratory distress cry, both at 22%, followed by asphyxia at 21% and deafness at 21% as well. The smallest proportions were hypoxic-ischemic encephalopathy (HIE) and asthma, each accounting for 7%.

Figure 2.

The proportion of infants’ need cry types using machine learning classification.

Figure 3.

The proportion of pathological cry types using machine learning classification.

Overview and comparison of existing infant cry databases

The infant cry database from the 17 included studies can be divided into two main types: cry databases and self-recorded datasets. Five studies^12,24,^37–39 used cry databases, 11 studies used self-recorded datasets, and one study⁴⁰ did not specify the dataset source. To evaluate the quality and characteristics of the infant datasets used in this review, each database and self-recorded dataset will be described in detail below.

Infant cry databases in this review include four main datasets: Donate a Cry Corpus,¹² Baby Chillanto Infant Cry,^12,24,38,39 Dunstan Baby,³⁷ and In-House Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT) Infant Cry.³⁸ Each database varies in the types of cries, the number of cry samples, and infant age ranges. Donate a Cry Corpus⁴² demonstrates diversity across both cry contexts and infant age, as it was developed through a crowdsourced mobile application that enabled global parental participation. However, because it contains 457 audio files across five cry types: belly pain, burping, discomfort, hungry, and tired from an unspecified number of infants aged between 0 and 2 years, the lack of controlled recording environments and limited identity tracking may introduce acoustic variability and background noise that can affect feature extraction accuracy.⁴³ Baby Chillanto Infant Cry Database⁴⁴ was created by the Instituto Nacional de Astrofísica, Óptica y Electrónica in Mexico under the CONACYT program. This database provides medically supervised and structured data collection. It contains 2268 audio recordings with five classes: asphyxia, deaf, normal, hunger, and pain from newborns up to 6 months old, by specialized physicians under controlled conditions. Although it ensures high acoustic quality, the dataset does not fully specify the number of unique infants, limiting transparency into how many samples were obtained per individual. Dunstan Baby^45,46 was developed by Priscilla Dunstan with her research team for commercial reasons. Dunstan had experience in opera and as a mother, which allowed her to recognize specific sounds in the human voice. The database contains infant cries from infants aged 6 months or younger in five types: hungry, burp, sleepy, pain, and discomfort. Approximately 83 infant cries in the video were recorded by 39 infants in a studio to eliminate noise. However, its small sample size restricts the dataset's variability and may lead to overfitting.⁴⁷ In-House DA-IICT Infant Cry was developed at the Dhirubhai Ambani Institute of Information and Communication Technology in India for research purposes. It includes 1190 cry samples in three categories: healthy, asthma, and HIE, but lacks documentation of the number of participating infants and their ages. Overall, existing infant cry databases differ widely in cry types, infant ages, and the number of recordings, reflecting diverse data collection methods. However, most databases lack detailed information on infant identity and standardized labeling, which limits data transparency and comparability across studies.

The self-recorded datasets created by the authors in each study included a variety of infant cry recordings that differed in several characteristics. The details of these datasets are shown in Table 1, which summarizes three key aspects: infant identity control, recording environment, and preprocessing procedures. For the infant identity control, all studies reported the number of infants in the study, except for four studies.^2,3,23,34 Additionally, most studies did not specify how many cry samples were obtained from each infant; only three studies^4,23,35 reported this information. This indicates limited control over infant identity and reduced transparency, as the use of multiple samples from the same infant may lead the ML model to recognize the baby's individual vocal characteristics rather than the true acoustic features of different cry types. Likewise, only one study⁴ reported the number of exemplars derived from each recording, which further limits transparency. Without control, it is possible that multiple cry samples originated from the same recording session or environment, causing the classifier to rely on background or environmental noise instead of the infant's cry itself.⁴⁸ To reduce this limitation, almost all studies, except one,³⁴ implemented some form of preprocessing to improve the quality of cry recordings and minimize background interference, such as manual removal of noncry sounds, band-pass filtering, or noise suppression algorithms. Overall, most datasets demonstrated attention to noise reduction and preprocessing, which helped ensure that classifiers primarily relied on acoustic features of the infant cry rather than environmental or background noise (Table 2).

Table 1.

Characteristics of self-recorded datasets: infant identity control, recording environment, and preprocessing methods.

Authors	Number of infants	Number of cry samples per infant	Number of exemplars per recording	Training–testing (%)	Recording environment	Recording device	Noise control	Normalization or segmentation
Zhang et al.³⁴	Not reported	Not reported	Not reported	90:10	The University Malaya Medical Centre, Malaysia	Not reported device	Not reported	Not reported
Laguna et al.³⁵	38 healthy full-term newborns (21 males, 17 females)	4 cry episodes	Not reported	Not reported	Maternity ward in the hospital	ZOOM H1N™ recorder (with a unilateral microphone)	Manual review and software inspection (iZotope RX7), band-pass filters	Normalization: power spectral density Segmentation: manually segmented into Cry Episodes and Cry Units
Joshi et al.²	Not reported	Not reported	Not reported	80:20 (CNN); 75:25 (ensemble models)	National Taiwan University Hospital	Sony HDR-PJ10 HD computerized video recorder	MFCC feature, spectral features, and DropBlock regularization	Batch normalization segmentation: framing and windowing
Liang et al.¹⁶	59 infants	Not reported	Not reported	70:15:15 (training: validation: testing)	The Far Eastern Memorial Hospital, Taiwan	Lollipop baby monitor	Preprocessing, remove sounds of adults, other infants, mechanical/electronic noise	Segmentation: each 10-s audio clip
Ashwini et al.³	Not reported	Not reported	Not reported	80:20	National Taiwan University Hospital Yunlin Branch, Taiwan	Not reported	Not reported	Normalized spectrogram images; short-time Fourier-transform segmentation
Chang et al.¹¹	29 infants (male: 17, female: 12)	Not reported	Not reported	50:50	National Taiwan University Hospital Yunlin Branch, Taiwan	Sony HDR-PJ10 high definition camcorder	Inferences and disturbances existing in the cry data were eliminated	Normalization performed as the first step of preprocessing, cry signals were divided into frames and windows
Parga et al.¹⁵	691 infants	Not reported	Not reported	Not reported	Natural environments, e.g. home or clinic settings	Various devices, primarily cell phones	Secondary cry detection algorithm screened out noncry sounds	Cries were segmented into 5-s utterances, and suprasegmental features were extracted using OpenSmile for standardization
Matikolaie and Tadj⁵	108 healthy infants, 17 sepsis infants	Not reported	Not reported	70:30	Sainte-Justine in Montréal Hospital, Canada, Al-Sahel and Al-Raee Hospital, Lebanon	WS-650M Olympus digital voice recorder	Noisy samples were manually filtered out	Normalized using preprocessing and cry was segmented into three cry phases (inspiration, phonation, expiration)
Zayed et al.²³	Not reported	Each infant's voice was recorded five times, each session lasting 90 s	Not reported	Not reported	Sainte-Justine in Montréal Hospital, Canada, Al-Sahel and Al-Raee Hospital, Lebanon	Two-channel Olympus handheld recorder	The preprocessing removed background noise, artifacts, and silence before segmentation	WaveSurfer software was used to perform the segmentation process
Khalilzad et al.⁴	17 sepsis infants, 33 RDS infants	Each infant's voice was recorded five times, and each session had an average of 90 s	53 recordings from 17 sepsis infants, 102 recordings from 33 RDS infants	55:15:30 (training: validation: testing)	Sainte-Justine in Montréal Hospital, Canada, Al-Sahel and Al-Raee Hospital, Lebanon	Two-channel Olympus handheld recorder	Preprocessing included the removal of silence, background noise, and artifacts	WaveSurfer software was used to perform the segmentation process
Matikolaie and Tadj³⁶	78 healthy infants and 34 infants with RDS	Not reported	Not reported	90:10	Sainte-Justine in Montréal Hospital, Canada, Al-Sahel and Al-Raee Hospital, Lebanon	Two-channel sound recorder	Unwanted episodes were removed during preprocessing	Segmentation were assigned using WaveSurfer software

CNN: convolutional neural network; HDR: high dynamic range; RDS: respiratory distress syndrome; MFCC: Mel-frequency cepstral coefficient.

Table 2.

The characteristics of the included studies.

Authors, year	Purpose	The number of datasets	Characteristics of dataset and sample	Classifier, crying type classification	Diagnostic performance (%)	Study quality
Infants’ need cry classification
Li et al.¹²	To develop a classification model using ResNet and a transformer for analyzing infant crying.	The number of cry datasets is 5204 divided into – Hunger: N = 2080 – Uncomfortable: N = 1760 – Pain: N = 1364 The number of training and testing cry datasets: 80% and 20%	The cry datasets were received from three databases -– Donate a Cry Corpus – Chillanto – Environmental sound classification-50 (ESC-50)	SE-ResNet-Transformer (developed by adding squeeze-and-excitation (SE) mechanism to the residual blocks of ResNet) Feature extraction (extracting key acoustic characteristics from audio signals): MFCC Three types of crying classification – Hungry – Uncomfortable – Pain	The percent accuracy rate of each neural network feature – SE-ResNet-Transformer: 93% – SE-ResNet34: 92% – ResNet18: 87% – ResNet34: 88% – ResNet50: 84% – SE-ResNet50: 87% Conclusion The mixed MFCC significantly enhances the model performance by integrating the SE mechanism into residual blocks of ResNet.	Low risk
Zhang et al.³⁴	To develop a method that can effectively minimize the conflict among deep learning models and improve the accuracy of baby cry recognition.	The number of cry datasets is 1726, divided into – Pain: N = 580 – Cold: N = 578 – Wet diaper: N = 578 The number of training and testing cry datasets: 90% and 10% Pain cry was collected when the baby was given an injection, cold data was collected when the baby was showered with cold water, and diaper change data was collected when the baby was suffering from a wet diaper.	The cries of the infant were recorded in the University Malaya Medical Centre, Malaysia.	Classifier: Long short-term memory (LSTM) Feature extraction: VGG16 (The whale optimization algorithm-variational mode decomposition: WOA-VMD was integrated to recognize infant cries) Three types of crying classification – Pain – Cold – Wet diaper	Percent of accuracy, sensitivity, and F1-score (the harmonic mean of precision and recall) Accuracy: 90.15% Sensitivity: 90.17% F1-score: 90.22%	High risk
Abbaskhah et al.³⁷	To determine three machine learning classifiers: support vector machine (SVM), multilayer perceptron (MLP), and convolutional neural network (CNN) to classify infant cry.	The number of cry datasets is 315 divided into – Hungry: N = 37 – Need to burp: N = 56 – Sleepy/tired: N = 61 – Stomach cramp – (lower gas): N = 55 – Physical discomfort at skin level (wet, hot): N = 106 The number of training and testing cry datasets: – SVM: 85% and 15% The number of training, validation, and testing cry datasets: – MLP: 80, 10 and 10% – CNN: 80, 10 and 10%	The cry datasets were received from Dunstan recorded databases and have 44,100 Hz, up to 1.5 s. After feature extraction, two types of data are used Asymmetry data for each class (non-SMOTE) Symmetry data for each class (SMOTE method).	Three classifiers type – Support vector machine (SVM) – Multilayer perceptron (MLP) – Convolutional neural network (CNN) Feature extraction: MFCC Five types of crying classification Hungry Need to burp – Sleepy/tired – Stomach cramp (lower gas) – Physical discomfort at skin level (wet, hot)	Percent of accuracy rate SVM – Non-SMOTE: 82.3% – SMOTE: 86.1% MLP – Non-SMOTE: 87.6% – SMOTE: 89.2% CNN – Non-SMOTE: 92.1% – SMOTE: 91.1% Conclusion The best classification accuracy of five classes is reported using the CNN model designed for non-SMOTE data up to 92.1% with 0.005 tolerance.	Low risk
Aggarwal et al.⁴⁰	To develop a model for recognizing baby cries and distinguishing between different kinds of baby cries.	The number of cry datasets is 457, divided into – Hungry: N = 282 – Discomfort: N = 175	Did not provide information	Four classifiers type – Logistic regression – Support vector machine – Decision tree – Random forest Feature extraction: spectrum Two types of crying classification – Hungry – Discomfort	Percent of accuracy, sensitivity, and specificity Logistic regression Hungry: 93.75, 94.09, 93.57% Discomfort: 93.81, 93.88, 93.64% Support vector machine Hungry: 97.12, 97.37, 96.86% Discomfort: 97.34, 97.56, 97.06% Decision tree Hungry: 92.32, 92.22, 92.02% Discomfort: 92.28, 92.11, 91.93% Random forest Hungry: 94.03, 94.17, 93.69% Discomfort: 94.32, 94.51, 94.22%	High risk
Laguna et al.³⁵	To characterize newborns’ cry sound based on acoustic features, neurophysiological, and behavioral signals. To determine the most relevant features to classify different crying reasons by using machine learning. To demonstrate deep learning approaches (AMSI) for interpreting infant cries.	The number of cry datasets is 513 divided into – Hunger: N = 102 – Sleepy: N = 111 – Fussy: N = 101 – Burp: N = 95 – Distress: N = 104 The number of training and testing cry datasets: 70% and 30% Labeling Different cry types were defined as changes in the newborn's status generated by different scenarios (i.e. hunger, sleepiness, fussiness, need to burp, stress, pain, etc.) by staff.	Characteristics of infants – 38 healthy full-term infants (21 males, 17 females) – Recording age 15.54 ± 21.46 days after birth – Birth weight 3132.91 ± 404.29 g. Crying record The cries of the infant were recorded in the maternity ward of the Clínic of Barcelona Hospital, Switzerland, by neonatologists. – The cries of the infant were recorded by placing the recorder away 30 cm from the infant's mouth. – Sampling frequency of 48 kHz and 24-bit	Four classifiers type – Random forest – Logistic regression – AdaBoost – Acoustic multistage interpreter (AMSI) Feature extraction: spectrum Five types of crying classification – Hunger – Sleepy – Fussy – Burp – Distress	The percent accuracy, sensitivity, specificity, and F score (a metric that evaluates a classification model's accuracy) – Random forest: 75.0, 63.7, 95.2, 60.0% – Logistic regression: 68.8, 54.7, 94.1, 54.4% – AdaBoost 77.1, 63.0, 95.5, 64.0% – Acoustic multistage interpreter (AMSI) 92.0, 75.4, 94.1, 75.5% Conclusion The acoustic multistage interpreter (AMSI) algorithm achieved the highest accuracy rate of 92%.	Low risk
Joshi et al.²	To classify infant cry signals by using the convolutional neuron network (CNN): VGG16 and YOLOv4. To improve the performance of the model by using ensemble-based boosting algorithms.	The total number of cry datasets is 68,430 divided into The number of training:testing samples for CNN classification – Sleep: N = 12,820:3205 – Hunger: N = 12,980:3245 – Pain: N = 14,740:3685 – Diaper: N = 14,204:3551 The number of training:testing samples for ensemble-based boosting algorithms – Sleep: N = 12,924:4308 – Hunger: N = 14,517:4839 – Pain: N = 14,115:4705 – Diaper: N = 14,418:4806	Infants age: 1–10 days Crying record – The cries of the infant were recorded in a supine position and 40 cm from the infant's mouth. – Record at National Taiwan University Hospital	Convolutional neuron network (CNN): VGG16 and YOLOv4 (convolutional neural network variants) Feature extraction: Mel-frequency cepstral coefficients (MFCC) Four types of crying classification – Sleep – Hunger – Pain – Diaper change	Percent of accuracy, sensitivity, and specificity VGG16 – Sleep: 84.5, 88, 85.6% – Hunger: 86.3, 93, 90.2% – Pain: 84.2, 86, 85.5% – Diaper: 77.8, 83, 81% YOLOv4 – Sleep: 87.8, 90, 83% – Hunger: 86.7, 92, 88% – Pain: 82.3, 85, 81% – Diaper: 75.6, 84, 79% Ensemble-based boosting algorithms – Sleep: 94.5, 88, 87% – Hunger: 96.3, 93, 89% – Pain: 91.2, 86, 80% – Diaper: 92.8, 83, 76% Conclusion CNN with ensemble-based boosting algorithms provides more diagnostic performance than a single CNN.	Low risk
Liang et al.¹⁶	To study using deep learning algorithms: the artificial neural network (ANN), convolutional neural network (CNN), and long short-term memory (LSTM) to recognize infants’ necessities cries.	The total number of cry datasets is 1705 divided into – Hunger: N = 868 – Change diaper: N = 301 – Emotional needs (holding/touch): N = 486 – Pain (medical treatment): N = 50 The number of training, validation, and testing cry datasets: 70%, 15%, and 15% The audio was recorded by nurses and then noted the cause of crying: – Hunger: stop crying when fed – Diaper change: stop crying after the diaper change. – Emotional needs: stop crying when physical touch/holding – Pain: caused by invasive medical treatment	Characteristics of infants – 59 infants – Infants age: 2–27 days Crying record – The cries of the infant were recorded at the Far Eastern Memorial Hospital, Taiwan, by nurse – The cries of the infant were recorded by placing the recorder away 30 cm from the infant's bed. – Sampling frequency of 8000 Hz, 16-bit.wav files	Three classifiers type – Artificial neural network (ANN) – Convolutional neural network (CNN) – Long short-term memory (LSTM) Feature extraction: Mel-frequency cepstral coefficients (MFCC) Four types of crying classification – Hungry – Change diaper – Emotional needs (holding/touch) – Pain (medical treatment)	Percent of sensitivity (after balanced data) Artificial neural network – Hungry: 27% – Change diaper: 46% – Emotional needs: 37% – Pain: 35% Convolutional neural network – Hungry: 46% – Change diaper: 53% – Emotional needs: 59% – Pain: 49% Long short-term memory – Hungry: 36% – Change diaper: 29% – Emotional needs: 47% – Pain: 35% Conclusion CNN demonstrated the highest sensitivity for recognizing infant cries, at around 46–59%.	Low risk
Ashwini et al.³	To improve the performance of the infant cry classification model, features will be extracted using a deep learning technique (CNN), classified with a machine learning algorithm (SVM), and compared with the SVM-based kernel techniques.	The total number of cry datasets is 300 divided into – Hunger: N = 100 – Sleepy: N = 100 – Pain: N = 100 The number of training and testing cry datasets: 80% and 20%.	Characteristics of infants – Infants age: 1–10 days with no pathological problems or complications. Crying record – The cries of the infant were recorded at National Taiwan University Hospital Yunlin Branch, Taiwan.	One classifier type – Support vector machine (SVM) and compare with the SVM-based kernel techniques (radial basis function (RBF), linear, and polynomial) Features extraction: CNN-based feature extraction Three types of crying classification – Hungry – Sleepy – Pain	Percent average of accuracy, sensitivity, and specificity SVM with radial basis function Accuracy: 92.59% Sensitivity: 94.50% Specificity: 89.35% SVM with linear Accuracy: 89.63% Sensitivity: 92.25% Specificity: 84.54% SVM with polynomial Accuracy: 91.11% Sensitivity: 93.42% Specificity: 87.02% Conclusion The SVM–RBF provides the highest accuracy of infant cry classification at 92.59%.	Low risk
Chang et al.¹¹	To develop an efficient classification system for infants’ cries classification by using gradient boosting algorithms with a grouped-support vector network (SVM).	The total number of cry datasets is 1002 divided into – Hunger: N = 372 – Sleepy: N = 258 – Discomfort: N = 372 The number of training and testing cry datasets: 50% and 50%.	Characteristics of infants – 29 infants (male: 17, female: 12) with no pathological background – Infants age: 1–10 days Crying record – The cries of the infant were recorded at National Taiwan University Hospital Yunlin Branch, Taiwan. – The infant's arms were in a neutral thumb position. – The cries of the infant were recorded by placing the recorder away 40 cm from the infant's mouth. – The lengths of recorded infant cries were between 10 and 60 s.	One classifier type – Support vector machine (SVM) + gradient boosting algorithms Feature extraction: The selected five features: Peak, Pitch, MFCCs, ΔMFCCs, and LPCC from 12 features. Three types of crying classification – Hungry – Sleepy – Discomfort	Percent of accuracy rate – Hungry: 95.69% – Sleepy: 93.02% – Discomfort: 95.16% Accuracy means: 95.16% Conclusion This method has a fast recognition rate of 27 s in the identification of cries.	Low risk
Parga et al.¹⁵	To use the algorithm to predict the behavioral state of infants and compare it with the colic sound.	The total number of cry datasets is 1071, divided into – Pain: N = 353 – Fussy: N = 171 – Hungry: N = 167 – Colic: N = 380 Labeling Each audio was identified with the meaning of cry by two medical staff raters. The cry was noted by causing, such as a pain cry was captured during two painful stimuli (vaccinations, ear-piercings)	Characteristics of infants – 691 infants (36% female) – Infants age: 0–24 months (average age 3 months) Crying record Audio was recorded in the infants’ natural environments, e.g. home or clinic settings, by caregiver or staff.	One classifier type – Random forest (RF) Feature extraction: MFCCs, and spectral features Three types of crying classification – Pain – Fussy – Hungry:	Percent of accuracy, sensitivity, and specificity Accuracy: 71.5% Sensitivity: 91% Specificity: 68% Conclusion Random forest can achieve 71.5% accuracy in discriminating infant cry. Colic cry is predicted as a 73% chance of a painful cry.	Low risk
Infants’ pathological cry classification
Matikolaie and Tadj⁵	To develop an automatic diagnostic system for identifying septic infants from healthy infants	The number of cry datasets was separated five-fold to ensure the validity of the model and divided into expiration (EXP) and inspiration (INSV) voice. (technique to evaluate a model's performance by splitting the dataset) Fold one – EXP (healthy:sepsis) 507:507 – INSV (healthy:sepsis) 140:140 Fold two – EXP (healthy:sepsis) 517:517 – INSV (healthy:sepsis) 141:141 Fold three – EXP (healthy:sepsis) 524:524 – INSV (healthy:sepsis) 139:139 Fold four – EXP (healthy:sepsis) 523:523 – INSV (healthy:sepsis) 132:132 Fold five – EXP (healthy:sepsis) 453:453 – INSV (healthy:sepsis) 109:109 Labeling Pediatricians labeled the cries as healthy or septic through the infants’ medical examination.	Characteristics of infants – Infants age: 1–53 days – 108 healthy infants, and 17 sepsis infants Crying record – The cries of the infant were recorded by staff at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon – The cries of the infant were recorded by placing the recorder away from the infant 10–30 cm and maintaining the surrounding noise at a minimum level.	Three classifiers types – Support vector machine (SVM) – Decision tree algorithm – Discriminant analysis algorithm Features extraction: MFCC, tilt, intensity, rhythm Two types of crying classification – Healthy – Sepsis	Inspiration voice The classifier's best F-score results (a metric that evaluates a classification model's accuracy) – Cubic SVM (MFCCs): 85.70% – Boosted tree (tilt): 79.00% – Cubic SVM (intensity): 70.90% – Cubic SVM (rhythm): 75.60% – Quadratic SVM: 86.00% Expiration voice The classifier's best F-score results – Quadratic discriminant (MFCCs): 83.00% – Quadratic discriminant (Tilt): 83.90% – Cubic SVM (intensity): 74.60% – Quadratic discriminant (rhythm): 77.70% – Quadratic discriminant: 82.80% Conclusion Feature sets using quadratic SVM resulted in the best F-score with 86% for the expiration dataset and tilt feature set with quadratic discriminant resulted in the best F-score of 83.90% for inspiration.	Low risk
Patil et al.³⁸	To enhance infant cry classification accuracy by developing constant-Q cepstral coefficients (CQCC)	The total number of cry datasets is 3458, consisting of Baby Chillanto database – Healthy cry: normal N = 507, hungry N = 350, pain N = 192 – Pathological cry Asphyxia N = 340, deaf N = 879 In-House DA-IICT database – Healthy cry N = 793 – Pathological cry: asthma N = 182, Hypoxic-ischemic encephalopathy (HIE) N = 215	The cry datasets were received from two databases The Baby Chillanto infant cry database, Mexico The In-House DA-IICT infant cry database, India	Two classifiers type – Gaussian mixture model (GMM) – Support vector machine (SVM) Features extraction: MFCC, LFCC, cepstral, CQCC Seven types of crying classification – Normal – Hungry – Pain – Asphyxia – Deaf – Asthma – HIE	Percent of the best accuracy rate Gaussian mixture model MFCC: 98.55% LFCC: 98.25% Cepstral: 98.68% CQCC: 99.82% Support vector machine MFCC: 88.11% LFCC: 80.18% Cepstral: 80.62% CQCC: 91.19% Conclusion The CQCC–GMM model achieved the highest accuracy of 99.82% in classifying infant cries.	Low risk
Zayed et al.²³	To develop a medical diagnostic system for interpreting infants’ cry audio signals (CAS) using a combination of different audio domain features and deep learning (DL) algorithms for early diagnosis.	The number of cry datasets is 15,950 divided into – Sepsis: N = 2554 – RDS: N = 4396 – Healthy: N = 9000 Labeling Medical staff label the cry based on medical tests and reports that identify each infant's pathology (RDS, sepsis)	Infants age: 1–53 days Weight: 0.98 to 5.2 kg Crying record – The cries of the infant were recorded by staff at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon – The cries of the infant were recorded by placing the recorder away 10–30 cm from the infant with 16-bit resolution and 44,100 Hz frequency. – Average duration of recording 90 s	Three classifiers type – Random forest (RF) – Support vector machine (SVM) – Deep neural network (DNN) Three features extraction model – Harmonic ratio (HR) – Gammatone frequency cepstral coefficients (GFCCs) – Image-based features through the spectrogram Three types of crying classification – Sepsis – RDS – Healthy	Random forest (RF) with GFCC and HR Accuracy: 91.18% Sensitivity and F1-score rate: – Sepsis: 89.00, 90.46% – RDS: 93.00, 91.44% – Healthy: 92.00, 92.46% Support vector machine (SVM) with GFCC and HR Accuracy: 94.79% Sensitivity and F1-score rate: – Sepsis: 92.00, 93.97% – RDS: 92.00, 92.68% – Healthy: 97.00, 96.48% Deep neural network (DNN) with Spectrogram HR and GRCC Accuracy: 97.50% sensitivity and F1-score rate: – Sepsis: 96.00, 96.48% – RDS: 97.00, 97.00% – Healthy: 99.00, 98.49% Conclusion The highest accuracy model to diagnose infant cries is the deep neural network (DNN) with spectrogram, HR, and GRCC features at 97.50%.	Low risk
Zhang et al.³⁹	To improve the performance of the infant cry classification by combining the fusion feature (BCRNet model) to compare with the deep learning model.	The total number of cry datasets is 2268, divided into three datasets Asphyxia vs. normal, hungry – Asphyxia: N = 340 – Normal and hungry: N = 507 Deaf vs. normal and hungry – Deaf: N = 879 – Normal and hungry: N = 507 Hungry vs. pain – Hungry: N = 350 – Pain: N = 192	The cry datasets were received from the Baby Chillanto Infant Cry.	Four classifiers type – Support vector machine (SVM) – Deep neural network (DNN) – Convolutional neural network (CNN) – BCRNet model Three features extraction – Hybrid features (MFCC, LMS, ZCR) – Spectrogram (STFT) – Fusion feature (VGG16 and ResNet50) Five types of crying classification – Asphyxia – Deaf – Hungry – Pain – Normal	Percent of the best accuracy and sensitivity Support vector machine Accuracy: 91.80% Sensitivity: 87.02% Deep neural network Accuracy: 92.42% Sensitivity: 87.94% Convolutional neural network Accuracy: 92.77% Sensitivity: 90.14% BCRNet model Accuracy: 96.96% Sensitivity: 93.05% Conclusion The BCRNet model (fusion feature) has more effectively improved the accuracy of baby cry recognition compared to other models.	Low risk
Khalilzad et al.⁴	To identify septic and respiratory distress syndrome (RDS) newborns by comparing the machine learning between multilayer perceptron (MLP) and support vector machine (SVM)	The number of cry datasets: N = 2264 divided into – Sepsis: N = 1132 – RDS: N = 1132 The number of training, validation, and testing cry datasets: 55%, 15%, and 30% Labeling The cry signals were labeled as healthy or with the diagnosed pathology group based on medical reports by staff.	Characteristics of infants – Infants age: 1–53 days – 17 sepsis infants, 33 RDS infants – Full term Crying record – The cries of the infant were recorded by staff at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon – The cries of the infant were recorded by placing the recorder away 10–30 cm from the infant with 16-bit resolution and 44,100 Hz frequency.	Two classifier types – Multilayer perceptron (MLP) – Support vector machine (SVM) Two-feature extraction model – Gammatone frequency cepstral coefficients (GFCCs) – Harmonic ratio (HR) Two types of crying classification – Septic – Respiratory distress syndrome (RDS)	The percent accuracy, specificity, and F score (a metric that evaluates a classification model's accuracy) Multilayer perceptron (MLP) – GFCC: 88.51, 89, 89% – HR: N/A Support vector machine (SVM) – GFCC: 92.94, 93, 93% – HR: 71.03, 71, 71% Combine features Multilayer perceptron (MLP) – GFCC + HR: 92.49, 92, 92% Support vector machine (SVM) – GFCC + HR: 95.29, 95, 95% Conclusion The highest accuracy model to identify septic and RDS newborns is a support vector machine with GFCC + HR features at 95.29%.	Low risk
Matikolaie and Tadj³⁶	To provide a better classification performance for differentiating the infant cry between healthy and RDS infants by using short-term features and long-term features.	The total number of cry datasets is 376, divided into – Expiration dataset N = 191 – Inspiration dataset N = 185 The number of training and testing cry datasets: 90% and 10%. Labeling Medical experts at both hospitals annotated the cause of crying based on medical reports.	Characteristics of infants – 117 full-term infants (78 healthy infants and 34 infants with RDS) – Infants age: 1–53 days Crying record – The cries of the infant were recorded at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon – The cries of the infant were recorded by placing the recorder away 10–30 cm from the infant with 16-bit resolution and 44.1 kHz frequency. – The length of each record is within 2−3 min. – The cause of crying, including hunger, pain, birth, wet diaper, etc., was annotated by medical experts at both hospitals.	One classifier type – Support vector machine (SVM) Six features extraction – MFCC – Tilt – Rhythm – MFCC + tilt – MFCC + rhythm – MFCC + tilt + rhythm Two types of crying classification – Respiratory distress syndrome (RDS) – Healthy	The percent accuracy and sensitivity Expiration dataset – MFCC: 70.60, 43.00% – Tilt: 55.50, 63.40% – Rhythm: 44.50, 35.50% – MFCC + Tilt: 73.30, 60.20% – MFCC + Rhythm: 72.20, 46.20% – MFCC + Tilt + Rhythm: 73.80, 60.20% Inspiration dataset – MFCC: 65.10, 36.60% – Tilt: 60.70, 71.10% – Rhythm: 50.90, 32.20% – MFCC + Tilt: 68.40, 43.00% – MFCC + Rhythm: 63.70, 32.20% – MFCC + Tilt + Rhythm: 67.80, 41.10% Conclusion The best performance for differentiating the infant cry between healthy and RDS infants is the combination of all three feature sets in the expiration episode.	Low risk
Rosales-Pérez et al.²⁴	To improve the predictive accuracy of infant cries to differentiate between normal and pathological cries.	The total number of cry datasets is 2268, divided into three datasets Asphyxia vs. normal, hungry – Asphyxia: N = 340 – Normal and hungry: N = 507 Deaf vs. normal and hungry – Deaf: N = 879 – Normal and hungry: N = 507 Hungry vs. pain – Hungry: N = 350 – Pain: N = 192	– The cry datasets were received from the Baby Chillanto Infant Cry property of INAOE-CONACyT, Mexico. – The infant's cries were recorded by medical doctors and labeled the reason for each crying.	One classifier type – Genetic selection of a fuzzy model (GSFM) algorithm Features extraction: Mel-frequency cepstral coefficient (MFCC) and Linear predictive coding (LPC) Five types of crying classification – Asphyxia – Deaf – Hungry – Pain – Normal	The percent of accuracy and sensitivity Asphyxia vs. normal and hungry – Accuracy: 90.68% – Sensitivity: 85.29% Deaf vs. normal and hungry – Accuracy: 99.42% – Sensitivity: 100% Hungry vs. pain – Accuracy: 97.96% – Sensitivity: 99.43%	Low risk

AdaBoost: adaptive boosting; AMSI: acoustic multistage interpreter; BCRNet model: bidirectional convolutional recurrent network; Classifier: the extracted features to categorize data into classes; CNN: convolutional neural network; Feature extraction: extracting key acoustic characteristics from audio signals; F-score results: a metric that evaluates a classification model's accuracy; GMM: Gaussian mixture model; LFCC: linear frequency cepstral coefficients; LMS: least mean squares; LPCC: linear predictive cepstral coefficient; MFCC: Mel-frequency cepstral coefficients; MLP: multilayer perceptron; ResNet: residual network; SE-ResNet-transformer: squeeze-and-excitation residual network with transformer module; SVM: support vector machine; VGG: visual geometry group; WOA-VMD: whale optimization algorithm-variational mode decomposition; ZCR: zero-crossing rate.

Infant cries type classification

In a review of 17 included studies, infant cries were analyzed using ML and categorized into two main types: infant's need and pathological cries. Specific need cries or nonpathological include nine types of crying, such as hunger, sleepiness, pain or distress, wet diaper, discomfort, fussiness, the need to burp, a desire for holding or touch, and feeling cold. Pathological cries include six types, such as sepsis, RDS, asphyxia, deafness, asthma, and HIE. Tables 3 and 4 illustrate an overview of these cries, displaying the types of crying, the ML classifiers used, and their performance rates. The two main categories of crying are compared based on the classifiers utilized and their performance rates.

Table 3.

Infant cries type classification and machine learning classifier.

Authors	Specific need cry classification									Pathological cry classification						Classifiers	Features extraction
Authors	Hungry	Sleepy	Pain/distress	Wet diaper	Discomfort	Fussy	Burp	Holding/touch	Cold	Sepsis	RDS	Asphyxia	Deaf	Asthma	HIE	Classifiers	Features extraction
Li et al.¹²	✓		✓		✓											SE-ResNet-transformer	MFCC
Zhang et al.³⁴			✓	✓					✓							LSTM	VGG16
Abbaskhah et al^.³⁷	✓	✓	✓		✓		✓									– Support vector machine (SVM) – Multilayer perceptron (MLP) – Convolutional neural network	Mel-frequency cepstral coefficients (MFCC)
Aggarwal et al.⁴⁰	✓				✓											– Logistic regression – Support vector machine (SVM) – Decision tree – Random forest	Spectrum
Laguna et al.³⁵	✓	✓	✓			✓	✓									– Random Forest – Logistic Regression – AdaBoost – Acoustic multistage interpreter	Spectrum
Joshi et al.²	✓	✓	✓	✓												Convolutional neuron network (CNN)	MFCC
Liang et al.¹⁶	✓		✓	✓				✓								– Artificial neural network (ANN) – Convolutional neural network – Long short-term memory (LSTM)	Mel-frequency cepstral coefficients (MFCC)
Ashwini et al.³	✓	✓	✓													Support vector machine (SVM)	CNN-based
Chang et al.¹¹	✓	✓			✓											Support vector machine (SVM)	Peak, Pitch, MFCCs, ΔMFCCs, and LPCC from 12 features
Parga et al.¹⁵	✓		✓			✓										Random forest	MFCC, spectrum
Matikolaie and Tadj⁵										✓						– Support vector machine (SVM) – Decision tree – Discriminant analysis	MFCC, tilt, intensity, rhythm
Patil et al.³⁸	✓		✓									✓	✓	✓	✓	– Gaussian mixture model (GMM) – Support vector machine (SVM)	MFCC, LFCC, Cepstral, CQCC
Zayed et al.²³										✓	✓					– Random forest – Support vector machine (SVM) – Deep neural network (DNN)	HR, GFCCs, spectrogram
Zhang et al.³⁹	✓		✓									✓	✓			– Support vector machine (SVM) – Deep neural network (DNN) – Convolutional neural network – BCRNet model	Hybrid features (MFCC, LMS, ZCR),Spectrogram,Fusion feature (VGG16 + ResNet50)
Khalilzad et al.⁴										✓	✓					– Multilayer perceptron (MLP) – Support vector machine (SVM)	GFCCs, HR
Matikolaie and Tadj³⁶											✓					Support Vector Machine (SVM)	MFCC, tilt, rhythm
Rosales-Pérez et al.²⁴	✓		✓									✓	✓			Genetic selection of a fuzzy model (GSFM)	MFCC, linear predictive coding (LPC)

AdaBoost: adaptive boosting; BCRNet model: bidirectional convolutional recurrent network; Classifier: the extracted features to categorize data into classes; CQCC: constant-Q cepstral coefficients; Feature extraction: extracting key acoustic characteristics from audio signals; GFCC: gammatone frequency cepstral coefficient; HIE: hypoxic-ischemic encephalopathy; HR: harmonic ratio; LFCC: linear frequency cepstral coefficient; LMS: least mean squares; LPCC: linear predictive cepstral coefficient; LSTM: long short-term memory; ResNet: residual network; RDS: respiratory distress syndrome; SE-ResNet-transformer: squeeze-and-excitation residual network with transformer module; VGG: visual geometry group; ZCR: zero-crossing rate.

Table 4.

Machine learning classifier types and performance rate.

Authors	Supervised learning												Unsupervised		Hybrid		Classifiers compare with performance rate	Best performance rate
Authors	SVM	Regression	Decision Tree	RF	AdaBoost	MLP	CNN	ANN	LSTM	DNN	Discriminant	BCRNet	GMM	AMSI	SE-ResNet-transformer	GSFM	Classifiers compare with performance rate	Accuracy (%)	Sensitivity (%)	Specificity (%)	F1-Score (%)
Li et al.¹²															✓		– SE-ResNet-transformer	93	93	-	92
Zhang et al.³⁴									✓								– LSTM	90.15	90.17	-	90.22
Abbaskhah et al.³⁷	✓					✓	✓										– SVM – MLP – CNN	86.189.292.1	-	-	-
Aggarwal et al.⁴⁰	✓	✓	✓	✓													– Regression – SVM – Decision tree – Random forest	93.8197.3492.3294.32	94.0997.5692.2294.51	93.6497.0692.0294.22	-
Laguna et al.³⁵		✓		✓	✓									✓			– RF – Regression – AdaBoost – AMSI	75.068.877.192.0	63.754.763.075.4	95.294.195.594.1	60.054.464.075.5
Joshi et al.²							✓										– CNN	77.8–86.3	83–93	81–90.2	-
Liang et al.¹⁶							✓	✓	✓								– ANN – CNN – LSTM	-	465947	-	-
Ashwini et al.³	✓																– SVM	92.59	94.50	89.35	-
Chang et al.¹¹	✓																– SVM	95.69	-	-	-
Parga et al.¹⁵				✓													– Random forest	71.50	91	68	-
Matikolaie and Tadj⁵	✓		✓								✓						– SVM – Decision tree – Discriminant	85.0078.3078.80	-	-	86.0079.0083.90
Patil et al.³⁸	✓												✓				– GMM – SVM	99.8291.19	-	-	-
Zayed et al.²³	✓			✓						✓							– Random forest – SVM – DNN	91.1894.7997.50	939799	-	92.4696.4898.49
Zhang et al.³⁹	✓						✓			✓		✓					– SVM – DNN – CNN – BCRNet	91.8092.4292.7796.96	87.0287.9490.1493.05	-	-
Khalilzad et al.⁴	✓					✓											– MLP – SVM – Combine feature	88.5171.03–92.9495.29	-	8971–9395	8971–9395
Matikolaie and Tadj³⁶	✓																– SVM	44.50–73.80	32.2–71.1	-	-
Rosales-Pérez et al.²⁴																✓	– GSFM	99.42	100	-	-

AdaBoost: adaptive boosting; AMSI: acoustic multistage interpreter; ANN: artificial neural network; BCRNet model: bidirectional convolutional recurrent network; CNN: convolutional neural network; DNN: deep neural network; GMM: Gaussian mixture model; GSFM: genetic selection of a fuzzy model; LSTM: long short-term memory; MLP: multilayer perceptron; RF: random forest; SE-ResNet-transformer: squeeze-and-excitation residual network with transformer module; Supervised learning: the model is trained on labeled data (inputs with known outputs); SVM: support vector machine; Unsupervised: the model is trained on unlabeled data.

Infant-specific need cry type or nonpathological cry

The studies on infant-specific cry types identified nine different types of crying that were classified using various classifiers. The performance rate of this classification varied significantly depending on the classifier used. There were 12 studies^{2,3,11,12,15,16,24,35,}^37–40 focused on hungry cry, which was the most commonly classified type. The accuracy of hungry cry classifications ranged widely, with performance rates between 68.8% and 99.82%. The highest accuracy of hunger cry was achieved at 99.82% by using a Gaussian mixture model (GMM). Five studies^2,3,11,35,37 utilized ML to classify sleepy cry, with accuracy ranging from 66.8% to 95.69%. The best performance (95.69%) was achieved using a support vector machine (SVM). From 11 studies^{2,3,12,15,16,24,34,35}^37–39 that mentioned pain cry, the performance varied from 46.0% to 99.82%. The lowest performance rate was classified using an artificial neural network (ANN), while the highest was classified by a GMM. For wet diaper cry, three studies^2,16,34 indicated that there was a performance rate ranging from 53.0% to 90.15%. Interestingly, both the highest and lowest performance rates were classified using long short-term memory (LSTM), but they utilized different features. The highest performance rate employed VGG16, while the lowest used Mel-frequency cepstral coefficients (MFCCs). Five studies^{3,11,12,37,40} reported high accuracy in discomfort cry classification, ranging from 86.1% to 97.34%, with SVM achieving the highest accuracy rate. Additionally, two studies^35,37 mentioned burp cry, which showed accuracy from 68.8% to 92.1% that convolutional neural networks (CNNs) were found to be the most effective classifier. The other three specific needs cries of fussiness, a desire for holding, and feeling cold, were reported by a different study with the best performance rates at 92.0%,³⁵ 59.0%,¹⁶ and 90.15%,³⁴ respectively.

From the performance rate of each infant-specific cry type mentioned above, it is evident that different classifiers achieved the highest accuracy for each different cry type. The best classifiers of each cry type were as follows: a GMM for hungry cries, an ANN for pain cries, an LSTM for wet diaper cries, an SVM for discomfort cries, a CNN for burp cries, and an acoustic multistage interpreter (AMSI) for fussiness. The cries associated with a desire for holding or touch and feeling cold did not have a classifier for comparing accuracy rates. However, the cries indicating feeling cold achieved high accuracy rates when using an LSTM classifier. In contrast, the cries expressing a need for holding or touch had lower accuracy when employed with CNNs. It can be concluded that the best accuracy rate for classifying nonpathological cries was observed for hunger and pain cries, both reaching an impressive 99.82% accuracy by using a GMM classifier.

Pathological cry type

In this review, there were seven studies focused on pathological cry types, which can be classified into six distinct types using various classifiers. There were three studies^4,5,16 focused on sepsis cry, revealing performance rates, ranging from 71.03% to 97.50%. Notably, a deep neural network (DNN) achieved the highest performance rate within this range. Three studies^4,23,36 mentioned RDS cry and reported that classification accuracy varied significantly, from 44.50% to 97.50%. The highest accuracy was achieved using a DNN, while the lowest accuracy was noted with an SVM. Three studies^24,38,39 reported findings for both asphyxia and deafness cries, revealing high accuracy rates. For asphyxia cries, accuracy ranged from 90.68% to 99.82%, while the accuracy for deaf cries was even higher, ranging from 99.42% to 99.82%. Furthermore, a GMM achieved the highest accuracy for both asphyxia and deaf cries. Other pathological cries, such as asthma and HIE, were reported in one study,³⁸ with performance rates ranging from 91.19% to 99.82%, where GMM achieved the highest accuracy.

From the performance rate of pathological cries mentioned above, it is evident that two classifiers stood out for their high classification accuracy: a DNN and a GMM. A DNN was particularly effective in detecting sepsis and RDS cries. In contrast, a GMM exceled in identifying cries linked to asphyxia, deafness, asthma, and HIE. Notably, the highest accuracy rate for classifying pathological cries was observed for deafness, ranging from 99.42% to 99.82% accuracy by using a GMM classifier.

ML classifier

Performance rate across classifier types

The most commonly used type of classifier for analyzing infant cries is supervised learning, with SVM being the most frequently utilized classifier. SVM has been employed in 10 out of 17 studies and showed a wide accuracy range from 44.50% to 97.34%, indicating high sensitivity to feature quality and combination. Among these, eight studies^3,5,11,23^37–40 achieved high-performance rates ranging from 85.00% to 97.34%. Five^{3,11,23,38,39} of these eight studies achieved high performance when multiple acoustic features were combined. For example, Chang et al.¹¹ achieved 95.69% accuracy using a combination of peak, pitch, MFCCs, ΔMFCCs, and linear predictive cepstral coefficients, while Matikolaie and Tadj⁵ obtained 86% accuracy using spectrum features alone. However, when limited to single or less informative features, SVM performance decreased substantially. Two studies^4,5 employed SVM as a classifier and reported medium- and low-performance rates. For instance, the study by Khalilzad et al.⁴ relied solely on harmonic ratio features, while the research by Matikolaie and Tadj³⁶ focused on rhythm features, achieving accuracies of 71.03% and 44.50%, respectively. These results demonstrate that SVM can perform very well when supported by diverse, discriminative features, but lacks robustness when relying on narrow or low-quality inputs.

Seven other studies^{2,12,15,16,24,34,35} did not use an SVM as a classifier but instead employed alternatives, such as CNNs, GMMs, LSTM, regression, random forest, ANNs, and hybrid classifiers. Among these, CNN was the most commonly used for classifying infant cries. Four studies^2,16,37,39 applied CNNs, but the performance levels varied, ranging from below 70% to 92.77%. Two studies^37,39 achieved high accuracy rates, between 92.10% and 92.77%, while the other two studies reported medium accuracy (70–90%)² and low accuracy (less than 70%).¹⁶ The differences in performance can be attributed to the fact that the first two studies,^37,39 which achieved high accuracy, utilized high-quality datasets with a large number of infant cries, and the sounds were labeled by expert staff, nurses, or pediatricians who have experience to identify the meaning of infant cry. Additionally, one study³⁹ integrated advanced architectures such as a bidirectional convolutional recurrent network (BCRNet) model to enhance recognition. In contrast, CNN models trained with single feature or smaller datasets yielded moderate (70–90%)² or low (<70%)¹⁶ accuracies. This pattern highlights the strong dependence of CNNs on both dataset quality and architectural complexity.

Among all the classifiers, the GMM demonstrated the highest performance rate in recognizing infant cries at 99.82%.³⁸ When comparing under the same feature conditions (MFCC), GMM clearly outperformed SVM (98.55% vs. 70.60%), demonstrating its superior capacity to model the probabilistic distribution of acoustic characteristics in infant cries. Similarly, hybrid or ensemble-based classifiers also achieved excellent results; for instance, genetic selection of a fuzzy model (GSFM; 99.42%)²⁴ and squeeze-and-excitation residual network with transformer module (SE-ResNet-transformer; 93%),¹² suggesting that combining statistical and deep-learning methods can yield near-optimal recognition accuracy.

Other classifiers, though less frequently applied, also showed promising results. Random forest^15,23,35,40 and regression models^35,40 demonstrated moderate-to-high accuracy (68.8–94.32%), with the best performances observed when integrated into ensemble frameworks. ANNs reported 46% accuracy,¹⁶ while DNNs ranged from 92.42%³⁹ to 97.50%,²³ showing improvement as model depth increased. LSTM networks, which are designed to capture temporal dynamics, achieved up to 90.15% accuracy³⁴ but performed modestly (47%)¹⁶ in studies with imbalanced datasets. These findings suggest that temporal models, such as LSTM can be powerful but require extensive data to realize the potential.

Overall, the comparative evidence across studies indicates that GMM consistently delivers the highest classification accuracy, followed closely by hybrid deep-learning architectures (GSFM, SE-ResNet-transformer). SVM remains the most commonly used classifier and can achieve competitive accuracy when multiple features are combined, such as MFCC, tilt, and rhythm yields better performance than using a single feature. Random forest and DNN approaches offer reliable mid-to-high performance, while simpler models such as basic ANNs or regression show moderate accuracy. In summary, integrating multifeature fusion with advanced or hybrid classifiers yields the most robust and accurate recognition of infant cries, whereas models using limited features tend to underperform.

Performance rate across datasets

In this review, the dataset came from two main types: infant cry databases and self-recorded data. For the databases, the infant cry sounds were recorded in hospitals or homes and labeled with the meaning of crying by doctors, nurses, or experts in infant vocalizations. In contrast, self-recorded cries were primarily recorded in hospitals, but many lacked clear labeling during the annotation process. To ensure data quality across different sources, almost all datasets underwent a standardized preprocessing. This involved removing noise, silence, segmenting, and standardizing audio amplitude levels to minimize variability across datasets. Out of 17 studies examined, 11 studies used self-recorded infant cry sounds, five studies^12,24^37–39 relied on cry databases, and one study⁴⁰ did not specify the data source. To evaluate the accuracy across the datasets, the performance rates of infant cries from both types were compared. Data from 11 self-recorded studies showed a performance rate ranging from 44.50% to 97.50%. In contrast, the cry databases exhibited a higher performance rate between 86.1% and 99.82%. In a comparison of self-recorded and cry databases using the same classifier and features with SVM and MFCC features, the self-recorded dataset from the study by Matikolaie and Tadj⁵ achieved a performance rate of 86.00%. In contrast, the cry databases used in the study by Patil et al.³⁸ achieved a higher performance rate of 88.11%. Therefore, infant cries from cry databases tend to achieve slightly higher accuracy compared to self-recorded datasets. This is because datasets from cry databases undergo verification, data validation, and preprocessing by experts (pediatricians and nurses) with experience in interpreting infant cries and labeling the reason for crying by cause, and applying appropriate actions to stop it. This process enhances the quality of infant cries accuracy to understand the meaning behind the infant cries.

The comparison of the accuracy rate by the chance level across studies

When comparing the accuracy rates across studies, the chance level serves as a crucial reference point for interpreting algorithmic performance. Chance level, determined by the proportion of the majority class in each dataset, reflects the baseline accuracy that a naive classifier could achieve by always predicting the most frequent category. Table 5 presents the chance level accuracy, reported accuracy in each study, and the improvement above chance. From the included studies, chance levels varied, ranging from 21.64% to 61.71% for need-based cry classification and from 25.42% to 56.43% for pathological cry classification. Ten studies^{2,3,11,12,24,34,35}^37–39 demonstrated substantial improvements above chance of more than 50%, indicating that their reported accuracies were meaningful beyond random or naive prediction. The highest improvements were observed in studies by Patil et al.,³⁸ which had more balanced datasets, resulting in improvements ranging from 65.77% to 74.4%, and provided more robust evidence of discriminative ability.

Table 5.

The chance level accuracy, reported accuracy, and the improvement above chance.

Authors	The number of datasets	Dataset sources	Chance level (%)	Reported accuracy (%)	Improvement above chance (%)
Infants’ need cry classification
Li et al.¹²	The number of cry datasets is 5204, divided into – Hunger: N = 2080; Uncomfortable: N = 1760; Pain: N = 1364	Three databasesDonate a Cry Corpus; Chillanto; Environmental sound classification-50	39.97	93.00	53.03
Zhang et al.³⁴	The number of cry datasets is 1726, divided into – Pain: N = 580; Cold: N = 578; Wet diaper: N = 578	The cries of the infant were recorded in the University Malaya Medical Centre, Malaysia.	33.60	90.15	56.55
Abbaskhah et al.³⁷	The number of cry datasets is 315, divided into – Hungry: N = 37; Need to burp: N = 56; Sleepy/tired: N = 61; Stomach cramp: N = 55; Physical discomfort at skin level: N = 106	Dunstan recorded databases.	33.65	86.10–92.10	52.45–58.45
Aggarwal et al.⁴⁰	The number of cry datasets is 457, divided into – Hungry: N = 282; Discomfort: N = 175	Did not provide information.	61.71	92.32–97.34	30.61–35.63
Laguna et al.³⁵	The number of cry datasets is 513, divided into – Hunger: N = 102; Sleepy: N = 111; Fussy: N = 101; Burp: N = 95; Distress: N = 104	The cries of the infant were recorded in the maternity ward of the Clínic of Barcelona Hospital, Switzerland, by neonatologists.	21.64	68.80–92.00	47.16–70.36
Joshi et al.²	The total number of cry datasets is 68,430, divided into – Sleep: N = 16,025; Hunger: N = 16,225; Pain: N = 18,425; Diaper: N = 17,755	The cries of the infant were recorded in the National Taiwan University Hospital	26.93	77.80–86.30	50.87–59.37
Liang et al.¹⁶	The total number of cry datasets is 1705, divided into – Hunger: N = 868; Change diaper: N = 301; Emotional needs (holding/touch): N = 486; Pain (medical treatment): N = 50	The cries of the infant were recorded at the Far Eastern Memorial Hospital, Taiwan, by nurse.	50.91	Not reported accuracy(sensitivity reported)	-
Ashwini et al.³	The total number of cry datasets is 300, divided into – Hunger: N = 100; Sleepy: N = 100; Pain: N = 100	The cries of the infant were recorded at National Taiwan University Hospital Yunlin Branch, Taiwan.	33.33	92.59	59.26
Chang et al.¹¹	The total number of cry datasets is 1002, divided into – Hunger: N = 372; Sleepy: N = 258; Discomfort: N = 372	The cries of the infant were recorded at National Taiwan University Hospital Yunlin Branch, Taiwan.	37.13	95.69	58.56
Parga et al.¹⁵	The total number of cry datasets is 1071, divided into – Pain: N = 353; Fussy: N = 171; Hungry: N = 167	The cries of the infant were recorded at the clinical setting and at home.	32.96	71.50	38.54
Infants’ pathological cry classification
Matikolaie and Tadj⁵	The number of cry datasets was separated into five folds and divided into expiration (EXP) and inspiration (INSV) between healthy and sepsis voiceFold one: EXP: 507:507; INSV: 140:140Fold two: EXP: 517:517; INSV: 141:141Fold three: EXP: 524:524; INSV: 139:139Fold four: EXP: 523:523; INSV: 132:132Fold five: EXP: 453:453; INSV: 109:109	The cries of the infant were recorded by staff at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon	39.52	78.30–85.00	38.78–45.48
Patil et al.³⁸	The total number of cry datasets is 3458, consisting ofBaby Chillanto Database – Healthy cry: Normal N = 507, hungry N = 350, pain = 192 – Pathological cry: Asphyxia N = 340, deaf N = 879 In-House DA-IICT Database – Healthy cry N = 793 – Pathological cry: Asthma N = 182, HIE: N = 215	Two databases The Baby Chillanto infant cry database, Mexico The In-House DA-IICT infant cry database, India	25.42	91.19–99.82	65.77–74.4
Zayed et al.²³	The total number of cry datasets is 15,950, divided into – Sepsis: N = 2554; RDS: N = 4396; Healthy: N = 9000	The cries of the infant were recorded by staff at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon	56.43	91.18–97.50	34.75–41.07
Zhang et al.³⁹	The total number of cry is 2268, divided into three datasets Asphyxia vs. normal, hungry – Asphyxia: N = 340; Normal and hungry: N = 507 Deaf vs. normal and hungry – Deaf: N = 879; Normal and hungry: N = 507 Hungry vs. pain – Hungry: N = 350; Pain: N = 192	Baby Chillanto Infant Cry database	38.76	91.80–96.96	53.04–58.2
Khalilzad et al.⁴	The total number of cry datasets: N = 2264, divided into – Sepsis: N = 1132 – RDS: N = 1132	The cries were recorded by staff at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon	50	71.03–95.29	21.03–45.29
Matikolaie and Tadj³⁶	The total number of cry datasets is 376, divided into – Expiration dataset N = 191 – Inspiration dataset N = 185	The cries were recorded at two hospitals Sainte-Justine in Montréal Hospital, Canada Al-Sahel and Al-Raee Hospital, Lebanon	50.80	44.50–73.80	-6.30–23.0
Rosales-Pérez et al.²⁴	The total number of cry datasets is 2268, divided into Asphyxia vs. normal, hungry – Asphyxia: N = 340; Normal and hungry: N = 507 Deaf vs. normal and hungry – Deaf: N = 879; Normal and hungry: N = 507 Hungry vs. pain – Hungry: N = 350; Pain: N = 192	The cry datasets were received from the Baby Chillanto Infant Cry property of INAOE-CONACyT, Mexico	38.76	99.42	60.66

RDS: respiratory distress syndrome; Chance level: The estimated accuracy based on the largest class proportion in each dataset.

In contrast, five studies^{4,16,23,36,40} with high class imbalance (chance level greater than 50%), achieved high raw accuracies but relatively small improvements above chance, which may overestimate the actual discriminative ability of the models. One study⁵ reported performances near or below chance level for subsets, indicating limited practical utility. These findings emphasize that the direct comparison of raw accuracy across studies is limited. Future work should consistently report classification performance relative to the chance level to allow for meaningful cross-study comparisons.

Quality assessment results

The Quality Assessment of Diagnostic Accuracy Studies-2³⁰ was utilized to evaluate the quality of the included studies, which can be divided into two parts: risk of bias assessment and applicability. Overall, 15 studies were considered as low risk, while two studies^34,40 were classified as high risk. In the risk of bias assessment within the selection domain, 15 studies did not specify the process for the random selection of patients, resulting in an unclear risk. Furthermore, the two studies^34,40 did not provide information about patient characteristics, which was considered a high risk that threatened internal validity. All 17 studies are considered to have a low risk in the index test, reference standard, and flow and timing domains. For the applicability assessment, 15 studies demonstrated a low risk in the selection domain, while two studies^34,40 had an unclear risk, which reduces confidence in generalizing the results to the population and poses a threat to external validity. Regarding the index test and reference standard domains, all 17 studies had a low risk, as they employed a strong process for measure validation. Therefore, the results can be regarded as having high confidence for generalization and application in classifying infant cry types. A quality assessment of each study is provided in Table S2 of the Supplemental materials (see Figures 4 and 5).

Figure 4.

Quality assessment results of risk of bias by domain in QUADAS-2. QUADAS-2: Quality Assessment of Diagnostic Accuracy Studies-2.

Figure 5.

Quality assessment results of applicability by domain in QUADAS-2. QUADAS-2: Quality Assessment of Diagnostic Accuracy Studies-2.

Discussion

This systematic review synthesizes findings from 17 studies on the application of ML to classify infant cry types, as well as the accuracy of this classification. The results revealed that ML can differentiate between two main types of cries: infants’ needs cries and pathological cries. Various classifiers were employed for these cry types, with each being suited to specific cry types and impacting performance rates differently. For specific-need cries, the highest classification accuracy achieved was 99.82%,³⁸ with hunger and pain cries being the most accurately classified types by using GMMs. In pathological cries, GMMs also achieved the highest accuracy for detecting pathological cries related to deafness, which was the most accurately classified, ranging from 99.42% to 99.82%.³⁸ The results clearly show that GMMs can effectively classify both nonpathological and pathological cries with a high accuracy. This finding is consistent with the research conducted by Jebarani et al.,⁴⁹ who used various ML classifiers, including GMMs, SVMs, and K-means, to detect breast cancer by categorizing magnetic resonance images as either cancerous or noncancerous. Their study indicated that GMMs achieved the best classification accuracy, reaching 93.80%. Additionally, Khlaifi et al.⁵⁰ applied GMMs to classify swallowing sounds and other sounds during food intake in adult patients recovering from stroke, achieving an accuracy ranging from 85.57% to 95.94% in distinguishing swallowing sounds. Therefore, GMMs are considered the best classifier for classifying specific conditions in both sound and image.

Nevertheless, an SVM was the most commonly used classifier for both specific-need and pathological cries; however, its performance rate varies depending on the features used and requires multiple features to enhance its performance. When comparing GMMs and SVMs using the same classifier, it was found that GMMs consistently outperformed SVMs. These results align with the findings of Sen et al.,⁵¹ who compared the classification performance of SVMs and GMMs in distinguishing between healthy and pathological pulmonary conditions based on pulmonary sounds. Their study revealed that GMMs achieved higher accuracy than SVMs in pulmonary sound classification. Therefore, it can be concluded that a GMM appears to be the best classifier to classify specific conditions.

Furthermore, dataset quality can affect the accuracy rate of infant cries classification. It was found that the infant cry databases have a higher accuracy of infant cries than self-record datasets. These results are consistent with the findings of Ji et al.,⁵² who reviewed infant cry analysis and observed that cry databases demonstrated higher accuracy compared to self-recorded datasets. Specifically, infant cry databases achieved accuracy rates ranging from 71.68% to 97.96%, whereas self-recorded datasets ranged from 62.1% to 91.58%. This difference is attributed to the fact that infant cry databases undergo a sound validation process and are annotated by experts to accurately label the meanings of the infant cries. The experts mentioned in the included studies refer to pediatricians and nurses who have experience in identifying the causes of infant crying and applying appropriate actions to stop it.

The dataset used in this review comprises four main databases: Donate a Cry Corpus,¹² Baby Chillanto Infant Cry,^12,24,38,39 Dunstan Baby,³⁷ and In-House DA-IICT Infant Cry.³⁸ The Donate a Cry Corpus includes five cry types: belly pain, burping, discomfort, hunger, and tiredness. The Baby Chillanto dataset contains five types: deaf, asphyxia, normal, hunger, and pain. The Dunstan Baby dataset includes hunger, need for a burp, need to poop, discomfort, and sleepy. The In-House DA-IICT dataset consists of three types: normal, asthma, and HIE. While these datasets provide valuable resources for infant cry analysis, they also present potential biases. Several datasets are imbalanced, with some cry types (e.g. hunger or discomfort) being overrepresented, while others (e.g. asthma or HIE) are limited in size. This imbalance may reduce generalizability, with models favoring frequent cry types. Future work should address this through data augmentation, balanced sampling, present infant identity controls, or broader collection of representative cry samples.

There are some inconsistencies regarding the age of infants that should be considered when training and analyzing ML models. Some studies^4,23,36 have indicated that newborns aged 1 to 53 days begin to gain control over their vocalizations during this period. Before this age, vocalizations are primarily regulated by independent biological rhythms and may indicate the newborn's health. Recent evidence from Lockhart-Bouron et al.¹³ suggests that human infants gradually transition from producing mainly noisy and shrill cries to more tonal and melodious cries from birth up to nearly 4 months of age, with no significant differences observed between sexes. This underscores the importance of exploring the appropriate age of infants to be used in ML to enhance accuracy rates and facilitate broader development in the future. Additionally, future research on infant cry analysis should aim to develop a standard dataset that can be utilized globally as a gold standard.

The application of ML to identify and classify the meanings of infant crying is essential for enhancing real-world parenting support. Currently, numerous initiatives are underway to integrate ML systems into mobile applications, facilitating their use in everyday situations. A recent research conducted by Mekhfioui et al.⁵³ presented ML experiments that employed three models: CNNs, Wav2Vec 2.0, and DistilHuBERT, to classify infant crying sounds into distinct categories, including hunger, discomfort, tiredness, and belly pain. The findings demonstrated that while all models exhibited commendable performance, DistilHuBERT achieved the highest accuracy (86.95%) compared with CNNs (83.93%) and Wav2Vec 2.0 (81.52%), along with the best overall metrics. Due to its advantageous balance of high accuracy and low computational cost, DistilHuBERT has been identified as the most effective model for real-time implementation on embedded systems and mobile applications. Future research should explore ways to further improve the techniques tailored to real-world conditions and apply them to mobile applications.

Conclusion

ML has demonstrated significant potential in classifying infant cries, effectively differentiating between types of cries related to the infant's needs and pathological cries. The need-based cries have nine subtypes, while pathological cries are categorized into six subtypes. Various classifiers have been employed to recognize the patterns in these cries. The accuracy of recognizing infant cries varies depending on the ML classifier and the features used for analysis. According to the studies reviewed, the accuracy rates range from 44.5% to 99.82%. Among the classifiers tested, the GMM achieved the highest performance rates, reaching 99.82% accuracy for hunger and pain cries, and between 99.42% and 99.82% for deafness-related cries. These advancements indicate that ML shows strong potential for accurately classifying infant cries and detecting pathologies. This capability is crucial for healthcare and everyday life, as it supports the early detection of health issues and improves infant care. Future research should focus on developing diverse cry datasets to improve model generalizability, evaluating performance in real-world settings, and integrating cry analysis with physiological signals to enhance diagnostic accuracy.

Limitations

This systematic review excluded gray literature, such as conference proceedings, dissertations, and editorials, and was limited to studies published in English. The exclusion also omitted articles published in other languages or indexed in other databases, which may have introduced publication bias. Among the included studies, there were issues with unclear patient selection processes in the datasets, leading to an unclear risk of bias in the selection domain. One study⁴⁰ did not report the sources or characteristics of the infant's cries, limiting transparency. Additionally, the studies reviewed did not consider the baby's identity, including factors such as age, sex, height, and weight, nor did they account for how many infants were recorded crying or how many sounds came from a single infant in their analysis of baby cries. It is crucial to incorporate this information into classification models to avoid issues like overfitting, pseudoreplication, and the false inflation of recognition accuracy. For the comparison of the accuracy rate with the chance level, the findings highlight that a direct comparison of raw accuracy across studies can be misleading and reinforce the importance of reporting performance relative to the chance level to enable meaningful cross-study comparisons. The heterogeneity of datasets, including differences in cry categories, class balance, and recording environments, further limits the ability to directly compare algorithmic performance. Future research should consistently report chance-level baselines, include balanced datasets, and incorporate complementary metrics such as balanced accuracy, Cohen's kappa, or F1-score to better reflect actual model performance.

Implication

This systematic review can serve as a foundation for future research, which should include subgroup analyses or meta-analyses to compare the performance of different ML classifiers across various types of infant cries. Additionally, it should evaluate the accuracy and reliability of infant cry datasets from online databases to identify the most accurate and standardized dataset. Establishing a globally recognized standard dataset for infant cries could enhance the consistency and reliability of ML applications in this field.

In a cultural issue application, there is a concern about applying infant cries to different cultures. The study by Cornec et al.⁵⁴ examined the perception of infant crying across communities in the Democratic Republic of Congo, comparing it to analogous data from French and British samples to assess the perception by adults across diverse cultures. The findings revealed no significant differences between Congolese and European populations in the acoustic structure of babies’ cries. This finding is consistent with the study by Gustafson et al.,⁵⁵ which found no difference in crying between infants born in a Mandarin Chinese-language environment and those born in an American English-language environment. Therefore, based on the evidence, it is reasonable to believe that ML can be applied to analyze infant crying across all cultures.

In clinical practice, ML can be applied to the early detection of infant conditions and the monitoring of infant health through cry analysis, enabling doctors and medical staff to promptly identify potential diseases. For example, ML systems can assist healthcare professionals in recognizing abnormal crying linked to specific medical conditions, such as asphyxia, asthma, or respiratory distress. Additionally, integrating ML-based cry analysis into bedside monitoring or mobile diagnostic devices would enable healthcare providers to receive real-time alerts for atypical cry patterns, complementing traditional clinical assessments. Beyond clinical facilities, ML-based cry analysis can also support parental education and telehealth applications, particularly for first-time parents or those in remote areas. When embedded in mobile health devices, ML algorithms can provide parents with immediate feedback, promoting appropriate caregiving responses.

However, several barriers may limit implementation in real-world settings. These include the high cost of developing and maintaining advanced ML systems, the need for well-labeled datasets, the need for validation across different clinical populations, and the need for specialized training for healthcare providers to correctly interpret and respond to system outputs. Additionally, disparities in access to technology across different healthcare settings can pose significant challenges. Therefore, effective clinical practice also necessitates training for healthcare providers and the development of technical systems for the application of infant cry analysis.

Supplemental Material

sj-docx-1-sci-10.1177_00368504251410776 - Supplemental material for The application of machine learning for infant cries classification and pathological cries detection: A systematic review

Supplemental material, sj-docx-1-sci-10.1177_00368504251410776 for The application of machine learning for infant cries classification and pathological cries detection: A systematic review by Sudhathai Sirithepmontree, Nattasit Katchamat and Sasitara Nuampa in Science Progress

Supplemental Material

sj-docx-2-sci-10.1177_00368504251410776 - Supplemental material for The application of machine learning for infant cries classification and pathological cries detection: A systematic review

Supplemental material, sj-docx-2-sci-10.1177_00368504251410776 for The application of machine learning for infant cries classification and pathological cries detection: A systematic review by Sudhathai Sirithepmontree, Nattasit Katchamat and Sasitara Nuampa in Science Progress

Footnotes

Acknowledgments

The authors sincerely thank Professor Dr. Hyekyun Rhee and Professor Dr. Lorraine Olszewski Walker for their invaluable guidance and support in elaborating this work.

ORCID iDs

Sudhathai Sirithepmontree

Nattasit Katchamat

Sasitara Nuampa

Ethical considerations

This review was based on previously published literature available in public databases and did not involve any human participants or the collection of new data; therefore, ethical approval was not required.

Author contributions

SS conceptualized the study and designed the review protocol. SS and NK performed the literature search, screened articles, and extracted data. SN resolved any discrepancies. SS, NK, and SN contributed to data analysis and interpretation. SS drafted and revised the manuscript. All authors have read and approved the final version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Valdes

Reyes-Galaviz

Ortiz

SDC

, et al. Biosignal processing and classification using computational learning and intelligence. In: Analysis and processing of infant cry for diagnosis purposes. London, UK: Academic Press, 2022.

Joshi

Srinivasan

Vincent

, et al. A multistage heterogeneous stacking ensemble model for augmented infant cry classification. Front Public Health 2022; 10: 1–19. DOI: 10.3389/fpubh.2022.819865

Ashwini

Vincent

Srinivasan

, et al. Deep learning assisted neonatal cry classification via support vector machine models. Front Public Health 2021; 9: 670352.

Khalilzad

Hasasneh

Tadj

. Newborn cry-based diagnostic system to distinguish between sepsis and respiratory distress syndrome using combined acoustic features. Diagnostics 2022; 12: 1–21.

Matikolaie

Tadj

. Machine learning-based cry diagnostic system for identifying septic newborns. J Voice 2024; 38: 963 e1–963 e14.

Kurth

Kennedy

Zemp Stutz

, et al. Responding to a crying infant—you do not learn it overnight: a phenomenological study. Midwifery 2014; 30: 742–749.

Shin

Park

Ryu

, et al. Maternal sensitivity: a concept analysis. J Adv Nurs 2008; 64: 304–314.

Petzoldt

Wittchen

Einsle

, et al. Maternal anxiety versus depressive disorders: specific relations to infants’ crying, feeding and sleeping problems. Child Care Health Dev 2015; 42: 231–245.

Oberlander

Rotem-Kohavi

. Post-partum depression and infant crying behaviour. Crying behaviour. Encyclopedia on Early Childhood Development, 2017. University of British Columbia.

10.

Brand

Furlano

Sidler

, et al. Associations between infants’ crying, sleep and cortisol secretion and mother's sleep and well-being. Neuropsychobiology 2014; 69: 39–51.

11.

Chang

C-Y

Bhattacharya

Vincent

PMDR

, et al. An efficient classification of neonates cry using extreme gradient boosting-assisted grouped-support-vector network. J Healthc Eng 2021; 2021: 1–14.

12.

Cui

. Classification of infant crying sounds using SE-ResNet-transformer. Sensors 2024; 24: 6575.

13.

Lockhart-Bouron

Anikin

Pisanski

, et al. Infant cries convey both stable and dynamic information about age and identity. Commun Psychol 2023; 1: 26.

14.

Marler

. Social organization, communications and graded signals: the chimpanzee and the gorilla. In: Growing points in ethology. Cambridge: Cambridge University Press, 1976.

15.

Parga

Lewin

Lewis

, et al.

Defining and distinguishing infant behavioral states using acoustic cry analysis: is colic painful?

Pediat Res 2020; 87: 576–580.

16.

Liang

Y-C

Wijaya

Yang

M-T

, et al. Deep learning for infant cry recognition. Int J Environ Res Public Health 2022; 19: 6311.

17.

Hammoud

Getahun

Baldycheva

, et al. Machine learning-based infant crying interpretation. Front Artif Intell 2024; 7: 1337356.

18.

França

Borges Monteiro

Arthur

, et al. Trends in deep learning methodologies. In: An overview of deep learning in big data, image, and signal processing in the modern digital age. London, UK: Academic Press, 2021.

19.

Sarker

. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci 2021; 2: 60.

20.

Jian

Pasquier

Sagahyroon

, et al. A machine learning approach to predicting diabetes complications. Healthcare 2021; 9: 1712.

21.

Tesfaye

Seboka

Sisay

. Application of machine learning methods for predicting childhood anaemia: analysis of Ethiopian Demographic Health Survey of 2016. PLoS One 2024; 19: e0300172.

22.

Mukhopadhyay

Saha

Majumdar

, et al. An evaluation of human perception for neonatal cry using a database of cry and underlying cause. In: Indian conference on medical informatics and telemedicine, 2013, pp.64–67.

23.

Zayed

Hasasneh

Tadj

. Infant cry signal diagnostic system using deep learning and fused features. Diagnostics 2023; 13: 1–22.

24.

Rosales-Pérez

Reyes-García

Gonzalez

, et al. Classifying infant cry patterns by the genetic selection of a fuzzy model. Biomed Signal Process Control 2015; 17: 38–46.

25.

Mangold

Zoretic

Thallapureddy

, et al. Machine learning models for predicting neonatal mortality: a systematic review. Neonatology 2021; 118: 394–405.

26.

McAdams

Kaur

Sun

, et al. Predicting clinical outcomes using artificial intelligence and machine learning in neonatal intensive care units: a systematic review. J Perinatol 2022; 42: 1561–1575.

27.

Sharifi-Heris

Laitala

Airola

, et al. Machine learning approach for preterm birth prediction using health records: systematic review. JMIR Med Inform 2022; 10: e33875.

28.

Bertini

Salas

Chabert

, et al. Using machine learning to predict complications in pregnancy: a systematic review. Front Bioeng Biotechnol 2021; 9: 1–16.

29.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Br Med J 2021; 372: 1–9.

30.

Whiting

Rutjes

AWS

Westwood

, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011; 155: 529–536.

31.

Krauze

Zhuge

Zhao

, et al. AI-driven image analysis in central nervous system tumors—traditional machine learning, deep learning and hybrid models. J Biotechnol Biomed 2022; 5: 1–19.

32.

Ludwig

Richards

Coulter

, et al. Automated detection of infant crying and fussing for clinical applications. J Pediatr Gastroenterol Nutr 2018; 66: 71.

33.

Saraswathy

Hariharan

Khairunizam

, et al. Time-frequency analysis-based method for application of infant cry classification. Int J Med Eng Inform 2020; 12: 119–134.

34.

Zhang

Ting

Choo

. Baby cry recognition based on WOA-VMD and an improved Dempster–Shafer evidence theory. Comput Methods Programs Biomed 2024; 245: 1–15.

35.

Laguna

Pusil

Bazán

, et al. Multi-modal analysis of infant cry types characterization: acoustics, body language and brain signals. Comput Biol Med 2023; 167: 1–10.

36.

Matikolaie

Tadj

. On the use of long-term features in a newborn cry diagnostic system. Biomed Signal Process Control 2020; 59: 1–9.

37.

Abbaskhah

Sedighi

Marvi

. Infant cry classification by MFCC feature extraction with MLP and CNN structures. Biomed Signal Process Control 2023; 86: 1–11.

38.

Patil

Kachhi

Patil

. CQT-based cepstral features for classification of normal vs. pathological infant cry. IEEE Trans Audio Speech Lang Process 2024; 32: 4713–4726.

39.

Zhang

Ting

H-N

Choo

Y-M

. Baby cry recognition by BCRNet using transfer learning and deep feature fusion. IEEE Access 2023; 11: 126251–126262.

40.

Aggarwal

Jhajharia

Izhar

, et al. A machine learning approach to classify biomedical acoustic features for baby cries. J Voice 2025; 39: 1446–1455.

41.

Liang

Wijaya

Yang

, et al. Deep learning for infant cry recognition. Int J Environ Res Public Health 2022; 19: 1–10.

42.

Valani

. Donate-a-cry-corpus-features-dataset. Kaggle. https://www.kaggle.com/datasets/bhoomikavalani/donateacrycorpusfeaturesdataset/data.

43.

Busquet

Efthymiou

Hildebrand

. Voice analytics in the wild: validity and predictive accuracy of common audio-recording devices. Behav Res Methods 2024; 56: 2114–2134.

44.

Reyes-Galaviz

Cano-Ortiz

Reyes-García

. Evolutionary-neural system to classify infant cry units for pathologies identification in recently born babies. In: Presented at: 2008 Seventh Mexican international conference on artificial intelligence, 2008.

45.

Maghfira

Basaruddin

Krisnadhi

. Infant cry classification using CNN–RNN. J Phys Conf Ser 2020; 1528: 012019.

46.

Dunstan Baby Language. https://www.dunstanbaby.com/ .

47.

Aliferis

Simon

. Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI. Cham, Switzerland: Springer, 2024.

48.

Yao

Micheletti

Johnson

, et al. Infant crying detection in real-world environments. Proc IEEE Int Conf Acoust Speech Signal Process 2022; 2022: 131–135.

49.

Jebarani

Umadevi

Dang

, et al. A novel hybrid K-means and GMM machine learning model for breast cancer detection. IEEE Access 2021; 9: 146153–146162.

50.

Khlaifi

Istrate

Demongeot

, et al. Swallowing sound recognition at home using GMM. IRBM 2018; 39: 407–412.

51.

Sen

Saraclar

Kahya

. A comparison of SVM and GMM-based classifier configurations for diagnostic classification of pulmonary sounds. IEEE Trans Biomed Eng 2015; 62: 1768–1776.

52.

Mudiyanselage

Gao

, et al. A review of infant cry analysis and classification. EURASIP J Audio Speech Music Process 2021; 2021: 1–17.

53.

Mekhfioui

Fadel

Hammouch

, et al. Development of a baby cry identification system using a raspberry pi-based embedded system and machine learning. Technologies 2025; 13: 1–13.

54.

Cornec

Mathevon

Pisanski

, et al. Human infant cries communicate distress and elicit sex stereotypes: cross cultural evidence. Evol Hum Behav 2024; 45: 48–57.

55.

Gustafson

Sanborn

Lin

, et al. Newborns’ cries are unique to individuals (but not to language environment). Infancy 2017; 22: 736–747.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

0.26 MB