Abstract
Religious or spiritual struggles are clinically important to health care chaplains because they are related to poorer health outcomes, involving both mental and physical health problems. Identifying persons experiencing religious struggle poses a challenge for chaplains. One potentially underappreciated means of triaging chaplaincy effort are prayers written in chapel notebooks. We show that religious struggle can be identified in these notebooks through instances of negative religious coping, such as feeling anger or abandonment toward God. We built a data set of entries in chapel notebooks and classified them as showing religious struggle, or not. We show that natural language processing techniques can be used to automatically classify the entries with respect to whether or not they reflect religious struggle with as much accuracy as humans. The work has potential applications to triaging chapel notebook entries for further attention from pastoral care staff.
Introduction
Spiritual and religious beliefs and practices often play a positive role in people’s lives. Although some religious beliefs and practices can confer health benefits, 1 this is not always the case, and some ways of using faith to cope have been associated with poorer health outcomes. Spiritual struggle describes the conflicts, questions, or tensions that arise around religious or spiritual issues. 2 Such struggles fall into 3 main categories: intrapersonal, interpersonal, and divine struggles. 3 Intrapersonal religious struggles are those in which the tension is internal, such as questioning of formerly helpful beliefs that are no longer adequate for the stressor or struggle due to guilt. Interpersonal religious struggles are those in which the tension exists between the individual and one or more other important people, such as a feeling of abandonment by members of one’s congregation. Divine struggles are those in which one’s relationship with the Divine is impaired, whether by anger, feelings of abandonment, or questions. These distinctions, although useful for selecting an intervention approach, are not mutually exclusive in actual experience, and a person may experience one or all forms of religious struggle simultaneously. Spiritual struggles are frequently operationalized using the negative spiritual coping scale of the Brief RCOPE. 4
Religious or spiritual struggles are clinically important to health care chaplains because they are related to poorer health outcomes, involving both mental 5 and physical 6 health problems. For example, spiritual struggle has been associated with depression (4% variance), anxiety, and negative illness adjustment in 155 patients newly diagnosed with breast cancer. 7 In a similar longitudinal study involving Orthodox Jews, spiritual struggle preceded and was a potential cause of future depression. 8 Finally, a study of 577 patients aged 55 years or older at Duke University Medical Center demonstrated that negative religious coping (reappraisals of demonic forces and punishing, spiritual discontent, and negative attitudes toward God and clergy) generated depressive symptoms and a poorer quality of life. 9
Struggles occur across religious traditions and have been described in samples composed of Jews, 10 Muslims, 11 and Hindus. 12 Spiritual struggle has been referred to as a catalyst that can either lead to further growth or become chronic. 13 Chaplaincy care is a potential catalyst to promote resolution of religious struggles in ways that lead to posttraumatic growth, as this growth is often attributed to religious coping. For example, surveys conducted with trauma survivors found that spiritual or religious coping played an important role in long-term recovery. 14 In addition, spirituality was shown to have a positive relationship with posttraumatic growth in bereaved human immunodeficiency virus (HIV)/AIDS caregivers. 15
Religious struggle can be identified through instances of negative religious coping, such as feeling anger or abandonment toward God. Measures of the prevalence of religious struggle vary considerably between studies, depending on the population sample and the way religious struggle was operationalized. A study of more than 5000 college students demonstrated that 25% reported religious struggles. 16 In a study of 48 adolescents with sickle cell disease (SCD) and some of their parents (N = 42), the adolescents more frequently used negative religious coping than did the parents, with reported rates of spiritual struggle ranging from 20% to 36% among adolescents with SCD and 4% to 12% among parents of adolescents with SCD. 17 More than half of women with breast cancer (53%) were found to have religious struggle in one hospital in the United Kingdom, which is particularly notable given the more secular environment in Europe compared with the United States. 18 Grossoehme and colleagues 19 found that 31% of parents of children (N = 22) with cystic fibrosis in one hospital setting had evidence of religious struggle. With religious struggle very relevant among patients with serious life-altering diseases, negative religious coping can affect everyday life, from treatment adherence to general mental health, creating the need for an effective way to identify this problem before it affects the patient in a long-term fashion.
Despite the prevalence of religious struggle and the emotional disease that is associated with it, identifying persons experiencing religious struggle poses a challenge for chaplains. In one study, only 31% of those with high spiritual needs and low spiritual resources requested any form of spiritual care. 20 Various methods have been proposed to identify persons who could potentially benefit from chaplaincy care. These include questionnaires, such as the Brief RCOPE 4 and others, as well as computer-delivered assessment modules, many of which have been discussed elsewhere. 6 However, each of these current methods contains drawbacks. The Brief RCOPE is not practical in the clinical setting as the questionnaire is lengthy and requires high participant burden. However, the assessment modules may lack construct validity and represent measures for depression rather than spiritual distress. In addition to these methods, chaplaincy has historically been an itinerant, reactive service that is driven by a combination of self-referrals and referrals from others.21–24 Thus, a feasible, efficient, and reliable means of identifying persons with potential religious struggle is important and would represent a significant shift in clinical practice for many health care chaplains.
One potentially underappreciated way of triaging chaplaincy effort is through prayers written in chapel notebooks. The role that written prayers play in coping in hospitals and other health care institutions has been described in several contexts, in both the United States and the United Kingdom.25–27 Prayer books in chapels have been compared with psalms of lament, 25 and it has been suggested that at least some people write prayers as a means of seeking not only divine aid but also the support of other people.25,26 A framework for conceptualizing such prayers has been used by ap Siôn and Nash, 27 framing them in terms of their reference, intention, and objective and how those constructs relate to heath and communication. These prayers may also help inform chaplains about who is using the prayer books as a resource to cope and who may be expressing religious struggle. As a result, the written prayers in chapels provide a different religious perspective for the patient and the parents in their relationships with God.
An analogous situation of using written texts as a means of analysis is the classification of genuine and elicit suicide notes by mental health experts using natural language processing (NLP).28,29 Natural language processing has been used to create constant, reliable means of prediction of future needs or behaviors based on written texts. Natural language processing is a field that stands at the crossroads of computer science and linguistics, where machine learning computer algorithms are used to decipher corpus of texts across different disciplines. Machine learning is a technique in which a computer learns from a data set to classify future data, and these techniques have been used for NLP in a variety of tasks involving the classification of free-text, including identifying differences in retrospective suicide notes, newsgroups, and social media30-32. Although there are many approaches to this type of classification, this study seeks to identify the optimal approach to classify the severity of religious struggle. Consequently, identifying and automating this approach provide decision support to chaplains for triaging. We follow the innovation process of design, prototype, pilot, and implement described by Provost and Hoppenjans. 33 To test the design and prototype phases, the hypothesis of this study was that machine learning algorithms can categorize written prayer texts of patients and their parents found in a pediatric academic medical center’s chapel notebooks for expressing religious struggle or no religious struggle using NLP.
Methods
Data
This study was approved by the Cincinnati Children’s Hospital Medical Center Institutional Review Board (2014-3384). Prayers were collected from chapels at 2 pediatric medical centers. Cincinnati Children’s Hospital Medical Center (site 1) was a 575-bed academic pediatric medical center in the US Midwest. There were 3 chapels at this site, each with its own prayer notebook. Prayers were collected weekly from those notebooks. At the top of each page read the following statement: “You are welcome to write a prayer in this book. Your prayers may be read by others. For the confidentiality of our children and families, please use initials rather than names when referring to a patient.” Birmingham Children’s Hospital (site 2) was a 364-bed pediatric hospital in the West Midlands of England in the United Kingdom. There were 3 places of worship and reflection on this site (Christian chapel, Muslim prayer room, and multifaith quiet room), and all were open continuously. The prayers used in this study were taken solely from the Christian chapel. Although the chapel is clearly a Christian place of worship and prayer, patients and families of all faiths and beliefs may have used the chapel for quiet prayer and reflection. This prayer book was perpetually open and replaced when it was full. It is explained at the beginning of each book that “These prayers will be prayed for at our weekly Holy Communion service and will also be reflected upon to inform and improve our support to patients, families and staff.”
Procedure
Following the process by Pestian et al, 28 an ontology for this study was developed by a quasi-Delphi consensus-building process. 31 A panel of 8 pediatric chaplains from site 1, all of whom were board-certified, generated a list of expressions of spiritual struggle derived from clinical experience and items from the negative religious subscale of the Brief RCOPE. 4 Items were continuously reevaluated for expressing spiritual struggle, and a consensus list was developed of expressions of spiritual struggle. The final list became the ontology for use by the annotators in this study (Table 1).
Chaplain-generated ontology of expressions of religious struggle.
First, each prayer was transcribed and anonymized. The transcription process involved copying written prayers into a separate text (.txt) document as a virtual catalog with all misspellings and grammar errors kept verbatim. For anonymization, patient names were replaced with “[N]” (other protected health information [PHI] was not encountered during transcription). In addition, a new line in the prayer would be represented as “NEWLINE,” any writings that could not be understood were replaced with “[unintelligible],” and any drawings were replaced with a double-bracketed description of the drawing (eg, a picture of a cross would become “[[cross]]”).
The next step was to annotate a set of prayers. Seven annotators were recruited for the study. Three annotators were assigned to annotate all of the US prayers, and the other study staff annotated subsets of the prayers within study time frame. The annotators consisted of a male board-certified chaplain, a male layperson, and 3 female board-certified chaplains from the United States, a male professional chaplain from England, and a female layperson from the United Kingdom doing graduate work on prayers. The male layperson and 2 female board-certified chaplains (all from United States) annotated the US prayers. The annotation process called for each annotator to individually read each prayer in the block and determine whether language indicative of spiritual struggle was present using the ontology. If at least one annotator considered the prayer to contain religious struggle, then the prayer was tagged as having “struggle.” Otherwise, a prayer was considered “not struggle.” The reason for this is because in the clinical setting, if one chaplain detects religious struggle in a patient, then that patient is considered in need of chaplain assistance. This label of “struggle” or “not struggle” would then be attributed to the prayers. This combination of a prayer with its corresponding label would become the training set for which machine learning techniques would be applied.
The goal of machine learning is to train a computational model from training data that can then generalize to unseen (test) data. Machine learning can be roughly divided into 3 types: supervised learning, when the training data are already labeled; semi-supervised learning, when only part of the training set is labeled; and unsupervised learning, when the challenge is to learn patterns in unlabeled data. Here, a supervised learning support vector machine (SVM) approach was used 35 .
Machine learning classifier
The task of building a machine learning model to classify text can be separated into 5 parts: (1) transcribing the text, (2) preprocessing of the transcribed text, (3) extraction of features (unique word or phrase found in any of the prayers), (4) selection of those features that most differentiate the text between 2 separate classes, and (5) optimization of the classifier. Before the features were extracted, the prayers were preprocessed such that punctuation and line spaces (paragraphs) were identified and removed from the final set for the classifier. Individual words (unigrams), word pairs (bigrams), 3-word phrases (trigrams), and 4-word phrases (quadragrams) were then extracted from each prayer and used as the backbone for the classifier. In addition, the number of words in each prayer was also considered in analysis.
Although data are frequently split into training, validation, and test sets, the sample size here was limited such that any performance measure from a “set aside” test would have unacceptable statistical uncertainty. The approach described by Kohavi and John 36 was therefore used, in which both the classification and feature selection were contained in the training set, allowing the validation set to act as a proxy for the test set. The prayers contained more than 5000 unique features. Most of those provided little or no discrimination between prayers with or without religious struggle. Most of the features were either common words, such as “the” and “a” which appeared with equal frequency in both types of prayers, or words or phrases that appeared only in 2 or fewer prayers. Therefore, to reduce “noise” in the classifier, only those features that provided the best discrimination (ie, those words that appeared with the most different frequencies in struggle and not-struggle prayers) were selected for input. This was important because the “noise,” or unnecessary words that did not distinguish whether prayers contained religious struggle or not, would cause the classifier to give a false result. Consequently, the frequency of type I errors and type II errors would increase. To reduce the “noise,” methods were fine-tuned in 2 different manners.
The first method was changing the type of feature selection test itself. One way the differences in the frequency distributions were quantified was with the Kolmogorov-Smirnov test (KS-test) P-value. 33 The KS-test was performed by first determining the largest difference in the cumulative distributions of 2 samples: struggle and not struggle. The KS-test P-value was then evaluated, which is the probability of obtaining a difference larger than the one observed. The P-values from all of the features were then ranked. The other test that was used was analysis of variance (ANOVA) 38 . The ANOVA test was performed in a similar manner to the KS-test. These tests were used because of their frequent occurrences in studies regarding machine learning and feature selection. For example, the KS-test has been used as a common method for classification algorithms39, 40 and ANOVA has been applied to email spam classification 41 , as well as other machine learning classifications.
The other method to reduce noise was choosing the number of top features to include in the classification. The first way was by manually choosing the top number of features in logarithmic steps of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024), called the wrapper method (Saeys, 2007). The other way was by using an information foraging algorithm to decide how many of the top features should be used to optimize the classifier 42 . Both methods were used in different combinations in the SVM to decide which one produced the best results.
An SVM, a type of machine learning classifier, was used to determine whether prayers contained religious struggle or not. SVMs are based on a computational learning theory called structural risk minimization, whose goal is to find a hypothesis with the lowest true error 43 . The SVM constructs a hyperplane in a high-dimensional space, which can be used for classification, regression, or other tasks 44 . The SVM is appropriate for this study’s data because its connection to computational learning enables it to be a universal learner 45 . Consequently, it tends to be fairly robust to overfitting 46 . The performance of the classifier was based on the area under the receiver operating characteristic, which was estimated using leave-one-out by subject cross-validation47, 48.
The area under a receiving operating characteristic (AROC) 49 measures the probability that given a randomly selected prayer with religious struggle and a second randomly selected prayer without religious struggle, the SVM will give a higher probability that the prayer contains religious struggle compared with a prayer actually expressing religious struggle. The performance of the classifier is optimized using the AROC for 2 reasons. First, the AROC provides a single statistic that quantifies the spectrum of possible sensitivities given the desired specificities for the classifier. Second, the AROC provides a measure of how accurately prayers are classified and thereby directly quantifies the increased efficiency of finding a struggling patient or family.
Results
The main analysis was conducted on 243 American prayers, which were only annotated by American members in the study (2 board-certified chaplains and 1 layperson). From these prayers, 42 contained signs of religious struggle, whereas 201 did not indicate religious struggle. Overall, interrater reliability was .41 using the Krippendorff alpha scale, which corresponds to moderate agreement. In addition, the annotators produced an overall agreement of 83.3% (SE, 5.8%). The most successful classifier had an AROC of 0.73 (±4%) and used only 12 features with information foraging. Different methods were used to tune the classifier, and the entire results can be found in Table 2. The list of top features is detailed in Table 3. Figure 1 depicts the full receiving operating characteristic (ROC) curve.
AROC for different classifiers in determining whether written prayers contain religious struggle.
Abbreviations: AROC, area under a receiving operating characteristic; ANOVA, analysis of variance; KS, Kolmogorov-Smirnov.
Techniques include wrapper method and information foraging (IF).
Probabilities and P-values of top features occurring in a prayer of both classes.
P-values based on the Welch t-test comparing prayers with struggle (n = 42) and prayers without struggle (n = 201).

Receiver operating characteristic for the classifier’s discrimination of religious struggle within American prayers. The gray line is the area under a receiving operating characteristic curve for a baseline (random) classifier.
A second analysis was performed, in which the total 528 prayers were annotated, 245 of which were obtained from site 1 and 283 of which were obtained from site 2. For this analysis, a combination of 1 to 4 annotators (both from the United States and United Kingdom) annotated each prayer. Overall, interrater reliability was .38 using the Krippendorff alpha scale, which corresponds to moderate agreement. In addition, the annotators achieved an overall agreement of 74.1% (SE, 2.8%). The purpose of this analysis was to test whether there was a significant difference in language that would require different classifiers for different cultures. The classifier was able to classify prayers containing religious struggle with an AROC score of 80% (±3%) using 256 features. Figure 2 depicts the full ROC curve.

Receiver operating characteristic for the classifier’s discrimination of religious struggle within British and American prayers. The gray line is the area under a receiving operating characteristic curve for a baseline (random) classifier.
Discussion
We present an exploratory attempt to use NLP to identify the spiritual struggle written in the prayer books of a pediatric academic medical center. The classifier’s performance (high AROC) supports our conclusion that an NLP classifier is a useful objective tool that clinicians and others can use to determine religious struggle in parents and patients through written prayers. The classifier can identify religious struggle by examining patterns of words and phrases within the corpus of texts. In addition, the classification algorithm used can be applied across data sets with different prayers and cultures because the classifier with just American prayers and annotators performed comparably as the classifier with both British and American prayers and annotators. Consequently, the classifier can provide additional decision support to chaplains in identifying persons with religious struggle.
The lack of significant differences in recognizing religious struggle by chaplains in the United States and the United Kingdom and by board-certified chaplains and nonchaplains supports the ontology and classifier’s possibility for clinical utility. Prayers across 2 English-speaking cultures as having expressions of religious struggle could be identified. This suggests that although there may be geographically unique expressions of religious struggle, there are also some expressions that are sufficiently common as to make the use of the classifier possible across regions. These results also suggest that it is feasible to train persons who are not board-certified chaplains to recognize expressions of religious struggle. It may therefore be possible to have prayers in chapel notebooks read in a cost-efficient manner by someone trained to use the ontology to recognize expressions of religious struggle and who is not a board-certified chaplain. Until the classifier has undergone further development and testing, this could be a viable way to triage chaplains’ clinical efforts.
The use of the female pronoun “her” more frequently in prayers with religious struggle was statistically significant, but may not be clinically meaningful. To determine whether the context in which gendered pronouns occurred might provide a different perspective, a further review of 100 randomly selected prayers was undertaken by the senior author (DHG). Prayers containing “his” used the pronoun primarily in the context of God being asked to “look over” or “watch over.” Prayers in which the pronoun “her” occurred were related to a girl or woman going through a surgery or procedure. This could reflect language of belief reflecting personal experience, an “ordinary theology,” 50 that is more troubled when a girl or woman is in need. It could also reflect a finding that is statistically significant and yet not clinically meaningful.
Another interesting aspect of the top features is the inclusion of “I,” “me,” and “my.” These personal pronouns indicate a focus on the self rather than their community. A study focusing on personal pronouns in social communication indicated that too much attention to one’s self is attributed to negative emotional states such as depression 51 . Because these self personal pronouns were demonstrated to be some of the most distinguishing features in the classifier, they might indicate the negative emotional state contained within prayers that have religious struggle, suggesting a need for further analysis.
The study has limitations that we acknowledge and accept for an exploratory study. The classifier was built around 528 prayers and 7 annotators from 2 different contexts. The restraints of time and the availability of the annotators meant that only 1 annotator annotated 100 British prayers in the supplementary analysis. As a result, more false-negatives and false-positives could have occurred than intended. However, with relatively stable interannotator agreement, much deviation was not expected and the chance of a type I error or type II error was low. Another potential limitation arises from the concern over the privacy of health information. This has led some institutions to discourage prayer writers from using surnames or patient room numbers, which would be helpful or necessary to correctly identify a prayer writer for follow-up. This concern need not be irremediable: Instructions at the top of each notebook page or on a bookstand could indicate that prayers are being reviewed and that a member of a chaplaincy department may follow up. This is little different in principle for similar notations that prayers may be read aloud during public worship in the chapel space. The lack of need of oversight of this or similar studies stems from the very public nature of the prayer books, the voluntary surrender of some privacy that is made in choosing to write a prayer, and the degree of self-disclosure writers choose to make. One study has even noted that some prayer writers appear to write what is in effect a plea for others to read and lift up their person in prayer. 26 The anonymity of the prayer writers also precluded linking prayers with the patient’s health status. However, it is the religious struggle of the prayer writer that is of primary clinical interest to the chaplain, which may or may not be directly related to the health status or religious struggle of the patient (or the person being prayed about). This is one difference, perhaps more common to pediatric than adult hospitals, in which the apparent proxy person is in fact the chaplain’s “patient” as opposed to the identified patient. A limitation of the current classifier is this inability to distinguish between identified patient and the prayer writer/proxy. Another limitation is the inability to directly evaluate the individuals who wrote prayers for religious struggle by a clinically trained chaplain, rather than relying solely on chaplains’ evaluation of the written prayer itself and the classifier. Prayers written in a pediatric hospital may differ in significant ways from those written in adult hospitals. For example, prayers relating to a child’s dying may be both more prevalent in a pediatric hospital and express different sentiments and contents than those found in adult centers (especially those without a maternity unit). It is also impossible to definitively obtain demographic data about the prayer writers. More women than men attend religious services in the United States, 52 and it may well be that most of the prayer writers were women. Nevertheless, important conclusions can be drawn. The classifier can discriminate between prayers with or without religious struggle. This validates the classifier’s design and prototype and its continued development to prepare for pilot testing as a tool to assist clinical chaplains in determining how to ration a limited resource, their time, by indicating prayer writers who are using faith to cope and those who are struggling and unlikely to self-refer for chaplaincy care.
Potential future research directions should focus on preparing for the next (pilot) phase of development. This translational research is needed to address the limitations mentioned above. These include, first, increasing the number of prayers included to strengthen the classifier, as well as give more evidence regarding the interannotator reliability across the cultures of the United States and the United Kingdom. Second is identifying and operationalizing a means of linking a prayer with an identifiable person while maintaining an appropriate level of privacy. For example, a header statement in such prayer books could invite the use of surnames or room numbers as an indication of a writer’s desire for direct, follow-up contact based on the prayer’s content. After overcoming the translational barriers, the algorithm may be implemented by making it available to medical centers or publishable to be used conveniently and with regional data (if our current presented data show drastic contrast to the region). The issue of religious struggle occurring more frequently when a woman is the subject of prayer in a hospital notebook bears further exploration. Another intriguing use for the classifier and this methodology lies outside the medical scope: The current state of foreign affairs for the United States increasingly deals with potential persons who might become radicalized. The machine learning algorithm of religious struggle might be useful to identify potential persons at risk of radicalization at a faster or even more reliable rate, using social media outlets (eg, Facebook and Twitter).
Footnotes
Acknowledgements
The authors acknowledge all the members of the Pestian Lab, especially Lesley Rohlfs, Robert Faist, and Rachel McCourt, for their support and ongoing commitment. They also gratefully acknowledge the assistance of the following persons: Gail Pyne-Geithman, Diana Tsurov, Sofia Newaz, Sophia M Dimitriou, Joy Cheng, Sarah Lohbeck, McKenzie Roedig, Natasha Yanes, Alexis Teeters, Bill Scrivener, MaryAnn Hegner, Karen Behm, Lou Langford, Maggi Hunt, and Pawel Matykiewicz.
Peer Review:
Four peer reviewers contributed to the peer review report. Reviewers’ reports totaled 1548 words, excluding any confidential comments to the academic editor.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Partial funding for this study was provided by the Summer Undergraduate Research Fellowship Program of the University of Cincinnati and by the Divisions of Pulmonary Medicine and Biomedical Informatics at Cincinnati Children’s Hospital Medical Center.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
DHG conceived and designed the experiments. DHG and PN collected study data. JG, DHG, and BG analyzed the data. JG and PN provided annotations for data. JG wrote the first draft of the manuscript. JG, DHG, and BC contributed to the writing of the manuscript. PN reviewed the manuscript for intellectual content. All authors reviewed and approved the final manuscript.
Disclosures and Ethics
As a requirement of publication, author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including, but not limited to, the following: authorship and contributorship, conflicts of interest, privacy and confidentiality, and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.
