Abstract
Study Design
A Sentiment Analysis of online reviews of spine surgeons.
Objectives
Physician review websites have significant impact on a patient’s provider selection. Written reviews are subjective, but sentiment analysis through machine learning can quantitatively analyze these reviews. This study analyzes online written reviews of spine surgeons and reports biases associated with demographic factors and trends in words utilized.
Methods
Online written and star-reviews of spine surgeons were obtained from healthgrades.com. A sentiment analysis package was used to analyze the written reviews. The relationship of demographic variables to these scores was analyzed with t-tests and word and bigram frequency analyses were performed. Additionally, a multiple regression analysis was performed on key terms.
Results
8357 reviews of 480 surgeons were analyzed. There was a significant difference between the means of sentiment analysis scores and star scores for both gender and age. Younger, male surgeons were rated more highly on average (P < .01). Word frequency analysis indicated that behavioral factors and pain were the main contributing factors to both the best and worst reviewed surgeons. Additionally, several clinically relevant words, when included in a review, affected the odds of a positive review.
Conclusions
The best reviews laud surgeons for their ability to manage pain and for exhibiting positive bedside manner. However, the worst reviews primarily focus on pain and its management, as exhibited by the frequency and multivariate analysis. Pain is a clear contributing factor to reviews, thus emphasizing the importance of establishing proper pain expectations prior to any intervention.
Introduction
The field of orthopedic and neurosurgical spine surgery is one that sees constant development. Over the last two decades, there has been a continual increase in the volume of spinal procedures performed.1-3 During this same time period, major technological advancements in the field, including minimally invasive surgery, computer assisted navigation, and robotic guidance, have changed the manner in which these procedures are conducted.4,5 With these constant developments, it is necessary that we stay up to date with what patients are saying about the surgeons who conduct these practices online as their reviews are never stagnant.
Unsurprisingly, orthopedic patients are frequently turning to the internet to both find and review medical information, clinics, and physicians.6-8 Physician review websites in particular (e.g., Vitals.com, Healthgrades.com, Google.com, RateMDs.com) have emerged as popular platforms for patients to disseminate their opinions on the providers they have seen; in short, these review websites allow prospective patients to explore former patient experiences and reviews when choosing their future care provider. Presently, studies have begun to evaluate physician review websites across orthopedics in general and the majority of orthopedic subspecialties (i.e., Spine, Hand, Foot and Ankle, Arthroplasty, Sports, Shoulder and Elbow). 8 Given that information found on the internet, including physician review websites, has the ability to impact a person’s thought process, it is vital that the orthopedic community understands what patients are saying about spine surgeons online.9,10
Multiple studies in the literature have analyzed patient reviews for spine surgeons on popular physician review websites. In terms of demographics, both younger surgeons and those who work in an academic setting were found to have more positive online reviews, while surgeon sex and region of practice had no impact on overall ratings.11-15 Surgeon personal characteristics, including surgeon trustworthiness, punctuality, confidence, character, and bedside manner were associated with positive reviews and ratings.12,13 However, the overall number of a surgeon’s scientific publications were found to have no impact on patient ratings of surgeon characteristics such as trustworthiness. 16 Yet, practice-related characteristics such as ancillary staff friendliness, ease of scheduling, and office environments were found to be associated with worse overall ratings.11,12,17,18 Further, to highlight the changing times, it was recently shown that surgeons with a social media presence are significantly more likely to have higher ratings across physician review websites; 13 a trend not seen in previous years. 12 This last point should serve as evidence that we need to continually evaluate the comments being deposited on these dynamic websites.
Of the aforementioned physician review website studies for spine surgeons, only the work of Donnally et al. and Kalagara et al. have analyzed written reviews,13,17 while the rest assessed granular star-reviews or rating scores; the most recent study focused on physician reviews from 2019. 12 While there are currently eight publications that have analyzed physician review websites for spine surgeons, none have utilized a natural language processing program to analyze written reviews.11-18 With new patient comments being left daily, and as the field of spine surgery continues to change, it is imperative that these findings are continually updated. The goal of this study was to use sentiment analysis to quantitatively analyze patient’s comments and ratings of spine surgeons to determine what factors are most associated with both positive and negative reviews. Using written reviews up through 2021, we hypothesized that surgeons who are younger, provide sufficient pain relief, and who have positive personal characteristics (e.g., punctual, confident, etc.) will have the most favorable written reviews and ratings.
Methods
Data Acquisition
The website healthgrades.com was used to collect publicly available written and start-rating reviews of spine surgeons. Healthgrades was chosen as when generally searching for providers, this review website was one of the first few websites suggested as well as due to ease of ability to web scrape large amounts of data from the websites without restriction. Although other physician-rating websites exist, many have firewalls in place that prevent the use of code to extract large amounts of data. A web scraper code was developed in order to obtain the demographic data. This code “scrapes” data from websites after being given the URLs for each provider’s healthgrade.com link. It is then able to parse through the HTML code in order to extract key demographic information needed for the study, such as provider age, gender, and reviews. Star-rating reviews refer to the reported ratings out of five stars given to surgeons from these websites. This data is publicly available, and the star-rating reviews provide an overall average star rating for each surgeon. Inclusion criteria included surgeons who were listed on the “The Physician Payments Sunshine Act” as “Spine Surgeons.” 19 This list of surgeons was then also cross-checked online review websites to confirm that they were listed as spine surgeons within their online profiles as well. Exclusion criteria included those surgeons who had no online ratings or less than seven written reviews. Several linear regressions were performed to confirm the relationship between our calculated sentiment analysis scores and online reported star scores. These regressions were performed on the entire cohort, with no surgeons excluded, all the way to excluding surgeons with <20 reviews. Out of all iterations, surgeons with at least seven reviews provided the greatest fit out of all regressions, and as a result, this cutoff was selected in order to include as many providers as possible while accurately calculating average scores.
Sentiment Analysis
The “Valence Aware Dictionary and sEntiment Reasoner” (VADER) sentiment analysis is a widely used python package used to obtain sentiment analysis scores of written text. This package is built into the Natural Language Toolkit (NLTK) library. 20 The input for VADER is a written excerpt and the output is a score that represents the “sentiment” of the sentence. Sentiment is defined as how positive or negative a sentence is based on the prose utilized, the connotations of the words, punctuation, capitalization, and word modulators such as “very” or “not. VADER was used to obtain scores of each written review for every spine surgeon. Currently, VADER has seen use in the analysis of various modes of social media and has recently seen application into healthcare related social media.21-23 It has been used for the analysis of Twitter tweets and other forms of social media in order to obtain insights into how the general public feels about particular subjects. As a result, VADER was selected for this study as it has been utilized in many studies focusing on online prose, and thus lends itself to be an ideal program for the present study.
VADER Score Calculation
VADER relies on a word dictionary that was developed by ten independent human raters who were trained and assessed for inter-rater reliability. The raters assigned scores ranging from −4 to +4, with 0 representing a neutral sentiment, to each word in their dictionary. 20 VADER works by taking the inputted sentences, scanning for words included in the dictionary, and summing and normalizing the matching scores to between −1 to +1, where −1 represents negative sentiment and +1 represents positive sentiment. Scores are also affected by punctuation marks as well as capitalization, both of which are used for emphasis in online reviews.
Additionally, the calculation that VADER performs factors in potential modifiers to words. A negating or emphasizing adverb, such as “not” or “very,” will impact the score of an overall text excerpt. This means that “not helpful” will contribute more negatively to overall sentiment while “very helpful” contributes more positively.
Model Validation
Linear regression analysis was implemented to compare the average sentiment analysis score for each doctor to their average star score in order to show an association between calculated sentiment scores of the written reviews and the online star rating.
Data Analysis
Student t-tests were completed when assessing how between demographic variables (age, gender) related to average sentiment scores of written reviews.
Word frequency analysis was performed to assess the most common content included in the surgeon reviews. Specifically, the most positive reviews (sentiment score >.75) and the most negative reviews (sentiment score <0) were independently assessed with word frequency analysis to determine what content was present in highly positive and highly negative surgeon reviews in particular. Further context was provided to the word frequency analysis by looking at the most commonly used word-pairs, or bigrams. This means that we searched every review for the frequencies of all two-word sequences.
Finally, we performed a multiple logistic regression on key words and word-pairs to analyze their association with a sentiment score >.5.
Results
Surgeon Demographics
Demographic Data on Spine Surgeons Analyzed
Model Validation: Linear Regression
The linear regression analysis of average sentiment analysis scores to average star scores showed a positive association between the scores (Figure 1, r2= .711, P-value < Linear regression of average calculated sentiment analysis scores of each surgeon compared to their reported online star ratings.
Model Validation and Demographic Analysis: Student T-Tests
Student T-test comparing Star and Written Reviews to Gender and Age
Further T-tests were conducted to check for a significant difference between means of sentiment analysis scores given to surgeons older and younger than the age of 50. There was a significant difference between older surgeons and younger ones as, on average, older surgeons received lower sentiment analysis scores (mean sentiments: < 50y = +.486, >50y = +.396; p =
Word Frequency Analysis
Frequencies of most used words recognized by NLTK are also reported. The most frequently used and meaningful words used to describe top-rated surgeons are words associated with care, compassion, and comfort; whereas, those with the worst reviews are often characterized as rude, arrogant, and unable to relieve the pain of their patients. Words that were high frequency but not clinically or behaviorally relevant were removed in order to focus on characteristics that would be helpful in determining what factors affect patient reviews. For example, words such as “great” and “horrible” were removed because although they describe generally the experience the patient had with a physician, it does not aid in our analysis of what behavioral or practice characteristics are associated with these reviews.
Clinically Relevant Single-Word Frequency Analysis of Best and Worst Reviews.
Clinically Relevant Bigram Frequency Analysis of Best and Worst Reviews.
For the most negatively reviewed surgeons, their descriptors centered around levels of pain as well as inefficiency in pain management. Of the reviews used in this analysis, those reviews which had a negative sentiment analysis score, pain was used 1063 times, and the next relevant word was rude at 241 (Table 3). This shows that the clear factor driving negative reviews of spine surgeons is in fact pain and pain management. Although there are behavioral words in the most used words for the negative reviews, they are used significantly less frequently than pain is. Bigram analysis of negative reviews also supports this claim as all of the top 5 clinically relevant, highest frequency bigrams were about pain, descriptors of pain, or regions of pain (Table 4).
Multiple Logistic Regression
Multiple logistic regression analysis on clinically relevant keywords.
The results of this regression showed us that words defining positive surgeon behaviors, such as “listens”, “knowledgeable,” “warm,” and “confident”, were positively associated with reviews that had positive sentiment scores. The more positive behaviors that were described by patients, the more likely she or he is to get a better review. This is shown through the top behaviors with the greatest odds ratios being “listens,” “knowledgeable,” “warm,” and “confident,” with odds ratios of 2.5 (P < .01), 2.97 (P < .01), 3.17 (P < .01), and 6.08 (P < .01), respectively, indicating that these words were associated with a 2x, 3x, and 6x chance of receiving an overall positive score if included in a review.
Regression analysis also showed that inclusion of “long wait” makes a surgeon’s review half (.45) as likely to receive a positive sentiment score. Additionally, the inclusion of a “friendly staff” also did not significantly affect the odds of receiving a largely positive review.
Finally, inclusion of the words “pain” and/or “severe pain” were significantly associated with decreased odds of receiving positive reviews (OR = .372 and OR = .263, respectively), whereas the inclusion of “pain free” in surgeon reviews conferred a 4 times greater likelihood that a surgeon received a positive review (OR = 3.96).
Discussion
The field of orthopedics, especially the spinal subspecialty, is very dynamic in terms of both medical practice and surgical approach. Currently, spinal procedures are seeing worldwide increases in volume in conjunction with patients utilizing the internet (i.e., Physician Review Websites) to aid in their surgeon selection process.1,2,7,8 Thus, in order for surgeons to provide the highest quality of care while successfully recruiting prospective patients, it is crucial that we understand how the spine community is being rated and reviewed online. This current study sought to appreciate the general vernacular being used to describe the highest and lowest rated spine surgeons, while simultaneously identifying characteristics associated with these scores. With 480 surgeons analyzed here, our study represents the second largest spine cohort analyzed using physician review websites. Further, our study processed the most written reviews (8,357) to date and is the only study to do so utilizing written sentiment analysis. Overall, we found that surgeons who are warm, confident, knowledgeable, and listen while providing sufficient pain relief receive the highest scores through online review platforms.
A consistent finding throughout the literature is that orthopedic surgeon’s age is significantly associated to their rating positivity online. In general, younger surgeons saw more positive reviews across various orthopedic subspecialties. 8 This point remained constant in sub-analyses of spine surgeons, with younger surgeons amassing significantly higher ratings than older surgeons across multiple physician review websites (e.g., vitals.com, Google.com, and healthgrades.com.11,12,14 Our findings agree with the current literature as we saw that spine surgeons younger than 50 years of age have significantly more positive written reviews (<50: +.531, >50: +.423; P = .03) and star ratings (<50: 4.59/5.00, >50: 3.99/5.00; P < .01) compared to surgeons older than 50. Given this consistent finding for online reviews of spine surgeons, more experienced surgeons should consider adopting characteristics highlighted later in this study. Further, this remains an interesting point as past literature (articles not focusing on online reviews) has shown that patients indicate no preference for a physician based on their age;24-26 yet patient’s online comments suggest a different narrative. This same point can be stressed for patients’ preference on their surgeon’s gender. Previous literature on patient review websites has shown that patients report no preference for their spine surgeons based on surgeon gender.14-16 However, offline studies have indicated that patients may prefer a male or female operating surgeon based on the subspecialty or the patient’s gender identity.27-29 Our results differ from the current physician review website literature. We saw that female surgeons received both lower written review scores (F: +.353, M: +.437; P = .04) and star ratings (F: 3.82/5.00, M: 3.97/5.00; P = .03) compared to male surgeons. However, given our cohort had a considerably larger male population (n = 440) compared to the female population (n = 40), we believe further research is warranted to draw strong conclusions regarding the impact of physician sex on review scores. Additionally, a gender-specific cohort split causes a selection bias, which is impacting the results we saw.
Previous literature has indicated that patients most desire orthopedic physicians who display behavioral qualities such as good bedside manner, listening, and trustworthiness.8,30-32 Additionally, Donnally et al., Melone et al., and Kalagara et al. showed that spine surgeons who are confident, punctual, helpful, and answer questions received the most positive ratings online.13,15,17 In our sentiment analysis, we show that “warm,” “confident,” “listen,” “knowledgeable,” and “bedside manner” are all positively associated with surgeons receiving higher written review scores. These findings make intuitive sense as patient’s want a surgeon who truly cares for their well-being. Elsewhere in the literature, however, non-personal and non–physician-related characteristics have been linked to alterations in post-visit patient satisfaction. For example, multiple subsidiary characteristics such as physician attire, clinic environment, and scheduling ease have all been shown to significantly influence a patient’s healthcare experience.28,33,34 For spine surgeons in particular, online ratings indicate that long wait times can drive down, while friendly staff help improve, surgeon ratings.11-13,15 While we did not note “staff,” “friendly staff,” or “long wait” as having an influence on surgeon ratings or reviews, it has been shown that these variables significantly affect patient post-visit satisfaction in spine practices.35,36 As (dis)satisfaction can impact how patients review spine surgeons online, and ultimately how future patients will perceive them, these ancillary points should be strongly considered by spine surgeons in order to optimize their overall practice.
In a similar study, Agarwal et al. studied the online ratings of emergency departments and urgent care centers utilizing co-occuring words in text analysis, similar to what was performed in the present study in our word and bigram frequency analysis and multivariate regression. 37 Comparatively, they found that the most positive reviews focused on effective pain management and improved communication. Our bigram frequency analysis and multivariate regression emphasizes this focus on “pain” and pain-descriptors as well. After screening over 8000 reviews for the word-couplings that were most common, pain and its locations inundated the negative reviews. It is clear that lasting pain and the patient’s experiences with pain are greatly influencing their perception of their providers’ abilities. Pain and pain management is an overwhelming focus in these reviews, indicating how critical it is for physicians to clarify any misconceptions that patients have prior to surgery and to perform proper pain expectation management. Soroceanu et al. in their study on preoperative expectations for patients undergoing lumbar and cervical spine surgery, found that patients who had fulfillment of their expectations had higher postoperative satisfaction. 38 Those who retained more preoperative expectations also had decreased postoperative satisfaction. Both of these results further emphasize how important it is that surgeons devote proper attention to pain expectation management. Mahomed et al. also indicated that 66% of individuals expected no pain after recovery from total joint arthroplasty procedures, 39 further emphasizing the misconceptions that patients may hold. By preemptively addressing pain and the inherent complexity in treating pain, physicians may be able to resolve any of the preconceived notions that surgery is a guaranteed cure to a patient’s pain. Surgeons who address these head on are more likely to resolve these misconceptions and patients, if after the procedure they still feel pain, will be less likely to attribute this lasting pain to the lack of abilities of their provider but rather to the fact that pain resolution is inherently difficult and never ensured.
Moreover, the power of this present study is such that we can screen for any descriptive words or phrases and assess how they may influence a physicians’ rating. This could prove to be a useful tool for surgeons trying to improve their individual online presence, especially as comments that do not pertain to a surgeon’s person in particular can still reflect negatively on them. Further, in recent years minimally invasive and robotic-assisted procedures have gained immense traction within the orthopedic subspecialties, especially in spine. The market for robotic assistance in particular is estimated to reach $2.77 billion by 2022 as more and more surgeons adopt these technologies in their everyday practice. 40 Despite this immense increase in popularity of emerging technologies among providers and in literature, the interest has yet to significantly alter patient opinions. When specifically searching for the frequency of phrases, the word “robot” only appeared once in our entire set of 8357 reviews. Additionally, the multivariate regression for words associated with different types of procedures, “open,” “minimally invasive,” “robot,” all returned insignificant. Future research needs to be performed as patients begin to comment on their experiences with these new surgical methods.
We do recognize that this study is not without its limitations. For this study, only one website, healthgrades.com, was utilized as it was the only website we were able to freely extract data from. This, has the potential to bias the results as they are only sourced from a single site. Additionally, given the subjective nature of the ratings and reviews left on public websites, we were unable to assess a patient’s rationale for leaving specific comments. Additionally, we were not able to stratify comments that pertained to physician or non-physician specific remarks in the overall sentiment analysis scores. Further, using the “The Physician Payments Sunshine Act” to curate a list of surgeons meant that we were only able to run analysis on a subset of all spine surgeons. Future directions for this project could be the development of an app or website that would allow providers to input their own websites and reviews to output individualized data on their performance in order to see how patients are reviewing them online. As comments are uploaded and added daily, through a machine learning based app, surgeons would be able to actively monitor their reviews in real time and amend their practices as needed.
Conclusion
This study represents the largest single analysis of written comments left for spine surgeons on physician review websites. We saw that gender did not play a role in the sentiment of the reviews left for providers in our analysis while age did. In general, positive behavioral attributes contributed significantly to positive reviews, as one would expect. However, although some negative behavioral characteristics contributed to the worst reviews, pain was the largest factor as seen in both the frequency analysis and the multiple logistic regression. This study thus serves to reinforce the importance of practicing proper pain expectation management prior to surgery in order to better serve patients and thus improve patient satisfaction and reviews.
Footnotes
Author Contribution
Samuel Kang-Wook Cho, MD, FAAOS, AAOS: Board or committee member, American Orthopaedic Association: Board or committee member, AO Spine North America: Board or committee member, Cervical Spine Research Society: Board or committee member, Globus Medical: IP royalties, North American Spine Society: Board or committee member, Scoliosis Research Society: Board or committee member, Stryker: Paid consultant, Jun S. Kim, MD Aldentyfy Inc.: Stock or Stock Options
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
