Abstract
The mobile app rating scale (MARS) is a widely used instrument for evaluating smartphone app quality. We aimed to examine the reliability and validity of the Korean version of MARS (MARS-K). Two independent raters performed the assessment using the translated 23-item questionnaire. We applied intraclass correlation coefficient analysis (ICC) to examine inter-rater reliability, Omega, and item-total correlation for internal consistency, and Pearson’s r for test–retest reliability and correlation between subscales and the total score of MARS-K. Most items showed moderate to good ICC (0.447–1.000). The MARS-K showed excellent internal consistency and all subscales exceeded the acceptable level of omega. Results indicated MARS-K to be a valid and reliable instrument for evaluating disease management apps offered in the Korean app store. However, upgrades are recommended to further improve MARS-K’s rating accuracy and reliability.
Introduction
Mobile applications (apps) represent one of the fastest-growing technologies due to the high penetration rate of mobile phones worldwide. 1 According to a Korea Internet and Security Agency report, 82.5% of people in Korea over 5 years old are smartphone users. 2 Mobile technologies play a pivotal part in various aspects of our everyday lives. Advanced smartphone features, such as Bluetooth and location sensing, extend the usability of health applications that perform varied tasks such as providing reminders to track calorie consumption, self-management of specific health conditions, the remote morning of a targeted disease, tracking physical activity, promoting self-care behaviors (taking medications as prescribed, maintaining a healthy diet and weight, maintaining good mental health habits), behavioral tracking, monitoring symptoms, and maintaining a dialogue with healthcare practitioners through secured text messages and video conferencing calls.3–8
Health applications render promising future directions for disease care due to their accessibility, relatively low cost, and high capacity for information storage. Some applications have already proven to be effective healthcare intervention tools. 9 Such applications may be particularly helpful for patients with chronic diseases as they offer multiple benefits. Individuals with chronic conditions can become overwhelmed due to complex treatment regimens. Poor adherence to disease management practices increases patients’ risks for complications leading to increased healthcare expenses. Health applications have clinical implications for supplementing traditional clinic-based treatment with real-time assessments, monitoring, and data collection. 10
For this reason, several apps have been developed that specifically target populations with chronic conditions. The mobile health market has reported a 41% growth rate annually from 2015, showing the highest growth rate in the digital health sector. 11 Including smartphone apps, mobile health devices were designed primarily for managing chronic conditions such as diabetes, hypertension, and depression. This market has expanded rapidly which has generated interest in the effectiveness of the apps, and the need for additional information regarding app quality. National public health agencies have initiated incorporating mobile technologies to ensure quality in healthcare services. 12 Researchers have demonstrated that perceived usefulness is closely linked to quality which may lead to an increase in available effective health apps.13,14
In Korea, new free or low-cost health applications are released daily. 11 While apps provide consumers with a potential advantage by offering interactive tools that improve access to health information and support treatment adherence, 15 they can also have some downsides. For example, these apps provide incorrect and misleading information and consumers may use this faulty information to make their own health-related decisions. 16 Moreover, every user has a different app adoption process based on circumstantial factors such as society’s values, environment, friends, and individual characteristics such as age, gender, race, health history, and educational background. 17 Consequently, some people adapt faster and gain significant benefits, while others may struggle to learn and use the apps, and wrong information can be transferred and followed by users. Given their potential effects—good and bad—on individuals with chronic conditions, mobile apps should be required to meet standards that guarantee their quality. However, apps do not have to demonstrate their safety or disclose if they are not evidence-based. 18 Therefore, there is a need for a thorough health application assessment process to optimize their effects on public health.
The Mobile Application Rating Scale (MARS) was recently developed by Stoyan and colleagues. 19 It uses 23 items including four objective domains (engagement, functionality, aesthetics, and information) and one subjective domain of health application, to evaluate an app’s quality. The MARS has been widely translated into other languages, including Italian, 20 German, 21 Spanish, 22 and Arabic, 23 and earlier studies have demonstrated its reliability. Previous studies also verified its applicability by using MARS to assess various kinds of apps, such as those which focus on smoking cessation, disease management for patients with epilepsy, anxiety, diabetes, obesity and associated disorders, health and fitness, and disease prevention.20–25 In addition, a recent validation study using MARS rating scores of 1,299 mobile health apps confirmed the validity and reliability of MARS, concluding that MARS is a suitable tool to assess the quality of health apps. In particular, this study included a large number of specific target groups across various diseases including, anxiety, low back pain, cancer, depression, diet, elderly, gastrointestinal diseases, medication adherence, mindfulness, pain, physical activity, post-traumatic stress disorder, rheumatism, weight management, and internalizing disorder MHA for children and youth. 26
In Korea, most health applications have not been based on sound evidence and theoretical frameworks; they have been designed by various individuals or organizations rapidly developing multifarious health applications that are easy to market for public usage. 27 Concerns have been raised not only about these apps lacking any theoretical basis but also about how few studies have been done to provide systematic evaluations of their effectiveness. 28 There are no existing instruments for measuring the quality of mobile applications in Korean. Thus, we aimed to develop a Korean version of MARS and test its validity and reliability.
Methods
We used content analyses 29 to chart a range of health-related mobile applications currently being used in Korea. We first collected data from various types of publicly available sources to detect the presence of frequently used mobile applications. To counterbalance potential bias in the findings, we used investigator triangulation. 30 Each researcher independently examined the data. The results were compared and a level of common agreement was achieved.
Search strategy: Phase one
The analysis involved four phases. The first phase was identifying health-related mobile applications targeting the general public in Korea. We selected the Google Play Store and the Apple App Store, as their apps work with the mobile operating systems most used in Korea. The Google Play and Apple App Store were searched from March 1 to March 3 of 2020. The search keywords included 46 chronic conditions from a list of diseases provided by the Korea Institute for Health and Social Affairs. 31
Selection criteria: Phase two
List of application tested.
Instrument, translation and back translation: Phase three
The third phase involved using a Mobile Application Rating Scale (MARS) developed by Stoyanov et al. 19 to validate the instrument with a sample of frequently used health-related mobile applications in Korea. This phase aimed at reviewing the quality of healthcare services offered by mobile applications and identifying patterns in the data. This resulted in a series of implications and recommendations being developed for validation of MARS in Korean.
The original version of the Mobile Application Rating Scale (MARS) consists of 23 items using 5-Point Likert scale which included rating scores from one (poor) to five (excellent). An additional option, “not applicable,” exists for five items: 14–17 and 19. There were two parts to the quality rating scales in MARS, the objective and subjective quality app assessments. The objective quality of mobile applications was evaluated across four subscales: engagement (four items), functionality (four items), aesthetics (three items), and information (seven items). Four items (items 20–23) were used for subjective quality assessment.
The Korean version of MARS was developed after obtaining permission from the developer of the original MARS. A forward–backward translation procedure was performed with the consensus of the developer. First, MARS was translated into Korean by the author. Then, a blind backward translation was performed by a bilingual translator who has been in an English-speaking country (the USA) for longer than 20 years. The developer of the original MARS reviewed the results and discrepancies were resolved.
Assessment of apps: Phase four
Based on previous studies,19,20,34 the sample size was determined as a minimum of 41 apps for a two-rater assessment. For an intraclass correlation coefficient (ICC), a sample size was determined using Zou’s sample size calculation. The number was needed to obtain an agreement of at least 80% with the half width of a two sided confidence interval remain below 0.15 with assurance probability of 0.90. Standard online training was made available for raters 19 to ensure consistent interpretation of terminology regardless of the researchers’ background or country, to improve the quality of app evaluation utilizing MARS. Following training program, three reviewers (KH, SK, and YH) pilot implemented MARS on apps not included in this study to verify the reliability of the result. Any distinctive conflicts regarding interpretational issues and subtle nuance were adjusted and resolved through discussion as an effort to improve alignment. In the assessment process, two authors (KH and SK) played the role of rater and independently evaluated all apps included that both raters were using MARS-K One of the two researchers was based in Australia and used a Samsung Galaxy A5 and an iPhone pro max while the other was based in Korea and used a Samsung Galaxy S10 and an iPhone 8. Both researchers downloaded and tested the mobile application.
Statistical analysis
All statistical analyses were performed using SPSS statistics software, version 27.0 (SPSS Inc., Chicago, Illinois), and the statistical significance was determined at p < 0.05. The rating scores of individual raters were pooled, and descriptive statistics such as mean and standardized deviation (SD) for total scores and subscale scores were produced.
Objectivity
The intraclass correlation coefficient (ICC) was calculated to examine the consistency between the ratings of the two raters. The results were interpreted according to the following criteria: excellent (ICCs above 0.90), good (ICCs between 0.76 and 0.89), moderate (ICC between 0.51 and 0.75), and poor (ICC below 0.50). 35
Reliability
The reliability analysis was assessed by Omega as it is known to provide a more unbiased estimation of reliability than Cronbach’s alpha which has been widely used.36,37,38 Omega scores of individual subscales scores and total scores of MARS-K were calculated to verify internal consistency. The reliability coefficient of Omega was interpreted according to the following criterion: acceptable (0.70–0.79), good (0.80–0.89), and excellent (>0.90). 39 In addition, test–retest reliability was used to access the stability of the measurement, with raters calculating the Pearson correlation scores at two points of time. The two assessments were taken approximately 2 weeks apart.
Validity
For item analysis, item-total correlation (ITC) was analyzed. Correlations were calculated to investigate whether each subscale measures unrelated construct. 40 Pearson’s correlation coefficient analysis was used to obtain the r scores for subscales and total scores of MARS-K; to evaluate the construction of measurements. The coefficient score of 0.70 was used for acceptability.
Results
A total of 284 apps were initially retrieved; then, the duplicates were removed (n = 17). Sixty apps were considered for downloading after applying the exclusion criteria stated above. Fourteen apps were further excluded due to being unavailable in Australia, inadequate contents, and functionality problems. Forty-six apps were included for the MARS-K validation study (Figure 1). The mean scores and distribution of scores on the five subscales by raters were shown in Table 2. Except for skewness score on the information subscale by Rater 2 (−1.026), others were within the ± 1 range. Flow chart for app selection. Mean scores and distribution by rater and subscale. aItems 18 and 19 were excluded from calculation because of lack of ratings.
Objectivity
Inter-rater, test–retest reliability and item-total correlation of MARS-K.
Reliability
Omega, by rater and subscale.
aItem 18, 19 were excluded from calculation because of lack of ratings.
Validity
Pearson’s correlation coefficient, by rater and subscale Rater 1: upper right, Rater 2: lower left).
ap < .05.
bp < .01.
Discussion
This study developed and evaluated the Korean version of MARS (MARS-K). A comprehensive search identified 46 health-related apps, targeting disease management that is prevalent in Korea. The MARS-K has good objectivity and overall reliability and validity, proving that MARS-K is suitable for quality evaluation of health applications in Korea. MARS-K would successfully replace previous app ratings, including star ratings in which reliability was a constant concern for health apps.41,42 Since end-users require information to gauge the reliability of the app, obtaining quality ratings from experts will help assure the end-users, leading to good adherence to the app.
Similar to previous studies, 20-23 consisting of contents that could easily objectify, relatively higher inter-rater reliability and internal consistency scores were obtained for subscale 1 (engagement) for MARS-K. The questions ascertained whether the app offers diverse features to enhance user engagement and included examples such as gamification, customized setting for individuals' need, option to add feedback, alerts, reminders, and ability to share gathered information with others. Raters were able to decide based on the clear evidence of the apps having those features.
The “functionality” subscale showed the relatively lower inter-rater reliability and internal consistency which is in line with the results of previous studies.20,22 Given some words had high levels of abstractness, low ICCs were the result of inconsistent interpretation of items. With respect to words such as “appropriate” and “uninterrupted” (item eight) and “intuitive” (item nine), raters might have a different standard which could increase the subjectivity of the ratings. Since several objective items were unclear, the ratings could hardly be objectified. There were numerous aspects to measure options for customization, and complete tailoring to the individual’s characteristics/preferences. It might be beneficial to count several features to clarify the level of each functionality and customization, for example ranging from 0 (none) to 10 or above (maximum). Incorporation of additional and exemplified explanations of the measurement would potentially enhance the consistency, transparency, and inter-rater reliability where necessary.
Findings indicated moderate to good correlation among subscales while “functionality” showed relatively low correlation with other subscales. These results support previous studies20,22 and could be explained by the commonly shared characteristics regarding apps' performance. Using MARS, apps were likely to obtain higher scores in functionality when consisting of features that are simple and easy to navigate. 20 As a matter of fact, both raters showed the highest average scores on functionality subscale. Most apps included in this study were information and education focused rather than disease management focused; thus, theoretically, they are easy to use and uncomplicated for the target population which include elderly with potential illness. The majority of health applications demonstrated good functionality by valuing simplicity and having limited features available for good usability, which also led to bifurcation of functionality from other subscales.
It was hard to draw matched results between raters for certain aspects of MARS-K, for example, dimension 6 (subjective quality). Although raters reached a consensus on how to rate, there was an inevitable difference between the results of raters due to subjective natures of items. 19 In addition, unexpectedly, item six asking the performance of apps showed low ICC. Studies reported the factors influencing app performances are processor, RAM, storage, software, battery, and temperature. 43 Two raters used different smartphones within different network environments (Rater 1 was based in Australia, and Rater 2 was based in Korea), which may have caused the divergence of ratings regarding app performance. Indeed, Rater 2 used a newer smartphone than Rater 1. Generally, apps are updated to suffice the latest model of smartphone and operating system; thus, Rater 2 scored higher on apps’ performance than Rater 1. Also, the network environment was not identical, which might have affected the functionality of apps.
All apps provided a certain level of information detailing what their app offered in the Google Play app store; some were comprehensive while others were deficient. Inter-rater reliability measure of sufficiency of app description, for both quality and quantity, showed considerable disjunction between raters. One explanation could be the different academic background of raters. Unlike the other rater, the rater with a nursing background will be able to assure whether or not the information provided is adequate for users’ understanding. Regarding health apps, one could involve multiple stakeholders from different knowledge backgrounds, for example, healthcare professionals or technicians. Considering MARS-K measures diverse aspects of an app’s quality, it is crucial to involve individuals with relevant backgrounds than having identical rating scores.
The following are some considerations for developing app rating scales. Patient experience has become a crucial part of the quality of any healthcare service, and the user-centered design method understands users’ experience and prioritizes the needs of end-users.44,45 Furthermore, the user-centered design is an evidence-based approach and proven method of mHealth intervention that engages end-users to enable more appropriate app design.46,47 The World Health Organization (WHO) recommends this approach and promotes this within the lifecycle of mHealth (i.e., app design) to ensure an effective outcome. 48 “User-centered,” “human-centered,” and “patient-centered” can be used interchangeably.
When apps fundamentally target disease management, it is important to embrace the entire user (patient) journey, emphasize on human-centered design, and encompass holistic well-being. 49 A human-centered design brings users' experience to the core of the design process. It uses the techniques to communicate, interact, empathize with the user, obtain and understand the desires, experiences, and potentially find the latent need. 50 Therefore, the app rating scale should evaluate if the app is well-designed from a human-centered point of view and comprehensively consider improving patient experience.
The current MARS-K finds it difficult to measure if an app offers motivation to use the app, provides a supportive and meaningful interaction, helps to effectively achieve their goals, changes behavior towards more positive, and finally improves their health and well-being. Furthermore, a previous validation study suggested the inclusion of therapeutic alliance domain as it could strengthen the quality of assessment when it comes to health apps. 26 The ENLIGHT instrument, for example, contains a section to access therapeutic alliance, 51 measuring feasibility of health apps as an effective means of disease management.
Other concerns are the possibility of different weights that individual items represent in each dimension. 26 Although it was confirmed that the individual dimensions measure different aspect of app quality, the current calculation methods should be reconsidered. Presently, most studies recognize the sum score of each dimension which poses risk of faulty analysis. Thus, future studies should suggest new calculation methods of how to weight items to ensure the accuracy of metric quality of measurement.
While validating MARS-K, two items showed complete agreement between raters; item 18 and 19. The two items ensure professional credibility and evidence-based practice, able to check if the app has been tested and whether they have a positive result and verified by evidence. Notably, only two apps were available to rate item 18 (developed by a reliable organization) and 19 (evidence-based research), which raised concerns regarding the quality of health apps in Korea. A substantial number of new health-related apps are regularly made available daily and it is widely acknowledged that there are no specific regulations for registering a newly developed app on the apps stores. 52 Therefore, it is hard to determine which health-related apps truly work, especially among those targeted for disease management. Potentially, apps providing false information could aggravate a user’s health condition. Future work should include improvements to the MARS, particularly to meet the requirements for targeting disease management. Also, a system could be developed to register health-related apps.
In addition, the health apps store private and disease related data; hence, it is crucial to follow the rules and regulations to ensure public safety. 42 The Food and Drug Administration (FDA) classifies health apps as medical devices with potential risk to users, therefore requiring a process of approval. Given that this approval focuses on safety aspects of the apps, the quality check using MARS, a quality evaluation tool, would provide end-users with sufficient information. As the first quality assessment tool, MARS-K would not only help improve the awareness of researchers and developers regarding the factors that influence the quality of apps but also help them understand what features and information can assist in improving the effectiveness of the app.
Conclusion
The MARS-K is a valid measurement for assessing the quality of health apps targeting disease management. However, further revisions of items would enhance the tools’ reliability. Additionally, the incorporation of user-centered factors and increasing the credibility of health apps would ensure their safety for clinical and personal use. Given the study’s findings of advantages and drawbacks, thorough upgrading of the tool might help improve the accuracy, reliability, and validity of ratings.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2019R1G1A1006737).
