Abstract
Background:
As the aesthetics field continues to innovate, it is important that outcomes are carefully evaluated.
Objectives:
To develop item libraries to measure how skin looks and feels from the patient perspective, that is, SKIN-Q.
Methods:
Concept elicitation interviews were conducted and data were used to draft the SKIN-Q, which was refined with patient and expert feedback. An online sample (i.e., Prolific) provided field-test data.
Results:
We conducted 26 qualitative interviews (88% women; 65% ≥ 40 years of age). A draft of the SKIN-Q item libraries were formed and revised with input from 12 experts, 11 patients, and 174 online participants who provided 180 survey responses. The psychometric sample of 657 participants (82% women; 36% aged ≥40 years) provided 713 completed surveys (facial, n = 595; body, n = 118). After removing 14 items, the psychometric analysis provided evidence of reliability (≥0.85) and validity for a 20-item set that measures how skin feels and a 46-item set that measures how skin looks. Short-form scales were tested to provide examples for how to utilize the item sets.
Conclusion:
The SKIN-Q represents an innovative way to measure satisfaction with skin (face and body) in the context of minimally invasive treatments.
KEY POINTS
Introduction
As aesthetic treatments continue to evolve, an increasing number of people are accessing an expanding range of aesthetic treatments to tighten, slim, reshape, and rejuvenate the skin on their face and body. 1 To measure outcomes of aesthetic treatments, our research team developed patient-reported outcome measures (PROMs) for people having surgical and nonsurgical procedures. These PROMs include FACE-Q Aesthetics2–9 and BODY-Q.10–12 To measure outcomes specific to the skin, FACE-Q Aesthetics includes a 12-item scale that measures satisfaction with facial skin appearance. BODY-Q includes a 7-item scale that measures how bothered someone is by excess skin on their body.
Since FACE-Q Aesthetics and the BODY-Q were developed, PROM science has continued to evolve. While standard practice for PROM design involves the development of short forms composed of a limited set of items (i.e., questions), more recently, item libraries and item banks have been developed to provide a flexible approach that addresses a limitation of short forms, that is, they may not include the most important concepts for a specific patient population or context of use.13,14
To evaluate outcomes for aesthetic treatments, it is vital that PROMs have high content validity.15–18 To this end, we aimed to develop a comprehensive set of items to measure how skin feels and looks and provide a way to customize fit-for-purpose short-form scales. More specifically, subsets of items can be selected from an item set and scored by calibrating scores to the full set of items (i.e., item-bank approach19,20), or scored using estimates from independent samples (i.e., item-library approach). 21 The short-form approach aims to reduce patient burden, while retaining reliable and valid measurement.
The specific aims of our study were as follows: (1) to elicit skin-related concepts important to patients having minimally invasive treatments that target the face and body; (2) to use the concepts to develop and refine a PROM (i.e., SKIN-Q); and (3) to determine how well the SKIN-Q performs psychometrically.
Methods
Our study used a mixed-methods approach 22 and followed internationally established guidelines for PROM development and validation.15–18 The study was coordinated at McMaster University (Canada). Ethics board approval (#13603) was obtained from the Hamilton Integrated Ethics Board (Canada).
An interpretative description qualitative approach was followed. 23 Adult participants were recruited from six plastic surgery clinics located in Canada (three sites) and the USA (three sites) between October 22, 2021 and March 31, 2022. Clinic staff identified patients who varied by age, gender, race, and minimally invasive treatment. Patients who agreed to take part in the study provided written informed consent. Interviews took place by phone or on a secure web conferencing platform (i.e., Zoom) with experienced qualitative interviewers who followed an interview guide (Supplementary Data S1).
Interviews were audio-recorded, transcribed, and coded line-by-line. Coding was performed independently by two coders who achieved consensus on their initial set of codes. Codes were refined through constant comparison in Excel. 24 Interviews continued until saturation of concepts was reached. 25 Participants were provided a $100 USD gift card.
An item pool was developed and refined through a series of steps. In October 2022, participants from the qualitative interviews were invited to provide feedback online in REDCap. 26 For each SKIN-Q item, participants selected one answer: (1) I do not understand the question; (2) I understand the question, but it could be worded better; (3) I understand the question, but it is not relevant to me; and (4) I understand the question and it is relevant to me. An open-text box was provided for suggestions. Participants were provided a $30 USD gift card.
Cognitive debriefing interviews were performed with survey participants using Zoom by an experienced interviewer. Participants were asked to provide feedback on the SKIN-Q, and to suggest missing content. Interviews were audio-recorded, transcribed, and analyzed. A gift card of $70 USD was provided. Experts in aesthetics, and representatives from the aesthetics industry, were emailed a copy of the SKIN-Q with instructions to point out items they thought were not relevant to patients and to suggest missing concepts.
Content validity was further explored using an online crowd-working platform, that is, Prolific (www.prolific.co). We conducted a screening survey in December 2022. At that time, residents of Canada and the USA fluent in English in the Prolific sample totaled 121,170. We paid participants the equivalent of 10.80 GBP per hour. We included participants who had the treatments listed in Supplementary Data S2 and excluded anyone who had not been to a plastic surgery or dermatology clinic for treatment in the past 12 months, and anyone who chose “none” or “other” for the treatment type, or “other for location of their body treatment.
For each item, participants chose one answer from the following: (1) I do not understand the question; (2) I understand the question, but it is NOT relevant to me, and (3) I understand the question and it is relevant to me. An open-text box was provided for suggestions. Data for scale refinement were downloaded into SPSS Version 28 (IBM Corporation, Armonk, NY, USA) for analysis.
A pilot field test was conducted using the Prolific sample described earlier. Survey invitations were sent in February 2023. A new Prolific sample was identified for the field-test study. The denominator for residents of Canada and the USA fluent in English for our screening survey in February 2023 was 121,448. Supplementary Data S2 inclusion criteria were used. We excluded participants who reported no treatment or chose “other” for the type of treatment, and for body treatments, anyone who chose “other” for treatment location and those whose treatment had worn off. Prolific participants were invited to complete a REDCap survey starting on March 3, 2023. At the end of the survey, participants were asked (yes/no) if they would be willing to complete the survey again in 7 days for a test–retest (TRT) study.
The pilot and field-test data were analyzed with Rasch Measurement Theory (RMT) analysis 27 using RUMM2030 software 28 and the unrestricted Rasch model for polytomous data. The pilot study was used to identify and remove items with extreme misfit to the Rasch model. For the field test, analysis was used to identify the best subset of items to retain for each item set based on a set of psychometric tests (Table 1).
Psychometric tests performed
DIF, differential item functioning; ICC, intraclass correlation coefficients; RMT, Rasch Measurement Theory; TRT, test–retest.
Results
Tables 2 and 3 show patient and treatment characteristics. The 26 participants from the qualitative sample had one or more minimally invasive facial treatments, and six participants had one or more minimally invasive body treatments involving the abdomen, chest, and thighs. Coding and analysis identified skin-specific concepts that were developed into items measuring how skin looks and how skin feels.
Participant characteristics
Treatment history reported by the qualitative sample and prolific participants based on number of surveys completed
Number of unique participants.
PRP, platelet-rich plasma.
Supplementary Data S3a,b shows item-level changes made after each round of patient/expert input. In Round 1, 11 of the 26 participants completed the 79 items survey providing 869 ratings. Of these, 0.1% of ratings were “I do not understand”; 10.9% of ratings were “I understand this question, but it could be worded better; 9.7% of ratings were “I understand this question, but it is not relevant to me”; and 79.3% of ratings were “I understand this question and it is relevant to me.”
Seven of the 11 participants took part in a cognitive debriefing interview. Round 1 also included three aesthetic plastic surgeons and one plastic surgery resident from Canada. Based on feedback, 67 items were retained, 7 were revised, 5 items were dropped, and 6 items were added resulting in a total of 80 items.
Round 2 included five plastic surgeons, one dermatologist, and two industry experts from Denmark, Canada, Sweden, and the USA. Based on this round, 20 items were retained, 58 items were revised, 2 items were dropped, and 8 items were added. At this point, the word “facial” was removed from all appearance items so that they were applicable to the face and body. At the end of Round 2, there were 86 items.
In Round 3, 939 Prolific participants accessed the screening survey. We invited the 281 people who met the study criteria to complete the survey. After exclusions, of the 174 respondents, 6 had both face and body treatments providing 180 (face = 129; body = 51) survey responses (Tables 2 and 3). Results for item comprehension and relevance are shown in Supplementary Data S3a,b and summarized in Supplementary Data S4. For the 86 items, the option “I do not understand the question” was chosen 1.3% of times, and the option “I understand the question and it is relevant to me” was chosen 74.6% of times. Based on this round, 79 items were retained, 2 items were revised, and 5 items were dropped. The pilot field test included 81 items.
The 174 participants were invited to complete the pilot field test and 161 respondents provided 167 assessments: 123 had facial treatments, and 44 had body treatments. Based on the RMT analysis, one skin item with poor fit was dropped. The field-test version had 80 items, that is, 58 measuring the skin looks (17 face-specific) and 22 measuring how the skin feels.
For the field test, we screened 2500 Prolific participants. After removing duplicates and incompletes, 2419 remained of which 904 met the inclusion criteria (face = 878; body = 157; both = 137). Of the 702 responses, 66 were incomplete, 51 had no treatment, 32 answered “other” for type of treatment, and 7 provided unreliable answers. The field-test sample included 546 surveys (face = 472; body = 74) from 496 participants.
For the RMT analysis, pilot (N = 167) and field-test (N = 546) data for the 657 participants were combined (total surveys = 713). Tables 2 and 3 show participant characteristics and treatment history. The 118 surveys from participants who had a body treatment, covered the following locations: abdomen = 60, thighs = 40, buttocks = 33, hips = 25, chest = 23, arms = 22, and lower legs = 8. RMT results are shown in Table 4 (scale-level) and Supplementary Data S5 (item-level).
Rasch measurement theory scale-level statistics and other psychometric results
α, Cronbach alpha; χ 2 , chi-square; CI, confidence intervals; DF, degrees of freedom; PSI, person separation index; +extr, with extremes; −extr, without extremes; ICC, intraclass correlation coefficient; LB, lower bound; UB, upper bound; RMT, Rasch Measurement Theory.
For the 58 items measuring how the skin looks, 12 items were dropped due to poor item fit to the Rasch model. Of the remaining 46 items, all are relevant to facial skin and 33 are relevant to body skin. The 46 items had ordered thresholds (Supplementary Data S6a). After the Bonferroni adjustment, all items fit the Rasch model with nonsignificant chi-square p-values, and most (i.e., 28/46) had fit residuals ±2.5 or less. Differential item functioning (DIF) was identified for 19 items, with stable DIF (i.e., evident in all three random samples) only evident for four items in the age-group analysis. Pearson correlations between person locations for items before and after item split for DIF showed negligible impact on the scoring, that is, all correlations 1.000.
Most of the sample scored on the scale (96.8%). Data fit the Rasch model. Reliability was high with Person Separation Index (PSI) and Cronbach alpha values ≥0.98. A total of 30 pairs of items evidenced local dependency with residual correlations >0.30. PSI dropped 0.01 and Cronbach alpha dropped 0.05 after a subtest was performed. Supplementary Data S6b shows the Person-Item Threshold Distribution. The samples (face and body) were targeted to the scale. The floor (0.3) and ceiling (1.7) effects were low.
For the 22 items measuring how the skin feels, 2 items were dropped due to poor item fit to the Rasch model. The remaining 20 items are relevant to face and body. These items had ordered thresholds (Supplementary Data S7a), fit the Rasch model with nonsignificant chi-square p-values, and most (i.e., 13/20) had fit residuals ±2.5 or less. DIF was identified for 13 items, with stable DIF evident for 8 items in age group. Pearson correlations between person locations for items before and after item split for the specific items that evidenced DIF showed negligible impact on scoring, that is, correlations ≥0.994. Most of the sample scored on the scale (93.4%).
Data had slight misfit to the Rasch model (chi-square = 196.89, df = 160, p = 0.03). Most of the sample scored on the scale (93.4%). Reliability was high with PSI and Cronbach alpha values ≥0.95. A total of seven pairs of items evidenced local dependency. The impact on the reliability statistics was marginal, with a drop in the PSI values of 0.01 and Cronbach alphas as 0.06 after subtests were performed. Supplementary Data S7b shows the Person-Item Threshold Distribution. The samples (face and body) were targeted to the scale. Floor (0.8) and ceiling (5.9) effects were low.
Five example short-form scales were created: Skin Rejuvenation, Skin Quality, and Facial Movement. All items in Skin Rejuvenation and Skin Quality are relevant to both facial and body skin. Psychometric results are shown in Table 4. Data fit the Rasch model for the five short forms. All 43 items had ordered thresholds and nonsignificant p-values after Bonferroni adjustment. Fit residuals for 25/43 items were ±2.5 or less. Reliability was high with PSI and Cronbach alpha values ≥0.87. Items in two short-form scales evidenced local dependency. Subtests performed led to drops of ≤0.11 in reliability statistics, with all values ≥0.83. The short forms were well targeted; ≥87.5% of the sample scored on the scales' range of measurement.
Figure 1 shows the construct validity results. As hypothesized, SKIN-Q scores were incrementally lower the more treatment was reported as having worn off, being more bothered by lax skin, and looking older than one's actual age. Our hypothesis that SKIN-Q scores would be incrementally lower for deeper dynamic and static lines was also supported (Supplementary Data S8).

Mean scale scores for construct validation hypotheses based on the following questions:
Table 4 shows the TRT results. We excluded one participant who completed the TRT on day 15 and 20 participants who reported change. Intraclass correlation coefficients results were ≥0.81.
Discussion
We followed best practice guidelines for PROM development to create content to measure outcomes of skin treatments for the face and body. Our qualitative efforts elicited a set of skin-related concepts important to participants, which they deemed were comprehensive, relevant, and easy to understand. When tested in a large sample of people who had a minimally invasive treatment, the psychometric evidence supported the reliability and validity of the SKIN-Q item sets and five example short forms.
These findings add to the published literature on PROMs for aesthetic treatments by providing a means to measure satisfaction with how skin feels and looks anywhere on the face or body. The BODY-Q skin scale differs from SKIN-Q in that the BODY-Q measures how bothered people are by excess skin, an important concept in the context of body contouring after massive weight loss.10,11
The SKIN-Q has some conceptual overlap with the FACE-Q Aesthetics 12-item skin scale. 8 Both scales measure satisfaction with skin. However, the FACE-Q version includes five unique items, and the word “facial” in each item. It is important to note that the FACE-Q Aesthetics skin scale is qualified as Medical Device Development Tools (MDDT) by the US Food and Drug Administration for use as a co-primary or secondary endpoint in clinical trials. 37 The SKIN-Q, in contrast, is new and has not undergone similar qualification.
As aesthetic treatments continue to expand, it is vital that PROM development keeps pace. PRO item sets used as banks or libraries provide a flexible approach that addresses limitations inherent in short forms. In the choice of which PROM to use, potential users would be wise to maximize content validity and ensure the PROM is fit for purpose. Recommendations on the use of item libraries has been published. 13
Our team has created the first available item sets for measuring satisfaction with skin in the context of minimally invasive aesthetic treatments. When used in clinical trials, these item sets can provide clinicians and researchers with the opportunity to pick a set of items relevant to measuring outcomes for their procedure or product. Short forms can either be calibrated in relation to the full set of items using the Rasch model (i.e., item bank approach), or stand-alone scoring can be created (i.e., item library approach).
These findings must be interpreted in the context of the study design. Limitations of our study include that the sample had fewer participants who underwent an aesthetic treatment for the body. Second, some treatments in our sample were represented by a small number of participants, and surgical treatments were not included. Third, our sample included only English-speaking people in Canada and the USA. Fourth, the treatment and clinical data were self-report, and was not verified clinically. Finally, the online platforms includes participants who self-select to take part and are paid for their involvement. There is evidence that the data provided through Prolific are high quality compared with other platforms.38,39
To conclude, the SKIN-Q is an innovative PROM that can be used to measure outcomes for minimally invasive treatments that target the face or body skin. The design of SKIN-Q makes it possible for end users to customize fit-for-purpose short-form scales to maximize content validity and reduce patient burden. Future research should examine psychometric properties not addressed in this article, such as responsiveness and minimally important differences. SKIN-Q can be accessed via www.qportfolio.org.
Footnotes
Authors' Contributions
Conceptualization, methodology, data curation, formal analysis, and writing—original draft by A.F.K. Conceptualization, methodology, and writing—review and editing by A.L.P. Conceptualization, methodology, investigation, and writing—review and editing by M.K. Investigation and writing—review and editing by J.M. and E.T. Resources, writing—review and editing by S.D., J.K., K.A., K.S., and L.P. Data curation, formal analysis, and writing—review and editing by C.R. Conceptualization, methodology, formal analysis, and writing—review and editing by S.C.
Author Disclosure Statement
A.F.K., S.C., and A.L.P. are codevelopers of the SKIN-Q PROM and as such would receive a share of any license revenues as royalties based on their institutions' inventor sharing policy. A.F.K. is an owner of EVENTUM Research, which provides consulting services to the pharmaceutical industry.
Funding Information
The authors received no financial support for the research, authorship, and publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
