Abstract
Introduction
Zhou and Chan [1] proposed a universal method of usability evaluation for products by combining the analytic hierarchy process (AHP) and fuzzy evaluation methods for synthesizing performance data and subjective response data. This universal method aims at deriving a two-layer comprehensive evaluation index that is structured hierarchically within the framework of ISO 9241 part 11 [2], which defines usability in terms of three major components, viz. effectiveness, efficiency, and user satisfaction; scored by Post-Study System Usability Questionnaire (PSSUQ) with respect to System Usefulness, Information Quality, and Interface Quality [3, 4]. As shown in Fig. 1, the weights of usability components at corresponding layers were elicited using the method of AHP in the proposed method [1]. After collecting data for corresponding metrics in the framework, the evaluation appraisals were computed by using the fuzzy comprehensive evaluation technique model to characterize fuzzy human judgments. Another goal of the Zhou and Chan paper [1] was to demonstrate theoretically the generality of the fuzzy usability evaluation method by showing that any set of standard usability attributes can be adopted and the same process can be applied to obtain a comprehensive evaluation. However, it is not enough to have a theory, it is necessary to test how successfully it can be applied in practical cases and to test the strength of the general methodological framework.
According to the Zhou and Chan study [1], the fuzzy comprehensive evaluation technique was found to be able to combine usability metrics for objective performance data and subjective data from scale questionnaire methods. In order to illustrate the effectiveness of the model, a case study based on summative usability testing is presented in this study. In this case, specific network management software was used for the test, and this software was designed and developed using an integrated user-centered design approach [5]. Before the software was launched, a standard summative usability test was carried out in a standard usability testing lab to benchmark the overall usability of the product [6–8]. In line with the comprehensive usability evaluation framework proposed by Zhou and Chan [1], the first part of the study here was to collect data on effectiveness, efficiency, and user satisfaction. Then based on the data of summative usability test, the fuzzy method was used to incorporate both the usability scores and uncertainties involved in the multiple components of the evaluation.
In the next section of the current study, a comparison was made between the proposed fuzzy evaluation framework and conventional methods traditionally used widely in usability practice. Conventionally, one simple and useful technique for combining metrics scores on different scales is based on percentage, which is called Combining Metrics Based on Percentages in Tullis and Albert’s book [9]. With use of the method, in many cases the evaluated factors or measures are weighted equally (namely averaging percentage with equal weights) but sometime an adjustment can be employed by using different weights to calculate the averages for the purpose of business goals of product or usability activities (namely weighted percentage averages). According to the previous study of Zhou and Chan [1], the universal framework integrated two main points: weighting evaluation factors or metrics with the method of analytic hierarchy process (AHP), and combining a comprehensive or single score with the fuzzy approach. Therefore, the two methods of combining metrics based on percentage i.e. averaging percentage and weighted percentage averages were selected to compare with the fuzzy usability evaluation framework in this paper.
Application of the fuzzy evaluation technique: A case study
This section will show us how to use the fuzzy evaluation technique proposed by Zhou and Chan [1] to benchmark the overall usability of one network management software. Using the method of lab-based usability test, this case test was conducted for a usability team in a telecom company. This case study will focus on the application of the fuzzy usability evaluation model. Like the fuzzy evaluation technique proposed in [1] aimed to solve practical usability evaluation issue, the current study will constitute to link this attempt through to the job of usability professionals in the real world application. Details about the fuzzy evaluation model and process in usability or user experience practice can be found in the theoretical study of Zhou and Chan [1].
Methods
Participants
All participants were informed about the study by the experimenter reading a pre-prepared introduction, and they were all required to sign an informed consent form if they agreed to participate in the study. Sixteen users, who were all familiar with and used the test software, and had more than two years of professional experiences, participated in the tests. They were all males, aged 22 to 32 (M ean = 27.19, S tandard D eviation = 3.66) and considered to be target users of the software. The participants all took part voluntarily and all were ensured that their response would be anonymous. The tests took approximately one hour to complete, and each participant was paid one hundred Chinese Yuan for participating.
Experimenters
There were three experimenters for the tests; one was a facilitator for conducting the tests, and the other two were observers. The facilitator had more than three years of experience in conducting usability testing, and the observers had at least six months of usability professional experience. They were all trained on use of the product by the software development team.
Equipped usability laboratory
The tests were conducted in a typical usability laboratory with two soundproof rooms (one for testing, the other for observation). The rooms were separated by one-way mirrors. All performance activities of the participants were captured by video cameras.
Test tasks
Based on results from a task analysis, which was conducted as an important usability activity in an earlier phase of design, fifteen tasks, e.g. selection of the interface to be used, were chosen for testing by the user-centered design team. The team consisted of the system developers, marketing specialists, and usability engineers for the product. Each task was allocated the shortest or ideal completion time as well as the longest allowable time for performance by the team. The tasks were selected to cover the typical functions of the software, and were organized as five test scenarios such as “log in and user management”, “parameter set and modify”, and so on [7, 8].
Procedure
In each test, the participant was asked to complete the tasks as shown in the scenarios. At the end of the tasks, each participant was instructed to fill out the Post-Study System Usability Questionnaire (PSSUQ). He or she was then debriefed, and any usability problems that the participant reported were recorded. The complete test procedure lasted approximately one hour.
Data collection
Data on task success, task completion time and user satisfaction were collected. According to Zhou and Chan [1], a task was considered a success as a combination of accuracy, errors and completeness. Using the proposed operational definitions shown in Table 1, task success was rated by the two observers independently with a numerical score ranging from 0 to 1. The task time was separately recorded by the two observers. In addition, possible usability problems were recorded.
Results
Preparatory statistic
In this case, average success was computed by dividing the sum of all the task successes by the numbers of tasks, and then averaging it over the two observers. Task completion time was obtained by summing all tasks times, and then averaging over the two observers. The absolute values of the total task time exhibited more variability and therefore were converted using a transformation. In line with the theoretical framework proposed by Zhou and Chan [1], the converted task time can be calculated using the formula: 2- (original task time / expected shortest task time). The result will be a value in the intervals (–∞, 0), [0, 1], or (1, 2).
Based on the study by Lewis [4], the PSSUQ rules for calculating the score for user satisfaction were as follows: System Usefulness was scored by averaging the responses to eight items (for example, “It was simple to use this system”, Information Quality was scored with averaging the responses to seven items (for example, “The system gave error messages that clearly told me how to fix problems”), and Interface Quality was scored with averaging the responses to three items (for example, “I liked using the interface of this system”. Each item was rated on a 7-point scale of “strongly disagree” to “strongly agree”.
After processing as above, the original data could be converted to preparatory data as presented in Table 2.
Fuzzy comprehensive evaluation
According to the theoretical evaluation framework proposed by Zhou and Chan [1], the mappings from U (i.e., evaluation vector) to V (i.e., appraisal vector) should be calculated first. In the proposed fuzzy usability evaluation framework [1], the semi-trapezoid and trapezoidal distribution was used to construct mapping functions to characterize fuzzy measure values. Using Equations (9)-(13) in the Zhou and Chan paper [1], with threshold parameters i.e. the values of v i and c i , in the framework, the membership function of task success and converted task time could be plotted as shown in Fig. 2. Thus, the value of average task success was ranked as very poor, poor, medium, good, and excellent with corresponding degrees ranged in the interval [0, 1].
For example, the process illustrated in Table 3 shows the membership degree for each of the corresponding grades for task success. After the processes of rank summing and normalization, r j was calculated as the appraisal vector in the appraisal matrix for the corresponding cluster in the evaluated hierarchy. As shown in Table 3, the effectiveness of the system was calculated as B effectiveness =(0, 0, 0, 0.371, 0.629). Similarly, efficiency was calculated as B efficiency =(0.227, 0.163, 0.253, 0.073, 0.284).
With respect to user subjective satisfaction, the proposed evaluation model identified the threshold value v
i
as being (1, 2, 3.5, 5.5, 6.5, 7) [1]. In a similar way, the relationship mapping for the three factors i.e. system usefulness, information quality, and interface quality, was plotted (Fig. 3). Thus, with these three mappings the appraisal matrix for user satisfaction was calculated as:
According to Zhou and Chan [1], the weight vector of user satisfaction was determined as W = (0.312, 0.198, 0.490). Therefore, in line with Equation (2) and (3) in [1], the fuzzy evaluation of user satisfaction can therefore be calculated as follows:
By combining the evaluation vectors of effectiveness, efficiency, and user satisfaction, the appraisal matrix for the overall usability could be obtained. Therefore, with the weight vector of elements in the usability evaluated matrix as W=(0.443, 0.170, 0.387), the top-cluster evaluation for overall usability was also calculated using Equation (2) and (3) in Zhou and Chan [1] as follows:
This is the final appraisal vector. According to the maximum membership principle, the conclusion was that the usability quality of the product was “good”. However, stakeholders of the user experience project want to know an evaluation ‘score’ for benchmarking or comparing among products in practice. In addition, the membership degree to “excellent” was also high, so the “maximum membership principle” may lead to a loss of information about membership degrees to the other four grades. Therefore, the appraisal vector could be defuzzified to a comprehensive score [10]. In this study, we defined ‘very poor’, ‘poor’, ‘medium’, ‘good’, ‘excellent’ in appraisal grading score as 31, 50, 67, 82, and 95, respectively, so the appraisal vector B can be defuzzified according to the following formula [10]:
Where a is the defuzzified score, and, a1 = 31, a2 = 50, a3 = 67, a4 = 82, a5 = 95, b i is the appraisal vector [10]. So the overall usability of the software evaluated in this study can be presented as:
This shows that the usability of the software was between good and excellent.
In order to test the reliability of the fuzzy usability evaluation framework, a comparison was made between the fuzzy method Zhou and Chan [1] and two typical conventional methods Combining Metrics Based on Percentages [9]. Confidence intervals are extremely important to usability professionals [6, 11] and have been used to illustrate the reliability of small sample size usability tests e.g. usability problem discovery, user performance measures such as task completion rate [12]. According to Sauro and Lewis’s statement [12], confidence interval provides both a measure of location and precision, that is, an estimation with a narrower confidence would be more precise than a wider one. Generally, the confidence level and the sample characteristics (i.e., variability of the sample, and the sample size) can affect the width of a confidence interval [12]. With remaining a constant confidence level and sample size, the method of data analysis can be a factor in affecting the width of a confidence interval. The data from the usability test case will be used again in this section. Since only sixteen participants were tested, the reliability of these three methods will be compared using confidence interval width for different sample sizes. In the usability community, Lewis used the Monte Carlo method to simulate usability problem discovery rates to examine how to use a suitable method for adjusting usability problem-discovery rates from small sample sizes [11]. Similar, this method was used here to produce usability testing data for different sample sizes, and then comparisons made.
Method
With this Monte Carlo simulation procedure, Matlab was used to sample data from the case study to produce each metric i.e. task success, task time, system usefulness, information quality, and interface quality, independently for data from each simulated participant with a sample size of 16. Within the data simulation procedure, the ranges of the true values measured in the usability test case were designed as boundaries for each metric, i.e. task success ranged from 0.93 to 0.99, task time ranged from 310.5 seconds to 814.5 seconds, information quality ranged from 4.00 to 6.43, interface quality ranged from 4.33 to 6.67, and system usefulness ranged from 5.38 to 7.0. The simulation procedure generated a total of 100 cases with the sample size of 16.
Results
In the first step for preparing analysis, the methods of ‘averaging percentage with equal weights’ and ‘weighted percentage averages’ were used to convert metrics to a percentage for each participant in each simulation case. For example in Table 4, the two methods were used to combine usability data for the actual case. In each simulation case, three evaluation methods, including the fuzzy approach, can then be used, to calculate each participant’s overall score on the product. Following this preparatory work, CONFIDENCE function in Excel was used to sample confidence interval width at the 95% confidence level for each method. Therefore, as illustrated in Table 5, all confidence interval width data was simulated for any sample size from a 1 to 16 range with the different evaluation methods.
Figure 4 shows plots of the average of confidence interval width by evaluation methods and sample size. Overall, the figure shows that (1) confidence interval width at 95% confidence level tends to be the smallest for all sample sizes when conducting fuzzy evaluation, and the confidence interval width is greatest for the averaging percentage method, (2) as the sample size increases, the differences amongst the confidence interval widths tend to reduce especially for the methods of weighted averaging percentage and fuzzy evaluation.
The significance of the above observations was examined by t-test. The result showed that statement (1) above was supported, and the t-test indicated that the width difference between any two methods was significant for any sample size, t (198)≥14.74, p < 0.001. Statement (2) above was partially supported. For the methods of fuzzy evaluation and weighted percentage averages, the confidence interval width differences between sample sizes of N12 and N13, N13 and N14, N14 and N15, and N15 and N16 were not significant, t (198)≤1.87, p > 0.05. For the method of averaging percentage, no significant differences were found between sample sizes of N14 and N15, and N15 and N16, t (198)≤1.92, p > 0.05. For any other two sample sizes, the differences of confidence interval width were significant for any evaluation method, t (198)≥2.03, p < 0.05.
Discussion
By following the procedures described in the fuzzy evaluation technique by Zhou and Chan [1], this study succeeded in combining summative usability test data to achieve an overall usability quality for the specific network management software used for the tests here. The two-layer evaluation structure used in this evaluation technique tends to be a common usability index, which may improve the technique’s applicability and universality. As discussed in Zhou and Chan [1], the calculations in the proposed technique are apt to be rather complex for practical use. To overcome this, a usability team in industry may use automatic procedures to run the computations, including the processes to identify parameters in the technique. The current case study indicated that the fuzzy evaluation technique would be particularly useful for comparing the usability or usability quality among different products.
Another goal of this study was to illustrate the advantages of the fuzzy evaluation technique in measuring usability uncertainty. Overall, the fuzzy approach can capture the uncertainties inherent in the usability evaluation, and the advantage of the method over the percentage methods was verified here with significantly smaller confidence interval widths for combining different usability.
Firstly, unlike existing usability evaluation methods such as Combining Metrics based on Percentage in terms of rigidly combining usability metrics [9], the fuzzy method used a trapezoidal member function for structuring the fuzzy evaluation matrix, as well as weighting the relative importance of evaluated elements at corresponding evaluated layer. Determining the weights of different evaluation factors should be a pre-requisite for almost all usability methods but the advantages of weighting evaluation factors haven’t been explained well for the usability community. In the current method, the weights of elements were quantified systematically by the analytic hierarchy process (AHP), which has been shown to be successful in other areas of evaluation [13–15]. Greater differences of confidence interval widths between the method of averaging equally percentage and weighted evaluation method, even including the method of weighted percentage averages, indicated that it is very important to weight evaluated factors when combining different metrics into a comprehensive usability evaluation score.
The proposed approach was used for combining the AHP, fuzzy evaluation, and the trapezoidal mapping function to compute the overall usability. The comparisons of confidence interval widths indicated that the proposed fuzzy evaluation technique can evidently decrease the margin for possible evaluation errors. Furthermore, the fuzzy evaluation method has no specific requests for data samples and system types. This is very desirable for usability evaluation in real world, because usability is often evaluated based on different measurements. Small samples are used frequently even when summative usability testing is conducted in usability practice. These findings indicate that use of the fuzzy approach provides benefits by estimating the true population value by combining metrics for the overall usability of a single product. This study illustrated the fuzzy approach can benefit usability practice in the various fields of usability evaluation [16, 17].
Conflict of interest
The authors have no conflict of interest to report.
Footnotes
Acknowledgments
This research was supported by the National Natural Science Foundation of China (NSFC, 31271100 and 71420107025). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Thanks to the anonymous reviewers for their very helpful comments and suggestions regarding an earlier version of this paper.
