Abstract
This paper explores the potential of combining online guest reviews with hotel classification systems, focusing on the feasibility of incorporating such reviews at the star-classification level. Leveraging machine learning techniques on a database of Portuguese hotels and their corresponding ratings on Booking.com, this study reveals a weak association between official hotel star categories and mean review scores of satisfaction items rated by users during the review process, suggesting a discrepancy between official star-classifications and consumer expectations and experiences. Based on the results, a new classification model is proposed, which integrates a classification system based on Booking.com reviews alongside traditional star categories, aiming to complement hotel star-classifications with a further quality dimension as perceived by customers through online reviews. This model provides travellers with more informative and reliable information, facilitating decision-making in the hotel selection process.
Keywords
Introduction
Hotel classification systems are widely used in the accommodation sector to provide measurable indicators for consumers and intermediaries (UNWTO, 2015). Using these indicators makes it easier to compare hotels’ service levels and equipment standards. Therefore, any information that can help to better understand and compare the characteristics and expected quality of the accommodation experience can be critical (Arzaghi et al., 2023). For marketing purposes, classification systems are particularly useful in promoting the most varied types of tourist accommodation. However, establishing a classification system for tourist accommodation is a complex task due to the diversity of types of accommodation and the cultural, environmental, and economic contexts in which the systems are applied (UNWTO, 2015). Despite hotels’ efforts to provide consumers with reliable, comparable, and relevant information, this industry continues to struggle with the problem of asymmetric information. Mainly because hotel classification systems may differ between countries and regions (Cser and Ohuchi, 2008; Rhee and Yang, 2015) and because they may be insufficient to inform about the quality of service and experience that hotels offer, which is more subjective and does not always correspond to the expected level.
With the increasing prevalence of the Internet, information about guest reviews, ratings, and scores has become increasingly accessible to travellers. Electronic word of mouth (eWOM) has been reducing the asymmetry of information, and the scores provided by travel sites, such as Booking.com, Expedia, TripAdvisor or Google Travel, among others, have contributed to reduce the subjectivity of service quality and the relationship this has with the characteristics of hotels (Li et al., 2017). As travel-related online searches are rising, the hotel’s official classifications and guest reviews play complementary roles. Traditionally, official classification systems focus on facilities and level of service, while guest reviews and ratings are based on expectations and quality of experience. Therefore, in this paper, the term “classification system” refers to formal hotel categorisation schemes (e.g., star-classifications assigned by official bodies based on technical criteria), whereas “rating system” is used to describe consumer-based evaluations such as review scores on online travel platforms.
Hensens (2015) anticipated the shift towards more dynamic, consumer-focused classification models, foreshadowing the integration of digital guest feedback with traditional classification systems. While formal classification systems already ensure that required facilities are present and meet predefined standards, guest reviews provide insight into how those facilities are perceived and experienced by customers – offering an additional, experiential layer of quality assessment. Therefore, the growing reliance on online travel-related content has reshaped how quality is perceived and communicated.
Before making an online hotel reservation, consumers visit on average almost 14 different travel-related websites, with about three visits per website, and perform nine travel-related searches on search engines (UNWTO, 2014). Official hotel classifications are often used by consumers as a filtering mechanism in the booking process, with guest reviews used to make a final selection from a narrower group of hotels. More recently, there has been interest in integrating classification processes into the digital and social era, with regions considering the use of online guest reviews in traditional hotel classification methods. According to the UNWTO (2014), there is a consensus among suppliers and consumers about the advantages of integrating guest reviews into hotel classification systems, provided that an appropriate methodology is developed to do so (UNWTO, 2014).
Despite the widespread use of both official hotel star-classifications and online guest reviews, few academic studies have explored how these systems can be meaningfully integrated into a unified classification framework. Most research either compares the two systems (Martin-Fuentes, 2016; Martin-Fuentes et al., 2018) or analyses the sentiment behind reviews (Krey et al., 2024) without proposing operational models for combining institutional and consumer perspectives. This study addresses that gap by developing a hybrid classification model that uses guest review data to complement and enrich formal hotel classification through machine learning techniques. It also examines the feasibility of incorporating guest reviews into large-scale hotel classification systems, specifically at the star-classification level. Thus, the proposed integrated approach aims to add a further consumer-based quality dimension to hotel classification, thereby refining the classification process by complementing the existing expert-led criteria with experiential guest perspectives.
The research was conducted using a database of Portuguese hotels and their corresponding ratings on Booking.com based on the satisfaction items rated by users during the review process after checking out. The satisfaction items refer to Overall, related to the global accommodation experience and overall satisfaction level, Value for money, Cleanliness, Location, Facilities, Comfort, Staff, and Wi-Fi. It is important to note that these satisfaction items are based on guests’ subjective evaluations of their experience, as provided on the Booking.com platform. As such, they reflect perceived quality rather than a formal audit of infrastructure. Consequently, this study does not aim to reproduce the full technical criteria used in official star-classification systems (e.g., availability of lifts, pools, or specific surface areas), but rather to model how guests interpret and evaluate their stay.
From the guest’s perspective, perceived quality is shaped by several specific factors that contribute to their overall satisfaction. Value for money refers to balancing perceived costs (primarily monetary) with perceived benefits. Customers seek accommodations offering the highest value at the lowest possible price (Gupta and Kim, 2009), irrespective of the accommodation type. Some authors found value for money to be the most critical factor in hotel selection, after the price criteria (Zaman et al., 2016). Cleanliness applies to rooms and other hotel areas, such as restrooms, entrances, parking areas, lobbies, and dining places. It is associated with safety and low health hazards and is one of the primary causes of dissatisfaction during a stay (Lockyer, 2002). Cleanliness and safety have become even more critical in the aftermath of the COVID-19 pandemic, with recent studies confirming that guests increasingly prioritise hygiene protocols and perceived health security when evaluating hotel quality (Pennington-Gray and Lee, 2024; Tiwari and Mishra, 2023; Tiwari and Omar, 2023). Hotel comfort is usually related to sleep quality, which includes factors like a cozy bed, noise level, adequate room temperature, lighting, and scent (Zaman et al., 2016). Location is related to the proximity to points of interest, transportation convenience, and the surrounding environment. Location may be essential for tourists who want easy access to the sites they plan to visit and the events they plan to attend (Masiero et al., 2019; Yang et al., 2015). Hotel facilities, also frequently referred as amenities, typically refer to supplementary services (e.g., in-room coffee maker or kettle, safe, luggage storage, recreational equipment storage, complimentary parking, etc.). These facilities may be included in the accommodation price or require an additional fee, depending on the hotel and its standard. To stay competitive, hotels strive to provide an increasing number of facilities (Chu and Choi, 2000). Regarding the hotel staff, it is often a crucial aspect of customer satisfaction regarding hotel services (Chu and Choi, 2000; Kim et al., 2020, 2022). Finally, free and reliable Wi-Fi in hotels is now considered an essential part of modern hospitality and is often viewed by guests as the most important technology that should always be available. While most hotels offer complimentary access, issues such as limited coverage or slow connection speeds are common sources of dissatisfaction (Cain et al., 2024).
By leveraging machine learning techniques, this work proposes a hybrid classification model that analyses all items rated by users during the review process on Booking.com and assigns a refined star-classification that accurately reflects the quality and services provided by each hotel. This methodology ensures a comprehensive and objective assessment of the hotels, enhancing the reliability and usefulness of the complementary classifications.
By applying supervised and unsupervised learning techniques to Booking.com review data and comparing outcomes with official star-classifications, this study advances the conceptual understanding of classification as a multidimensional construct – both facility and experience-based (Koutoulas and Vagena, 2023). This aligns with current academic interest in reconciling objective and subjective indicators of quality in service contexts (Nilashi et al., 2022).
Literature review
Hotel classification systems
Classification systems categorise hotels and services by assigning them distinct grading levels. These systems provide comparative information about hotel facilities, such as view, room quality, room service, food, spa, and fitness services, and more recently, on the surrounding area’s public services and facilities (Arzaghi et al., 2023). Classifications are attributed based primarily on the types of facilities and services offered, rather than on the subjective quality of service delivery. In systems such as the Automobile Association (AA) in the UK, classifications focus on the presence, scope, and consistency of services rather than their experiential quality. Some hospitality brands also have different classifications for their properties across geographies, sometimes under a different brand name, to target specific customer segments (Claver et al., 2006). This is because classifications indicate not only the facilities provided by the hotel but also the price levels (Nilashi et al., 2022). Potential guests with different needs consider various criteria to make stay-related decisions, and the classification systems may serve as a credible and trustworthy signal of the hotel’s services to make that decision easier (Masiero et al., 2015).
Classification systems can also be divided into those that only evaluate objective criteria and those that evaluate both objective and subjective criteria and can be either statutory (or official) or voluntary (UNWTO, 2015). Most statutory or official systems are government or state-owned classification systems and focus mainly on physical attributes and services, relying more on quantitative and technical aspects than service quality. However, the combination of private and public systems is more intended towards guests, their needs, and expectations (Minazzi, 2010). Hence, public authorities must be more guest-oriented and interested in regulating properties to increase international competitiveness (Khan et al., 2022). These distinctions are reflected in the diversity of classification approaches adopted around the world. Although there is no internationally centralised hotel classification system, several prominent national and regional systems have emerged. These are applied across different parts of the world and use symbols such as diamonds, stars, crowns, suns, coffee pots, letters, and even feathers to categorize hotels (Vallen and Vallen, 2017).
In 1900, Michelin Tyres introduced pictorial symbols to point out the facilities of French establishments (Khan et al., 2022), giving rise to what is now one of the most famous travel guides, the Michelin Guide. In 1912, the AA launched the hotel star-classification in the UK, and today, it is the most used grading system in the country, rating and awarding stars to hotels based on quality, facilities, and services (Blomberg-Nygard and Anderson, 2016). AA has worked closely with VisitBritain, VisitEngland, VisitScotland and Wales Tourist Board to implement Common Quality Standards for hotel inspections, ensuring consistent ratings across the UK (AA Hotel and Hospitality Services, 2024). In addition to standard “Black Star” classifications, the AA awards Silver Stars (for hotels exceeding quality expectations) and distinguished Red Stars, which recognise properties that deliver exceptional hospitality and service levels across all star categories, thereby providing an additional layer of recognition above the traditional star classification (AA Hotel and Hospitality Services, 2024).
In 1958, the oil and gas company Mobil, through their magazine Mobil’s Travel Guide (known today as Forbes Travel Guide), rated hotels using a 1-to-5-star system (Arzaghi et al., 2023). In 1962, the International Union of Official Travel Organizations (UNWTO) developed a consensus on using 5-categories of hotel classification (Vine, 1981). In 1976, the American Automobile Association (AAA) started rating hotels and restaurants using the Diamond Grading system, being considered, nowadays, the most extensive classification system as it grades more lodging properties than any other system in the world based on facilities and services offered (Nalley et al., 2019).
More recently, in 2009, a joint initiative led to the founding of the Hotelstars Union, under the patronage of HOTREC Hospitality Europe (The Confederation of National Associations of Hotels, Restaurants, Cafés and Similar Establishments in the European Union and European Economic Area). This platform aimed to harmonise European hotel classification based on a standard criteria catalogue. Although this initiative did not get the adhesion of all HOTREC members, more than 22,000 hotels are classified within the Hotelstars Union. This system has 247 harmonised criteria (mandatory plus optional criteria), uses a 1 to 5-star grading system and demands revision of criteria every 5 to 6 years. It aims to improve transparency for guests and hoteliers, as well as quality control and fair competition (https://www.hotelstars.eu/).
Within the 1-to-5-star grading system, variants have also emerged. The European Hotelstars Union has a higher “Superior” mark to account for some extra features in each star category. Another example is the Australian classification system, which has half-star increments for their hotels, making it possible to find 1.5-star hotels (Arzaghi et al., 2023). According to Vallen and Vallen (2017), some other differences and similarities can be pointed out. In Sweden, Germany, Switzerland and France, the “Hotel Garni” means no restaurant but includes continental breakfast. Besides the 1-to-5 classifications in Switzerland, a luxury class, “Gran Tourism” or “Gran Especial,” has been added. The same happens in Italy, India, and Spain with an extra classification of 5-star “Deluxe” (UNWTO, 2015). The Irish Tourist Board takes a different approach, listing the facilities available (e.g., elevator, air conditioning, laundry) rather than grading them. Directories of the European Community follow a different approach and classify by location: seaside/countryside, small town/large city. European auto clubs go further by distinguishing privately owned from government-run accommodations (Vallen and Vallen (2017). In this context, Spain has standardised its Paradores’ rating system, consisting of a government-operated chain of charming hotels in historic buildings. Portugal also has its Pousadas, which can be compared to the Spanish Paradores. In Japan, the traditional inns, the Ryokans, are rated according to their rooms, baths and gardens.
Although the most frequent and worldwide recognisable is a 1-to-5-star classification (Tiwari and Omar, 2023), it is still necessary to work on a universal, more credible and more customer-oriented system so that international travellers can have a more accurate picture of what hotels are offering (Núñez-Serrano et al., 2014). Classification systems serve hotels, hotel guests, and the travel trade, such as tour operators and travel agencies (Narangajavana and Hu, 2008; Nunkoo et al., 2020). In some cases, online travel agencies (OTA) show the official star-classifications side by side with their guest rating scores of the hotels displayed on their online platforms (Koutoulas and Vagena, 2023). The main limitation in using star-classifications for comparing hotels is the fragmentation of hotel classification systems, as each country, and sometimes each region, uses its system with a distinct set of criteria, thus creating confusion to hotel guests about what level of quality and comfort to expect (Núñez-Serrano et al., 2014).
In addition to being based primarily on facilities and services, traditional classification systems also face criticism for other limitations. These include the reliance on scheduled inspections, which may not reflect the hotel’s continuous performance. Moreover, consumers are frequently unaware of the criteria underlying star-classification, leading to misunderstandings or mistrust (UNWTO, 2015).
More recently, new classification and rating initiatives have emerged. In 2024, Michelin introduced the “Michelin Keys” for hotels, aiming to recognise outstanding establishments worldwide based on consistent excellence and guest experience (Guide, 2024). At the same time, several hotels – particularly in the Middle East – have adopted unofficial “6-star” or “7-star” labels as part of branding strategies, despite the absence of formal global standards. Among the most notable examples are the Burj Al Arab and the Jumeirah Marsa Al Arab, both in Dubai, which are often marketed as “7-star hotels” (Forbes, 2023; Jumeirah Group, 2024). These differences among classification systems reflect the respective countries’ cultural, economic, or national traditions (Maravić, 2017).
Online guest reviews and integrated approaches
With the continued growth of social media and online reservation platforms which allow and encourage guest feedback, the playing field for hotel classification is changing rapidly. The information on hotels’ characteristics and attributes and the customers’ experiences, reviews, and scores have become increasingly available directly to travellers (Arzaghi et al., 2023). A recent systematic review by Pestana et al. (2024) mapped the growing body of literature on online hotel reviews, emphasizing their rising influence on service quality assessment and classification methods. Consumers are giving more importance to ratings given by other consumers, and less importance to official classifications. Recent studies have demonstrated that sentiment analysis applied to guest reviews can effectively forecast hotel performance, providing valuable predictive insights complementary to traditional star-classifications (Krey et al., 2024). Therefore, eWOM can significantly impact the reputation of a hotel and booking rates. Positive reviews attract potential guests, while negative feedback deters them from booking (Hensens, 2015). Most of the online reviews focus on service quality. At the same time, conventional classification systems tend to focus primarily on objective, tangible criteria such as the availability and size of facilities and services, occasionally on subjective tangible criteria such as cleanliness and state of maintenance, and rarely on service quality (Hensens et al., 2010).
The customers’ view of hotel quality is largely subjective and depends on their perceptions of its characteristics, facilities, services, location, and even the price. For instance, Kim et al. (2022) found systematic differences in online reviews between distinct traveller segments, highlighting the importance of incorporating varied consumer perspectives into classification frameworks. eWOM is a staple feature of online customer-to-customer communication, reducing information asymmetry of lesser-known hotels more than higher-quality hotels (Yang et al., 2018). Specialised sites, such as Tripadvisor.com, and customer reviews and scores provided by travel sites, like Booking.com and Expedia.com, have significantly contributed to resolving the quality information problem in the travel and hotel industries (Li et al., 2017) while also providing review scores that simplify comparisons.
Nowadays, many online platforms generate a substantial number of reviews and user-generated content, including hotel reviews and ratings. This amount of new data may play a crucial role in decision-making by providing additional information and ultimately influencing the traveller. Besides ratings and textual comments, most review platforms allow users to upload photos, offering visual evidence that enhances the credibility and richness of guest feedback. Recent studies show that the consistency between visual and verbal content can significantly impact consumer perception and hotel ratings (Liu et al., 2024). Moreover, hotel managers can publicly respond to reviews, a practice that has been shown to positively influence booking behaviour when responses are timely and customer-focused (Krey et al., 2024; Lopes et al., 2024). These systems can offer an independent and trusted reference on the standard and quality of hotel service and facilities, thereby facilitating consumers in choosing their accommodation. They also provide a framework for accommodation providers to market, position themselves appropriately, and leverage their investments in the quality of their product-service offers (UNWTO, 2014).
However, one of the preconditions for this is sharing accurate information and sometimes it can be a problem if customers give biased and superficial reviews or inadequate observation (Hensens, 2015). The presence of fake or manipulated reviews further undermines the credibility of user-generated content. As Tuomi (2021) points out, the emergence of deepfake consumer reviews in tourism makes it increasingly difficult for other users to assess the authenticity and reliability of the feedback they read.
Additionally, the large number of reviews makes it time-consuming for customers to read and draw conclusions. Since these issues can make it difficult for customers to make decisions confidently, authenticating such reviews and scores may constitute a basis for future demand, as the experience of past customers is a key criterion for choosing a hotel (Arzaghi et al., 2023). Travellers can become overwhelmed by the sheer volume of reviews and struggle to extract relevant and valuable information for their selection process. This issue of information overload can make decision-making more difficult and time-consuming for potential guests. As a result, there is a need for the compilation and summarisation of this data to aid travellers in overcoming the discrepancies between the star-classification system and guest satisfaction. Therefore, conventional classification systems and online travel platforms, such as Booking.com or TripAdvisor, may complement each other through integrated classification models. Several countries are moving towards integrated models, which can be grouped into two types: full integration and comparative performance.
Full integration implies that the hotel can adjust its star level up or down, depending on its perceived quality, as measured by guest reviews, compared to other hotels. In a comparative performance model, the aggregated guest review rating is displayed separately from the hotel classification. However, integrating consumer reviews into hotel classification is not new; some travel sites have been doing so for the past few years, such as Hotwire.com and Priceline.com, which primarily operate in the United States. These sites sell rooms not in specific hotels but in classes of hotels in general areas, such as a 4-star hotel in Times Square, New York City, for example. The accuracy of the star information is, therefore, critical to the success of these sites. Consumers may not revisit the travel site if they purchase a 4-star hotel but feel it is a 3-star hotel due to the quality of service or facilities. Norway and Switzerland have established models for integrating guest reviews into hotel classification, and regions such as the United Arab Emirates, Germany, and Australia are also developing integrated platforms. The model in Norway, developed by QualityMark Norway and yet to be implemented due to resistance from major hotel chains, is an example of full-scale integration. On the other hand, the system currently being used in Switzerland, which uses Hotelstars Union criteria for its official classification, involves instead a parallel presentation of aggregated guest review information alongside traditional hotel classifications (UNWTO, 2014; UNWTO, 2015).
In this context, the integration of guest reviews from online platforms into traditional hotel classification systems focuses on the feasibility of incorporating such reviews at the star-classification level. On the one hand, this approach respects the traditional characteristics and classification models of each country or region (in this case, star-classification) and, on the other, it integrates a classification obtained through machine learning techniques based on Booking.com reviews. These recent contributions underscore the need for a model that unites institutional classifications with user perception data – an integration still underexplored in empirical research, and which this study aims to address. Therefore, this model is considered innovative in that it can be adapted to the specific contexts of any region or country.
Methods
Data
For this study, a database combining hotel star-classifications from the Portuguese National Tourism Board (Turismo de Portugal) with online guest review ratings from Booking.com was compiled. All data were collected in October 2021, with star-classifications sourced from the official database and review scores gathered manually from Booking.com. To build our sample, we began by consulting the list of 1,426 hotels registered with Turismo de Portugal at the time. This registry gathers all information regarding official registration number, star classification, number of rooms and beds, and other available facilities. This search revealed that 226 hotels did not have all the information available and were therefore excluded due to their inconsistency. Thus, our final sample consisted of 1,200 hotels.
The Booking.com guest review scores are obtained after a guest has checked out of a property that had a reservation made through the platform. The platform emails the guest one questionnaire containing one mandatory question on the overall score of the property, 6 specific questions relating to cleanliness, comfort, value, facilities, location and staff that are optional and a few more optional ratings on breakfast and Wi-Fi facilities. Guests are invited to rate the property by attributing scores from 1 to 10. The platform also encourages guests to provide feedback in the form of an open question, even though it is also optional. After that, the average values are recalculated and, together with the characteristics, prices and photographs of the hotel, the ratings for each of the 8 items previously mentioned are presented (Overall, Value for money, Cleanliness, Location, Facilities, Comfort, Staff, Wi-Fi). The score metrics only considers reviews of the previous 36 months and is in constant update.
Software and libraries
The RStudio program 2023.0301 with R-4.3.0 was used to analyse the data. To fulfil the objectives of this work several R libraries were used, including: MASS for support functions; dplyr for data manipulation; e1071 for support vector machines training; psych and gtsummary for summary statistics; caret for classification and regression training; cluster for cluster analysis; randomForest for Random Forests classification and regression; factoextra to extract and visualise the results of multivariate data.
Support vector machine
The Support Vector Machine (SVM) is a popular machine learning algorithm (Bishop, 2006; Cervantes et al., 2020) originally derived for binary classification problems. In its simplest form, known as hard margin SVM, the model seeks the optimal (linear) decision boundary
Gradient Boosting Machines
Gradient Boosting Machines (GBM) are a class of ensemble methods whose rationale is based on the idea that combining several weak models (eventually slightly better than random guessing) can produce a single stronger model (Bishop, 2006; Mienye and Sun, 2022; Natekin and Knoll, 2013). Usually using low depth decision trees as weak models, GBM’s training is performed in sequence in a way that the following weak model is trained to correct the errors of the previous ones. Formally, GBM is an additive model
K-means clustering
Clustering algorithms leverage the underlying structure of a data distribution by partitioning the dataset into clusters based on specified criteria without prior knowledge of the dataset. Each cluster contains similar data instances, distinct from those in other clusters, with dissimilarity measured according to the algorithm’s objective and the data characteristics. Clustering is crucial in many data-driven applications and is extensively studied in fields like optimization, bioinformatics, computational geometry, statistics, pattern recognition, and image processing (Bishop, 2006; Ikotun et al., 2023). In this work, k-means clustering is used to discover cluster structures within Portuguese hotels based on the Booking.com’s online scores.
k-means is a popular partitioning clustering algorithm based on the distances between data points and cluster centroids. The algorithm starts by initializing k centroids (representatives of the k clusters), either randomly or through advanced techniques like density-based initialization. Each data point is then assigned to the nearest centroid, and the centroids are recalculated. This process is repeated until the centroids stabilize (convergence), reaching a local minimum of the objective function (Bishop, 2006; Ikotun et al., 2023; Steinley and Brusco, 2007).
Choosing the optimal number of clusters k to use is a fundamental problem for k-means. Incorporating domain knowledge about the data can provide valuable insights into a reasonable range for k. In this work, for example, a sensible k would be 5, corresponding to the number of hotel stars. Other strategies include the elbow method or the Gap statistic as ways of estimating such value. In the former, the total within sum of squares errors (SSE), measuring how tightly the data points in a cluster are grouped around the cluster centroid, is computed for several values of k and the point where the rate of decrease in SSE sharply slows down (the “elbow point”) is chosen. This point represents a balance between the compactness of the clusters (low SSE) and the simplicity of the model (fewer clusters). The Gap statistic (Tibshirani et al., 2001) compares the total within-cluster variation for different values of k with their expected values under a null reference distribution of the data. The goal is to identify the number of clusters that significantly improves clustering performance over random noise, which corresponds to a higher Gap value.
Random forests and variable importance
Random forest is another class of ensemble models that builds many trees and combines their predictions into a single one (Bishop, 2006; Mienye and Sun, 2022). Differently from boosting models, in random forests many bootstrap samples (sampling with replacement) are obtained from the original set, and each sample is trained with a full tree (training is performed in parallel). Full trees here give more model variance but lower bias. By combining predictions, variance is also reduced, and a more robust model is obtained. In the construction of each tree, only a random subset of the available input features is allowed to compete for each node which fosters the variability among trees (reducing the effect of stronger variables that consistently win the first nodes of every tree). Random forests allow us to track and measure the importance of each feature in the construction of the model. Two approaches to measure importance are usually provided: the mean decrease in accuracy (MDA) and the mean decrease in impurity (MDI) as measured by the Gini index. MDA is generally preferred as it directly measures how much permuting a variable reduces prediction accuracy, reflecting its true contribution to cluster discrimination, while MDI can be biased toward variables with more categories or continuous scales (Louppe et al., 2013; Sikdar et al., 2025).
In this work, random forests are used to measure feature importance in predicting cluster membership for the clustering solutions obtained with k-means. This allows to identify which Booking.com’s scores are more important to define the clusters and therefore characterise the corresponding hotels.
Results
Description of the hotel sample
Median scores (interquartile range) for Booking.com’s 8 review items, by hotel star category. p-value obtained for the Kruskal-Wallis test for significant differences between categories.
Based on the results, higher star-classifications generally correspond to higher review ratings. However, this trend is not uniform across all review items, as (Figure 1(a)) illustrates. Specifically, the relationship between star-classification and review score is less straightforward for location and value for money. In terms of location, 1-star hotels receive higher ratings compared to 2- and 3-star hotels. As mentioned by Masiero et al. (2019), location is related to the proximity to points of interest, transportation convenience, and the surrounding environment. In this context, 1-star hotels are generally associated with smaller hotels, sometimes located in pre-existing buildings in historic centres and, therefore, close to transport infrastructures such as metro and train stations or bus stops. Also, in value for money, 1-star hotels have the same rating as 2-star hotels and higher than 3-star hotels, for example, while no significant differences are found (p = 0.13) between the five categories. As Gupta and Kim (2009) refer, value for money refers to seeking accommodations offering the highest value at the lowest possible price. In this context, 1-star hotels – typically positioned in the budget segment – tend to have lower prices than the rest and, focusing on the quality of service, regardless of the existing level of facilities, they may find a strategic advantage here compared to other higher star-classification hotels. Within this 5-star classification system, it is also interesting to note that, regardless of the star category, Cleanliness, Location and Staff are rated with higher values than the other items. This is clear from the normalized heatmap of (Figure 1(b)). (a) Heatmap of Booking.com’s median values for the 5-star hotel ranking. (b) Normalized heatmap (by column) of Booking.com’s median values for the 5-star hotel ranking.
Predicting star-classifications based on guest review scores
We applied two predictive models to see if Booking.com’s guest scores constitute a good set of predictors of the official hotel star-classification, and therefore, verify if the customer perceptions of the quality of a stay align with the parameters that accredit a given star-classification.
Accuracy of the SVM model for hotel star prediction based on guest reviews using linear and RBF kernel functions and different values of C.
Confusion matrix of the registered hotel star categories and the predicted categories using GBM.
Performance measures of the GBM model on the use of user-generated content from Booking.com on the prediction of hotel categories. PPV – Positive predicted value; NPV – Negative Predicted Value; F1-measure = 2 * Precision * Recall / (Precision + Recall).
Hotel segmentation based on Booking.com average review scores
The previous results have shown that the perception of the quality of a stay by Booking.com customers does not reflect the hotel segmentation that the star categorization currently provides. Therefore, we used k-means clustering to investigate the existence of a different segmentation structure that could better reflect such perceptions. We started to estimate the optimal number of clusters for this data using both the elbow method and the Gap statistic. As Figure 2 shows, both approaches estimate a hotel segmentation solution with 3 to 4 clusters as the best options. Interestingly, there seems to be no advantage in using 5 clusters as the star-classification system suggests. (a) Graphical representation of the elbow method, where a notorious curve inflexion is observed at 3 to 4 clusters. The addition of more clusters does not generate a significant reduction in the total within-clusters sum of squares; (b) Graphical representation of the Gap statistic analysis, also showing an optimal number of clusters of 3 or 4.
Mean scores of Booking.com reviews obtained for each cluster in the 3- and 4-cluster solutions. Between / Within SS ratio indicates the ratio of between clusters sum of squares and within clusters sum of squares, providing a measure of cluster compactness.
Figure 3 plots each of the cluster solutions along with the star category for each hotel in the principal components space. Although in general it is possible to observe that a higher star category tends to have higher review scores, each cluster is composed of hotels having different star categories showing that this classification is not completely associated with the customer perception of quality. Taking the 4-cluster solution as an example, we see that cluster 4 (the cluster rated with lower review score values) contains hotels from 1 to 4 stars, while cluster 1, which is essentially populated with 5- and 4-star hotels, also contains 1-, 2- and 3- star hotels. A similar behaviour is observed in the solution with 3 clusters. (a) Graphical representation of the cluster solutions with 3 clusters; (b) Graphical representation of the cluster solutions with 4 clusters. Each data point (hotel) is represented by a specific symbol associated with the star category. Clusters are represented with different colors, and a bivariate gaussian 95% confidence ellipse is plotted to approximate the shape and spread of each cluster in the principal components space.
Considering the segmentation structures obtained from k-means, it was further investigated the impact of each Booking.com item to their characterization. For that, a random forest model was built for each of the cluster solutions using the cluster assignments as the target variable, and variable importance was measured. Figure 4 presents the variable importance plots for both cases. (a) Variable importance plots with 3-cluster solution; (b) Variable importance plots with 4-cluster solution.
For the 3-cluster solution, the Overall, Cleanliness, Facilities and Comfort scores present the highest mean decrease in the Gini impurity index (191.70, 130.54, 123.33, and 112.67, respectively), indicating that these are the items that contribute the most for the purity of nodes in the random forest and therefore are strongly present in the trees of the random forest. Additionally, Wi-Fi is the score that provides the highest mean decrease in accuracy (57.21) in the model, indicating its importance in predicting cluster membership (Figure 4(a)).
The results are similar in the 4-cluster solution, with the same variables considered by the same order of importance. Thus, the Overall, Cleanliness, Facilities and Comfort scores present 202.11, 157.68, 149.73, and 117.33, respectively mean decrease in the Gini impurity index, and Wi-Fi also is identified as the score with the highest mean decrease accuracy (62.15), as shown in Figure 4(b).
It is noticed that Value for money, Location and Staff are consistently in the bottom rank of importance in all situations, thus not contributing significantly to the reduction of impurity, cluster distinction or cluster membership prediction.
Discussion
This study analysed the agreement between hotel star categories in Portugal and the corresponding mean review scores of Booking.com on 8 items: Overall satisfaction, Cleanliness, Comfort, Facilities, Staff, Value for money, Location and Wi-Fi. The results, regardless of the methodology of machine learning employed (SVM or ensemble GBM), indicate a weak association between the hotel star category and the mean review scores. In fact, the maximum accuracy obtained with an SVM was 59%, whereas for GBM was 53%. Other authors (Soifer et al., 2020) also reported such discrepancy between hotel attributes and facilities and online user ratings which has been found to be particularly relevant in 5-star hotels in Lisbon (Rita et al., 2022) and reflects the disconformity between consumer expectations and consumer experience (Li et al., 2020).
One of the reasons for the discrepancy between the hotel star-classification and the guest review scores might be related to the fact that the hotel star-classification is reviewed only once every 5 years whereas the Booking.com scores retain the reviews of the previous 36-month period and keep updating the metric whenever new reviews are added. Another reason is related to the fact that star-classification systems rely more on facilities and level of service, while guest reviews are based on expectations and quality of experience (UNWTO, 2014), and, therefore, not always the star-classification level matches the guest reviews appreciation.
A previous study on 1,500 reviews of 50 small and medium hotels in the region of Lisbon identified that guests pay more attention to the room conditions (including cleanliness and comfort to rest) when writing a review and attributing a score (Chaves et al., 2012), which is in line with the results here presented for the 3- and 4-cluster solutions, for which Cleanliness is the second most important variable for the categorization of hotel review scores.
As some other countries are working to implement integrated systems (UNWTO, 2014; UNWTO, 2015), the 3- or 4-cluster solution that derives from the Booking.com scores could be considered an option to complement the current hotel star system, gathering the best of the two worlds. On one hand, the hotel star category would provide detailed information regarding the type of facilities and services guests should expect. On the other hand, the integration of a categorization system based on reviews provided by other guests would generate trustful, easy to interpret information, directed mainly to service quality.
According to the predicted cluster membership, each hotel could receive a quality rating of Bronze, Silver or Gold categorization, in the case of adoption of the 3-cluster solution or Bronze, Silver, Gold or Platinum in the case of 4-cluster. This would be very helpful for the traveller searching for a hotel in the context of the multicriteria decision process. For instance, if the traveller is looking for a 3-star hotel, providing information regarding the experience of previous guests such as a very good experience (3-star Gold) or a not-so-remarkable experience (3-star Bronze) clearly facilitates the process, yet not preventing the traveller from choosing based on other features namely price or location.
In fact, online platforms such as TripAdvisor or Booking.com have employed a similar quality rating system that serves as a guide for a variety of alternative accommodations such as apartments or villas. This rating system includes information on both the facilities and the average review score as well as anonymized and aggregated historical data, corroborating the importance of adding summarized information of the review scores to the hotel star-classification. Such system is welcomed by the hotel management sector (Koutoulas and Vagena, 2023; Vagena and Manoussakis, 2021) and would overcome the limitation of the non-universality of the star-classification system. Additionally, it would include up-to-date and reliable information to the star-classification system.
This proposal also responds to current demands for integrating real-time, consumer-centred insights into regulatory systems in tourism, reflecting broader shifts toward digital trust and user empowerment.
Therefore, this research moves beyond technical modelling by proposing an applied framework that can bridge the current disconnection between institutional classification systems and real guest experience. Whereas professional inspectors assess hotels based on standardised criteria related to physical infrastructure and service provision, guest reviews offer insights into how those services are perceived and experienced. Integrating both perspectives enables a more comprehensive and accurate reflection of hotel quality in today’s digital environment. This conceptual innovation expands the theoretical understanding of how quality is communicated and perceived in digital hospitality environments. On a practical level, the model introduces a dual-layer classification which offers clearer signals for travellers, supports data-driven decision-making for platforms and policymakers, and encourages hotels to prioritise experiential quality, not just infrastructure compliance. For formal classification organizations, this model provides a path to modernise and enhance the credibility of national classification systems without undermining their traditional structure.
Conclusion
This study contributes to the ongoing debate about combining online guest reviews with traditional hotel classification systems. By leveraging data from Portuguese hotels and analysing guest satisfaction ratings on Booking.com, this paper highlights the misalignment between the existing star-classification system and the actual experiences of hotel guests. The weak association between hotel star-classifications and guest satisfaction scores emphasizes the need for a more dynamic and comprehensive approach to hotel classification. One of the most striking findings of this research is that hotel star-classifications, which primarily reflect facilities and services, do not consistently align with guest satisfaction scores related to factors such as cleanliness, comfort, and value for money, among others. This discrepancy indicates that the traditional star-classification system may not fully capture the subjective, experience-based aspects of hotel quality that modern travellers prioritize. As the analysis shows, while 5-star hotels generally score higher on most satisfaction items, lower-tier hotels, mainly 1- and 2-star establishments, can sometimes outperform higher-tier hotels on specific dimensions like location and value for money. This observation suggests that travellers are not solely motivated by the amenities a hotel provides but also by the quality of the service and the overall experience. Integrating guest reviews into the star-classification system offers a potential solution to this gap, providing a dual framework that enhances the objectivity of traditional classification with the subjectivity of guest reviews. Moreover, the introduction of machine learning techniques, such as Support Vector Machines (SVM) and Gradient Boosting Machines (GBM), in this context demonstrates that predictive models can be used to assess the service quality offered by hotels more accurately. Although the prediction models in this study showed moderate accuracy (around 53%–59%), they present an important first step in developing a refined classification system that could better serve both travellers and hoteliers.
This paper proposes an innovative hybrid classification model incorporating guest reviews alongside traditional star-classification. The clustering analysis results suggest the potential for 3- or 4-cluster solutions to complement existing star-classification and point to the feasibility of categorizing hotels by facilities and the quality of service as perceived by guests. For example, a hotel currently classified as 3-star could receive additional “quality” distinctions such as Bronze, Silver, or Gold based on guest review scores. This would result in a dual classification – e.g., 3-Star Bronze – that clearly communicates both the technical compliance of the hotel (in terms of infrastructure and services) and the experiential quality as perceived by guests. Such a label would be simple for consumers to interpret and could serve as an intuitive decision-making aid, particularly when comparing hotels within the same star category.
Theoretical implications
From a theoretical standpoint, this study contributes to reconceptualising hotel classification as a dynamic system that combines experiential guest data with formal institutional criteria. It addresses a known limitation in the literature, which often treats these sources as separate or incompatible. By proposing a hybrid model that integrates both perspectives, this research presents a new conceptual framework for understanding hotel quality. The use of machine learning reinforces this contribution, demonstrating how data-driven techniques can support more nuanced and multidimensional classification systems in the hospitality industry. Furthermore, the model introduces a clustering-based distinction—Bronze, Silver, or Gold (and Platinum in a four-cluster solution)—that complements existing star ratings and reflects guest-perceived service quality in a structured and interpretable format.
Managerial implications
At a practical level, the proposed model enables consumers to differentiate more effectively between hotels within the same star category, facilitating informed booking decisions that are based not only on technical classification but also on perceived service quality. This dual system provides additional transparency and clarity in a highly competitive digital marketplace. For hotel managers, the framework offers an opportunity to monitor and enhance performance based on guest perceptions, thereby improving satisfaction and competitive positioning. Public rating agencies and online travel intermediaries (OTAs) may also benefit, as this model presents a scalable and adaptable solution to modernise traditional classification systems without compromising their institutional structure. By integrating guest feedback into official frameworks, rating bodies can enhance public trust and provide a more accurate, up-to-date reflection of hotel quality – particularly relevant in a post-pandemic context where travellers increasingly prioritize hygiene, comfort, and staff engagement.
Limitations and future research
While this study provides a solid foundation for combining guest reviews with hotel classification systems, several limitations suggest new paths for future research. First, the data collected for this analysis refer to October 2021, during the recovery phase following the peak of the COVID-19 pandemic. Traveller expectations and behaviours may still have been influenced by pandemic-related concerns – particularly regarding hygiene and safety – which could have shaped review patterns. Future research should consider how guest priorities evolve in fully post-pandemic contexts, and whether the discrepancies between star-classifications and guest reviews persist. In addition, future studies could benefit from incorporating data from other review platforms, such as TripAdvisor or Google Reviews, to validate results and explore platform-specific variations in user behaviour and satisfaction assessment.
Additionally, future studies could benefit from incorporating larger datasets and using more advanced machine-learning techniques to improve the accuracy of predictive models. It’s also crucial to explore how the combination of guest reviews with classification systems could be standardized across different regions and hotel types, ensuring the development of a universally applicable system.
Finally, a deeper examination of other factors influencing guest satisfaction, such as cultural differences, price sensitivity, and the role of loyalty programs, would further refine the proposed model and its effectiveness in capturing the full spectrum of hotel quality.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is funded by national funds through FCT – Fundação para a Ciência e a Tecnologia, I.P., under the support UID/05105: REMIT – Investigação em Economia, Gestão e Tecnologias da Informação, and by CIDMA under the Portuguese Foundation for Science and Technology (FCT,
) Multi-Annual Financing Program for R&D Units..
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
