Abstract
This study applies machine learning methods to Nielsen Homescan Panel data to understand the segmentation of U.S. vitamin consumers. Consumer segmentation is crucial for understanding purchasing behavior and optimizing strategies to maximize profits. We employ the RFM (Recency–Frequency–Monetary) framework with demographic attributes, and apply the K-means algorithm to classify households into distinct segments. Considering COVID-19 as an exogenous economic shock, we then compare segment composition and purchasing behavior before and during the pandemic and estimate segment-specific price elasticities. We provide targeted marketing recommendations tied to each segment, offering guidance for businesses to enhance consumer engagement in the post-pandemic dietary supplement market.
Introduction
Consumer segmentation plays a crucial role in enabling firms to effectively target specific consumer segments and allocate resources efficiently in retail marketing and decision-making. The segmentation is commonly based on the combination of the key metrics, including geographical factors, which classify consumers based on their location; demographic attributes, which consider characteristics like age, gender, income level, education, and household size; behavioral segmentation, which analyzes purchasing patterns, brand loyalty, frequency of purchases, and product usage; and psychographic factors, which focus on lifestyle choices, values, interests, and personality traits (Tynan & Drayton, 1987). The core idea behind market segmentation is that consumers grouped in the same segment exhibit similar characteristics, which translate into aligned buying behavior and demand patterns. Moreover, firms need a comprehensive understanding of how to identify the segments for potential new customers to successfully expand their market and strengthen their competitive position.
Machine learning (ML) techniques have become increasingly popular in customer segmentation. These techniques aid in analyzing purchase behavior, classifying customers, and designing effective marketing strategies. For example, Qadadeh and Abdallah (2018) applied multiple clustering algorithms to segment customers using an insurance company dataset, enabling businesses to identify customer characteristics and implement more effective marketing strategies. Similarly, Abdulhafedh (2021) combined K-means and hierarchical clustering to categorize customers based on their transaction data, helping credit card companies refine their strategic planning. Swenson et al. (2016) took a different approach by applying both unsupervised and supervised learning to segment the healthcare market. Using patient medical records and demographic data, their study aimed to improve efficiency in value-based healthcare services.
Several studies have focused on applying ML methods in the agriculture and food market. Harding and Lovenheim (2017) used the K-Median algorithm to group products based on nutritional levels and examined how nutrition taxes impact food consumption using Nielsen Homescan data. Varela (2013) explored consumer preferences in the orange juice market. The study applied multiple ML techniques to analyze how consumers responded to different flavors, providing valuable insights for product development. Bargoni et al. (2022) used K-means clustering to study strategies adopted by agri-food businesses in Italy during COVID-19. The key contribution of their study is a qualitative segmentation that provides firms with insights into emerging strategic approaches for maintaining competitive advantage.
A common approach in ML-based consumer segmentation is the use of RFM (Recency–Frequency–Monetary) attributes derived from transaction data (Cheng & Chen, 2009; Khajvand et al., 2011; Rahim et al., 2021; Rungruang et al., 2024). Sarvari et al. (2016) found that integrating RFM attributes with demographic information leads to more accurate customer segmentation. Since the Nielsen Homescan data includes both consumer purchase histories and demographics, we utilized this dataset to construct a market segmentation model for vitamin C purchases with the K-means algorithm, incorporating RFM attributes and demographic factors. We analyze the characteristics of each segment to provide insights for firms developing business strategies. Since our objective is to understand segment evolution under economic shocks, we exploit COVID-19 as an exogenous demand shock and compare purchasing patterns in the pre- and during-pandemic periods. This setting is relevant because demand for vitamin C rose notably during the COVID-19 pandemic, given its well-known immune-boosting properties (Ahmed et al., 2023).
This study contributes to market research practice in three ways. First, we use machine learning approach to categorize vitamin C consumers into segments, which, to our knowledge, has not been previously studied in the literature for similar products. Second, we link the clustering to segment-specific own-price elasticities, connecting who the segments are to how they respond to price and providing data-driven target guidance for pricing and promotion. Third, we examine COVID-19 as an economic shock to evaluate how consumer segment composition differs before and during the disruption. Using the centroid distance metric, we develop a practical tool that managers can adopt to monitor and respond to changes in consumer behavior under other market disturbances.
Empirical Analysis
Methodology
K-means clustering is an unsupervised machine learning method that divides
K-means clustering is applicable when the features are either continuous or binary variables. Therefore, categorical variables are converted into continuous dummy variables and standardized using z-scores. The algorithm follows a two-step process. First, households are randomly distributed into distinct clusters, and the centroid of each cluster is computed as the average of the household characteristics within that cluster. Subsequently, centroids are updated based on the latest cluster assignments. This process repeats until there are no significant changes in their positions. At the final stage, each household is assigned to a unique cluster, ensuring exclusivity and preventing overlapping memberships.
To analyze potential changes in the profile of each market segment before and during COVID-19, we applied K-means separately to the pre-COVID and during-COVID datasets, using March 11, 2020 (the WHO pandemic declaration) as the cut-off date. Each cluster has a centroid, the mean vector of its member observations across all features. Clusters are estimated separately for each period. Clustering labels are ordered within-period by monetary and then frequency scores. Note that labels are not intended to imply one-to-one continuity across periods. We present side-by-side centroid profiles and segment shares for each period. As a descriptive summary, we report an order based centroid distance:
To obtain segment-specific price elasticities, we estimate a log–log specification with brand tier fixed effects. For each period
With the estimated elasticities, the revenue change for segment
Data
Our study uses Nielsen Consumer Panel data from 2018 to 2022, including consumer purchase histories and demographic information. The dataset comprises a total of 22,025 purchase records, which have then been aggregated for household-level analysis. The demographic variables incorporated in this market segmentation analysis are race, household size, female head education level, and income level. Information on race, household size, and female head education level is provided directly in the data set. However, Nielsen Consumer Panel data does not provide exact income values for each household, but instead offers income ranges. We used the median value within each range as a proxy for the actual income. For instance, for households in the income range between $70,000 and $99,999, we used $84,999.5 as their representative income. To classify households into income groups, we follow the federal poverty guidelines (FPL) issued annually by the U.S. Department of Health and Human Services (HHS), which adjust thresholds by household size (Creamer et al., 2022). Households with income above 400% of FPL are defined as “high”, those between 200% and 400% as “middle”, and those below 200% as “low” income group.
In addition, we extracted the recency, frequency, and monetary (RFM) metrics from the customer purchase records. Specifically, Recency (R) measures the time elapsed since a consumer’s most recent purchase. A shorter interval corresponds to a lower recency score. Frequency (F) represents the number of transactions within a specific period. A higher frequency score indicates more frequent purchases. Finally, Monetary (M) reflects the total amount spent by a consumer. A higher monetary score represents greater spending. Next, we normalized the RFM attributes and assigned RFM scores ranging from 1 to 5. A higher recency score indicates a longer time since the last purchase, a higher frequency score reflects more frequent purchasing behavior, and a higher monetary score represents greater spending levels (Hughes, 2005). Variables used in the price elasticity estimation include quantity purchased, unit price paid, brand type, and date of purchase (see Supplementary Material for estimation details).
Empirical Results
To determine the optimal number of clusters applied in the K-means algorithm, the elbow method was employed. This technique is broadly used in clustering analysis, including consumer segmentation (Kansal et al., 2018; Syakur et al., 2018). It aims to identify the point at which adding more clusters no longer significantly reduces the within-cluster sum of squares. As shown in Figure 1, the elbows suggest that three or four clusters are optimal, as indicated by the sharp decline in the cluster sum of squared distances. To ensure that the clusters are compatible and meaningful for both periods, we selected four clusters as the final segmentation. Elbow Plots (WCSS vs 
Description of Demographic and RFM Variables by Cluster Pre- and During-COVID
Households per segment: Pre–COVID
Cluster 3 underwent the most notable shift in the profiling. Pre-COVID, it combined very low frequency with relatively high monetary score. During COVID, the profile suggests that Cluster 3 appears to capture a new group of high-frequency, high-spending shoppers, whose characteristics may reflect more engaged or health-motivated supplement shoppers during the pandemic. Cluster 4 remained the most stable in its behavioral profile: frequency remained high and monetary score was consistently elevated. Pre-COVID, it included a high proportion of White households and college educated female heads and the lowest representation of low-income households among the clusters. During COVID, households assigned in Cluster 4 exhibited similar high spending and frequency, representing a behaviorally consistent, high-value consumer segment. Hence, the segment likely comprises routine or brand-loyal supplement shoppers whose purchasing patterns were less affected by external disruptions.
The centroid difference reflects how the average profile of each segment has changed between the pre- and during COVID periods. Clustering is estimated separately by period, so each solution reflects contemporaneous purchasing patterns. This measure is descriptive and does not imply one-to-one tracking of households or segments. It summarizes how far the cluster centers are in the RFM and demographic feature space. Cluster 3 exhibits the largest centroid difference (2.74), indicating a substantial transformation in the characteristics of the households assigned to this cluster during COVID, consistent with the higher purchase frequency and spending reported earlier. Cluster 4 also shows a notable difference (1.89), reflecting moderate changes in purchasing intensity and a slightly broader demographic composition. By contrast, Clusters 1 and 2 display relatively small centroid differences (0.66 and 0.48), aligning with their more stable RFM and demographic profiles. Overall, the more substantial profile changes during the pandemic are concentrated in the higher spending segments.
Estimated Price Elasticities by Cluster: Pre-COVID vs. During-COVID
Notes. Clusters 1–4 are derived separately across periods; labels do not imply direct equivalence.
Under the constant-elasticity (log–log) specification, a uniform 10% reduction in list price lowers revenue in every segment, whereas a uniform 5% increase raises it. Because all estimated elasticities are inelastic, the implied segment-level effects are small: roughly 1.6 to 2.9 percentage points decrease for the 10% cut and about 0.8 to 1.4 percentage points increase for the 5% increase. From a managerial perspective, inelastic demand argues against across-the-board discounting. If price changes are considered, modest list-price increases are expected to raise revenue, and any promotional activity should be tightly targeted to the segment with the largest absolute elasticity within the relevant period. If price changes are pursued, modest list-price increases or tightly targeted, short-lived promotions aimed at the most price-responsive segment are recommended.
Discussion and Conclusion
This study applies K-means clustering to segment U.S. vitamin C buyers. The analysis reveals distinct shopping patterns, with clear differences in price sensitivity across four segments. In addition, we show that economic disruption (e.g., the COVID-19 pandemic) is associated with within-period shifts in expenditure composition and modest changes in price responsiveness. We link behavioral clustering to segment-specific own-price elasticities by period and translate those elasticities into expected revenue changes under common price moves, so recommendations are anchored in financial relevance. Because market conditions differ across periods, segments and elasticities are estimated separately for the pre- and during-COVID windows, providing a shock–aware template that can be reused for other disruptions without assuming one–to–one segment continuity. The results offer actionable guidance for firms in the dietary supplement market.
Having established the segment structure and price responsiveness, we translate these findings into managerial actions. For lower-engagement or occasional segments (limited spending; infrequent purchases), the objective is reactivation and habit formation. Effective tactics include win-back outreach based on loyalty histories, targeted coupons, limited-time offers, and bundle/threshold promotions (e.g., buy-one-get-one), to stimulate frequency. Where feasible, memberships or subscriptions (auto-replenishment, reorder reminders) can nudge repeat purchasing and reduce reliance on deep discounts (Jayaraman et al., 2013; Meyer-Waarden, 2008; Peker et al., 2017).
For higher-value segments (those concentrating spend within a period), the priority is to protect margin while reinforcing perceived value. Pair price decisions with value drivers beyond price, such as reliable assortment, service benefits, and clear quality or efficacy cues (Grewal et al., 2011; Steenkamp et al., 2010). Personalized recognition can help sustain frequency without conditioning the segment on deep discounts (Peker et al., 2017). Among habitual, health-oriented buyers, premium presentation, product innovation, and relevant wellness content support willingness to pay (Camanzi et al., 2024; Valls et al., 2012). Given inelastic demand across all segments, broad discounts are not expected to raise revenue; if prices are adjusted, modest list-price increases or narrowly targeted, limited-duration promotions should be directed to segments with larger absolute elasticities and meaningful expenditure shares within each period.
This study also addresses the lack of structured consumer segmentation evidence in the vitamin category by combining RFM- and demographic-based clustering with segment-specific elasticities to inform pricing and promotion. A limitation of this study is temporal coverage: only Nielsen Homescan data through 2022 had been integrated when the analysis was finalized; consequently, the study captures behavior during and immediately after the pandemic but not longer-run post-pandemic dynamics. Additionally, while we account for brand-type differences using brand-tier fixed effects, our ability to estimate more granular brand-level effects was limited by sample size within each segment. Future work could track patterns beyond 2022, incorporate larger panel subsets, examine brand loyalty and segment migration, and consider alternative weighting schemes (e.g., WRFM) when recency is less informative during shocks (Vaidya & Kumar, 2006).
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
