Sage Journals: Discover world-class research

Abstract

In this issue, in the article entitled, “Nutritional Status, Daily Nutrition Intake, and Dietary Patterns of Korean Adults With Low Vision and Blindness,” the authors use a large-scale database (the Korea National Health and Nutrition Examination Survey or KNHANES) to look at differences in nutrition as a function of visual status. The authors in this study looked at many different variables related to nutrition and began their analyses by using t-tests to compare participants with visual impairments to people without visual impairments on body measures and daily nutrient intake. These analyses simply compare the average value on a dependent variable for one group with that of the other group. But then the authors use K-means clustering to investigate patterns in the nutritional data.

K-means clustering, as the name implies, is a process of dividing a dataset into a given number of clusters, which are groups of datapoints that are more similar to each other than to datapoints in other groupings. The designation of “K” indicates that the process can be used to divide a dataset into an unspecified number of clusters. This technique is called “K-means” because each cluster is centered around a variable mean for the datapoints in that cluster. The goal is to make the amount of variability within a cluster as small as possible by using as few clusters as possible.

There are several methods of performing a K-means cluster process, but the general idea is that a guess at how many clusters will best suit the data is generated, then the observations are each associated with a mean that is approximately in the center of a group of observations that will make up a cluster. This process will define an initial set of clusters and observations within them. Those observations, or scores, will then be used to calculate a more precise mean for that cluster, and doing so may lead to some observations on the fringes of the clusters to move from one cluster to another. This rearrangement will lead to another recalculation. The process is repeated until no changes occur, and the clusters and their final means are defined. There are other procedures for clustering data that rely on different underlying mathematical constructs, but those methods are not the focus of this sidebar.

In the article under discussion, the authors determine that two clusters would suffice in each of the participant groups. The authors used different “seeds,” or starting values, for the cluster means and let each attempt run for 500 iterations. The means being used in this process were the average values for all of the measures of consumption of 19 different food groups previously identified in the study. As you can imagine, it would be an impossible task to try to figure out which participants were most similar to each other, based on 19 different scores, which is why a process like K-means clustering is so useful. In the end, the authors were able to define four clusters made up of participants who were more similar to each other on the 19 scores than to participants in other clusters.

The participants with visual impairments and the participants without visual impairments both had two clusters, so individual variables making up the clusters could be compared within a participant group by using t-tests. If a small number of variables (like two) had been used to create the clusters, then using a t-test would have led to obvious statistical differences. These variables would, by definition, be different in the two clusters, since they were used to define the clusters. Since the authors of this article used 19 measures, however, using a series of t-test was warranted to illustrate which of the 19 measures were most effective in defining the clusters.

K -Means Clustering Explained

Abstract