Abstract
This study adopts the quantile regression method to analyze the influencing factors of low-income, middle-income, and high-income groups at the national, urban, and rural levels in China, respectively, at the quantiles of 0.05, 0.20, 0.50, 0.80, and 0.95. Subsequently, the GM(1, 1) model within the gray-system theory is utilized to predict the proportion of the middle-income group in China in the future. By comparing the prediction results from 2010 to 2017 with the data of the proportion of the middle-income group at the national, urban, and rural levels obtained through kernel density estimation, it is found that the gray-system prediction exhibits high accuracy and satisfactory results. The implications of this study for future social development may lie in providing a certain degree of data reference for social governors.
Introduction
In sociology, the middle-income group holds significant academic importance. When the middle-income group accounts for a relatively large proportion in society, the social structure takes on an olive shape, that is, large in the middle and small at both ends. This structure is considered the most stable form of social structure.1–4 Compared with the pyramid structure (where the low-income group accounts for the majority), the olive structure can reduce the contradictions and conflicts between the lower and upper classes of society, making society more harmonious and stable. The middle-income group usually has a certain economic foundation and social status, and they tend to solve problems in a peaceful and rational way rather than taking extreme measures. When social conflicts and problems arise, the middle-income group can play a buffering role and avoid the intensification and spread of conflicts.
Enhancing the proportion of the middle-income group is the ultimate goal of the research on the middle-income group. Given the conditions of mobility and uncertainty, the income status of the middle-income group is not static, which leads to fluctuations in the proportion of the middle-income group. Jenkins 5 pointed out that human capital has a significant stock effect and structural effect on the widening of income gaps and the increase in the proportion of the middle-income group. Martinez and Parent 6 found through research that generally, women focusing on family life had limited opportunities for skill improvement. Coupled with their relatively low-income levels, in case of divorce, this group of women would find it difficult to remain in the middle-income group. Therefore, they believed that the divorce rate was an important factor causing changes in the proportion of the middle-income group.
The measurement of the proportion of the middle-income group is also an important area of research for scholars around the world.7–12 For common income-related problems, ordinary least squares (OLS) regression is generally used to construct models.13–16 This is because OLS is a simple and effective method for estimating model coefficients.15,16 However, only when multiple assumptions of OLS regression are met, such as zero-mean, homoscedasticity, no autocorrelation, and the error term following a normal distribution, can OLS regression possess excellent properties like unbiasedness, efficiency, and consistency. In real-life situations and routine research, it is rare to ensure that these assumptions hold simultaneously. Therefore, when the sample exhibits severe heteroscedasticity, heavy-tailed, or leptokurtic distributions, the estimates obtained by the least-squares method will no longer possess the abovementioned excellent properties.
Research on the middle-income group has been approached from diverse perspectives and using various methods. This diversity underscores the profound practical and theoretical significance of conducting systematic research on the group's evolution. It is precisely this significance that has drawn extensive attention to this research topic within the academic community. Regarding the study of factors influencing the middle-income group, current research predominantly adopts a macroperspective, with relatively few studies exploring microinfluencing factors. This imbalance renders the research incomplete. Existing research on issues related to the middle-income group typically focuses on isolated aspects, and there is a scarcity of systematic, multilevel research, such as investigations into the microinfluencing factors and predictive analyses of the middle-income group. Consequently, it fails to comprehensively and clearly depict the current state of the middle-income group.
Building on the strengths of various previous studies, this article utilizes the microdatabase of the Chinese General Social Survey (CGSS) 17 from 2010 to 2017. From a microperspective, it introduces quantile regression to analyze the factors influencing the proportion of China's middle-income group and changes in their income. Additionally, the gray prediction method is employed to forecast the future proportion of the middle-income group. Therefore, compared with traditional research, this article conducts a comprehensive study on issues related to China's middle-income group from multiple directions, perspectives, and levels, which endows it with certain research value.
Model construction
Unlike the coefficients of ordinary least squares (OLS) regression, the coefficients of quantile regression can describe the impact of changes in independent variables on the conditional mean of the dependent variable. Different from OLS, which estimates regression coefficients using the sum of squared deviations, quantile regression employs the method of least absolute deviations sum for estimation. Similar to OLS, quantile regression also has methods to test the goodness of the model, which can be roughly divided into three types: goodness-of-fit test, likelihood ratio test, and Wald test. Given that the theory and applications of quantile regression are relatively mature at present, with a large number of studies available for reference, this article will not elaborate on them extensively.
As can be seen, the quantile regression model not only has a complete model-building system but also has the advantage of reflecting the changes in the degree of influence of independent variables on the dependent variable at different quantiles. Therefore, this article will use quantile regression to model and analyze the microfactors affecting the change in the proportion of the middle-income group.
The variables used are explained as follows. According to the survey content of the CGSS 2010–2017 questionnaire and the research questions of this article, the total annual income of the respondents will continue to be used as the dependent variable; on this basis, nine variables with the most important influence on income were screened out as independent variables, including the respondents’ region, urban–rural differences, age, gender, education, father's education, mother's education, happiness level, and health level. In the context of this research, the regional variable is defined based on the economic development levels. Specifically, the provinces and municipalities in eastern China are categorized as developed regions, the central provinces are grouped into moderately developed regions, and the western provinces of China are classified as underdeveloped regions. This classification scheme is designed to meticulously examine the variances in the extent to which the middle-income group in different regions is influenced by various factors. The coding differences of specific influencing factors are as follows.
Region: 1, undeveloped area; 2, moderately developed area; 3, developed area. Urban–rural differences: 1, urban; 0, rural. Gender: 1, male; 0, female. Education: 0, no education, private school, literacy class; 1, primary school; 2, junior high school; 3, technical school, technical secondary school; 4, vocational high school, general high school; 5, college (adult higher education), undergraduate (adult higher education); 6, college (formal higher education), undergraduate (formal higher education), graduate and above. Happiness level: 1, extremely unhappy; 2, relatively unhappy; 3, indifferent (neither happy nor unhappy); 4, relatively happy; 5, extremely happy. Health level: 1, very unhealthy; 2, relatively unhealthy; 3, average; 4, relatively healthy; 5, very healthy.
In the present research, a rigorous sample-selection strategy is employed for the purpose of in-depth comparative analysis. Specifically, samples from 2010 and 2017 were meticulously chosen. Initially, a series of preprocessing steps are carried out on the income variable. Some inapplicable samples, which may have deviated from the research scope due to factors such as abnormal data values or inconsistent data collection methods, are eliminated. Subsequently, in order to ensure the reliability and validity of the data, samples with response conditions such as “missing,” “refusal to answer,” and “don't know” are further excluded. These response conditions can potentially introduce biases or uncertainties into the data analysis. As a result, the number of valid samples in this study for 2010 and 2017 is determined to be 9638 and 10,911, respectively, providing a solid data foundation for subsequent statistical inferences and model-building procedures.
Factor analysis
We construct a quantile regression model with the annual total income of residents as the dependent variable, and age, region, urban–rural differences, gender, education, father's education, mother's education, happiness level, and health level as independent variables. When using this model to conduct quantile regression on the incomes of urban and rural residents, the urban variable is excluded from the analysis. Additionally, the group around the 0.05 quantile is approximated as the low-income group, the group around the 0.95 quantile is approximated as the high-income group (it seems there is a mistake in your original text where you said 0.95 as low-income, it's corrected here), and the groups around the 0.20, 0.50, and 0.80 quantiles are approximated as the middle-income group for analysis.
The quantile regression results for 2010 and 2017 in Tables 1 and 2 reflect the magnitude of the contribution of different variables to residents’ income at different quantiles, as well as the changes in the degree of this influence over time. Here, * means significant at 0.1 level, ** means significant at 0.5 level, *** means significant at 0.01 level.
At the 0.05 quantile, the variables of father's educational attainment, mother's educational attainment, and happiness level are not significant in both 2010 and 2017. This indicates that these variables have no impact on the income of this group. In contrast, variables such as gender, the respondent's educational attainment, and health level have, without exception, exerted a certain influence on the income of this group in both 2010 and 2017.
Quantile regression results for 2010.
Quantile regression results for 2017.
From the coefficients of the quantile regression, it can be observed that the average income of men is higher than that of women. There is a positive correlation between educational attainment and income, meaning that the higher the educational level, the higher the income. Additionally, happier individuals tend to have relatively higher incomes.
Looking at the data longitudinally, among all the nine variables, gender and educational attainment have the largest coefficients. This suggests that, at this quantile, gender and educational attainment are the two most important variables influencing income.
At the three quantiles of 0.20, 0.50, and 0.80 (representing the middle-income group), the contributions of each variable are generally significant. First, judging from the absolute values of the variable coefficients, four variables, namely region, urban–rural area, gender, and educational attainment, are the most important factors influencing the middle-income group. Among them, the absolute value of the coefficient of the region variable is the largest, indicating that the income level of the middle-income group in the eastern region is higher than that in the central and western regions. There are significant regional disparities among the middle-income groups in China.
In terms of the coefficients of the other three relatively important variables affecting the middle-income group, the income level of the urban middle-income group is higher than that of the rural group, men have a higher income level than women, and the higher the educational attainment, the higher the income level. This shows that there are urban–rural and gender disparities within the middle-income group.
Moreover, as the quantile increases, the coefficients of urban–rural area, gender, educational attainment, and the absolute value of the coefficient of the region variable also become larger. This implies that within the middle-income group, the increase in the income level increasingly depends on the changes in variables such as region, urban–rural area, gender, and educational attainment. The higher the income, the more obvious the effects of these four influencing factors.
From a temporal perspective, the absolute values of the coefficients of urban–rural area, gender, educational attainment, and the region variable in 2017 are significantly larger than those in 2010. This indicates that over time, the variables of region, urban–rural area, gender, and educational attainment are playing an increasingly important role in the income changes of the middle-income group.
At the 0.95 quantile (representing the high-income group), the influencing factors are the same as those affecting the middle-income group at the 0.20, 0.50, and 0.80 quantiles. The absolute values of the coefficients of the variables of region, urban–rural area, gender, and educational attainment are the largest. Therefore, these four variables still play the most significant roles in influencing the income of the high-income group.
Regarding the changes over time, it is notable that the age variable became insignificant in 2017 compared to 2010. Meanwhile, the influence of the father's educational attainment variable on the income of this group changed from being insignificant in 2010 to significant in 2017.
These changes imply that over time, the composition of the high-income group is becoming increasingly younger. This might be attributed to the rise of emerging industries such as e-commerce, computer technology, and artificial intelligence. These industries have enabled a large number of young people with relevant skills to rapidly increase their incomes and achieve financial freedom, thereby joining the high-income group. Meanwhile, the intergenerational transmission effect of parental education is becoming more and more prominent. In an environment where educational attainment is playing an increasingly important role in determining income, parents with higher educational levels have access to high-quality educational resources. They can directly influence their children, allowing the younger generation to continue to and even take the lead in reaping the benefits of education. This enables them to secure better job opportunities and higher incomes. As a result, the educational attainment of fathers is becoming increasingly significant.
Prediction of the proportion of middle-income groups
Related concepts
The GM(1,1) model in gray-system theory can effectively address the contradiction between sample size and accuracy. Owing to its advantages such as simple calculation methods, requirement for a relatively small amount of sample data, and relatively stable prediction results, it has been widely applied in disciplines like agriculture, geology, and meteorology, and has become one of the most extensively used gray models.18–21
In this study, the calculation results of the proportion of the middle-income group at the national, urban, and rural levels from 2010 to 2017 are used as the original sequence. Given the relatively small sample size, if only traditional prediction models were employed, the prediction results might lack credibility. Therefore, the GM(1,1) model is used to predict the proportion of the middle-income group at the national, urban, and rural levels for the next 3 years (from 2018 to 2020).
Prediction model
Generally speaking, our understanding of things can be classified into three states: complete knowledge, complete ignorance, and partial knowledge (a semi-known state). These three states of understanding correspond to white, black, and gray systems, respectively. Therefore, gray prediction involves forecasting systems that contain both known and uncertain information, that is, predicting gray processes that change within a certain range and are related to time.
Gray system theory defines gray derivatives and gray differential equations based on concepts such as the associated space and smooth discrete functions. Then, a dynamic model in the form of a differential equation is established using discrete data. Since this is a basic model established based on gray-system theory, and the model is approximate and nonunique, it is called a gray-system model, denoted as GM(n, h) (Grey Model), where n represents the order of the model and h represents the number of variables included in the model. In practical research, the GM(1, 1) model is mainly used. It represents a first-order gray system model containing only one variable.
There are two sequence generation methods in gray prediction: accumulated generation and weighted adjacent-value generation. The specific process of prediction using the GM(1,1) model will not be elaborated here. The testing methods for the prediction results include the residual error test method and the grade-ratio deviation value test method.
Case study
In this study, the calculated results of the proportion of the middle-income group at the national, urban, and rural levels from 2010 to 2017, which are derived through the application of the kernel function, 22 serve as the original sequence. Subsequently, the gray-prediction method is utilized to generate the predicted results of the proportion of the middle-income group. These predicted results are then compared with the proportion results of the middle-income group obtained via the kernel function.
Building upon this comparison, predictions are made regarding the changes in the proportion of the middle-income group over the next 3 years (from 2018 to 2020). This approach enables the observation of the scale changes within this group.
The prediction results are presented in Table 3.
Predicted results of the proportion of middle-income groups.
Table 3 shows the prediction results of the proportion of middle-income groups in the whole country, urban and rural areas obtained using the gray prediction. By comparing the estimation results of the kernel density and the prediction results obtained by the gray prediction GM (1,1) model, it is found that the maximum error of the two is 7.99%, the minimum error is only 0.03%, and the average error is 2.84%. The values are relatively close. The absolute correlation degree of the 6-year gray forecast given by the R software is above 99%; that is, the correlation degree is first-level, and the prediction accuracy is excellent, indicating that the gray prediction results are credible.
Discussions
National level
In 2010, the proportion of the middle-income group nationwide stood at 30.35%. By 2017, this proportion had decreased to 27.04%. During the period from 2010 to 2017, the changing trends of the proportion of the middle-income group derived from kernel density estimation and gray prediction were generally similar, both demonstrating a continuous downward trajectory. Based on the prediction results, it was further observed that this downward trend persisted from 2018 to 2020. In 2018, the proportion was 26.25%, and by 2020, it had declined to only 24.55%. At the national level, during the period from 2011 to 2020, the proportion of the middle-income group decreased at a rate of approximately 0.85% per annum. This indicates that, overall, the proportion of the middle-income group across the country was continuously diminishing.
This implies that at the present stage, a certain degree of polarization in income has emerged in China. Moreover, this situation will pose a threat to social harmony and stability in the coming period, accompanied by the increase in residents’ income. This necessitates that urban governors should pay sufficient attention to the widening income gap. Through a series of policies and macrocontrol measures, they should continuously increase the income of low- and middle-income earners and regulate the income of high-income groups, thereby expanding the size of the middle-income group and making unremitting efforts to achieve a stable social income structure.
Urban and rural level
The proportion of the middle-income group in both urban and rural areas follows a downward trend, similar to that at the national level. In 2010, the proportion of the middle-income group in urban areas was 40.37%. The forecast for 2017 was 38.68%. Specifically, from 2010 to 2011, there was a slight increase in the proportion of the middle-income group in urban areas. Subsequently, from 2011 to 2017, it consistently declined. Overall, from 2010 to 2017, the change in this proportion was relatively minor. The predicted proportion for urban areas in 2018 was 37.57%, and in 2020, it was 35.37%. Calculations indicate that from 2011 to 2020, the proportion of the middle-income group in urban areas decreased at a rate of approximately 1.16% per annum, with a margin of change slightly larger than the national average.
In 2010, the proportion of the middle-income group in rural areas was 32.73%, which decreased to 22.13% in 2017. The predicted value for 2018 was 20.27%, and by 2020, it was only 17.02%. From 2011 to 2020, the proportion of the middle-income group in rural areas shrank at a rate of over 1.73% per year. This changing trend is significantly greater than that at the national and urban levels.
Upon overall comparison, from 2010 to 2020, the proportion of the middle-income group in urban areas decreased by around 4.58%, while in rural areas, it decreased by nearly 15.63%. In terms of the predicted urban–rural structure, the gap in the proportion of the middle–income group between urban and rural areas was 7.64% in 2010, and it widened to 18.35% in 2020.
This indicates that there are significant urban–rural disparities in residents’ incomes, and this gap is continuously widening. The main reasons for this disparity may be as follows: Urban areas have a higher economic level and faster development, resulting in relatively higher incomes. As a result, residents’ incomes in urban areas increase more rapidly, naturally attracting more workers. This means that although urban areas are facing the situation of the loss of the middle-income group and the emergence of income polarization, there are still enough people to supplement and become middle-income earners. Therefore, the proportion of the middle-income group in urban areas is higher and the decrease is smaller.
In contrast, rural areas first have a large gap in economic development levels compared with urban areas. Moreover, rural areas close to economically developed regions will gradually transform into urban areas due to the impact of urbanization. This means that rural areas with relatively better economic conditions no longer belong to the rural category due to urbanization. As a result, the remaining rural areas are truly underdeveloped regions, losing their development vitality, having even lower income levels, and suffering more severe labor force outflows, thus creating such a vicious cycle.
At the same time, rural areas that have not been urbanized are facing increasingly serious social problems such as aging, population loss, and development stagnation. As a result, most of the residents in rural areas are low-quality elderly people who have lost their labor force, and their income levels are naturally far from comparable to those in urban areas. Therefore, with the continuous increase of the elderly population and the continuous decrease of the labor force, the middle-income group in rural areas has decreased significantly, and the gap with urban areas has been continuously expanding.
Conclusions
This study employed the quantile regression method at the 0.05, 0.20, 0.50, 0.80, and 0.95 quantiles to analyze the influencing factors of low-income, middle-income, and high-income earners at the national, urban, and rural levels in China, respectively. It was found that the variables significantly affecting the income of the middle-income group vary across the national, urban, and rural levels. At the national level, four variables, namely region, urban–rural status, gender, and educational attainment, are the most crucial factors influencing the middle-income group. In urban areas, the four key factors are region, gender, educational attainment, and happiness level. For rural areas, the four significant factors are region, gender, educational attainment, and health status. Evidently, the three variables of region, gender, and educational attainment have a significant impact regardless of the level. Subsequently, the GM(1,1) model in the gray system was utilized to predict the proportion of the middle-income group in China in the future. By comparing the prediction results from 2010 to 2017 with the data on the proportion of the middle-income group at the national, urban, and rural levels obtained through kernel density estimation, it was found that the gray-system prediction has high accuracy and good results. Therefore, the proportions of the middle-income group in China predicted by this method for 2018, 2019, and 2020 are reliable. By observing the changing trend of this proportion, it was found that the proportion of the middle-income group in China will continue to decline in the next three periods.
The enlightenment of this study for future social development may lie in providing a certain degree of data reference for social governors, enabling the consideration of social fairness while promoting economic development and making the income structure more rational. Due to limited data and the issue of timeliness, this study has certain limitations. In the future, attempts will be made to apply big-data technology for improvement.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Research Initiation Project of Wenzhou Polytechnic, National Natural Science Foundation of China (grant number RC202307, 52165061).
Conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
