Abstract
Prediction techniques are important in plant breeding as they provide a tool for selection that is more efficient and economical than traditional phenotypic and pedigree based selection. The conventional genomic prediction models include molecular marker information to predict the phenotype. With the development of new phenomics techniques we have the opportunity to collect image data on the plants, and extend the traditional genomic prediction models where we incorporate diverse set of information collected on the plants. In our research, we developed a hybrid matrix model that incorporates molecular marker and canopy coverage information as a weighted linear combination to predict grain yield for the soybean nested association mapping (SoyNAM) panel. To obtain the testing and training sets, we clustered the individuals based on their marker and canopy information using 2 different clustering techniques, and we compared 5 different cross-validation schemes. The results showed that the predictive ability of the models was the highest when both the canopy and marker information was included, and it was the lowest when only the canopy information was included.
Keywords
Introduction
Plant breeding is a science that uses multiple scientific disciplines to develop improved cultivars. Some of the scientific fields involved in plant breeding are genetics, agronomy, soil science, botany, biochemistry, plant pathology, computer science, statistics, and engineering. The goal of plant breeding is to combine the advantages of these disciplines to increase food security and face the environmental challenges, thus producing high-yielding cultivars with agronomically desired traits for humans in a sustainable way.
One of the biggest tasks of plant breeders is to select individuals for crossing that will produce the desired phenotype. Breeders have to cross individuals and test the progeny in several cycles. They not only measure their success regarding the amount of increase in the performance of the cultivar that is achieved in a cycle (called genetic gain) but also in the amount of time they can release a cultivar to the farmers. One of the ways in which the development process can be shortened is by decreasing the amount of cycles used to develop the cultivar. To achieve this goal, several methods were developed that help in the selection procedure. The development of inexpensive genotyping strategies such as genotype by sequencing (GBS) and single-nucleotide polymorphism (SNP)1,2 made it feasible for plant breeders to take advantage of molecular marker information and develop techniques that improve the efficiency of selection in plant breeding. Molecular markers were first used for prediction purposes in the 1970s when it was assumed that only a few genes affect the phenotype, and it is financially feasible to develop progeny with the desired genes. This approach is called quantitative trait locus (QTL) mapping.3,4 However, the success of QTL mapping depends on the number of QTLs affecting the phenotype. Agronomically important traits (eg, yield) usually are influenced by many small-effect QTLs instead of by few large-effect QTLs. As a result, the adoption of the technology into breeding programs is problematic because it is very expensive to develop a population using this method.
Another method that takes advantage of the dense molecular marker information is called genomic prediction (GP) or genomic selection (GS). The distinction between the 2 terms relies on the notion that genomic information is used for predicting the phenotype, and the prediction results are used by the plant breeders to select individuals for advancing into the next generation and make the desired crosses. Although there are earlier articles describing the foundation of this method,5–7 the first article defining GS using dense markers was presented by Meuwissen et al. 8 Since then, there has been a large amount of effort dedicated to developing methods for GP and evaluating them under various conditions.9–13 GP has the potential to increase the expected genetic gain by reducing the cycle length, and it uses fewer resources compared with traditional phenotypic selection; thus, it has the ability to improve complex traits that are influenced by many small-effect QTLs and their interactions, and traits with low heritability. Evidently, the improvement is more significant for simple traits that are controlled by fewer QTLs and traits with high heritability. 14
Phenomics is an area of research that opens new avenues in the field of prediction for plant breeding. With this new technique, physical and biochemical traits of the plants can be measured in a cost-effective way, and the measurements can be monitored over time. However, it also introduces new challenges in plant breeding because an extensive amount of data can be collected, and the existing statistical methods have difficulty incorporating the large dimensionality of the collected data. Thus, new statistical techniques and algorithms have to be developed to efficiently use the given data to increase the predictive ability of complex traits.
In this article, we introduce a model that combines molecular marker and canopy coverage image information via a combined relationship matrix for the soybean nested association mapping (SoyNAM) population. Christensen et al 15 and Legarra et al 16 have developed similar models by combining marker and pedigree data. To our knowledge, this is the first time that phenotypic and genomic data are used to compute a hybrid matrix. For the model, we create a hybrid matrix that combines both the molecular marker and canopy coverage data. We predict the yield of the lines and compare the predictive ability of the model with that of the models where only the molecular marker information and only the canopy coverage information are included. The ultimate goal is to help improve the selection tools used by breeders. In this article, first, we introduce the SoyNAM population and discuss the structure, composition of the population, and molecular marker and canopy data. Then, we discuss how the hybrid matrix is developed from the base GP model. For the prediction, we considered 5 different ways to split the data into training and testing sets based on 2 clustering algorithms. We describe the clustering algorithms used to divide the data into testing and training sets, and how the data were divided. Finally, we present the results and some future avenues to improve the model.
Material and Methods
The SoyNAM dataset
For the evaluation of our model, we used data collected from the SoyNAM population (https://www.soybase.org/SoyNAM/), which is a nested association mapping population originally consisting of 5600 F5-derived recombinant inbred lines (RILs). The RILs were derived by crossing a common, high-yielding parent (IA3023) to 40 other parents. Out of the 40 parents, 17 were high-yielding, elite lines from 8 different states from the United States, 15 lines had diverse ancestry, and 8 were considered as exotic lines as their origins were South Korea, China, Russia, and Serbia. Originally, the development of the SoyNAM population started in 2011, but the phenotypic and canopy coverage data used in this study were observed in 2013. 17 Details of the experimental design can be found in Xavier et al. 17 The molecular marker information consisted of 5305 SNP markers but, after employing quality control (removing markers with minor allele frequency less than 0.05), about 4600 SNP markers were included into the model. 17
Canopy coverage information, which is the area covered by the plant, can easily be measured on soybean plants. The canopy coverage is correlated with canopy light interception, which is a trait that is positively correlated with grain yield but difficult to measure. 18
The canopy image data used for evaluating our model were ground-based red-green-blue (RGB) images. Then, the raw images were converted into a metric that describes the percentage of image pixels that are classified as canopy pixels. The images were collected at regular intervals from 2 to 8 weeks after planting; thus, only 6 measurements were obtained. To extrapolate the data, allowing estimations of daily measures, a logistic growth function via the logit link was used to model longitudinal data of canopy coverage for each genetic line. In this way, at the end for each RIL, there were (6 × 7) 42 canopy coverage measurements (14-56 days after planting) available. The data and models employed in this research are described in detail by Xavier et al. 17
The phenotypic trait that we predicted using the molecular marker and canopy coverage information was grain yield (in kg/ha). There are 8 other traits that were observed for the SoyNAM population, but we only considered the agronomically most important trait. For more information about the development of the SoyNAM population, the genotypic, phenotypic, and canopy coverage information, the reader can refer to Xavier et al.17,19
The hybrid matrix model development
The goal of this study was to develop a model to predict grain yield that incorporates molecular marker and canopy coverage image information using the SoyNAM population. To build a hybrid matrix model, the following model was used as a base
where
Because the marker and canopy data were available, these 2 sources of information were used for computing covariance structures as follows:
Here,
The hybrid matrix could be built combining both covariance structures as a linear combination that depends on the weight
In this study, values between 0 and 1 in steps of 0.01 were used for
Cross-validation methods
To evaluate the prediction models, 3 cross-validation (CV) schemes were employed. The first one was the leave-one-observation-out CV where, for predicting an unobserved genotype, we used all of the remaining RILs. The other ones were based on 2 clustering techniques. We clustered the RILs based on the Euclidean distance among the combined molecular marker and canopy coverage information. We compared 2 clustering algorithms: Clustering Large Applications (CLARA) and a Hierarchical Linkage (HL) method.
CLARA
22
is based on
HL method is a clustering technique that is based on the dissimilarity of the objects, and it builds the clusters starting with the objects being individual clusters (which also called as a “bottom-up” approach). The dissimilarity of the objects is based on the Euclidean distance among the molecular markers and the canopy coverage information of the RILs. For dividing the objects into clusters, the Ward method was implemented which minimizes the total within-cluster variance.
Essentially, the foundations of methods for creating clusters are different. Whereas CLARA attempts to find clusters based on similarities within randomly selected subsets of the data, the HL method starts with single-element clusters and builds the clusters in a “bottom-up” manner.
For both the CLARA and the HL methods, the RILs were partitioned into 2, 3, 4, and 5 clusters. The predictions were performed using the leave-one-observation-out CV within each of the 2, 3, 4, or 5 clusters. Once the predictions for all clusters were computed, these values were integrated into a single vector across clusters. Then, the predictive ability for each of the 99 strategies was evaluated using the Pearson correlation coefficient between the entire vectors of the observed and the predicted values.
Results and Discussion
In this study, we developed a hybrid matrix approach for incorporating molecular marker and canopy coverage information to predict grain yield for the SoyNAM population. For the hybrid matrix, we evaluated 11 different weights. One of the extremes was when absolutely no canopy coverage information was included in the training-testing sets and these were composed only of marker data. The other extreme was the opposite case, where no marker information was used but only canopy coverage information was available. The intermediate cases had both canopy and marker information at different intensities. We clustered the RILs using the CLARA and HL methods, and the numbers of clusters we considered were 2, 3, 4, and 5. Within each cluster, for dividing the data into training and testing sets, we used the leave-one-observation-out CV scheme, where 1 RIL was predicted using all of the remaining RILs. The predictive ability was evaluated based on the correlation between the observed and the predicted grain yield.
We considered prediction strategies based on the combination of 11 weights
Predictive ability values for the CLARA and HL methods using 2, 3, 4, and 5 clusters compared with the leave-one-observation-out CV for the different weights of the hybrid matrix model.
CLARA, Clustering Large Applications; CV, cross-validation; HL, Hierarchical Linkage.
From Table 1, we can see that the highest predictive ability when no clusters were considered (i.e., only 1 cluster, the entire population) was 0.599. This value was obtained for different weights
In the figures, we only show the results using the CLARA method due to the similarity of the clustering techniques regarding predictive ability. The top row of the figures shows the correlation plots when the leave-one-observation-out CV is implemented for the whole dataset without clustering, and the bottom row of the figures shows the results for the CLARA method when 5 clusters were considered. The left panels are the correlation plots when only canopy coverage information is included (
The plots in Figure 1 show the predictive ability as the correlation between the observed and predicted values of grain yield. The title of the plots includes the overall correlation coefficients (i.e., the prediction between the entire vector of the predicted and observed values across clusters), and the plots include the correct classification rates for 16 combinations between 4 different intervals based on empirical percentiles ([0%, 20%], [20%, 50%], [50%, 80%], and [80%, 100%]) for the predictions and 4 intervals ([0%, 20%], [20%, 50%], [50%, 80%], and [80%, 100%]) for the observed values. Within each plot, the top right value represents the proportion of the top 20% of the observed values that were also predicted as the top 20% in grain yield. On the other hand, if we were interested in screening or discarding the lowest 20% of the lines based on the predictions, the bottom left value shows the proportion of correctly discarded lines based on the observed values. In this case, the objective would be to increase the classification rate values in the diagonal grid of the plot and to reduce the classification error rate addressed in the off diagonal. In general, we observed that the classification rate in the diagonal of the grid improved when the marker and canopy image data were combined using the hybrid matrix and the classification error rate decreased (i.e., for cases in the off diagonal).

Scatter plots showing the prediction accuracy between the predicted (
When we compare the overall correlation coefficients and the classification error rate (the numbers within the grid in the off diagonal), we observed that the predictive ability was the lowest when we only included canopy coverage information into the prediction model, and thus the classification error rate was larger than those in the other 2 cases. When only marker information was considered, the predictive ability was improved and the classification error rate was reduced.
Prediction techniques are valuable tools for plant breeders as they can evaluate a larger number of candidates and select a less number of lines with the same level of efficiency and less cost than with the traditional phenotypic selection; thus, it is important to improve the predictive ability of the available models. With the advancements of collecting image information on plants, we have the opportunity to develop prediction models where we integrate multiple sources of information. The hybrid matrix model can integrate the diverse information as a weighted linear combination of the different sources of information, and it increases the predictive ability compared with models that only incorporated a single source of variation into the model (e.g., only molecular marker or only canopy coverage information). In our future work, we plan to develop a model that also includes the interaction of the hybrid matrix with environmental covariates.
Footnotes
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding:
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Author Contributions
RH participated in the design of the study, performed the genomic predictions, statistical analysis, and drafted the manu-script. DJ participated in the design of the study, performed the statistical analysis, and participated in drafting the manuscript.
